Hi, I have a question related to sequence file. I wonder why I should use it under what kind of circumstance?
Let's say if I have a csv file, I can store that directly in HDFS. But if I do know that the first 2 fields are some kind of key, and most of MR jobs will query on that key, will it make sense to store the data as sequence file in this case? And what benefits it can bring?
Best benefit I want to get is to reduce the IO for MR job, but not sure if sequence file can give me that.If the data is stored as key/value pair in the sequence file, and since mapper/reducer will certain only use the key part mostly of time to compare/sort, what difference it makes if I just store as flat file, and only use the first 2 fields as the key?
In the mapper of the sequence file, anyway it will scan the whole content of the file. If only key part will be compared, do we save IO by NOT deserializing the value part, if some optimization done here? Sound like we can avoid deserializing value part when unnecessary. Is that the benefit? If not, why would I use key/value format, instead of just (Text, Text)? Assume that my data doesn't have any binary data.