Although many MapReduce applications deal with text files, there are many cases where processing binary data is required. In this case, you basically have two options:
Using Sequence Files within Pydoop is easy. To write output key/value pairs in the SequenceFile format, add the following properties to your job configuration file:
<property>
<name>mapred.output.format.class</name>
<value>org.apache.hadoop.mapred.SequenceFileOutputFormat</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>desired_compression_type</value>
</property>
Where desired_compression_type can be NONE, RECORD or BLOCK (see the SequenceFile documentation at the above link). To read key/value pairs written in the SequenceFile format, add the following one:
<property>
<name>mapred.input.format.class</name>
<value>org.apache.hadoop.mapred.SequenceFileInputFormat</value>
</property>
SequenceFile is mostly useful to handle complex objects like C-style structs or images. To keep our example as simple as possible, we considered a situation where a MapReduce task needs to emit the raw bytes of an integer value.
We wrote a trivial application that reads input from a previous word count run and filters out words whose count falls below a configurable threshold. Of course, the filter could have been directly applied to the wordcount reducer: the job has been artificially split into two runs to give a SequenceFile read / write example.
Suppose you know in advance that most counts will be large, but not so large that they cannot fit in a 32-bit integer: since the decimal representation could require as much as 10 bytes, you decide to save space by having the wordcount reducer emit the raw four bytes of the integer instead:
class WordCountReducer(Reducer):
def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), struct.pack(">i", s))
Since newline characters can appear in the serialized values, you cannot use the standard text format where each line contains a tab-separated key-value pair. The problem is solved by using SequenceFileOutputFormat for wordcount and SequenceFileInputFormat for the filtering application.
For further details, see the source code under examples/sequence_file.