
Input/output
There is one aspect of our driver classes that we have mentioned several times without getting into a detailed explanation: the format and structure of the data input into and output from MapReduce jobs.
Files, splits, and records
We have talked about files being broken into splits as part of the job startup and the data in a split being sent to the mapper implementation. However, this overlooks two aspects: how the data is stored in the file and how the individual keys and values are passed to the mapper structure.
InputFormat and RecordReader
Hadoop has the concept of an InputFormat for the first of these responsibilities. The InputFormat
abstract class in the org.apache.hadoop.mapreduce
package provides two methods as shown in the following code:
public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits( JobContext context) ; RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) ; }
These methods display the two responsibilities of the InputFormat
class:
- To provide the details on how to split an input file into the splits required for map processing
- To create a
RecordReader
class that will generate the series of key/value pairs from a split
The RecordReader
class is also an abstract class within the org.apache.hadoop.mapreduce
package:
public abstract class RecordReader<Key, Value> implements Closeable { public abstract void initialize(InputSplit split, TaskAttemptContext context) ; public abstract boolean nextKeyValue() throws IOException, InterruptedException ; public abstract Key getCurrentKey() throws IOException, InterruptedException ; public abstract Value getCurrentValue() throws IOException, InterruptedException ; public abstract float getProgress() throws IOException, InterruptedException ; public abstract close() throws IOException ; }
A RecordReader
instance is created for each split and calls getNextKeyValue
to return a Boolean indicating if another key/value pair is available and if so, the getKey
and getValue
methods are used to access the key and value respectively.
The combination of the InputFormat
and RecordReader
classes therefore are all that is required to bridge between any kind of input data and the key/value pairs required by MapReduce.
Hadoop-provided InputFormat
There are some Hadoop-provided InputFormat
implementations within the org.apache.hadoop.mapreduce.lib.input
package:
FileInputFormat
: This is an abstract base class that can be the parent of any file-based inputSequenceFileInputFormat
: This is an efficient binary file format that will be discussed in an upcoming sectionTextInputFormat
: This is used for plain text files
Tip
The pre-0.20 API has additional InputFormats defined in the org.apache.hadoop.mapred
package.
Note that InputFormats
are not restricted to reading from files; FileInputFormat
is itself a subclass of InputFormat
. It is possible to have Hadoop use data that is not based on the files as the input to MapReduce jobs; common sources are relational databases or HBase.
Hadoop-provided RecordReader
Similarly, Hadoop provides a few common RecordReader
implementations, which are also present within the org.apache.hadoop.mapreduce.lib.input
package:
LineRecordReader
: This implementation is the defaultRecordReader
class for text files that present the line number as the key and the line contents as the valueSequenceFileRecordReader
: This implementation reads the key/value from the binarySequenceFile
container
Again, the pre-0.20 API has additional RecordReader
classes in the org.apache.hadoop.mapred
package, such as KeyValueRecordReader
, that have not yet been ported to the new API.
OutputFormat and RecordWriter
There is a similar pattern for writing the output of a job coordinated by subclasses of OutputFormat
and RecordWriter
from the org.apache.hadoop.mapreduce
package. We'll not explore these in any detail here, but the general approach is similar, though OutputFormat
does have a more involved API as it has methods for tasks such as validation of the output specification.
Hadoop-provided OutputFormat
The following OutputFormats are provided in the org.apache.hadoop.mapreduce.output
package:
FileOutputFormat
: This is the base class for all file-based OutputFormatsNullOutputFormat
: This is a dummy implementation that discards the output and writes nothing to the fileSequenceFileOutputFormat
: This writes to the binarySequenceFile
formatTextOutputFormat
: This writes a plain text file
Note that these classes define their required RecordWriter
implementations as inner classes so there are no separately provided RecordWriter
implementations.
Don't forget Sequence files
The SequenceFile
class within the org.apache.hadoop.io
package provides an efficient binary file format that is often useful as an output from a MapReduce job. This is especially true if the output from the job is processed as the input of another job. The Sequence
files have several advantages, as follows:
- As binary files, they are intrinsically more compact than text files
- They additionally support optional compression, which can also be applied at different levels, that is, compress each record or an entire split
- The file can be split and processed in parallel
This last characteristic is important, as most binary formats—particularly those that are compressed or encrypted—cannot be split and must be read as a single linear stream of data. Using such files as input to a MapReduce job means that a single mapper will be used to process the entire file, causing a potentially large performance hit. In such a situation, it is preferable to either use a splitable format such as SequenceFile
, or, if you cannot avoid receiving the file in the other format, do a preprocessing step that converts it into a splitable format. This will be a trade-off, as the conversion will take time; but in many cases—especially with complex map tasks—this will be outweighed by the time saved.