Hadoop Beginner's Guide
上QQ阅读APP看书,第一时间看更新

Input/output

There is one aspect of our driver classes that we have mentioned several times without getting into a detailed explanation: the format and structure of the data input into and output from MapReduce jobs.

Files, splits, and records

We have talked about files being broken into splits as part of the job startup and the data in a split being sent to the mapper implementation. However, this overlooks two aspects: how the data is stored in the file and how the individual keys and values are passed to the mapper structure.

InputFormat and RecordReader

Hadoop has the concept of an InputFormat for the first of these responsibilities. The InputFormat abstract class in the org.apache.hadoop.mapreduce package provides two methods as shown in the following code:

public abstract class InputFormat<K, V>
{
public abstract List<InputSplit> getSplits( JobContext context) ;
RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) ;
}

These methods display the two responsibilities of the InputFormat class:

  • To provide the details on how to split an input file into the splits required for map processing
  • To create a RecordReader class that will generate the series of key/value pairs from a split

The RecordReader class is also an abstract class within the org.apache.hadoop.mapreduce package:

public abstract class RecordReader<Key, Value> implements Closeable
{
public abstract void initialize(InputSplit split, TaskAttemptContext context) ;
  public abstract boolean nextKeyValue() 
throws IOException, InterruptedException ;
public abstract Key getCurrentKey() 
throws IOException, InterruptedException ;
public abstract Value getCurrentValue() 
throws IOException, InterruptedException ;
public abstract float getProgress() 
throws IOException, InterruptedException ;
public abstract close() throws IOException ;
}

A RecordReader instance is created for each split and calls getNextKeyValue to return a Boolean indicating if another key/value pair is available and if so, the getKey and getValue methods are used to access the key and value respectively.

The combination of the InputFormat and RecordReader classes therefore are all that is required to bridge between any kind of input data and the key/value pairs required by MapReduce.

Hadoop-provided InputFormat

There are some Hadoop-provided InputFormat implementations within the org.apache.hadoop.mapreduce.lib.input package:

  • FileInputFormat: This is an abstract base class that can be the parent of any file-based input
  • SequenceFileInputFormat: This is an efficient binary file format that will be discussed in an upcoming section
  • TextInputFormat: This is used for plain text files

Tip

The pre-0.20 API has additional InputFormats defined in the org.apache.hadoop.mapred package.

Note that InputFormats are not restricted to reading from files; FileInputFormat is itself a subclass of InputFormat. It is possible to have Hadoop use data that is not based on the files as the input to MapReduce jobs; common sources are relational databases or HBase.

Hadoop-provided RecordReader

Similarly, Hadoop provides a few common RecordReader implementations, which are also present within the org.apache.hadoop.mapreduce.lib.input package:

  • LineRecordReader: This implementation is the default RecordReader class for text files that present the line number as the key and the line contents as the value
  • SequenceFileRecordReader: This implementation reads the key/value from the binary SequenceFile container

Again, the pre-0.20 API has additional RecordReader classes in the org.apache.hadoop.mapred package, such as KeyValueRecordReader, that have not yet been ported to the new API.

OutputFormat and RecordWriter

There is a similar pattern for writing the output of a job coordinated by subclasses of OutputFormat and RecordWriter from the org.apache.hadoop.mapreduce package. We'll not explore these in any detail here, but the general approach is similar, though OutputFormat does have a more involved API as it has methods for tasks such as validation of the output specification.

Tip

It is this step that causes a job to fail if a specified output directory already exists. If you wanted different behavior, it would require a subclass of OutputFormat that overrides this method.

Hadoop-provided OutputFormat

The following OutputFormats are provided in the org.apache.hadoop.mapreduce.output package:

  • FileOutputFormat: This is the base class for all file-based OutputFormats
  • NullOutputFormat: This is a dummy implementation that discards the output and writes nothing to the file
  • SequenceFileOutputFormat: This writes to the binary SequenceFile format
  • TextOutputFormat: This writes a plain text file

Note that these classes define their required RecordWriter implementations as inner classes so there are no separately provided RecordWriter implementations.

Don't forget Sequence files

The SequenceFile class within the org.apache.hadoop.io package provides an efficient binary file format that is often useful as an output from a MapReduce job. This is especially true if the output from the job is processed as the input of another job. The Sequence files have several advantages, as follows:

  • As binary files, they are intrinsically more compact than text files
  • They additionally support optional compression, which can also be applied at different levels, that is, compress each record or an entire split
  • The file can be split and processed in parallel

This last characteristic is important, as most binary formats—particularly those that are compressed or encrypted—cannot be split and must be read as a single linear stream of data. Using such files as input to a MapReduce job means that a single mapper will be used to process the entire file, causing a potentially large performance hit. In such a situation, it is preferable to either use a splitable format such as SequenceFile, or, if you cannot avoid receiving the file in the other format, do a preprocessing step that converts it into a splitable format. This will be a trade-off, as the conversion will take time; but in many cases—especially with complex map tasks—this will be outweighed by the time saved.