
Time for action – WordCount with a combiner
Let's add a combiner to our first WordCount example. In fact, let's use our reducer as the combiner. Since the combiner must have the same interface as the reducer, this is something you'll often see, though note that the type of processing involved in the reducer will determine if it is a true candidate for a combiner; we'll discuss this later. Since we are looking to count word occurrences, we can do a partial count on the map node and pass these subtotals to the reducer.
- Copy
WordCount1.java
toWordCount2.java
and change the driver class to add the following line between the definition of theMapper
andReducer
classes:job.setCombinerClass(WordCountReducer.class);
- Also change the class name to
WordCount2
and then compile it.$ javac WordCount2.java
- Create the JAR file.
$ jar cvf wc2.jar WordCount2*class
- Run the job on Hadoop.
$ hadoop jar wc2.jar WordCount2 test.txt output
- Examine the output.
$ hadoop fs -cat output/part-r-00000
What just happened?
This output may not be what you expected, as the value for the word is
is now incorrectly specified as 1 instead of 2.
The problem lies in how the combiner and reducer will interact. The value provided to the reducer, which was previously (is, 1, 1)
, is now (is, 2)
because our combiner did its own summation of the number of elements for each word. However, our reducer does not look at the actual values in the Iterable
object, it simply counts how many are there.
You need to be careful when writing a combiner. Remember that Hadoop makes no guarantees on how many times it may be applied to map output, it may be 0, 1, or more. It is therefore critical that the operation performed by the combiner can effectively be applied in such a way. Distributive operations such as summation, addition, and similar are usually safe, but, as shown previously, ensure the reduce logic isn't making implicit assumptions that might break this property.