
Running a simple MapReduce program
In this recipe, we will look at how to make sense of the data stored on HDFS and extract useful information out of the files like the number of occurrences of a string, a pattern, or estimations, and various benchmarks. For this purpose, we can use MapReduce, which is a computation framework that helps us answer many questions we might have about the data.
With Hadoop, we can process huge amount of data. However, to get an understanding of its working, we'll start with a simple program such as pi estimation or a word count example.
ResourceManager is the master for Yet another Resource Negotiator (YARN). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers.
When a job is submitted to ResourceManager (RM), it will check for the job queue it is submitted to and whether the user has permissions to do so or not. Then it will ask Application Master launcher (AM Launcher) to launch an Application Master (AM) container on a node. Going forward, AM is the one responsible for running Map and Reduce containers with the help of inputs from RM.
AM takes care of failures and relaunches the containers if needed. Although there are many concepts, such as resource grant, AppManager, and retry logic, this being a recipe book we will keep the theory to the minimal. The following diagram shows the relationship between AM, RM, and NodeManager.

Getting ready
To step through the recipes in this chapter, make sure you have a running cluster with HDFS and YARN setup correctly as discussed in the previous chapters. This can be a single node cluster or a multinode cluster, as long as the cluster is configured correctly.
How to do it...
- Connect to an edge node in the cluster as this is the preferred way, but you can connect to any node in the cluster. If you are connecting to a Datanode, please go through Chapter 2, Maintain Hadoop Cluster HDFS to understand the disadvantages of doing so.
- Switch to user
hadoop
. - Navigate to the directory were Hadoop is installed. Hadoop is bundled with many examples to run and test the cluster and get the user started.
- Under
/opt/cluster/hadoop/share/hadoop/mapreduce/
, there are lot of example JARs. - To see all programs a JAR can run, execute the JAR without any arguments:
yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar
- To run a pi estimation job, execute
mapreduce-examples-2.7.2.jar
, with arguments as shown in the following screenshot:
Similarly, there are many other examples that can be run using the example JAR. To run a wordcount example, create a simple test file and copy it to HDFS and execute the program as follows:
yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input_file /out_directory
- The JAR
hadoop-mapreduce-examples-2.7.2.jar
, is a bundled JAR, which has various classes to call such aswordcount
,pi
and many more. Each class we call takes some input and writes some output. We can write our own classes/jars and use just a section of the above bundled JAR.