Batch layer for data processing
The core of Hadoop technology has been its ability to perform faster, performant, and optimized batch processes. It proved to be a big success in solving some of the more complex problems of long-running batch processing within organizations. The initial implementations of Hadoop were based on open source Hadoop distributions; however, with the inherent need to make it professionally supported, there were a number of features that were incorporated to make it feasible for enterprise use in terms of provisioning, management, monitoring, and alerting. This resulted in some of the more customized distributions led by MapR, Cloudera, and Hortonworks:
Figure 03: The Hadoop 1 framework
As shown in this image, the Hadoop 1 framework can be broadly classified into Storage and Processing. Storage here is represented by Hadoop Distributed File System (HDFS) while processing is represented as a MapReduce API. Hadoop 2 included many of the improved capabilities with the introduction of YARN, and this representation changed as shown here:
Figure 04: The Hadoop 2 framework
There could be a number of other frameworks that could have been considered here like Pig scripts, Hive queries, Impala queries, and so on. However, all of these are internally executed as MapReduce batch jobs and can be additionally orchestrated with actions, stages, and parameters in an Oozie workflow or cascade.
Let's have a closer look at how Hadoop MapReduce batch jobs work. The functioning of a Hadoop MapReduce consists of various key components that were introduced as part of the initial Hadoop framework. The Hadoop framework itself has undergone an evolution with Hadoop 1 and Hadoop 2. Hadoop 1, while establishing the Hadoop capability of MapReduce batch jobs, it did suffer from a single point of failure. With the introduction of Hadoop 2, this single point of failure was eliminated and a few other key capabilities were added. From a discussion perspective, we will cover Hadoop 2 and compare it with Hadoop 1 as we go through the details. Hadoop 2, for batch frameworks, consists of the following key components:
Figure 05: A typical Hadoop cluster