上QQ阅读APP看书，第一时间看更新

Apache Hadoop

Apache Hadoop is an open source software platform that processes very large datasets in a distributed environment with respect to storage and computational power, and is mainly built on low cost commodity hardware. It came into existence thanks to a Google File System paper that was published in October 2003. Another research paper from Google MapReduce looked at simplified data processing in large clusters.

Apache Nutch is a highly-scalable and extensible open source web crawler project which implemented the MapReduce facility and the distributed file system based on Google's research paper. These facilities were later announced as a sub-project called Apache Hadoop.

Apache Hadoop is designed to easily scale up from a few to thousands of servers. It helps you to process locally stored data in an overall parallel processing setup. One of the benefits of Hadoop is that it handles failure at a software level. The following figure illustrates the overall architecture of the Hadoop Ecosystem and where the different frameworks are within it:

Figure-3.0.1

In Hadoop Ecosystem, Apache Hadoop provides a framework for the file system layer, cluster management layer, and the processing layer. It leaves an option for other projects and frameworks to come and work alongside Hadoop Ecosystem and develop their own framework for any of the layers available in the system. We will discuss other Apache Projects that fit in the remaining layers later.

Apache Hadoop is comprised of four main modules. These modules are Hadoop Distributed File System (the file system layer), Hadoop MapReduce (which works with both cluster management and the processing layer), Yet Another Resource Negotiator (YARN, the cluster management layer), and Hadoop Common. We will discuss these in turn as we proceed with this chapter.