上QQ阅读APP看书，第一时间看更新

The execution flow of MapReduce

In a traditional system, when a client program executes, it requests data that will then be made available in the client's system for further processing. In the Hadoop system, the program will be sent towards data so that if data resides in multiple data nodes, the copies of a program will be sent to each data node, making it process in parallel. This is where the MapReduce framework comes in, as it pictures how a program will execute, what the flow will be, and how it will generate a result.

Job Tracker assigns tasks to Task Tracker before it starts processing on the data node. Let's suppose we have a file named log.txt that is 200 MB. As per the default configuration, it will be split into four data blocks; three of them will be 64 MB and the fourth one will be 8 MB. The following figure illustrates the storage of this file in different data nodes:

The input and output of a MapReduce program is written on the file system.