Big Data Architect’s Handbook
上QQ阅读APP看书,第一时间看更新

Shuffle and Sort

Between the Map and Reduce steps, there is a step called Shuffle. It is a process of transforming results from the Mapper phase and making them available to the Reducer function. All Mappers were executed independently on different data nodes and produced results as Key Value pairs. The shuffling step will not wait for Mappers to finish their jobs and will start processing them as they move along. Whichever Mapper finishes the job, Shuffle will start to transform its data.

The key-value pairs of the intermediate result will be sorted before passed to Reducer. This sort is done only on keys and not on values. Values will be passed to Reducer without sorting. The following figure illustrates this process: