Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

Distributed filesystems

Think of the hard drive inside your desktop, laptop, smartphone, or other personal device you own. Files are written to and stored on local hard drives and retrieved as and when you need them. Your local operating system manages read and write requests to your local hard drive by maintaining a local filesystem—a means by which the operating system keeps track of how the disk is organized and where files are located.

As your personal data footprint grows, you take up more and more space on your local hard drive until it reaches its capacity. At this time, you may seek to purchase a larger capacity hard drive to replace the one inside your device, or you may seek to purchase an extra hard drive to complement your existing one. In the case of the latter, you personally manage which of your personal files reside on which hard drive, or perhaps use one of them to archive files you rarely use to free up space on your primary drive. Hopefully, you also maintain backups of your personal files should the worse happen and your device or primary hard drive malfunctions!

A distributed filesystem (DFS) extends the notion of local filesystems, while offering a number of useful benefits. In a distributed filesystem within the context of our big data ecosystem, data is physically split across the nodes and disks in a cluster. Like distributed data stores in general, a distributed filesystem provides a layer of abstraction and manages read and write requests across the cluster itself, meaning that the physical split is invisible to requesting client applications which view the distributed filesystem as one logical entity just like a conventional local filesystem.

Furthermore, distributed filesystems provide useful benefits out of the box, including the following:

Data replication, where data can be configured to be automatically replicated across the cluster for fault tolerance in the event one or more of the nodes or disks should fail
Data integrity checking
The ability to persist huge files, typically gigabytes (GB) to terabytes (TB) in size, which would not normally be possible on conventional local filesystems

The HDFS is a well-known example of a distributed filesystem within the context of our big data ecosystem. In the HDFS, a master/slave architecture is employed, consisting of a single NameNode that manages the distributed filesystem, and multiple DataNodes, which typically reside on each node in the cluster and manage the physical disks attached to that node as well as where the data is physically persisted to. Just as with traditional filesystems, HDFS supports standard filesystem operations, such as opening and closing files and directories. When a client application requests a file to be written to the HDFS, it is split into one or more blocks that are then mapped by the NameNode to the DataNodes, where they are physically persisted. When a client application requests a file to be read from the HDFS, the DataNodes fulfill this request.

One of the core benefits of HDFS is that it provides fault tolerance inherently through its distributed architecture, as well as through data replication. Since typically there will be multiple nodes (potentially thousands) in an HDFS cluster, it is resilient to hardware failure as operations can be automatically offloaded to the healthy parts of the cluster while the non-functional hardware is being recovered or replaced. Furthermore, when a file is split into blocks and mapped by the NameNode to the DataNodes, these blocks can be configured to automatically replicate across the DataNodes, taking into account the topology of the HDFS cluster.

Therefore, if a failure did occur, for example, a disk failure on one of the DataNodes, data would still be available to client applications. The high-level architecture of an HDFS cluster is illustrated in Figure 1.3:

Figure 1.3: Apache Hadoop distributed filesystem high-level architecture

To learn more about the Apache Hadoop framework, please visit http://hadoop.apache.org/.