Machine Learning with Apache Spark Quick Start Guide
上QQ阅读APP看书,第一时间看更新

Apache Spark

Apache Spark is a well-known example of a general-purpose distributed processing engine capable of handling petabytes (PB) of data. Because it is a general-purpose engine, it is suited to a wide variety of use cases at scale, including the engineering and execution of Extract-Transform-Load (ETL) pipelines using its Spark SQL library, interactive analytics, stream processing using its Spark Streaming library, graph-based processing using its GraphX library, and machine learning using its MLlib library. We will be employing Apache Spark's machine learning library in later chapters. For now, however, it is important to get an overview of how Apache Spark works under the hood.

Apache Spark software services run in Java Virtual Machines (JVM), but that does not mean Spark applications must be written in Java. In fact, Spark exposes its API and programming model to a variety of language variants, including Java, Scala, Python, and R, any of which may be used to write a Spark application. In terms of its logical architecture, Spark employs a master/worker architecture as illustrated in Figure 1.9:

Figure 1.9: Apache Spark logical architecture

Every application written in Apache Spark consists of a Driver Program. The driver program is responsible for splitting a Spark application into tasks, which are then partitioned across the Worker nodes in the distributed cluster and scheduled to execute by the driver. The driver program also instantiates a SparkContext, which tells the application how to connect to the Spark cluster and its underlying services.

The worker nodes, also known as slaves, are where the computational processing physically occurs. Typically, Spark worker nodes are co-located on the same nodes as where the underlying data is also persisted to improve performance. Worker nodes spawn processes called Executors, and it is these executors that are responsible for executing the computational tasks and storing any locally-cached data. Executors communicate with the driver program in order to receive scheduled functions, such as map and reduce functions, which are then executed. The Cluster Manager is responsible for scheduling and allocating resources across the cluster and must therefore be able to communicate with every worker node, as well as the driver. The driver program requests executors from the cluster manager (since the cluster manager is aware of the resources available) so that it may schedule tasks.

Apache Spark is bundled with its own simple cluster manager which, when used, is referred to as Spark Standalone mode. Spark applications deployed to a standalone cluster will, by default, utilize all nodes in the cluster and are scheduled in a First-In-First-Out (FIFO) manner. Apache Spark also supports other cluster managers, including Apache Mesos and Apache Hadoop YARN, both of which are beyond the scope of this book.