Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Spark is scalable

The way that Spark scales data analysis problems is, it runs on top of a cluster manager, so your actual Spark scripts are just everyday scripts written in Python, Java, or Scala; they behave just like any other script. Your "driver program" is what we call it, and it will run on your desktop or on one master node of your cluster. However, under the hood, when you run it, Spark knows how to take the work and actually farm it out to different computers on your cluster or even different CPUs on the same machine. Spark can actually run on top of different cluster managers. It has its own built-in cluster manager that you can use by default, but if you have access to a Hadoop cluster there's a component called YARN, that Spark can also run on top of to distribute work among a huge Hadoop cluster, if you have one available. For example, you can use Amazon's Elastic MapReduce service to get cheap and easy access to a Hadoop cluster, which we'll do later on in this course. As illustrated in the following diagram, the cluster manager will split up the work and coordinate it among various executors. Spark will split up and create multiple executors per machine – ideally you want one per CPU core. It can do all the coordination using the cluster manager, and also your driver program itself, to farm out work and distribute it to different nodes and give you fault tolerance. So, if one of your executors goes down, it can recover without actually stopping your entire job and making you start it all over again.

The beauty of it is that it scales out to an entire cluster of computers and it gives you horizontal partitioning and horizontal scalability; basically, the sky' is the limit. However, from a user's standpoint and from a developer's standpoint, it's all just one simple little program running on one computer that feels a lot like writing any other script. This is a really nice aspect of Spark.