Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Understanding the code

The first couple of lines are just boilerplate stuff. One thing you'll see in every Python Spark script is the import statement to import SparkConf and SparkContext from the pyspark library that Spark includes. You will, at a minimum, need those two objects:

from pyspark import SparkConf, SparkContext 
import collections

SparkContext, as we talked about earlier, is the fundamental starting point that the Spark framework gives you to create RDDs from. You can't create SparkContext without SparkConf, which allows you to configure the SparkContext and tell it things such as, "do you want to run just some on computer or do you want to run on a cluster and if so, in what way?" The other bit of housekeeping at the beginning of our script is importing the collections package from Python. This is just because you want to sort the final results when we're done-that's just standard Python stuff.