
Understanding the code
The first couple of lines are just boilerplate stuff. One thing you'll see in every Python Spark script is the import statement to import SparkConf and SparkContext from the pyspark library that Spark includes. You will, at a minimum, need those two objects:
from pyspark import SparkConf, SparkContext import collections
SparkContext, as we talked about earlier, is the fundamental starting point that the Spark framework gives you to create RDDs from. You can't create SparkContext without SparkConf, which allows you to configure the SparkContext and tell it things such as, "do you want to run just some on computer or do you want to run on a cluster and if so, in what way?" The other bit of housekeeping at the beginning of our script is importing the collections package from Python. This is just because you want to sort the final results when we're done-that's just standard Python stuff.