Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Loading the data

Now we're going to load up our data file. If you remember from Chapter 1, Getting Started With Spark, a very common way of creating an RDD is through the sc.textFile method. So, as you can see in the line of code shown here, this is actually going to go out to our local file system, go to the ml-100k rating dataset from MovieLens, and then load up the data file that includes all of the movie ratings data:

lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")

Now if you were to open up the u.data file in some sort of a text editor, it would look something like the following information, only with a hundred thousand lines:

What textFile does is it breaks up that input file line by line, so that every line of text corresponds to one value in your RDD. The first value of the lines RDD is going to be this entire line of text:

The second line will be this line of text:

The third value will be this line of text and so on and so forth.

So if this were my entire u.data file, my RDD would consist of five values where each value is a string that represents a line of text:

Later on, we'll actually break that up and look at what that string means.