Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Extract (MAP) the data we care about

This image demonstrates what our mapper function does:

The Lambda sequence allows us to have a shorthand of passing a function that we want to pass into our map. So you can see, we have lines.map and then inside our parentheses, we have lambda x, where x gets passed into split, and we extract the 2 number field:

ratings = lines.map(lambda x: x.split()[2])

What this code is going to do, in every line, is take the line of input and split it based on whitespace individual fields. Running split on the first line, for example, is going to result in a list of values of 196 242 3 881250949. These numbers represent the user ID, 196, the movie ID, 242, the rating value, 3, and a timestamp, 881250949. So, the way to interpret the u.data file is: user ID 196 watched movie 242, gave it a rating of three out of five, and they did this at this particular timestamp. This can be translated into an actual human readable time; that's epoch seconds if you're curious. In computer programming, we start counting from 0. So the second field is actually the rating itself: 3. So what's this map function actually doing? Again, it's splitting up each line into its individual fields based on whitespaces, and then it's taking the field number two, which is the actual rating value. For every line of data, it sucks out the rating value and puts that into a new RDD that we're calling ratings.

Make sure you understand what's going on here, this is a very fundamental part in understanding Spark. If you need to stare at this some more, do what you have to do because it's very important that you get this concept. If you don't, you're going to have a hard time going forward. Understand this little one line of code:

ratings = lines.map(lambda x: x.split()[2])

This is going to take every individual input line from our lines RDD, which contains the raw input data, split it up into fields, in this case, a user ID field, movie ID field, ratings field, and timestamp, extract the ratings field which is field 2, and put that into a new RDD, called ratings. So we start with this:

After this map operation, the ratings RDD gets populated with these values:

Our new RDD, called ratings, will contain 3, 3, 1, 2, and 1 because these are the rating values extracted from the source data. It's also very important to remember that the map function and all the action functions don't transform the RDD in place. So your lines RDD remains untouched. What it's doing is creating a new RDD called ratings, and you need to remember to assign the result of that transformation to a new RDD or else it will just go nowhere. This is a very common mistake; you can't just call lines.map and expect it to change everything in your lines RDD. You need to assign that result to a new RDD, in this case, we've called it ratings. So, if you understand all of that, great, let's move forward. If not, again, take your time, do what you have to do to understand how we extract the ratings field in this instance, it's a very important concept.