Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Transforming RDDs

One common thing you're going to do once you have an RDD is transform it in some way, shape, or form. The following list shows some of the basic operations you can do on RDDs. This is not a complete list, but these are all the most common operations you can do on an RDD:

map
flatmap
filter
distinct
sample
union, intersection, subtract, cartesian

There's not a whole lot to wrap your head around here. The thing is, although there aren't a lot of different operations you can do to transform an RDD, they're all very powerful. So let's start with the map function on an RDD. This allows you to take a set of data and transform it into some other set of data, given a function that operates on the RDD. So, for example, if I want to square all the numbers in an RDD, I might have a map that points to a function that just multiplies everything in that RDD by itself. The map function has a one-to-one relationship, so that every entry in your original RDD gets mapped to a new value in your new RDD. So your new RDD will have just as many entries as your original RDD.

Moving on to the flatmap function, which is very similar to map except it has the capability to produce multiple values for every input value that you have from the original RDD. So the RDD that you transform using flatmap may be larger or even smaller than map in the RDD you started with. But still, fundamentally, it transforms one RDD in to another using some function, it just has the ability to blow that out into multiple results, or even no results per original entry.

The filter function can be used to trim out information you don't need. So let's say, for example, you have an RDD filled with weblog data and you want to filter out everything but the error lines in that weblog. You could have a filter function that just looks for the word error in a line of text, and if it doesn't have error in it, it throws it away; that's what a filter would do in an RDD.

There are some other, less common operations such as distinct, which you would use if you just wanted to get the distinct values, the unique values that are in an RDD, and throw out all the duplicates. You could call sample on it if you just want to take a random sample from the RDD and get a smaller dataset to work with and experiment with. The sample operation can be very useful while you're testing a script on a large dataset and you just want to run it locally to work out the bugs. Finally, you can actually do intersections of various types between two RDDs. There are methods that can take two different RDDs as the input and output a single RDD. For this, we can call the union, intersection, or subtraction operations for subtracting one RDD's values from another. We also have the Cartesian product where you get every possible combination between every element in the RDD, and obviously, that fills up really quickly. So again, not a lot of different operations, but they are all very powerful and they allow you to input your own functions for transforming one dataset into another using RDDs.