Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Map example

Here's an example of a map operation, just to make these explanations a little more real. Let's say we have an RDD that's just, you know, hard-coded to contain the values 1, 2, 3, and 4:

rdd=sc.parallelize([1,2,3,4])

In Python, I could just say rdd.map and pass this lambda function of the x points to x times x:

rdd.map(lambda x:x*x)

This one line will go ahead and square every value of that original RDD and then will return a new RDD that contains the values 1, 4, 9, and 16, which are the squares of 1, 2, 3, and 4.

What was that lambda thing? Well, if you're new to Python, this might be a new concept for you. Basically, a lot of Spark is centered around this concept called functional programming; it sounds fancy, but it's really not that complicated. The only thing is, many of the methods in RDDs actually accept a function as a parameter, as opposed to a more typical parameter, such as a float or string or something like that. Instead, you're passing in some function, some operation that you want to perform on every value of the RDD. Lambda is just shorthand really, so it's exactly the same thing as passing in a function directly. So just like I could say rdd.map, lambda x gets transformed into x times x:

rdd.map(lambda x: x*x)

I can alternatively just define a function called squareIt, or whatever I want to call it. This takes x as a parameter and returns the square of that argument:

def squareIt (x) : 
      return x*x

Then, I can say rdd.map and then the name of that function:

rdd.map(squareIt)

It does exactly the same thing as this line:

rdd.map(lambda x: x*x)

The lambda shorthand is just a little bit more compact. For simple functions that you want to use for your transformations on RDDs, it can save you a little bit of typing and give you a little bit more clarity. If you want to do something more complicated while you're mapping one RDD into another, you'll probably want to write a separate function for it and that's an option too. The basic concept is that you're passing in functions to your RDD methods instead of more traditional project parameters. That's all there is to that, I mean that's all there is to functional programming really, it wasn't that complicated, right? The more examples we look at, the more sense it makes.