data:image/s3,"s3://crabby-images/85416/85416513892eedd3de683b2eb5c91819d1e2aed6" alt="Frank Kane's Taming Big Data with Apache Spark and Python"
RDD actions
In addition to transforming one RDD into another, you can also perform actions on to an RDD. So once you have the data that you want in an RDD dataset, you can then perform an action on it to actually get a result. This is a list of RDD actions:
- collect
- count
- countByValue
- take
- top
- reduce
- and more...
So you can call collect, to just dump out all the values that are in there right now and just print them all out or whatever you want to do with them. You can call count to get a count of all the values that are in it. Call countByValue, which will actually give you a breakdown by unique value of how many times each value occurs in your RDD. There are actions such as take and top that let you sample a few values from the RDD final results. More powerful is the reduce function, which lets you combine all the different values for a given key value and boils things down into a summation or aggregation of your RDD. This will make more sense when you look at more examples. There are more examples of actions you can do as well, but these are the more common ones.
Another thing to understand regarding RDDs is that nothing actually happens until you call an action. We talked earlier about how Spark is so fast because it constructs a directed acyclic graph as soon as you ask for an action to happen; at this point, it knows what actually needs to be done to get the results that you want, and it can compute the most optimal path to make that happen. So it's important that when you're writing your Spark driver scripts, your script isn't actually going to do anything until you call one of these action methods. At this point, it will actually start farming things out to your cluster or write it on your own computer, whatever you've told it to do, and actually start executing that program. You won't actually have any results or any processing whatsoever until one of these action methods are actually called within your script, so it's important to keep that in mind.
That's what an RDD is, it's sort of the foundation of Spark. Now that you understand how to use RDDs and what they're for, let's go back to that ratings histogram example and figure out what it's actually doing under the hood there.