Frank Kane's Taming Big Data with Apache Spark and Python
上QQ阅读APP看书,第一时间看更新

Running Spark code

Let's go ahead and start up Enthought Canopy. Once you get to the Welcome screen, go to the Tools menu and then to Canopy Command Prompt. This will give you a little Command Prompt you can use; it has all the right permissions and environment variables you need to actually run Python.

So type in cd c:\spark, as shown here, which is where we installed Spark in our previous steps:

We'll make sure that we have Spark in there, so you should see all the contents of the Spark distribution pre-built. Let's look at what's in here by typing dir and hitting Enter:

Now, depending on the distribution that you downloaded, there might be a README.md file or a CHANGES.txt file, so pick one or the other; whatever you see there, that's what we're going to use.

We will set up a little simple Spark program here that just counts the number of lines in that file, so let's type in pyspark to kick off the Python version of the Spark interpreter:

If everything is set up properly, you should see something like this:

If you're not seeing this and you're seeing some weird Windows error about not being able to find pyspark, go back and double-check all those environment variables. The odds are that there's something wrong with your path or with your SPARK_HOME environment variables. Sometimes you need to log out of Windows and log back in, in order to get environment variable changes to get picked up by the system; so, if all else fails, try this. Also, if you got cute and installed things to a different path than I recommended in the setup sections, make sure that your environment variables reflect those changes. If you put it in a folder that has spaces in the name, that can cause problems as well. You might run into trouble if your path is too long or if you have too much stuff in your path, so have a look at that if you're encountering problems at this stage. Another possibility is that you're running on a managed PC that doesn't actually allow you to change environment variables, so you might have thought you did it, but there might be some administrative policy preventing you from doing so. If so, try running the set up steps again under a new account that's an administrator if possible. However, assuming you've gotten this far, let's have some fun.

Let's write some Spark code, shall we? We should get some payoff for all this work that we have done, so follow along with me here. I'm going to type in rdd = sc.textFile("README.md"), with a capital F in textFile – case does matter. Again, if your version of Spark has a changes.txt instead, just use changes.txt there:

Make sure you get that exactly right; remember those are parentheses, not brackets. What this is doing is creating something called a Resilient Distributed Data store (rdd), which is constructed by each line of input text in that README.md file. We're going to talk about rdds a lot more shortly. Spark can actually distribute the processing of this object through an entire cluster. Now let's just find out how many lines are in it and how many lines did we import into that rdd. So type in rdd.count() as shown in the following screenshot, and we'll get our answer. It actually ran a full-blown Spark job just for that. The answer is 104 lines in that file:

Now your answer might be different depending on what version of Spark you installed, but the important thing is that you got a number there, and you actually ran a Spark program that could do that in a distributed manner if it was running on a real cluster, so congratulations! Everything's set up properly; you have run your first Spark program already on Windows, and now we can get into how it's all working and doing some more interesting stuff with Spark. So, to get out of this Command Prompt, just type in quit(), and once that's done, you can close this window and move on. So, congratulations, you got everything set up; it was a lot of work but I think it's worth it. You're now set up to learn Spark using Python, so let's do it.