Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Running the ratings counter script

If you go to the Tools menu in Canopy, you have a shortcut there for Command Prompt that you can use, or you can open up Command Prompt anywhere. When you open that up, just make sure that you get into your SparkCourse directory where you actually downloaded the script that we're going to be using. So, type in C:\SparkCourse (or navigate to the directory if it's in a different location) and then type dir and you should see the contents of the directory. The ratings-counter.py and ml-100k folders should both be in there:

All I need to do to run it, is type in spark-submit ratings-counter.py-follow along with me here:

I'm going to hit Enter and that will let me run this saved script that I wrote for Spark. Off it goes, and we soon get our results. So it made short work of those 100,000 ratings. 100,000 ratings doesn't constitute really big data but we're just playing around on our desktop for now:

The results are kind of interesting. It turns out that the most common rating is four star, so people are most generous with four star ratings, with 34,000 of them in the dataset, and people seem to reserve one stars for the worst of the worst, only about 6,000 one star ratings out of our 100,00 ratings. It might be fun to go and see what actually got rated one star if you want to find some really bad movies to watch.