Frank Kane's Taming Big Data with Apache Spark and Python
上QQ阅读APP看书,第一时间看更新

Components of Spark

Spark is made up of many components. We're going to be focusing a lot on Spark Core, which means looking at what you can do with RDD objects and how you can use that to distribute the processing of data and the mapping and reducing of large datasets. But in addition to Spark Core, Spark also includes several libraries that run on top of it, as shown in the following diagram:

We have Spark Streaming, which actually gives you the ability to analyze real-time data streams – a set of web logs coming in from a fleet of web servers that need to be continually updated for instance. We'll talk about Spark Streaming later in the book. We have Spark SQL, which lets you run Spark on top of a Hive context, deal with structured data within Spark, and actually run SQL queries on top of it. So if you are familiar with SQL and you want to treat Spark as sort of a data warehouse, you can do that too. Spark also includes something called MLlib, which is a series of machine learning algorithms; if you're going to be doing any sort of machine learning or data mining with Spark, it contains many useful tools you can use to simplify a lot of common operations. For example, if you need to do a Pearson correlation or get statistical properties of your dataset, MLlib makes that very easy to do. Finally, we have GraphX: that's not the kind of graphs you draw on graph paper, it's actually managing graphs of information – network theory sort of stuff. So if you have a graph, for example, a social graph of people that are friends with each other or citations between documents and scholarly articles, or other things of that nature, GraphX can help you make sense of those sorts of networks and give you high-level information about the properties of those graphs. All these libraries run on top of Spark Core. Spark has a lot to offer and it's expanding all the time.