Apache Spark 2:Data Processing and Real-Time Analytics
上QQ阅读APP看书,第一时间看更新

Overview

The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS:

These feed into the Spark Streaming module and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process stream-based data.

The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it to express the Spark module functionality:

When discussing Spark Discrete Streams, the previous figure, taken from the Spark website at http://spark.apache.org/, is the diagram that we would like to use.

The green boxes in the previous figure show the continuous data stream sent to Spark being broken down into a Discrete Stream (DStream).

A DStream is nothing other than an ordered set of RDDs. Therefore, Apache Spark Streaming is not real streaming, but micro-batching. The size of the RDDs backing the DStream determines the batch size. This way DStreams can make use of all the functionality provided by RDDs including fault tolerance and the capability of being spillable to disk. The size of each element in the stream is then based on a batch time, which might be two seconds.

It is also possible, to create a window, expressed as the previous red box, over the DStream. For instance, when carrying out trend analysis in real time, it might be necessary to determine the top ten Twitter-based hashtags over a ten-minute window.

So, given that Spark can be used for stream processing, how is a stream created? The following Scala-based code shows how a Twitter stream can be created. This example is simplified because Twitter authorization has not been included, but you get the idea. (The full example code is in the Checkpointing section.)

The Spark Stream Context (SSC) is created using the Spark Context, sc. A batch time is specified when it is created; in this case, 5 seconds. A Twitter-based DStream, called stream, is then created from Streamingcontext using a window of 60 seconds:

val ssc    = new StreamingContext(sc, Seconds(5) )
val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) )

Stream processing can be started with the stream context start method (shown next), and the awaitTermination method indicates that it should process until stopped. So, if this code is embedded in a library-based application, it will run until the session is terminated, perhaps with Crtl + C:

ssc.start()
ssc.awaitTermination()

This explains what Spark Streaming is and what it does, but it does not explain error handling or what to do if your stream-based application fails. The next section will examine Spark Streaming error management and recovery.