
Apache Spark
Apache Spark in an open-source framework designed for multiple purposes. It has different processing components, which are as follows:
- Spark Core: It is like the main brain of Spark. It handles the most critical functions within Spark, such as task scheduling.
- Spark SQL: It is designed to work with structured data. It can run SQL as well as HiveQL queries to seamlessly integrate Spark in an existing application. It also supports JDBC and ODBC connectors as per the industry standard to connect to Business Intelligence tools.
- Spark Streaming: It is designed to work with streams of data in a scalable and fault-tolerant manner. It splits streaming data into mini batches and then executes a job. It can also be linked with an existing data stream, like in HDFS, Kafka, and Flume. You can also create your own data source if needed.
- MLib: It is designed to execute a machine learning algorithm and perform statistical analysis on your big data with the super fast processing power of Spark. The list of algorithmic topics includes classification, regression, decision trees, random forests, and gradient-boosted trees. It also contains algorithms related to recommendation, clustering, topic modeling, frequent itemsets, association rules, and sequential pattern mining.
- GraphX: It is designed for graph-related analytics. It has a vast list of algorithms that have been contributed by its users.
The following figure illustrates Spark's components in Hadoop Ecosystem:

Figure-3.8.1
Apache Spark is one of the top choices for big data architects because of the components and libraries it supports. The application developer doesn't have to look for a different framework to do any of the listed functionalities, and some of its key features are as follows:
- Spark's main attraction is its memory processing of large datasets. Spark persists the data in memory and also on the disk if necessary. It also has an improved algorithm for processing data. As per the Spark's team benchmark, its response time is 100 times faster than MapReduce in memory processing and 10 times faster than Hadoop at disk operations.
- From a developer's perspective, Spark supports integration with Java, Python, Scala and R programming languages, while most frameworks only support Java.
- It supports multiple distributed processing frameworks including Hadoop YARN and Mesos, and in the cloud. It can also work in an independent cluster mode.
- It supports multiple data sources, including HDFS, Cassandra, HBase, or any Hadoop data source.
Apache Spark contains different libraries, as we have discussed, relating to big data processing. To understand Spark as a big data architect, you will need more than an overview. We will cover Spark's libraries in-depth in a different section of this book as we proceed.