Apache Spark 2：Data Processing and Real-Time Analytics

上QQ阅读APP看书，第一时间看更新

To Get the Most out of This Book

Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for a cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for a standalone word missing and for an SQL warehouse).

Along with this, you would require the following:

VirtualBox 5.1.22 or above
Hortonworks HDP Sandbox V2.6 or above
Eclipse Neon or above
Eclipse Scala Plugin
Eclipse Git Plugin
Spark 2.0.0 (or higher)
Hadoop 2.7 (or higher)
Java (JDK and JRE) 1.7+/1.8+
Scala 2.11.x (or higher)
Python 2.7+/3.4+
R 3.1+ and RStudio 1.0.143 (or higher)
Maven Eclipse plugin (2.9 or higher)
Maven compiler plugin for Eclipse (2.3.2 or higher)
Maven assembly plugin for Eclipse (2.4.1 or higher)
Oracle JDK SE 1.8.x
JetBrain IntelliJ Community Edition 2016.2.X or later version
Scala plug-in for IntelliJ 2016.2.x
Jfreechart 1.0.19
breeze-core 0.12
Cloud9 1.5.0 JAR
Bliki-core 3.0.19
hadoop-streaming 2.2.0
Jcommon 1.0.23
Lucene-analyzers-common 6.0.0
Lucene-core-6.0.0
Spark-streaming-flume-assembly 2.0.0
Spark-streaming-kafka-assembly 2.0.0