
To Get the Most out of This Book
Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).
Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for a cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for a standalone word missing and for an SQL warehouse).
Along with this, you would require the following:
- VirtualBox 5.1.22 or above
- Hortonworks HDP Sandbox V2.6 or above
- Eclipse Neon or above
- Eclipse Scala Plugin
- Eclipse Git Plugin
- Spark 2.0.0 (or higher)
- Hadoop 2.7 (or higher)
- Java (JDK and JRE) 1.7+/1.8+
- Scala 2.11.x (or higher)
- Python 2.7+/3.4+
- R 3.1+ and RStudio 1.0.143 (or higher)
- Maven Eclipse plugin (2.9 or higher)
- Maven compiler plugin for Eclipse (2.3.2 or higher)
- Maven assembly plugin for Eclipse (2.4.1 or higher)
- Oracle JDK SE 1.8.x
- JetBrain IntelliJ Community Edition 2016.2.X or later version
- Scala plug-in for IntelliJ 2016.2.x
- Jfreechart 1.0.19
- breeze-core 0.12
- Cloud9 1.5.0 JAR
- Bliki-core 3.0.19
- hadoop-streaming 2.2.0
- Jcommon 1.0.23
- Lucene-analyzers-common 6.0.0
- Lucene-core-6.0.0
- Spark-streaming-flume-assembly 2.0.0
- Spark-streaming-kafka-assembly 2.0.0