Apache Spark 2:Data Processing and Real-Time Analytics

To Get the Most out of This Book

Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for a cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for a standalone word missing and for an SQL warehouse).

Along with this, you would require the following:

  • VirtualBox 5.1.22 or above
  • Hortonworks HDP Sandbox V2.6 or above
  • Eclipse Neon or above
  • Eclipse Scala Plugin
  • Eclipse Git Plugin
  • Spark 2.0.0 (or higher)
  • Hadoop 2.7 (or higher)
  • Java (JDK and JRE) 1.7+/1.8+
  • Scala 2.11.x (or higher)
  • Python 2.7+/3.4+
  • R 3.1+ and RStudio 1.0.143 (or higher)
  • Maven Eclipse plugin (2.9 or higher)
  • Maven compiler plugin for Eclipse (2.3.2 or higher)
  • Maven assembly plugin for Eclipse (2.4.1 or higher)
  • Oracle JDK SE 1.8.x
  • JetBrain IntelliJ Community Edition 2016.2.X or later version
  • Scala plug-in for IntelliJ 2016.2.x
  • Jfreechart 1.0.19
  • breeze-core 0.12
  • Cloud9 1.5.0 JAR
  • Bliki-core 3.0.19
  • hadoop-streaming 2.2.0
  • Jcommon 1.0.23
  • Lucene-analyzers-common 6.0.0
  • Lucene-core-6.0.0
  • Spark-streaming-flume-assembly 2.0.0
  • Spark-streaming-kafka-assembly 2.0.0