更新时间:2021-07-02 18:56:09
coverpage
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
A First Taste and What’s New in Apache Spark V2
Spark machine learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
What's new in Apache Spark V2?
Cluster design
Cluster management
Local
Standalone
Apache YARN
Apache Mesos
Cloud-based deployments
Performance
The cluster structure
Hadoop Distributed File System
Data locality
Memory
Coding
Cloud
Summary
Apache Spark SQL
The SparkSession--your gateway to structured data processing
Importing and saving data
Processing the text files
Processing JSON files
Processing the Parquet files
Understanding the DataSource API
Implicit schema discovery
Predicate push-down on smart data sources
DataFrames
Using SQL
Defining schemas manually
Using SQL subqueries
Applying SQL table joins
Using Datasets
The Dataset API in action
User-defined functions
RDDs versus DataFrames versus Datasets
The Catalyst Optimizer
Understanding the workings of the Catalyst Optimizer
Managing temporary views with the catalog API
The SQL abstract syntax tree
How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
Internal class and object representations of LEPs
How to optimize the Resolved Logical Execution Plan
Physical Execution Plan generation and selection
Code generation
Practical examples
Using the explain method to obtain the PEP
How smart data sources work internally
Project Tungsten
Memory management beyond the Java Virtual Machine Garbage Collector
Understanding the UnsafeRow object
The null bit set region
The fixed length values region
The variable length values region
Understanding the BytesToBytesMap
A practical example on memory usage and performance
Cache-friendly layout of data in memory
Cache eviction strategies and pre-fetching
Understanding columnar storage
Understanding whole stage code generation
A practical example on whole stage code generation performance
Operator fusing versus the volcano iterator model
Apache Spark Streaming
Overview
Errors and recovery
Checkpointing
Streaming sources
TCP stream
File streams
Flume
Kafka