Big Data Architect’s Handbook

更新时间：2021-06-25 20:57:54

最新章节：Leave a review - let other readers know what you think

封面

版权信息

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Why Big Data?

What is big data?

Characteristics of big data

Volume

Velocity

Variety

Veracity

Variability

Value

Solution-based approach for data

Data – the most valuable asset

Traditional approaches to data storage

Clustered computing

High availability

Resource pooling

Easy scalability

Big data – how does it make a difference?

Big data solutions – cloud versus on-premises infrastructure

Cost

Security

Current capabilities

Scalability

Big data glossary

Big data

Batch processing

Cluster computing

Data warehouse

Data lake

Data mining

ETL

Hadoop

In-memory computing

Machine learning

MapReduce

NoSQL

Stream processing

Summary

Big Data Environment Setup

Oracle VM VirtualBox installation

Ubuntu installation

Hadoop prerequisite installation

Java installation

SSH installation and configuration

Hadoop system user

Apache Hadoop installation

Hadoop configuration

Path configuration for Hadoop commands

Hadoop server start and stop

Summary

Hadoop Ecosystem

Apache Hadoop

Hadoop Distributed File System

HDFS hands-on

Creating a directory in HDFS

Copying files from a local file system to HDFS

Copying files from HDFS to a local file system

Deleting files and folders in HDFS

Hadoop MapReduce

Job Tracker and Task Tracker

The execution flow of MapReduce

Mapper

Shuffle and Sort

Reducer

Example program

Preparing the data file for analysis

Program code

Driver program

Mapper program

Reducer program

Observations and results

YARN

Resource Manager

Node Manager

Container

Application Master

Apache Projects related to big data

Apache Zookeeper

Apache Kafka

Apache Flume

Apache Cassandra

Apache HBase

Apache Spark