上QQ阅读APP看书，第一时间看更新

Apache Flume

Apache Flume is a framework that helps move a large amount of streaming data from one place to another. It is primarily designed for log collection and aggregation, obtaining data across different servers before moving it to a centralized place such as Hadoop for processing and analysis. Its usages not limited to just log aggregation, though. Flume's data source connectors are customizable and can be used to transport a large amount of event-generated data such as network traffic data, social media data, and so on. It is a fault-tolerant system with many failure and recovery mechanisms. The following figure illustrates how Apache Flume works:

There are three main components inside a Flume agent. They are as follows:

Flume source: External Data sources including web servers or social media generate a log or event-related data; these events will be consumed by a flume source. For example, if a Flume source is configured to monitor a Twitter account, it will generate an event for every tweet and pass it on. Once a Flume source receives the event, it will pass it on to a Flume channel. It has no control over how a Flume channel will store or treat the data.
Flume channel: A Flume channel is connected to both a Flume source and Flume sink. Once any event reaches a Flume source, it will be immediately passed on to a Flume channel, which acts as a buffer once you have configured how much data it can store. A Flume channel is responsible for storing data until it is consumed by the Flume sink. It may use a local file system to store data for more reliability.
Flume sink: A Flume sink is responsible for retrieving an event from a Flume channel and passing it on to an external repository, such as HDFS. The event will be removed from the Flume channel once it is consumed by a Flume sink.

Flume is a framework that helps with data handling while streaming it in to Hadoop. It is often compared to Apache Kafka, although they are two quite different systems. Some of their functionalities may overlap, but that doesn't mean Apache Kafka is an improved version of Apache Flume or vice versa. Kafka is more of a general-purpose system that can be used in messaging or in other types of systems that require an approach with a fault-tolerant mechanism. On the other hand, Flume is a reliable data handler tightly built for the Hadoop ecosystem that brings a large amount of log and streaming data into the Hadoop cluster. It provides many built-in sources and sinks, which can be used directly in your application.