One of the channels writes the data to the local file system and the other writes to HDFS. The target file also defines the port and two channels (mc1 and mc2). The /etc/flume/conf//Ĭ.fileType = DataStream In some Hadoop distributions, Flume can be started as a service when the system boots, such as “service start flume.” This configuration allows for automatic use of the Flume agent. This agent writes the data into HDFS and should be started before the source agent. $ flume-ng agent -c conf -f nf -n collector Now that the data directories are created, the Flume target agent can be started (as user hdfs). $ hdfs dfs -mkdir /user/hdfs/flume-channel/ Next, as user hdfs, make a Flume data directory in HDFS. To run the example, create the directory as root. The web log is also mirrored on the local file system by the agent that writes to HDFS. nf-The source Flume agent that captures the web log data nf-The target Flume agent that writes the data to HDFS (See the sidebar “Flume Configuration Files.”) The full source code and further implementation notes are available from the book web page in Appendix A, “Book Web Page and Code Download.” Two files are needed to configure Flume. This example is easily modified to use other web logs from different machines. In this example web logs from the local machine will be placed into HDFS using Flume. For more information and example configurations, please see the Flume Users Guide at. The full scope of Flume functionality is beyond the scope of this book, and there are many additional features in Flume such as plug-ins and interceptors that can enhance Flume pipelines. There are many possible ways to construct Flume transport networks. Sink-The sink delivers data to destinations such as HDFS, a local file, or another Flume agent.įigure 4.6 A Flume consolidation network. It can be thought of as a buffer that manages input (source) and output (sink) flow rates. web log) or another Flume agent.Ĭhannel-A channel is a data queue that forwards the source data to the sink destination. The input data can be from a real-time source (e.g. It can send the data to more than one channel. Source-The source component receives data and sends it to a channel. Flume is often used for log files, social-media-generated data, email messages, and pretty much any continuous data source.Īs shown in Figure 4.4, a Flume agent is composed of three components: Often data transport involves a number of Flume agents that may traverse a series of machines and locations. In order to use this type of data for data science with Hadoop, we need a way to ingest such data into HDFS.Īpache Flume is designed to collect, transport, and store data streams into HDFS. In addition to structured data in databases, another common source of data is log files, which usually come in the form of continuous (streaming) incremental files often from multiple source machines. Learn More Buy Using Apache Flume to Acquire Data Streams Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |