In this post we will discuss about the famous real time use case of hadoop’s flume tool, Twitter data analysis using hadoop flume with Apache’s distribution of Flume and we will touch base the counter distribution from Cloudera as Well.
For this purpose we will try the experimental Twitter Source provided by Apache’s Flume distribution to get streaming tweets into HDFS and we will process them in Hive and create structured tables which can be further utilized for analytics via any other tool like, Tableau, QlikView or Hunk.
Twitter Data Analysis Using Hadoop Flume
Flume TwitterAgent Setup
In this section, we will setup a Twitter Agent in Apache Flume distribution (apache-flume-1.5.2-bin , which is the latest version at the time of writing this post). In this agent, we will use Twitter Source provided by Apache, File Channel and HDFS Sink as the primary components.
Twitter Source Overview
As per Apache’s Flume documentation, Apache’s Twitter Source org.apache.flume.source.twitter.TwitterSource is highly experimental and may change between minor versions of Flume. This should be used at our own risk and curiosity to analyze real-time streaming data.
Twitter Source connects via Streaming API to the twitter fire-hose, and continuously downloads tweets. These tweets are converted into Avro format and Avro events are sent to the downstream Flume sink, HDFS sink in our use case. For Connecting with Twitter streaming data, we need consumer and access tokens and secrets of a Twitter developer account.
Creation of Twitter Developer Account
Twitter Developer Account can be created at Twitter Developers apps Page. In this page we need to provide valid twitter account page in the website field from which we need to get streaming data. If we provide valid details on this page we will get our app created as shown in below screen shots. For security reasons, I have blurred consumer access key, secret values in the below screens.
We need below four values for authenticated by Twitter.
- Consumer Key (API Key)
- Consumer Secret (API Secret)
- Access Token
- Access Token Secret
In the same screen, We can create access tokens by clicking on ‘Create my access token‘. Access tokens will be as shown in the below screen.
Below is the property configuration table for Twitter Source and Mandatory fields are highlighted in bold
|type||–||It should be org.apache.flume.source.twitter.TwitterSource|
|consumerKey||OAuth consumer key|
|consumerSecret||OAuth consumer secret|
|accessToken||OAuth access token|
|accessTokenSecret||OAuth toekn secret|
|maxBatchSize||1000||Maximum number of twitter messages to put in a single batch|
|maxBatchDurationMillis||1000||Maximum number of milliseconds to wait before closing a batch|
Creation of Agent in flume.conf
Lets create the agent named TwitterAgent under flume.conf file in FLUME_CONF_DIR directory location with below configuration properties. In this we need to specify the access tokens collected from twitter correctly in Twitter source setup.
Start The Flume-ng Agent
Make sure hadoop is setup properly and hadoop daemons are running fine before triggering flume-ng agent TwitterAgent. Once it is done, we can fire the agent with below command on terminal.
As the time intervals given in the conf file are in terms of minutes, please wait at least for 5-10 minutes to get good amount of messages into output HDFS files. We can verify these files in the HDFS Web UI at http://localhost:50070 .
If you are curious to see the raw data in these avro files, get the files from HDFS to local file system and browse them with below command.
Below is the sample output of tojson tool: