Data Collection from HTTP Client into HBase 7

This post provides a proof of concept of data collection from HTTP client into HBase. In this post, we will setup a flume agent with HTTP Source, JDBC Channel and AsyncHBase Sink.

Initially we concentrate on POC of HTTP client data collection into HBase and at the end of this post we will go deep into details of each component used for setup of this agent.

Now lets create our agent Agent6 in flume.conf properties file under<FLUME_HOME/conf> directory.

Data collection from HTTP client into HBase – Flume Agent – HTTP Source, AsyncHBase and JDBC Channel:

Add the below configuration properties in flume.conf file to create Agent6 with HTTP  source, JDBC channel and AsyncHBase Sink.

Configuration Before Agent Start up:
  • Start Hadoop and Yarn daemons. Also start Hbase daemons. Make sure all the daemons are started successfully.
  • In Hbase, Create the table with column family specified in flume.conf file.

Below is the screen shot from terminal performing the above activities.

Start of all daemons

Hbase Table creation2

  • Create HTTP Client to post our input file to HTTP Source at the configured hostname and port number.

Flume HTTP Source’s default handler is org.apache.flume.source.http.JSONHandler. We need to create input for this handler in JSON format as shown below. Further details on this handler and supported formats will be discussed at the bottom of this post.  As of now, lets consider the below lines of text into input file.

For creation of HTTP Client, we have below java code. This application can send a JSON document to a remote web server using HTTP POST.

Copy this code into a new java project in eclipse with name PostFile and add required jars (all org.apache-*.jar files present in eclipse plugin folder) to build path.

Eclipse postfile

To add required jars, right click on JRE System Library –> Build Path –> Configure Build Path –> ADD External Jars –> select all org.apache.*.jar files from eclipse plugins folder (usually /opt/eclipse/plugins) as shown below.

External jars

Now compile this program and make it ready for running.

Start Agent :

After confirming that all the above configuration is successful and there are no issues then we are ready to start the agent otherwise we will end up in undesired error messages or java exceptions.

Start the agent with below command.

Flume Agent6

Run the HTTP Client from eclipse. (Run the java program with arguments in the below format).

In Eclipse, Run –> Run Configurations –> Main (Specify Project Name and Main class as PostFile) after this, in the Arguments tab specify the remote address and file path as arguments. It is shown in below screen shot.

HTTP Client Run

Run this and verify the output in HBase table. But do not stop the flume agent after verification of HBase output. We will keep it running for table increments testing.

Verify the Output:

Verify the output of table_t1 table in HBase. As shown in below screen shot, we can see the table_t1 with 3 rows added into it. (Out of these three rows, 2 rows have data and 1 with incRow for table increments purpose).

Hbase Table op

So, we have successfully configured flume agent and transferred files from HTTP Client through HTTP POST request and the same is received into AsyncHBase Sink.

Lets try to test whether HBase table increments are possible with this agent.

Feed Another File through HTTP Client to check HBase Table Increments:

As we didn’t stop our Flume agent, let’s try to run HTTP client with another input2.json file and it contents are shown below.

 Run HTTP Client from Eclipse as shown in below screen shot.

HTTP Client Run2And now verify the contents of table_t1 in HBase and we will be able to see the new lines of events added into table as new rows. The same can be verified from below screen shot.

Table increments

And also we can observe that, incRow value is incremented by 2 rows now, and resulting in x04 from x02.

So, with this we can confirm that HBase Table Increments are also working fine with our agent setup.

Below Page provides broader details of HTTP Source, JSON Handler, AsyncHBase Sink components.

Details of Components:

HTTP Source:

This source accepts Flume Events by HTTP POST and GET. HTTP requests are converted into flume events by a pluggable “handler” which must implement the HTTPSourceHandler interface. This handler takes a HttpServletRequest and returns a list of flume events. All events sent in one post request are considered to be one batch and inserted into the channel in one transaction.

Below is the property table for HTTP Source and mandatory properties are in bold.

Property Name Default Description
type   The component type name, needs to be http
port The port the source should bind to.
bind The hostname or IP address to listen on
handler See Desc Default – org.apache.flume.source.http.JSONHandler
enableSSL false Set the property true, to enable SSL
JSON Handler:

It can handle events represented in JSON format, and supports UTF-8, UTF-16 and UTF-32 character sets. The handler accepts an array of events (even if there is only one event, the event has to be sent in an array) and converts them to a Flume event based on the encoding specified in the request. That’s why we provide all our input data in [] brackets to represent the array. If no encoding is specified, UTF-8 is assumed.

To set the charset, the request must have content type specified as application/json; charset=UTF-8 (replace UTF-8 with UTF-16 or UTF-32 as required).

AsyncHBase Sink:

This sink is similar to HBase sink and it takes events from channel and writes them onto HBase. The only difference from HBaseSink is that, This AsyncHBaseSink uses an aysnchronous API internally and is likely to perform better. In this, AsyncHbaseEventSerializer is used to convert the events into HBase puts or increments.

Below is the property table for AsyncHBaseSink and mandatory properties are in bold.

Property Name Default Description
type The component type name, needs to be asynchbase
table The name of the table in Hbase to write to.
zookeeperQuorum This is the value of hbase.zookeeper.quorum in hbase-site.xml
znodeParent /hbase The base path for the znode for the -ROOT- region. Value of zookeeper.znode.parent in hbase-site.xml
columnFamily The column family in Hbase to write to.
batchSize 100 Number of events to be written per txn.
coalesceIncrements FALSE Should the sink coalesce multiple increments to a cell per batch.
timeout 60000 Time in millisec, the sink waits for acks from hbase for all events in a trx.
serializer See Desc  org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer

In this post we can optionally use CURL utility as well. For example, POST a json document to it via:

curl -X POST -H ‘Content-Type: application/json; charset=UTF-8’ -d ‘[{ “headers” : { “ip” : “”, “host” : “” }, “body” : “random_body” }, { “headers” : { “ip” : “”, “host” : “” }, “body” : “really_random_body” }]’ http://hostname:port

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

7 thoughts on “Data Collection from HTTP Client into HBase

  • sandeep

    i want to pull data from a http source to hdfs:

    My Source URL is as below:

    source url:

    I need to extract json from the above URL to hdfs

    ########## NEW AGENT ##########
    # flume-ng agent -f /etc/flume/conf/flume.httptest.conf -n httpagent

    # slagent = SysLogAgent
    httpagent.sources = http-source
    httpagent.sinks = hdfs
    httpagent.channels = ch3

    # Define / Configure Source (multiport seems to support newer “stuff”)
    httpagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
    #httpagent.sources.http-source.handler = org.apache.flume.source.http.JSONHandler
    httpagent.sources.http-source.channels = ch3
    httpagent.sources.http-source.port = 8989
    #httpagent.sources.http-source.bind = localhost
    #httpagent.sources.http-source.url =

    # Local File Sink
    httpagent.sinks.local-file-sink.type = hdfs = ch3 = hdfs://localhost:9000/user/training/flumefolder
    httpagent.sinks.local-file-sink.rollInterval = 5

    # Channels
    httpagent.channels.ch3.type = memory
    httpagent.channels.ch3.capacity = 100

    Please let me know in case I need to make any corrections in the agent.


  • kumar

    Hi Siva ,
    my use case to collect the Websphere log and store in HDFC for reporting , Is the flume agent need to install in all Websphere server box where i need to pull the logs or need to install in only one server where i need to give the webserver server box address and HDFS adrees for each flume agent ?
    Sorry , it may be very silly question but i am new to hadoop .


  • kumae

    Hi Siva,

    my use case to collect the webserver logs from 5 node and collect data from one MQ queue . So in this case i need to install flume in only one server or i need to install in all webserver serves . can you please suggest me . I am new to hadoop .please please help me

  • Saurabh

    Hi Siva

    I have a similar application where in instead of json data i am receiving the whole file (zip files) and i am using blob handler and HDFS sink.

    It works fine. But i want to distribute the load bewteen different channels is it possible to do so by using some kind of load balancer or something?