Avro MapReduce Word Count Example 1


In this post, we will discuss about famous word count example through mapreduce and create a sample avro data file in hadoop distributed file system.

Prerequisite:

In order to execute the mapreduce word count program given in this post, we need avro-mapred-1.7.4-hadoop2.jar file to be present in $HADOOP_HOME/share/hadoop/common/lib directory. This jar contains the classes used for avro serialization and deserialization through mapreduce framework. For instructions on installation and integration of Avro with Hadoop2 refer the post Avro Installation. If the correct version of this jar file is not present in common/lib directory then we will end up in lot of errors/exceptions. So, we need to be very careful in choosing the version of this jar file into right directory of hadoop distribution.

Usually it is preferable to place this jar in $HADOOP_HOME/share/hadoop/common/lib directory as this location is included in hadoop classpath by default.

Avro MapReduce Word Count:

After setup of Avro in Hadoop cluster, we can run the below mapreduce program. Copy the below code snippet into MapReduceAvroWordCount.java file. It is a traditional mapreduce word count program only but it reads input file from text format and writes its output to an avro data file in Avro Pair<CharSequence, Integer> records instead of text.

In the above code we are using AvroWrapper class to write pairs of <String, Integer> values and this pair is included in the reduce output key. Reducer’s output value is maintained as NullWritable as there is no need for it because, both string and its count are included in reducer’s output key part itself.

  • Compile this program by providing the classpath argument containing the path to avro-mapred-1.7.4-hadoop2.jar & avro-1.7.4.jar files. And build the jar file with generated classes.

MR compilation

In the above screen shot classes are stored in avromr directory and the same is used to build avromr.jar file.

  • Create some sample input file and copy it to hdfs. Below are the contents of our input file.

  •  Run the mapreduce program with below command.

avro mapreduce wc run

  • Verify the outputs of the above mapreduce job.

avro mr job output

We can observe the key-value pairs in the avro output file and Key part is string and value part is integer.

So, we have successfully tested our classic mapreduce program via avro format in hadoop environment.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.