Avro MapReduce 2 API Example


Avro provides support for both old Mapreduce Package API (org.apache.hadoop.mapred) and new Mapreduce Package API (org.apache.hadoop.mapreduce). Avro data can be used as both input and output from a MapReduce job, as well as the intermediate format.

In this post we will provide an example run of Avro Mapreduce 2 API. This post can be treated as continuation for the previous post on Avro Mapreduce API. In this post, we will create some sample schema and generate avro data file using ruby and run Mapreduce program to count the colors in the avro data file.

Create a Schema:

For testing of avro data files via mapreduce we will create sample schema as shown below. Copy the below schema into sample.avsc file.

Generate Avro Data:

Generate some sample avro data records into samplecolors.avro file, confining to above schema by using below ruby code.

run the above ruby program from command terminal and generate the samplecolors.avro file containing avro records as shown below in the screen shot.

generate data

MapReduce Color Count Example:

Below is a sample mapreduce program to count the colors from the above avro data file. Copy the below code snippet into MapReduceColorCount.java program.

In the above program, we have used GenericRecord class to read the schema from the input avro data file (i.e. without using code generation) instead of using code generation for schema in JAVA API.

If we want to use the code generation, then we need to provide the specific class names for the schema in Mapper class. In the above program we will have below changes to ColorCountMapper class and we need to import the package containing schema class. And rest of the program can be used without any changes.

  • Compile the above program and build the jar file with below commands.

MR Compilation

  • Copy the samplecolors.avro file created in above section into HDFS input directory and run the mapreduce program with hadoop jar command.

Run mapreduce job

  • Verify the output of mapreduce job in /out/mrwc directory.

MR job output

  • Copy the part-r-00000 file into local FS and use the tojson tool to view the avro file in JSON format.

Output in JSON formatWe can verify that color counts are correctly matching with the input data. There are two records with no color in samplecolors.avro file and same is shown in output file.

In the above example, we have used below classes from org.apache.avro.mapreduce package & org.apache.avro.mapred package.

AvroKey<T> The wrapper of keys for jobs configured with AvroJob .
AvroValue<T> The wrapper of values for jobs configured with AvroJob .
AvroJob Utility methods for configuring jobs that work with Avro.
AvroKeyInputFormat<T> A MapReduce InputFormat that can handle Avro container files.
AvroKeyValueOutputFormat<K,V> FileOutputFormat for writing Avro container files of key/value pairs.

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.