In this post, we will discuss about famous word count example through mapreduce and create a sample avro data file in hadoop distributed file system.
In order to execute the mapreduce word count program given in this post, we need avro-mapred-1.7.4-hadoop2.jar file to be present in $HADOOP_HOME/share/hadoop/common/lib directory. This jar contains the classes used for avro serialization and deserialization through mapreduce framework. For instructions on installation and integration of Avro with Hadoop2 refer the post Avro Installation. If the correct version of this jar file is not present in common/lib directory then we will end up in lot of errors/exceptions. So, we need to be very careful in choosing the version of this jar file into right directory of hadoop distribution.
Usually it is preferable to place this jar in $HADOOP_HOME/share/hadoop/common/lib directory as this location is included in hadoop classpath by default.
Avro MapReduce Word Count:
After setup of Avro in Hadoop cluster, we can run the below mapreduce program. Copy the below code snippet into MapReduceAvroWordCount.java file. It is a traditional mapreduce word count program only but it reads input file from text format and writes its output to an avro data file in Avro Pair<CharSequence, Integer> records instead of text.