Merging Small Files Into Avro File


This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS output directory.

We will store the file names as keys and file contents as values in Avro file. We will use the below Avro Schema to store the files.

Merging Small Files Into Avro File:

In the below program we are parsing the above schema and writing each small into avro file according to the above schema. We also used Snappy Codec to compress the Avro Data file. For each file in the input directory, we are creating a new Avro record. We are also printing MD5 hash value for each file on the console to verify writing of the files.

Compile this program and build SmallFiles.jar file which we will use below to run this program. Below is the local input directory structure we are using to run this program.

Local Input directory

Run the below command.

SmallFilesToAvroFile

Lets verify the contents of output file /out/avro file with the help of tojson tool.  We have used tail -n1 to display only the last file contents for ease of displaying on console.

JSON View of avro file


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.