Merging Small Files Into Avro File


This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS output directory.

We will store the file names as keys and file contents as values in Avro file. We will use the below Avro Schema to store the files.

Merging Small Files Into Avro File:

In the below program we are parsing the above schema and writing each small into avro file according to the above schema. We also used Snappy Codec to compress the Avro Data file. For each file in the input directory, we are creating a new Avro record. We are also printing MD5 hash value for each file on the console to verify writing of the files.