Merging Small Files into SequenceFile 7


In this post, we will discuss one of the famous use case of SequenceFiles, where we will merge large number of small files into SequenceFile. We will get to this requirement mainly due to the lack efficient processing of large number of small files in hadoop or mapreduce.

Need For Merging Small Files:

As hadoop stores all the HDFS files metadata in namenode’s main memory(which is a limited value) for fast metadata retrieval, so hadoop is suitable for storing small number of large files instead of huge number of small files. Below are the two main disadvantage of maintaining small files in hadoop.

Below is a sample calculation on namenode main memory usage.

On average each file occupies 600 bytes of space in memory. Suppose we need to store 1 billion files of each 100 KB, then we need 60 GB of main memory on namenode and 10 TB of total storage. Suppose if we merge these files into of 100 MB file each, then 60 MB of main memory will be sufficient. Maintaining 60 MB is easier when compared to 60 GB in main memory. So, we need to merge small files into large files.

One more disadvantage of maintaining small files, from Mapreduce perspective is that, processing these files will require 1 billion map tasks to process 1 billion files of 100 KB each, as each file will be processed as one separate file split in mapreduce job.  If we merge them into 100 MB file each and we have block size of 128 MB then, we might require only 1 lakh map tasks only. So, if we merge small files into large files, then mapreduce job can be completed quickly.

So, it is not optional but mandatory to convert huge number of small files into less number of large files.

Solutions:

In order to solve both the problems mentioned above, we need a file format that can store both file name and contents in a single file and it would be a great value add if it also supports file splitting and compression to process efficiently these files in mapreduce jobs.

We will discuss mainly two possible solutions for this.

  • Merging Small Files into SequenceFile
  • Merging Small Files into Avro File

Both SequenceFile and Avro files supports splitting and compression formats. In this post, we will discuss about the first technique of merging small files into sequencefile and next post we will provide details on merging small files into avro file.

Merging Small Files Into SequenceFile:

We will merge small files into sequencefile with the help of custom record reader and custom input format classes by extending InputFormat and RecordReader classes from hadoop API.

We will process each file contents as a single record and we need below two classes to process full file as a record. In this FullFileInputFormat keys are not needed and only contents are needed. So, keys are given as NullWritable and values as BytesWritable.

In order to prevent file splitting, we are overriding isSplittable() method and returning false.

And we are overriding createRecordReader() to return a custom Record Reader instance. 

Below is a custom FullFileRecordReader implementation. In nextKeyValue() method below, we open the file, create a byte array with length as the length of the file, and use the Hadoop IOUtils class to copy entire file contents into the byte array.

Below is the conversion program SmallFilesToSequenceFile which will make use of the above FullFileInputFormat class and merge small files from an input HDFS directory to a sequence file in output HDFS directory. We will store the file names as keys and file contents as values in Sequence File (key,value) pairs.

In this we are using single reducer to produce a single output sequence file as we are testing this program only on a few files and we are using IdentityReducer which will copy the map output directly to output partition files.

We will compile these three programs from example package and build a jar file (SmallFiles.jar) from eclipse.

We running the above program against the below input HDFS path with contains 9 small files.

input dir

We run this program with below command.

SmallFilesToSequenceFileBelow is the output sequence file. Below is a sample program to extract keys from SequenceFile to verify the names of input files present in combined sequence file created from above job run.

We will compile this program and add this to SmallFiles.jar file. We will run this program on output sequence file to verify the file names present in it.

SequenceFile Key Extractor

So we have successfully merged all the 9 input files present in /in/ HDFS directory into /out/seq/part-r-00000 sequencefile.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

7 thoughts on “Merging Small Files into SequenceFile

  • Aditya

    Hi ,

    for the above Example  I wanna know how many mappers are working?? is it 9 (i.e) no.of mappers=no.of inputspilts

    or else is it one mapper processing the all files treating each file as a single record??

    please clarify

     

    Thanks

    Aditya kumar

     

     

     

      • Aditya

        Hi Siva,
        Thanks for the Reply, If no.of the mappers is equal to input spilts, for Example: if we want to combine 1000 small files into one sequence file then 1000 mappers will be instantiated which is very slow.
        what is the use of combine File Format?? and what benefit we will get if we are reading the whole file as the single record??
        please explain


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.