HAR Files – Hadoop Archive Files


Hadoop Archive Files

Hadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient.

The advantage of har files is that, these files can be directly used as input files in Mapreduce jobs.

HAR Files Creation

Hadoop archive files can be created by below command.

Here the NAME should be in *.har format and src path should be relative to parent path. dest will be relative to user directory.

Output destination directory structure looks like as shown below

The part files contain the contents of the original files concatenated together, and the indexes file contains offset and length of each file in the part file.

In the below screen, we can see the data in part-0 as concatenated data from merged.txt and user.avsc files.

HAR Files Deletion:

To delete a HAR file, we need to use the recursive form of remove, as mentioned below.

Limitations of HAR Files:
  • Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving. We can delete the original files after creation of archive to release some disk space.
  • Archives are immutable. Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
  • HAR files can be used as input to MapReduce but there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file will require lots of map tasks which are inefficient.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.