Table of Contents
Hadoop Archive Files
Hadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient.
The advantage of har files is that, these files can be directly used as input files in Mapreduce jobs.
HAR Files Creation
Hadoop archive files can be created by below command.
$ hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Here the NAME should be in *.har format and src path should be relative to parent path. dest will be relative to user directory.
Output destination directory structure looks like as shown below
The part files contain the contents of the original files concatenated together, and the indexes file contains offset and length of each file in the part file.
In the below screen, we can see the data in part-0 as concatenated data from merged.txt and user.avsc files.
HAR Files Deletion:
To delete a HAR file, we need to use the recursive form of remove, as mentioned below.
$ hadoop fs -rmr /user/hadoop1/archive/test.har
Limitations of HAR Files:
- Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving. We can delete the original files after creation of archive to release some disk space.
- Archives are immutable. Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
- HAR files can be used as input to MapReduce but there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file will require lots of map tasks which are inefficient.