HDFS Distributed File Copy Tool – distcp 7


HDFS Distributed File copy

Hadoop provides HDFS Distributed File copy (distcp) tool for copying large amounts of HDFS files within or in between HDFS clusters.

It is implemented based on Mapreduce framework and thus it submits a map-only mapreduce job to parallelize the copy process. Usually this tool is useful for copying files between clusters from production to development environments.

It supports some advanced command options while copying files. Below are the list of command options it supports.

The basic syntax for using this command is

This command can be run from source machine/environment.

In this post, parallel copying within same cluster is described. then hdfs://namenode can be removed from the syntax.

1. Below is the screen shot of source and destination directory structure before copying:

2. Triggering distcp command.

3. The entire source directory /sample will be copied into output directory resulting in directory structure of /example/sample.

Some of the frequently useful command options are listed below. All these are not mandatory but just optional.

i).   -atomic:  This option is used to either commit all changes at a time or no changes should be committed. This makes sure that no partial copying is allowed. Either all files are copied entirely or no file is copied.

ii).  -overwrite: By default distcp will skip copying the files that already exist in the destination directory but these can be overwritten unconditionally with this option.

iii). -update:  If we need to copy only missing files  or changed files, this options is very helpful and minimizes the copy time by copying only missing files/updated files instead of all the source files.

iv).  -m <arg> :  This option lets user to specify the maximum number of mappers to be used.

v).  -delete :  Deletes the existing files in the destination directory but not in source directory.

In the below example, we have copied only missing files from /test to /input directory using maximum of 5 mappers.

Directory structures before issuing distcp and after issue are also presented.

Advantages over hadoop fs -put command or hadoop fs -cp:

hadoop fs -put command or hadoop fs -cp command can be used to copy the files from local file system into hadoop cluster and from one hadoop cluster to another respectively but here the process is sequential, i.e. only one process will be run to copy file by file. But the advantage of using hadoop distcp command will give us the flexibility to specify the number of parallel tasks should be run in the background to copy files between clusters.

Thus by parallel processing, hadoop’s distcp is the better option to copy bulk data/ huge number of files from one machine to another (or cluster to cluster) than using fs -put or fs -cp commands.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

7 thoughts on “HDFS Distributed File Copy Tool – distcp

  • John

    Hello,

    Great post. Can you explain one thing for me. If I wanted to use distcp to create a replica of my current hdfs – I would run hadoop distcp hdfs://namenode1 hdfs://namenode2 on my source machine aka namenode1. How does my machine recognize namenode2 if namenode2 is a separate machine.

    Thanks

    • Siva Post author

      They will be in the same VPN and the network team ensures that each machine recognizes other’s hostname or ip address. this works in real time both on target machine or source machine, i have used it many times.

  • Manoj Donga

    Can we use distcp for across data-center ? , We are planning to migrate our data-center . Can you suggest how to use distcp  accross datacenter if possible ? Do I need to sync metadata of hive to new datacenter if my tables are created in hive before distcp ?


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.