Hadoop Best Practices


Hadoop Best Practices

  • Avoiding small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file.
  • Maintain Optimal HDFS Block size, generally >= 128 MB, to avoid tens of thousands of map tasks in processing large data sets.
  • Usage of Combiners wherever applicable/suitable to reduce the network traffic from mapper nodes to reducer nodes.
  • Applications processing large data-sets with optimal number of reducers and avoiding cases like using a few reducers (e.g., 1).
  • Using the PARALLEL keyword in Pig scripts for processing large data-sets.
  • Avoid Applications writing out multiple, small, output files from each reducer.
  • Using the DistributedCache to distribute artifacts of sizes around tens of MBs each.
  • Enabling compression for both intermediate map-outputs and the output of the application, that is, output of the reduces.
  • Implementing automated processes to screen-scrape the web-ui is strictly prohibited. Some parts of the web-ui, such as browsing of job-history, are very resource-intensive on the JobTracker and could lead to severe performance problems when they are screen-scraped.
  • Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend.
  • Maintaining Oozie workflows to be comprise of fewer number of medium-to-large sized Map-Reduce jobs, in terms of processing, rather than large number of small Map-Reduce jobs.

Misc

  • Counters are very expensive since the JobTracker has to maintain every counter of every map/reduce task for the entire duration of the application

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.