Formula to Calculate HDFS nodes storage

Formula to calculate HDFS nodes Storage (H)

Below is the formula to calculate the HDFS Storage size required, when building a new Hadoop cluster.

H = C*R*S/(1-i) * 120%

Where:
C = Compression ratio. It depends on the type of compression used (Snappy, LZOP, …) and size of the data. When no compression is used, C=1.

R = Replication factor. It is usually 3 in a production cluster.

S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. (In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months period, Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three months, and Output files from MR Jobs may create at least 10 % of the initial data, then we need to consider 600 TB as the initial data size).

i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a common guidlines for many production applications. Even Cloudera has recommended 25% for intermediate results.

120 % – or 1.2 times the above total size, this is because, We have to allow room for the file system underlying the HDFS. For HDFS, this is ext3 or ext4 usually which gets very, very unhappy at much above 80% fill. I.e. For example, if you have your cluster total size as 1200 TB, but it is recommended to use only up to 1000 TB.

Example:

With no compression i.e. C = 1, a Replication factor of 3, and Intermediate factor of 0.25 = 1/4
H = 1*3*S/(1-1/4) = 3*S/(3/4) = 4*S
With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

Formula to Calculate the No of data nodes:

Number of data nodes (n):
n = H/d = c*r*S/(1-i)/d

where d= disk space available per node. Here we also need to consider the RAM, IOPS bandwidth, CPU configurations of nodes as well.

RAM Considerations:

We need to have enough RAM space for our own processes to run as well as buffer space for transferring data through the shuffle step. Small memory means that we can’t run as many mappers in parallel as our CPU hardware will support which will slow down our processing. The number of reducers is often more limited by how much random I/O the reducers cause on the source nodes than by memory, but some reducers are very memory hungry.

64 GB should be enough for moderate sized dual socket motherboards, but there are definitely applications where 128 GB would improve speed. Moving above 128 GB is unlikely to improve speed for most motherboards.

Network speed is moderately important during normal operations.

The calculation works for data nodes, but assumes that

  • No node or storage failure
  • Only running map reduce or Hive, Pig (probably not a fair of putting it), but we might need to consider the formula when some analytic tools that create 3x + storage of processed data as intermediate storage are being used in the cluster.
  • Other performance characteristics are not relevant (processor, memory)
  • Adding new hardware is instantaneous.

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. 1)suppose if i used snaapy and LZAP compression techniques what is the value for C for both cases?

    2)can you explain one example to calculate data nodes also?

  2. Thanks for the article! If you don’t mind can you explain the formula? Specifically why Compression Ration is in numerator and not denominator? If your compression ratio is 2:1 then basically you need half the size of your cluster.

    Second, did not understand why (1-i) ? Should it be adding ith fraction of total size into equation?