Most Popular Hadoop Distributions
Currently there are lot of Hadoop distributions available in the big data market, but the major free open source distribution is from Apache Software Foundation. And even remaining hadoop distribution companies provide free versions of Hadoop, and also provide customized hadoop distributions suitable for client organization needs. By using Apache Hadoop as the core framework, these companies build their own customized hadoop cluster setup and services and provide commercial support for big data organizations. These are known as commercial hadoop distributions. These hadoop vendors provide services like managing updates, providing support, training, and consulting, and even adding some innovations of their own that make Hadoop reasonable for an enterprise to handle.
In Free Open Source market, Redhat is making money by taking Unix/Linux Core Kernel (an open source operating system) bundle all its required components, building a simple installer, and providing paid support to any customers.
In the same way, there are many companies which are providing enterprise editions and paid support on top of apache Hadoop distribution.
Free Open Source Hadoop Distribution
- Core Hadoop Distribution Used by all other distributions
- Complex Cluster Setup but No Commercial Support
- Manual Installation and Integration of Hadoop Eco System Components like Hive, HBase, Pig, etc.
- Right choice for free trial / test demo purpose.
Other Popular Hadoop Distributions
- Hadoop’s co-founder, Doug Cutting, is its chief architect
- Cloudera is the Market leader in the Hadoop space because it released the first commercial Hadoop distribution
Highly active contributor of code to the Hadoop ecosystem
- Provides Cloudera Distribution for Hadoop (CDH) Parcels as well as powerful management and monitoring tool, Cloudera Manager for Hadoop administration.
- Its approach is to take components it deems to be mature and retrofit them into the existing production-ready open source libraries that are included in its distribution.
- Formed in 2008 with its core distribution based on 100% open source Apache Hadoop.
- CDH may be downloaded from Cloudera’s website at no charge upto 50 data nodes large cluster, but with no technical support nor Cloudera Manager.
- Fast growing company and Started in 2011.
- Another Major Player in Hadoop market.
- Initially originated from Yahoo and has the largest number of committers and code contributors for the Hadoop ecosystem components.
- Releases Hortonworks Data Platform (HDP), which includes Hadoop as well as related tooling and projects
- Hortonworks has collaboration with major data management companies like Teradata, Microsoft, Informatica, and SAS to provide integrated Hadoop solutions with their own product sets.
- Uses Apache Ambari for management, Stinger for queries, and Solr for searches.
Amazon Web Services Elastic MapReduce (AWS EMR) Hadoop
- Hosted Hadoop framework running on the web-scale
infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple
Storage Service (Amazon S3).
- Provides Management Software and GUI Support
- Provides enhanced Data protection
- Hosted Hadoop framework running on the web-scale
- Provides complete distribution of Apache Hadoop and related projects that’s independent of the Apache Software Foundation.
- MapR is being promoted as the only Hadoop distribution that provides full data protection, no single points of failure, and significant ease-of-use advantages.
- It has customized underlying HDFS into its own proprietary version MapRFS that is intended to improve efficient management of data, reliability, and ease of use.
- Three MapR editions are available: M3, M5, and M7.
- The M3 Edition is free and available for unlimited production use;
- MapR M5 is an intermediate-level subscription software offering;
- MapR M7 is a complete distribution for Apache Hadoop and HBase that includes Pig, Hive, Sqoop, and much more.
Pivotal Greenplum Hadoop
- Integrates EMC’s massively parallel processing (MPP) database technology (formerly known as Greenplum, and now known as HAWQ) with Apache Hadoop
- High-performance Hadoop distribution with true SQL processing for Hadoop.
- SQL-based queries and other business intelligence tools can be used to analyze data that is stored in HDFS
- Provides excellent performance with optimizations for Intel Xeon processors, Intel SSD storage, and Intel 10GbE networking.
- Provides data security via encryption and decryption in HDFS
- Supports role-based access control with cell-level granularity in HBase.
- Improved Hive query performance.
- Support for statistical analysis with open source statistical package R, and analytical graphics through Intel Graph Builder.
IBM InfoSphere Big Insights
- Focus around value add on top of the open source Hadoop stack
- BigInsights comes with a built in browser-based spreadsheet tool called BigSheets
- Great support for Adaptive Real-time Analytics and good text analytic capabilities by using the AQL and JAQL.
Microsoft Hadoop on Windows Azure
- Microsoft HDInsight is integration of Apache Hadoop version and Hortonworks Data Platform on Windows Cloud Platform Azure
- Currently supports Pig, Hive, and Sqoop
- DataStax Enterprise big data platform consists of open source tools Apache Hadoop, Cassandra, Solr, Hive, Pig, Mahout, etc.
- DSE is designed to manage real-time, enterprise search data in the same database cluster.
- It also comes with OpsCenter Enterprise, which allows for the management DSE Clusters via a central web interface.
Apart from these, there are many other hadoop distributions but all of these are open sourced under Apache’s GNU Public License.
Below is a good comparison chart prepared by Robert Schneider, in the Hadoop Buyer’s Guide in 2014 for Cloudera, Horton Works, MapR, the three leading commercial Hadoop Distributions.
By looking at the above comparison chart one may feel that MapR is better among all the three but before concluding that, we need to consider a few characteristics of it.
- MapR has its own Proprietary File System (MapRFS), It will be painful if an organization has to switch the hadoop vendor from MapR to any other because of its own FS being different from native HDFS in other distributions.
- Mutable Keys: The MapR file system allows mutable keys while HDFS does not. The idea of being able to change an established key (mutability) is risky and potentially dangers the ability of all previously built applications to use the data if the keys were inadvertently changed. MapR argues that there are actually advantages of mutability and that good data management practices eliminate this risk.
So, an organization has to review all the strengths and weaknesses of each vendor before choosing hadoop distribution for their big data business intelligence platform.
For Additional Details refer – Apache Wiki