In case you missed it, Cloudera released a new version of its Hadoop distributions, MapR announced a planned release and Hortonworks unveiled its new Hadoop platform, Hortonworks Data Platform 1.0.
It’s getting to be quite a crowded market, distribution-wise. All of which begs the question: How do you choose among these distributions?
It’s a bit tricky, because really, they all go back to Apache Hadoop for the same source code. So at this point, it boils down to their philosophy and reputation. To help you decide which distribution will work best for you, here's a summary of their reputation, the features they're currently boasting about, and their marketing advantage at this point.
Reputation: The Hadoop veteran.
Features It Boasts About: High availability, multi-cluster management, better integration with other tools thanks to an API manager, and increased security.
Advantage: “The category leader that is setting the standard for Apache Hadoop in the Enterprise,” reads a recent press release, and that pretty much sums up Cloudera’s market advantage right now. It’s the most established distribution by far, with a four-year track record. Hadoop’s creator, Doug Cutting, is also part of the management team as architect. Cloudera also has 250 members in its partner program, including Oracle.
Reputation: The conservative choice.
The new kid on the block is playing it conservative by using Apache code 1.0 instead of the newly available 2.0, which Cloudera used. Hortonworks felt this code needed to “bake a bit more” before going to enterprises, according to InformationWeek. In Cloudera’s defense, the 2.0 code does patch a known vulnerability.
Current Features It Boasts About: High availability via integration between Hadoop and VMware vSphere. Also, first to use Apache HCatalog functionality for metadata services.
Advantage: Hortonworks is all about the power partnerships with IBM, Microsoft and Teradata. One partnership that may be a strategic differentiator is the inclusion of Talend’s drag-and-drop data-integration software.
Version 2.0 (available in third quarter)
Reputation: A high-performance Hadoop distribution with high availability. Replaces some “flawed components,” such as Hadoop Distributed File Server, with non-open-source components.
Features It Boasts About: Version 2.0 will support multi-tenancy, which “enables physical Hadoop clusters to be logically partitioned to provide separate systems administration, data placement, and job management,” according to InformationWeek.
Advantage: Integration with Amazon Web Services. MapR may have the ultimate fast-pass: Amazon is offering it as part of its Elastic MapReduce (EMR) service, and specifically, its M3 and M5 editions are available as an option from an EMR drop-down menu, and at no additional charge for the M3-based service. Since billing is done through Amazon and its level-one support is Amazon, this may make it a sort of default Hadoop option for Amazon users — and that’s no small thing. Greenplum EMC also offers a distribution, but it's tied to its hardware.