Hadoop, and big data in general, is like some sort of tech primordial soup these days. No one's quite sure what's going to work, but every week or so, a new business or product emerges to give it a go.
Most recently, a company called Hortonworks spun off a Hadoop group from Yahoo. Their business model is to offer training and support for Hadoop, and certainly, given the skill shortage and demand, that's a smart start to a new business.
In a recent Q&A with Information Management's Jim Ericson, Hortonworks CEO Eric Baldeschwieler said the company's leaders expect half the world's data will be in Hadoop within five years. Sit with that a minute-half the world's data, running on Hadoop. That's bound to keep Larry Ellison up at night.
But in May, EMC unveiled its Greenplum HD Enterprise Edition, and guess what? It's not based on the standard distribution. It's based around MapR Technologies' Hadoop distribution, according to Howard, who adds that distribution is also now available directly from MapR.
This isn't just a matter of market dominance: As Howard explains, the standard distribution of Hadoop has three major problems: resiliency, compression and performance. He contends that the MapR distribution addresses these flaws. If you're considering Hadoop at all, you'll want to take into acount Howard's critique.
Of course, my focus has been on the integration front, where a slew of companies are offering various connectors and ways of accessing and analyzing Hadoop-stored information. Here the question isn't about distribution, but rather about how you use the information stored within Hadoop. Informatica, Composite, Talend, Syncsort, Pentaho, SnapLogic, IBM-I'm still working my way through the product briefings as company after company unveils Hadoop-focused solutions.
One topic that has come up several times-particularly from Informatica and IBM-is the concept of a "big data platform." The word "platform" always creates more questions in my mind, so during a recent interview, I asked David Corrigan, the director of strategy for IBM's InfoSphere portfolio, what, exactly, constitutes a "big data platform."
Corrigan said that in IBM's view, a big data platform would incorporate five core capabilities: