I'll be honest with you: The first time I saw the name "MapReduce," I thought it had to do with road maps and possibly GPS. And I thought Apache Hadoop sounded a bit like a medieval weapon.https://o1.qnsr.com/log/p.gif?;n=203;c=204663295;s=11915;x=7936;f=201904081034270;u=j;z=TIMESTAMP;a=20410779;e=i
This week, I had an opportunity to interview Richard Daley, the CEO and founder of open source BI vendor Pentaho, about how his company is responding to Hadoop's big data potential. It was an informative interview, which I hope will be published next week.
In preparation for that interview, I did a bit more digging Hadoop and MapReduce and I thought it'd be useful to share a bit of that information with you, particularly since this is in many ways an emerging technology with big implications. Despite its youth, Hadoop is taking off-Daley told me that big data is such a big problem, Hadoop job opportunities have already exploded in the past 9-12 months.
It turns out, MapReduce has nothing to do with roads. It's a patented approach (aka, software framework or algorithm) created by Google as a way to process large amounts of data through distributed computing. I think Wikipedia actually explains "MapReduce" well:
"'Map' step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
'Reduce' step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve."
Google has since made MapReduce code open source, which lead Apache to create Apache Hadoop, a new open source project with an impressive community contributing to its development.
For the most part, when I've read about Hadoop, it's been about how it's an alternative to SQL-although I think it'd be more accurate to say its big data use case falls outside the use case for a SQL database. It is, in essence, a NoSQL approach, as James Kobielus pointed out in a recent Intelligent Enterprise piece on Hadoop, and as such, it's for Big Data, which generally means the ridiculous amounts of data that machines produce, as Daley put it.
Hadoop does have some problems. As IT Business Edge's Mike Vizard pointed out, two major challenges with Hadoop are performance and accessibility. Vizard writes that this is where companies like Cloudera, Netezza and BI companies like Pentaho come into play.
Recently, there's been a bit of buzz about how some vendors are applying data integration to Hadoop to give companies the ability to better put those large amounts of data to good use. IBM has made several announcements around Hadoop, including plans to provide a range of professional services to support Hadoop implementations, as well as unveiling plans for its own distribution of Apache Hadoop and a "Hadoop-powered BigData portfolio, dubbed InfoSphere BigInsights," according to a recent TDWI article.
The TDWI piece is a particularly useful read, since it mentions many of the vendors leveraging Hadoop, including a detailed look at what the open source data integration vendor Talend is doing with Hadoop. And, as I mentioned earlier, Pentaho is among those doing interesting work with integration of data stored in Hadoop. The company is applying ETL to Hadoop at the node level and then allowing you to bring that information into its BI system.
Keep in mind we're really ramping up on the Hadoop hype cycle here-you'll see a lot of vendors declaring they're "the only" or the "first" to offer this or that, and it's still hard to ferret out fact from marketing at this point. But it certainly seems promising, particularly if you've got Big Data problems of your own.