“Will Hadoop Steal Work from ETL,” I asked in an August post that looked at how the Hadoop ecosystem is being used in situations where ETL is simply too slow.
To clarify, when I said “Hadoop,” I meant using the Hadoop Distributed File System with some combination of the Hadoop stack — Pig, Hive — or perhaps hand-coded MapReduce jobs.
At the end of the post, I wondered how vendors with ETL-based tools would respond to this use case. So when I finally got a chance to ask one, I did.
“Hadoop, to the extent that it does anything that ETL tools do, does some amount of the ‘T’ (transform) portion. It does almost zero ‘E’ (extract) and it does almost zero ‘L,’ (load) from an enterprise perspective,” said James Markarian, the CTO of Informatica. “Hadoop does a fairly narrow set of the ‘T’ by itself, meaning that if you want to do fuzzy matching or identity resolution or address validation or kind of the complex forms of ‘T’ that people have come to expect from ETL platforms.”
Still, using Hadoop for ETL is not new, according to Yves de Montcheuil of Talend, which offers an open source ETL-based data integration solution. In fact, he adds, it’s one of the most proven use cases for this young technology.
It’s like using any other data management technology to “help” ETL, he writes in a post responding to my original piece.
“Using Hadoop for ETL is not very different — except for one aspect: Hadoop is not always the target,” de Montcheuil states. “If Hadoop is used only as the transformation engine, data still needs to be loaded in its final destination.”
It’s even possible — although perhaps not effective — to build an ETL system based on Hadoop, Markarian said.
“Now, what you wouldn’t necessarily have is something that could handle real-time workloads, which are more and more part of the analytic world that we live in,” he said. “You would do a pretty good job of re-writing half of what ETL is used for today and you would still be looking for a solution that could solve some of the low latency, complex event processing-style problems that people do have.”
But this whole Hadoop-as-ETL-replacement is probably an academic question anyway. In the real world, enterprises don’t care about replacing their ETL tools with Hadoop.
What they want to do is augment ETL with Hadoop, in effect using Hadoop as an “auxiliary engine (a booster engine if you want) for ETL,” de Montcheuil writes.
He identifies three common use cases for Hadoop ETL (you should read his full post for the details):
- The source data is in Hadoop and the target is a big data analytics system running against Hadoop, a situation he sees as the “most optimal use case for Hadoop ETL,” because the data is transformed and stays in the cluster.
- The source data is stored elsewhere, but the target is Hadoop.
- The target is a non-Hadoop system, so the data is transformed in Hadoop but the results are extracted and brought into a datamart, analytics engine or some other location. “This is perfectly possible with the proper connectivity but will create additional latency and require more resources,” he warns.
A combination of the last two are what Markarian says he sees most often, since most of Informatica’s customers are interested in the most cost-effective way to perform a certain amount of processing.
“The conclusion that a lot of the high-scale guys are coming to is that, rather than doing a lot of work in some of these data warehouse appliance environments, they can pull some of that workload out and do it more cost effectively in Hadoop,” Markarian said in a recent interview. By doing so, they’ve been able to achieve better service-level agreements in their data warehousing appliances.