Enterprises Giving ETL a Boost with Hadoop

Loraine Lawson

Will Hadoop Steal Work from ETL,” I asked in an August post that looked at how the Hadoop ecosystem is being used in situations where ETL is simply too slow.

To clarify, when I said “Hadoop,” I meant using the Hadoop Distributed File System with some combination of the Hadoop stack — Pig, Hive — or perhaps hand-coded MapReduce jobs.

At the end of the post, I wondered how vendors with ETL-based tools would respond to this use case. So when I finally got a chance to ask one, I did.

“Hadoop, to the extent that it does anything that ETL tools do, does some amount of the ‘T’ (transform) portion. It does almost zero ‘E’ (extract) and it does almost zero ‘L,’ (load) from an enterprise perspective,” said James Markarian, the CTO of Informatica. “Hadoop does a fairly narrow set of the ‘T’ by itself, meaning that if you want to do fuzzy matching or identity resolution or address validation or kind of the complex forms of ‘T’ that people have come to expect from ETL platforms.”

Still, using Hadoop for ETL is not new, according to Yves de Montcheuil of Talend, which offers an open source ETL-based data integration solution. In fact, he adds, it’s one of the most proven use cases for this young technology.

It’s like using any other data management technology to “help” ETL, he writes in a post responding to my original piece.

“Using Hadoop for ETL is not very different — except for one aspect: Hadoop is not always the target,” de Montcheuil states. “If Hadoop is used only as the transformation engine, data still needs to be loaded in its final destination.”

It’s even possible — although perhaps not effective — to build an ETL system based on Hadoop, Markarian said.

“Now, what you wouldn’t necessarily have is something that could handle real-time workloads, which are more and more part of the analytic world that we live in,” he said. “You would do a pretty good job of re-writing half of what ETL is used for today and you would still be looking for a solution that could solve some of the low latency, complex event processing-style problems that people do have.”


But this whole Hadoop-as-ETL-replacement is probably an academic question anyway. In the real world, enterprises don’t care about replacing their ETL tools with Hadoop.

What they want to do is augment ETL with Hadoop, in effect using Hadoop as an “auxiliary engine (a booster engine if you want) for ETL,” de Montcheuil writes.

He identifies three common use cases for Hadoop ETL (you should read his full post for the details):

  1. The source data is in Hadoop and the target is a big data analytics system running against Hadoop, a situation he sees as the “most optimal use case for Hadoop ETL,” because the data is transformed and stays in the cluster.
  2. The source data is stored elsewhere, but the target is Hadoop.
  3. The target is a non-Hadoop system, so the data is transformed in Hadoop but the results are extracted and brought into a datamart, analytics engine or some other location. “This is perfectly possible with the proper connectivity but will create additional latency and require more resources,” he warns.

A combination of the last two are what Markarian says he sees most often, since most of Informatica’s customers are interested in the most cost-effective way to perform a certain amount of processing.

“The conclusion that a lot of the high-scale guys are coming to is that, rather than doing a lot of work in some of these data warehouse appliance environments, they can pull some of that workload out and do it more cost effectively in Hadoop,” Markarian said in a recent interview. By doing so, they’ve been able to achieve better service-level agreements in their data warehousing appliances.



Add Comment      Leave a comment on this blog post
Oct 8, 2012 11:33 AM H.M. H.M.  says:
ETL is rather difficult with Big Data. Data Quality, Data Hygiene, Data Consistency, Sampling and Automated Testing and Standardization are some of the tasks that come to mind. Tools like Informatica have built tools on top of Relational Stores that perform these tasks. Trying to replicate an equivalent tool in the big data space with Hadoop is not trivial and might take many years. One promising platform seems to be the one from LexisNexis (HPCC Systems) that seems to have all these capabilities. I am not surprised because LexisNexis has been dealing with Big Data for a long time now and would need these tools on a daily basis Reply
Apr 16, 2015 5:17 AM Ratnesh Ratnesh  says:
Thanks for your post! ETL no doubt needs to continue to evolve and adapt to developer preferences and the performance, scale and latency needs of modern applications. Hadoop is just another engine upon which ETL and its associated technologies can run. Renaming what is commonly referred to as ETL, or worse, ignorantly dismissing data challenges and enterprise-wide data needs, is just irresponsible. More at www.youtube.com/watch?v=1jMR4cHBwZE Reply

Post a comment

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

null
null

 

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.