Could Hadoop replace ETL? Not really and not yet, says James Markarian, Informatica’s CTO. In this Q&A with IT Business Edge’s Loraine Lawson, Markarian explains how Informatica’s customers are using Hadoop with ETL and data integration tools, as well as the company’s approach to developing new Hadoop tools. (Several pre-built transformations and data flows are now in beta, with plans for release later this year.)
Lawson: I believe you wanted a chance to respond to a post about the possibility of Hadoop usurping ETL’s role, at least in terms of processing. I also know that ETL use has grown with Hadoop. Tell me what you are seeing in terms of how ETL technology is playing with Hadoop?
Markarian: ETL has evolved past what ETL once meant to the point where we generally use a different term — data integration — to describe what we do. It’s important to go back and define what traditional ETL means, which is how people are using it in this context and is strictly around loading and pre-processing analytic environments, primarily.
Hadoop, to the extent that it does anything that ETL tools do, does some amount of the “T” (transform) portion. It does almost zero “E” (extract) and it does almost zero “L,” (load) from an enterprise perspective. So supporting all the different interfaces into SAP, doing log-based change data capture out of everything from, and then all the downstream loading, which again, in integration cases could be data warehousing appliances or even other operational systems that could benefit from the analytics.
Hadoop does a fairly narrow set of the “T” by itself, meaning that if you want to do fuzzy matching or identity resolution or address validation or kind of the complex forms of “T” that people have come to expect from ETL platforms. Hadoop is a good platform itself for plugging some of those components in, but doesn’t even natively support the higher-value forms of transformation that you would expect from any ETL platform nowadays.
Now, they have a framework where you can plug those things in and you have developers building some of these components out there, but even very basic things like an efficient join — that’s a very common ETL capability — MapReduce is really an inefficient place right now to join. I think it’s got a ways to go beyond being a good platform for eventually being an ETL solution to actually being an ETL solution.
What we’re seeing is customers trying to figure out what the most cost-effective way to perform a certain amount of processing. The conclusion that a lot of the high-scale guys are coming to is that, rather than doing a lot of work in some of these data warehouse appliance environments, they can pull some of that workload out and do it more cost effectively in Hadoop. What this allows them to do is achieve better SLA’s in their data warehousing appliances, because those are still really important parts of their business intelligence stack.
Hadoop doesn’t serve a large number of users, but a lot of the pre-processing work that goes into readying information for BI reports or even ad hoc query can happen offline, which is what Hadoop is really good at. And I think this use case in particular, that’s where some of the confusion creeps in around how it affects ETL or data integration.
Lawson: Is it possible that you could build a new generation of ETL tools using the Hadoop stack and that whole process of distributing nodes?
Markarian: Absolutely you could. You could start writing from scratch and re-write all the connectors and all the transformations and everything in Hadoop. And, at the end of it, what you would have is a very good, scalable batch ETL system.
Now, what you wouldn’t necessarily have is something that could handle real-time workloads, which are more and more part of the analytic world that we live in. So you would do a pretty good job of re-writing half of what ETL is used for today and you would still be looking for a solution that could solve some of the low latency, complex event processing-style problems that people do have.
Otherwise, you’d be looking at potentially using multiple technologies to support your analytics infrastructure rather than just one. But, again, anything’s possible. You can do it and I’m sure that there will be vendors out there just trying to crack the batch Hadoop/ETL part of the problem.
But if I was starting a company today, knowing what I know about ETL problems, that’s probably not directly the way I would solve this problem.
If you look at how we’re approaching it, we’re taking our tools and our existing metadata and saying anything that you're running on Informatica today, you can deploy to Hadoop and get the scale-out benefits. For all of our 5000+ customers, we’re saying that’s a smarter way than re-writing natively in Hadoop.
Lawson: You mentioned hand coding, which has been blamed for a lot of the “integration spaghetti” of the past. Is that an issue with hand coding for Hadoop?
Markarian: In Hadoop environments, yes, there’s no question that Hadoop right now is dominated by people that are writing MapReduce and PIG and Hive scripts. That’s not really criticism of Hadoop; it’s more a statement of where we are in the adoption cycle. Hadoop is in the really early phases of adoption, even though it may be broader than other technologies that have been rolled out, it’s still relatively early. Most Hadoop deployments right now are in kind of experimentation mode and very few customers, at least that we’ve come across, even at the biggest banks and pharma-companies, have real production Hadoop environments.
It’s like other sort of hobbyist environments: There’s a lot of hand coding. Eventually I think we’ll see that change as more tools become available, as users start understanding what costs they're seeing and where they can really make the process more efficient. But, yes, right now, lots of hand coding, so it is a lot like data warehousing and the like in the early to mid-90s.
Lawson: Earlier, when you were describing using Informatica’s tools with Hadoop, it sounded like what you get is an amplifier for the functionalities of Informatica’s tools. Would that be a fair assessment of how it’s used right now?
Markarian: Amplifier is an interesting word. I think it has the ability to scale out certain jobs that Informatica is executing today. I think that part of that’s true, so as long as we qualify it, I think that’s fair, though there are cases where we’re introducing new stuff to help complement Hadoop environments. Things like our parsing technology packaged up in something called H-Parser to address some of these new use cases, at least new for Informatica, where we haven’t really been involved in large scale, like log processing, Web log processing or other application log processing.
We are introducing new things specifically around the types of things that customers are doing with their Hadoop environments. We’re also seeing it as a new opportunity and not just what you’d characterize as an amplification of something that we have been doing previously.
What we’re trying to do for our customers is protect the existing investment that they’ve built up in Informatica and say, “Hey, look, if you have integration jobs or if you have data quality rules that you're enforcing or if you have parsing routines or other business logic that you’ve built in Informatica, whether you're running it inside of our engine or using us to push it into database environments (something we call “push-down optimization”) you can take all of that and actually run it and scale it out inside of Hadoop without making code changes.”