New Code Uses APIs to Simplify ETL Flows, MapReduce on Hadoop

Loraine Lawson

This week, Syncsort announced that the code it submitted to Apache Hadoop has officially been committed to Apache Hadoop 2.0.3 alpha. The code will make it easier to build ETL flows and MapReduce jobs. Syncsort’s Josh Rogers, senior vice president, Data Integration Business, explains why Syncsort decided to make this contribution and why it’s important.

Lawson: I know you submitted code to Apache Hadoop in January. What’s the news about that?

Rogers: The contribution that we have created and submitted to the open source community, which has been committed in January, allows the Hadoop MapReduce to be pluggable from a sort perspective. The reason we believe that’s valuable is we see Hadoop becoming kind of the operating system for Big Data and we see the number-one use case that people are deploying on Hadoop is ETL. I think you wrote a blog post maybe a week ago or two weeks ago on responding to the Gartner article around Hadoop’s not a data integration tool. (Editor’s note: If you’re not a Gartner client, see “Hadoop and DI: A Platform Is Not a Solution.”

We believe while Hadoop is very powerful, it’s still very immature and not quite ready for prime time, but we believe that organizations aren’t waiting. They're rushing to Hadoop and building lots of applications to derive bang out of it. And in the ETL space, there are clear barriers that are being successful with that strategy. We believe this contribution helps us and helps the community start to lower those barriers.

So we had announced earlier that we had submitted it, but now it’s actually being committed and we’ll see it start to appear in all the major distributions in the coming months.

Lawson: What does that mean, the code is being committed?

Rogers: The way open source software works is all sorts of individuals will make contributions to the framework. Then there’s a commitment process for which contributions get incorporated to the framework or don’t. And so we have. So in Apache Hadoop 2.0.3, it will include pluggable sort.

Lawson: For people who are using Hadoop now, but maybe didn’t read the past story, can you quickly explain what this code accomplishes?

Rogers: It enables a more sophisticated manipulation of data within the MapReduce framework. Some examples of that would be things like data sampling or hash joinings or hash aggregations. That type of data manipulation is fairly complicated to achieve because of the way MapReduce works.

Lawson: Does this change the way MapReduce actually works?

Rogers: Correct, so it actually creates a set of APIs that you can call out in the middle of the map or the reduce step, at the sort phase, to access the data that’s being processed with the MapReduce.

Lawson: So why did you decide to contribute this function?

Rogers: What we’ve contributed is actually technically called a new feature patch and it’s a set of APIs that allow a developer to call out. The benefit for the community is the ability to code more sophisticated use cases.

The benefit for Syncsort is now you can call out to our technology. So if you want to use our sort technology in the middle of the map sort or in the reduce merge, you can do that. If you want to call out to our ETL technology and actually have it handle some of the more complex data flows such as, for example, a large join, you can do that.

That carries a whole set of benefits, in terms of allowing developers to more easily design complex data flows. We don’t generate code on the back end. We actually directly integrate this with the MapReduce framework and at the same time, we’ll make that entire flow feel much faster.

We’ve done some initial tests around join. We’re seeing data points are five times faster versus a join written in PIG. We’re seeing examples of 30 times faster than a join produced by co-generation approaches.

All the other data integration vendors are offering the benefit of ease of use by allowing you to use their graphical user interface to design data integration pose. On the back end, they're generating some sort of language that gets run in Hadoop. Talend generates PIG, Informatica generates HiveQL, but what we’re doing is different. What we’re doing is we’re actually integrating directly into the MapReduce framework. We’re not generating any code and we’re delivering that same ease of use, but we’re also delivering a massive performance benefit.

Lawson: You mentioned the conversation about using Hadoop as a sort of data integration platform. How do you see this impacting that?

Rogers: If you go to that article that Ted Friedman wrote, I think what he says is it’s still lacking the maturity and capability you need to do data integration. And we believe that we have an ability to lower those barriers significantly, particularly given the commitment of our contribution as well as our ability to plug our technology into that new patch.

Lawson: All of the barriers or some of the barriers?

Rogers: The barriers to be able to write complex data integration flows for enterprise IT and have them run on top of Hadoop. Today, certain processes are difficult to code in MapReduce and a good example would be a large join, because it’s a distributive processing framework. We make that very simple.

Add Comment      Leave a comment on this blog post

Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.