Loraine Lawson spoke with Pentaho founder and CEO Richard Daley about Hadoop, a distributed approach to storing and processing large amounts of data. And where there's data, there's a need to integrate and provide business users with access to that data. Until recently, you needed a Java developer to do that, according to Daley. He explains the unique integration challenges of Hadoop and how the open source data integration and BI company is tackling it.
Lawson: How does Hadoop impact or change data integration?
"We've given a visual environment to people who were only allowed to do things in Java code before. So we've opened up the accessibility to this new data store and processing system to a whole new audience of people who wouldn't have gotten there before."
Daley: What happens when you store data? You've got to get it out. And when you get it out, sometimes it's fine all by itself; other times you want to go through there and integrate it with other sources. So if I'm pulling out even Web log information, I still want to be able to go through there and integrate that and potentially even cleanse it and go through some data quality where I'm going to be pulling out IP addresses. I want to find out what's the right geography for that, you know, what domain does that specifically attach to.
Other times, you want to take it out of Hadoop, aggregate it, and then integrate it with other sources that provide additional richness, with that customer data and/or some industry benchmark things.
You had to be a Java programmer to get things in and out of Hadoop for the last couple of years. We've taken our data-integration tools, as well as our BI products, and we've integrated it with Hadoop. What does that mean? Our data-integration product has a visual design environment, so if we're building transformations and building steps and doing cleansing and so forth, that's all a drag-and-drop environment. Now we've given a visual environment to people who were only allowed to do things in Java code before. So we've opened up the accessibility to this new data store and processing system to a whole new audience of people who wouldn't have gotten there before. Then, in addition to the visual environment, we give them the scheduling capabilities and all of those things, which is making their lives much, much easier.
In addition to pulling things out of Hadoop and putting things into Hadoop, our ETL engine is actually all Java, so we could also run inside the Hadoop nodes. All of these nodes basically are cores, regular processors, and, as I mentioned before, Hadoop is also a file-processing system. So we can literally deploy our ETL engine out into each and every one of these nodes. If you're 50 nodes or 5,000 nodes, it doesn't make any difference. We perform a bunch of the cleansing and the transformations and even the integration at that node level in a highly distributed environment, and when you're talking about this mass quantity of data, it makes a lot of sense to try to do it that way versus trying to extract a huge amount of data, put it through transformations and ride it back out. This way we can do everything, if you will, within the same processors that that data is residing. So that's a big, big technical advantage that nobody else has out there in the market.