Pentaho is best known for its business analytics tool and open source heritage, but it also is one of the few BI vendors that focuses on solving integration problems. Founder and Chief Strategy Officer Richard Daley recently explained to IT Business Edge's Loraine Lawson what Pentaho is doing to address a critical Big Data issue: connecting Hadoop to other enterprise systems for better business intelligence analysis, not just for Pentaho BI customers, but for all BI systems.
"We like to think of ourselves as that fabric that lays across all of these types. People who are already familiar with our tool, then this just makes the step into those environments much, much easier. ... This is not just marketing speak, this is actually what we hear from prospects and partners and people out in the field using these things."
Lawson: Pentaho had two big announcements around Big Data recently, correct?
Daley: We announced the open sourcing of Kettle, which is our data integration product. Also, we changed the license over to an Apache license to make it, if you will, more consumable by that community, because most of the things that are around Big Data - Hadoop and a lot of NoSQL technologies - are all in Apache. We thought about that for quite a while. We talked with a lot of different partners and community members from Apache and they were all completely open arms in terms of bring it over, we'd love to have it. They like to keep things pure in that environment, so it's a much more natural fit to put it in there. It was a good move for us.
Lawson: What's the other announcement?
Daley: The next announcement is a strategic joint development partnership with DataStax, which is the company behind another Apache open source project called Cassandra. It's one of the more popular NoSQL technologies on the market. The founders of DataStax were the original creators and architects of Cassandra. We've been working with them for about a quarter now and doing a lot of development. The product is already available for people to start downloading Pentaho Kettle, which immediately has tight integration with Cassandra. One of the biggest things the guys over there were asking for was help on showing people how to use it, how to get data in and out of Cassandra, how to report and do analytics against it, so we've also created a lot of how-to video documentation.
Hadoop is more of a file processing system and uses Map Reduce. In terms of storage, you use things like Cassandra or H-Base. Once Hadoop is done processing the data, you're going to store it in some type of a file or some type of a database. So Hadoop uses Map Reduce technology then to get over into things like Cassandra, which is a NoSQL database. It's very capable of handling unstructured data and real-time data, as well as structured data.
Lawson: We're starting to see a fairly dense stack of technology around Big Data and I think it's very confusing for people to understand what they're going to need and how it all plays together. So could you talk a bit about what you add to the pile stack and why you felt it was important to integrate with Cassandra and how you see the Big Data stack playing out?
Daley: First of all, in terms of the environment, it's not going to be just a one-stop shop for any of these technologies. What we see out in the field and customers are actually putting these things into production is a very clear hybrid environment. So, Hadoop will coexist with NoSQLs like a Cassandra and they're also going to coexist with existing relational data technologies and even some of the high-performance analytical engines. If you think about a Hadoop or a NoSQL, they're not meant to go through and entirely rip and replace existing relational environments. They're there to augment those environments.