I’m always curious about use cases with Hadoop, mostly because I feel there’s a lot of unexplored potential still.
For example, could Hadoop make it easier to achieve master data management’s goal of a “single version of the customer” from large datasets? During a recent interview with IT Business Edge, Ciaran Dynes said the idea has a lot of potential, especially when you consider that customer records from, say, banks can have up to 150 different attributes.
Hadoop can allow you to explore as many dimensions and attributes you want, he explained.
“They have every flavor of your address and duplications of your address, for that matter, in that same record,” Dynes, Talend’s senior director of product management and product marketing, said. “What Hadoop allows you to consider is, ‘Let’s put it all up there for the problems that they’re presenting like a single version of the customer.’”
Dynes also thinks we’re still exploring the edges of Hadoop’s potential to change information management.
“We genuinely think it is going to probably have a bigger effect on the industry than the cloud,” he said. “Its opening up possibilities that we didn’t think we could look at in terms of analytics, would be one thing. But I think there’s so many applications for this technology and so many ways of thinking about how you integrate your entire company that I do think it’ll have a profound effect on the industry.”
Talend released its Big Data edition last month. The company, which specializes in open source data management solutions, believes it offers some unique selling points, including better scalability and the ability to run jobs without an agent or engine.
“The reason Hortonworks is using Talend is because … one, they wanted something that was Apache, so that was open source, but the second thing they wanted was they didn’t want to be dependent on somebody else’s technology,” he said. “So what that basically means is what we give you is a tool. You generate the source codes, which is MapReduce and that is what you run on Hadoop. There’s nothing else required from Talend. You don’t require an agent, an engine. All we’re basically doing is making your job easier to basically do data integration and data quality with Hadoop.”
This “ready-to-run MapReduce code,” as he calls it, is designed for parallel programming, so it scales across the entire Hadoop cluster.
“The developers themselves never really have to learn MapReduce. They don’t have to learn Pig,” he told me during the Q&A. “All they basically need to know is what is your source data. It could be Oracle. It could be SQL, it could be SalesForce.com, it could be data from Apache weblogs or whatever it may be. Put it into Hadoop. And that’s it; it makes it really, really simple.”
Talend supports all distributions of Hadoop (Apache, Cloudera, Greenplum, Hortonworks and MapR, in case you’re wondering).