How Talend’s Solution Simplifies Hadoop MapReduce and Big Data Integration

Loraine Lawson

Ciaran Dynes, the senior director of product management and product marketing at Talend, explains to IT Business Edge’s Loraine Lawson how the company’s Big Data edition can help organizations generate MapReduce code for Hadoop, without hiring special developers.

Lawson: Can you talk a little bit about what you're offering and what problem it is you're trying to solve for companies with your solution?

Dynes: When we look at the integration market, there is typically kind of three silos that you see. There’s the data, applications and then there’s the business process silos. We offer different integration solutions for those three silos. The way we build and manufacture the software, it’s all based on a common core platform.

What we provide on the Big Data scene is very similar. Even though the technology of Hadoop is different and Big Data is different, the actual activities from an integration vendor perspective are kind of more of the same, taking existing data sources within an enterprise and uploading it into something like a Hadoop cluster, and then on the other side of things, where you might want to present that information for analysis, we also provide accessibility to the information in the Hadoop cluster.

Where we at Talend have a differentiation today is that the way we’ve been doing integration from the get-go is based on a code generator. It’s actually optimized already to take the benefit of the parallel system that we get from Hadoop, because what we actually generate at the end of the day is MapReduce codes.

“MapReduce” is effectively … it’s a fancy term for parallel programming, and that parallel programming is very, very sophisticated and very, very complex. We basically abstract away that complexity by giving you really intuitive, simple-to-use developer tools. What we generate at the end of the day is just ready-to-run MapReduce code. So we hit that middle ground of developers who may not necessarily understand Big Data and Hadoop, but would dearly love to integrate their data to run on top of Hadoop and then on the performance side, we don’t compromise the performance because what we give you is MapReduce.

So jobs by default are generated to be parallel and that’s the difference, because when we generate MapReduce, it will run across the entire cluster. You get all of the wonderful scalability, management features that you get with Hortonworks and Cloudera and these other vendors.

Lawson: They're not running analytics with Talend; it’s running integration jobs, right? How are you simplifying that?

Dynes: It’s running integration jobs, absolutely. So I’m not sure if you’ve ever taken a look at a MapReduce job. It is actually pretty sophisticated, in terms of building an algorithm and there are things that are provided by the Apache world, like Pig, which is basically a language that helps you write MapReduce, but even at that, you’ve got to go and learn the language.


What we give you is the familiar tool that you’ve been using for years around ETL and it’s basically a drag-and-drop set of components on a pallet. You wire those together, like you would any other ETL job from Talend, and you click “generate.” And what it generates under the covers this time around is MapReduce code.

So it means the developers themselves never really have to learn MapReduce. They don’t have to learn Pig. All they basically need to know is what is your source data. It could be Oracle. It could be SQL, it could be SalesForce.com, it could be data from Apache weblogs or whatever it may be. Put it into Hadoop. And that’s it; it makes it really, really simple.

That simplicity doesn’t come for free and if you were to get a developer to basically build the same thing, he would be an advanced computer scientist today that could actually build the same thing.

Lawson: Do you have some examples maybe of the types of situations where businesses are using Talend’s Big Data solution?

Dynes: We’ve taken the product itself, which is Talend Open Studio and we’ve basically released a Big Data edition. One of the consumers of that today already is Hortonworks, which is maybe not as well known as Cloudera, but they will be vying for that number one position over the next 12 to 24 months.

They embed Talend Open Studio for Big Data in their solution, so we do, you know, we’re starting to see some use cases coming from Hortonworks, but more closer to home we’ve been working with a couple of financial institutions. They’ve been our primary use case and we’ve got one example today working in telecom. But by and large, there’s a couple of use cases they typically focus in on.

In the banks, I guess it’s no great surprise, they're looking for fraud and fraud analytics, and the piece that we helped them do is pull their existing data from either data warehouse or traditional databases and what they're combining that today with is weblog information or they're combining that today with other types of role information that they have from different types of, you know, more flat file structures rather than relational tables.

The other use case that we’re working on within the finance case at the moment is looking at trending information: So the combination of MongoDB and NoSQL databases, and basically using that information with Hadoop to process very large volumes or log information. A couple of the banks that we work with do have a couple of pretty advanced POCs (proof of concept) in this space. I think it’s fair to say there are not many people in production unless you're in the search engine or in the advertising/marketing area that have a lot of this type of technology in production today. A lot of the things you see are early stage analytics within finance and people looking to sort of explore what it may mean for their companies in the next number of years.

Then on the telco side, it’s basically processing CDRs or call data records — again, largely generated from machines. Millions of these records are generated every year because, effectively, for every call that you make on a telephone or a mobile phone or whatever it may be, a record is generated. A lot of telcos look to analyze this information for numerous, numerous activities. Some are fraud, some are trending, some are looking at customer experience and so they look for unusual patterns where, across the given territory, if they see a lot of calls being terminated and reconnected within a certain timeframe, they can basically start to analyze the health of the network. Many of these are in the proof-of-concept phase. Some are in production.

Lawson: Are there things you had to do to your tool to prepare it for Hadoop?

Dynes: The answer to that is no, and that is what you’ve just stumbled upon is the unique differentiation of Talend. And I can prove it by way of example. The reason Hortonworks is using Talend is because, well, a couple of reasons. One is they wanted something that was Apache, so that was open source, but the second thing they wanted was they didn’t want to be dependent on somebody else’s technology.

So what that basically means is what we give you is a tool. You generate the source codes, which is MapReduce and that is what you run on Hadoop. There’s nothing else required from Talend. You don’t require an agent, an engine. All we’re basically doing is making your job easier to basically do data integration and data quality with Hadoop.

We added a number of different connectors, components to generate the code. But our platform is uniquely different than anybody else is. What you’ll find with other vendor solutions is they do need to deploy Hadoop on the big clusters and therein lies the problem, because it’s fine with a 10-node cluster, but when you start looking at, you know, 10-, 15-, 20-node clusters, which is what some organizations will eventually get up to, you're looking at incremental software costs, and you're looking at maintaining software that’s got nothing to do with MapReduce except running the engine. We just generate code and that’s the beauty of the solution.



Add Comment      Leave a comment on this blog post
Mar 11, 2013 11:02 AM Ajay Barve Ajay Barve  says:
Ciaran, Loraine, Thanks for the explaining MR and Talend. Very informative nterview. I am working on a developing ETL with MR where I need to decide if we should write the Mapper and Reducer or else to use a Talend and Talend will generate the MR which will run in Clodera environment. Unfortunately I have not been able to find a single sample code or white paper which explains Talend Generates a Map Reduce code. Can you ask one of your team members to share some sample code and a link to a document ? Reply
May 27, 2013 3:36 AM Pcoffre Pcoffre  says:
Hi Ajay, Hope I'm not answering too late your request! First, thanks for your message. Also, here is a link to a video dedicated to MapReduce, on our Talend.com site. You can view it here: http://www.talend.com/resources/webinars/leverage-hadoop-mapreduce-in-talend Best, Pcoffre. Reply

Post a comment

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.


 
Resource centers

Business Intelligence

Business performance information for strategic and operational decision-making

SOA

SOA uses interoperable services grouped around business processes to ease data integration

Data Warehousing

Data warehousing helps companies make sense of their operational data