Better Data Integration with Hadoop? It’s Possible

Loraine Lawson

More organizations are using Hadoop not just to process large datasets, but as a replacement for the transformation engines in ETL.

But is Hadoop capable of being a data integration platform, complete with data quality functions?

Gartner analyst Ted Friedman (@Ted_Friedman) thinks not. Friedman recently wrote a research paper, “Hadoop is Not a Data Integration Solution,” on the topic. The description sums up his point:

“As use of the Hadoop stack continues to grow, organizations are asking if it is a suitable solution for data integration. Today, the answer is no. Not only are many key data integration capabilities immature or missing from the stack, but many have not been addressed in current projects.”


I haven’t read the paper, because I’m not a client and it’s $195, but Todd Goldman has. Goldman is vice president and general manager for Enterprise Data Integration at Informatica. He wrote a response to the paper.

 He says many companies are turning Hadoop into a data integration platform.

“Gartner is correct in that, Hadoop, by itself, is NOT a data integration platform,” Goldman writes. “However, it can be made into a data integration platform. Lots of companies are investing in making Hadoop based integration easier.”

Informatica did this by porting its Virtual Data Machine onto Hadoop, he adds, giving companies the same integration development environment they use for ETL jobs, with Hadoop as the underlying engine.

Not surprisingly, Informatica is not the only vendor investing in adding full data integration platform capabilities to Hadoop.

“The market in general is moving in this direction so expect to see some exciting capabilities emerging over the next six months,” he states, adding that there are companies already using a kind of graphical development environment with Hadoop — as opposed to hand-coding MapReduce jobs. Not surprisingly, they’re able to create code five times faster, he said.

Hadoop has already made it possible to run more complex transformations in substantially less time than traditional ETL tools. Some companies are even running sophisticated integration jobs, he adds, without hiring expensive data scientists or MapReduce specialists.

If you’d like to read more about Big Data integration, check out this Big Data integration piece by Richard Daley, industry veteran and co-founder of Pentaho. Daley looks at all the tools in the Hadoop stack and discusses supporting integration for other NoSQL solutions, such as MongoDB, Cassandra and HBASE.



Add Comment      Leave a comment on this blog post
Feb 13, 2013 8:47 AM Yves de Montcheuil Yves de Montcheuil  says:
@Loraine, the right approach to leveraging Hadoop for ETL isn't to port a runtime engine to Hadoop, since that would require a pretty complex deployment on every single node. Rather, to leverage native Hadoop code (MapReduce, Pig, Hive, etc.) and let Hadoop do what it's designed for: execute this generated job leveraging the parallel architecture. Reply

Post a comment

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.


 
Resource centers

Business Intelligence

Business performance information for strategic and operational decision-making

SOA

SOA uses interoperable services grouped around business processes to ease data integration

Data Warehousing

Data warehousing helps companies make sense of their operational data


Thanks for your registration, follow us on our social networks to keep up-to-date