Better Data Integration with Hadoop? It’s Possible

    More organizations are using Hadoop not just to process large datasets, but as a replacement for the transformation engines in ETL.

    But is Hadoop capable of being a data integration platform, complete with data quality functions?

    Gartner analyst Ted Friedman (@Ted_Friedman) thinks not. Friedman recently wrote a research paper, “Hadoop is Not a Data Integration Solution,” on the topic. The description sums up his point:

    “As use of the Hadoop stack continues to grow, organizations are asking if it is a suitable solution for data integration. Today, the answer is no. Not only are many key data integration capabilities immature or missing from the stack, but many have not been addressed in current projects.”

    I haven’t read the paper, because I’m not a client and it’s $195, but Todd Goldman has. Goldman is vice president and general manager for Enterprise Data Integration at Informatica. He wrote a response to the paper.

     He says many companies are turning Hadoop into a data integration platform.

    “Gartner is correct in that, Hadoop, by itself, is NOT a data integration platform,” Goldman writes. “However, it can be made into a data integration platform. Lots of companies are investing in making Hadoop based integration easier.”

    Informatica did this by porting its Virtual Data Machine onto Hadoop, he adds, giving companies the same integration development environment they use for ETL jobs, with Hadoop as the underlying engine.

    Not surprisingly, Informatica is not the only vendor investing in adding full data integration platform capabilities to Hadoop.

    “The market in general is moving in this direction so expect to see some exciting capabilities emerging over the next six months,” he states, adding that there are companies already using a kind of graphical development environment with Hadoop — as opposed to hand-coding MapReduce jobs. Not surprisingly, they’re able to create code five times faster, he said.

    Hadoop has already made it possible to run more complex transformations in substantially less time than traditional ETL tools. Some companies are even running sophisticated integration jobs, he adds, without hiring expensive data scientists or MapReduce specialists.

    If you’d like to read more about Big Data integration, check out this Big Data integration piece by Richard Daley, industry veteran and co-founder of Pentaho. Daley looks at all the tools in the Hadoop stack and discusses supporting integration for other NoSQL solutions, such as MongoDB, Cassandra and HBASE.

    Loraine Lawson
    Loraine Lawson
    Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.

    Latest Articles