Companies can and have spent two to three years and millions of dollars building a Big Data solution just to analyze specific data, points out Scott Hedrick, who leads Informatica Big Data Ecosystems. And yet, at the end of the day, it’s still just one specific problem they’re able to analyze.
What organizations want is the data lake: a way to pool hundreds of other data sources for analytics. For most companies, though, it’s still too cost and time prohibitive, he added during a recent interview.
Capgemini would like to change that. The consultancy is partnering with Pivotal — for its analytics solution and Hadoop distribution — and integration powerhouse Informatica to give organizations a complete solution. Promotional materials promise a real-time, unified approach to information management that acts like “search in a box.”
“Capgemini has dubbed it a business lake meaning there is some level of manageability and tools that you put on data lakes so it does not become a swamp,” Hedrick explained.
Informatica has a long history of working with both companies, going back to Informatica’s high-speed connectivity to Pivotal’s Greenplum.
The company’s role in this partnership is obviously integration, but it also brings data quality capabilities to the partnership. That’s a serious concern with Big Data and data lakes, which generally are run on Hadoop clusters. The solution leverages Informatica’s Vibe engine and library, which leverages Yarn’s support for multiple data processing engines. Essentially, this means Vibe is running on the nodes in Hadoop to deliver that much-touted Hadoop processing power.
In addition to running integration on thousands of nodes, Informatica can run the data quality on the same scale, Hedrick said. Informatica also brings to Hadoop its MDM matching capabilities, address validation, as well as data cleansing, governance and metadata management.
Obviously, that matters because it brings some order to the potential chaos of Hadoop. But here’s the other reason it matters: You don’t have to learn Pig, Hive or any of the other Hadoop languages to leverage it. You just have to know Informatica’s tools.
“We take care of all that complexity in our mappings,” he said. “One of the ways Capgemini likes to describe Informatica’s role in this Business Data Lake is ‘distillation,’ taking data from various sources of data quality and putting it through a process so you can distill it to great data that companies can put to work.”
If you’d like to learn more, PC Advisor has a good summary article with more from Capgemini. You can also read Informatica’s news announcement.
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.