I don’t think anyone really thought that Hadoop and other Big Data technologies would liberate us from the basics of data, such as integration and governance. It was just so easy to ignore those issues in the heady first years of Big Data hype and pilot projects. Now, it’s time to do the hard work of figuring out how to make all this data useful.
And, frankly, the to-do list just keeps growing.
Data integration expert David Linthicum added his concerns about data integration tools in a recent Informatica blog post. Linthicum is piggy-backing on an idea proposed by analytics expert Tom Davenport. After interviewing data scientists for his research, Davenport concluded that the only way to support the demand for Big Data analytics is to provide the data scientists with better tools.
“I don’t think you can do all this without adopting some new approaches to data integration and curation,” Davenport is quoted as saying in Venture Beat. “It’s just not going to happen without that.”
Linthicum agrees whole-heartedly that the major challenge to moving to Big Data is cleaning and moving all that data. What he seems to very politely and circumspectly disagree with is Davenport’s contention that current tools are not enough.
“The good news is that data integration is not a new concept, and the technology is more than mature,” Linthicum writes. “What’s more, data cleansing tools can now be a part of the data integration technology offerings, and actually clean the data as it moves from place to place, and do so in near real-time.”
Of course, he’s also writing for Informatica’s blog, so maybe he has to say that. Regardless, I think the more important take-away for CIOs is this:
The fact is, just implementing Hadoop-based databases won’t make a big data system work. Indeed, the data must come from existing operational data stores, and leverage all types of interfaces and database models. The fundamental need to translate the data structure and content to effectively move from one data store (or stores, typically) to the big data systems has more complexities than most enterprises understand.
From Linthicum’s point of view, this means investing in a broader data strategy. For some, that may mean first investing in data integration and other enabling technologies before moving forward with Big Data.
You’ll need to judge for yourself where you are with that, but I have to warn you: Other analysts also worry about how all this will play out with large data sets.
As I pointed out in a prior post, experts say data lakes are more dream than reality at this point. Data lakes are Big Data, and the idea is to draw in data from all sorts of sources, unstructured and structured, to create a comprehensive “pool” of data.
Experts say the concept is lofty and to a certain extent, technically possible, but hasten to add that it’s also essentially useless, since there’s currently no way to properly manage the data.