Strange are the ways of the technology market gods. While technology itself follows a fairly predictable bell curve of hype, the terms seem to come and go in spurts.
Several years ago, experts and vendors, such as those in this Forbes piece, would often talk about “data lakes” as a way of explaining Big Data’s capabilities. Big Data was going to change everything: No more silos, no more separation of structured and unstructured data, and no more need for data marts.
It was more of a metaphor for the capabilities than anything specific, as I recall.
But that changed at the end of 2013. Booz, Allen, Hamilton and then Capgemini and Pivotal decided to apply the term to proprietary solutions. The term also made some 2014 technology predictions lists.
“Business leaders, disappointed by the lack of business-relevant progress from IT, will assume more responsibility for Big Data initiatives,” IT services vendor EMC predicted in a Forbes blog post. “CIOs will respond by embracing the development of new data architectures that bring together silos of data into a single data lake.”
CIOs would win this by switching to Big Data, which would lead to “supporting simple queries over complex tools.”
Uh-huh. Well, that would be nice, wouldn’t it?
Consider yourself forewarned: The analysts are talking about “data lakes,” and they are skeptical.
During a widespread email exchange, Gartner researchers attempted to peg down the definition and value of a data lake, according to analyst Andrew White.
“In a nutshell there does seem to be confusion in the market place,” White writes. “And vendors, some vendors, are (predictably) taking advantage of the confusion.”
Now, this is my personal commentary and in no way should be seen as actually representing what White said, but, I suspect that’s analyst-speak for: Nobody knows exactly how or if a data lake would actually work, so we can’t believe people are claiming to sell them.
You can read what he says to make up your own mind about that interpretation.
But his piece goes beyond “Buyer Beware.” He’s also offering a very good reason why data lakes are problematic, at best, or, at worst, a total waste of money that costs you your job:
“There is no attempt to manage the data in the lake; there is no need to qualify the data, its format, its quality, its consistency, or anything that would allow one data to relate to another. … That is the part that is winding me up: a data lake does not include any intrinsic, persistent, data lake-wide information governance.”
His advice is to steer clear of the hype, and I assume data marts, at least until Gartner can research it further and create a “common understanding for what a data lake is and isn’t.”
Edd Dumbill, a principal analyst for O'Reilly Radar, may have the jump on Gartner with this topic. Dumbill wrote an often-cited column attempting to define and create a Hadoop maturity model that leads to the data lake.
That said, Dumbill would no doubt agree with White that data lakes are more dream than reality at this point.
“The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment,” he wrote in January. “I call it a dream, because we’ve a way to go to make the vision come true. It is, however, an accessible dream.”
The idea of a data lake raises tricky issues beyond the problems of building it, governing it and ensuring data quality within it. Smart CIOs will also want to consider whether data lakes will require a new, more granular approach to security, which is explained in this recent CSO article.