Edd Dumbill may have just won the argument over whether data lakes are a practical, achievable idea.
Data lakes are a simple enough idea: You dump a wide range of data into a Hadoop cluster and then leverage that across the enterprise.
The problem is what Gartner calls the “Data Lake Fallacy,” which is the challenge of managing data lakes in a governable and secure way.
Dumbill acknowledges the barriers to data lake adoption in a recent O’Reilly Radar Podcast. Ultimately, though, the VP of strategy at Silicon Valley Data Science says data lakes will happen for one reason: Data lakes free data from enterprise silos.
“One of the hardest things for organizations to get their head around is getting data in the first place,” Dumbill told O’Reilly’s Mac Slocum. “A lot of CIOs will be, ‘Great, I want to do data science but I’ve got this database over here and this one over here and these all need to speak to each other and they’re in different formats and so on.’ In many ways, having data in a data lake provides you with a foundation (with) which you can start to integrate data with and then make it accessible as a building block in an organization.”
By creating a data lake — whether internally or in the cloud — you can then make it more widely available for business intelligence or data analytics via APIs or data services, he said. In this way, data becomes the “building block of new things” and changes application development.
“I think it’s something the whole industry is going to come to, and the reason I’m kinda getting behind it is, it necessitates a change of thinking about the way you build applications,” Dumbill said. “We talk about data as a raw resource. The data lake as a technology helps us to focus on it and then we think about data and what we can make from it and how it can help the business instead of thinking I need to buy tool A, buy program B and use application C.”
That brings us full circle to Hadoop as the next application development platform, an idea Dumbill introduced last January.
My research for a recent Enterprise Apps Today article, “The Down Low on Data Lakes,” leads me to believe that most experts agree with Dumbill, but are urging caution because the concept is still so immature. As Teradata’s GM Dan Graham told me:
“It’s still so new there’s more worst practices than there are best practices right now. There are just not enough repeatable implementations. In fact, the vision of the data lake is not exactly harmonious across the vendors and the customers.”
There’s also the question of how you keep data lakes from becoming data swamps. Dumbill acknowledges these issues in the podcast, noting that you “have to get search right,” but experts cite further concerns about:
- Security concerns
- Maintaining data lineage for auditing and compliance issues
- Data quality
- Data governance
Right now, there are no established answers to these questions, and finding them may take the next five years.
“It’s going to be a five-year journey to get this nailed down to where it’s really humming along and doing what everybody wants it to do,” Graham said. “They may believe the data lake vision, but a fraction of them are actually building it right now.”
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.