The data experts are still sounding the warning bell about data lakes, prognosticating a list of problems that data lakes will cause you.
Meanwhile, word on the street is that enterprises are building data lakes anyway, because everyone else thinks it’s a great idea. This means that many enterprises are now stuck looking for ways out of the prognosticated problems.
It’s going to get interesting for the rest of us—and possibly very expensive for some.
Gartner Director of Public Relations Christy Pettey revisited the problems of data lakes, drawing on Research Director Nick Heudecker’s presentation at the Business Intelligence & Analytics Summit. Pettey’s article identifies the three main problem areas with data lakes:
- Data governance. “By its definition, a data lake accepts any data, without oversight or governance,” Pettey writes. “Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp.” She adds that without metadata, every analyst must start from scratch. This is a point that I’ve seen over and over, but it’s worth noting that there are vendor tools that can scrape metadata from the data.
- Security and access control. This is a big issue for several reasons. First, data lakes do make a rich target for hackers. Second, since governance isn’t innate to data lakes, there is no lifecycle management, which further puts you at risk of creating compliance problems. Third, as Pettey points out, the “security capabilities of central data lake technologies are still emerging.”
- Performance questions. “Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure,” her article notes.
Cost is also a key question here—while Hadoop can run on any server, it still has an associated cost, SAS Principal Business Solutions Manager Anne Buff recently pointed out.
These points tend to be the big issues you see repeated over and over, but if you really want to drill down on both the pros and cons of data lakes, check out the Data Lake Debate, in which Buff is participating on the “con” side.
It’s an eight-week series on SAS’s website that is moderated by industry analyst and SAS Best Practices VP Jill Dyché. Director of Emerging Technologies on the SAS Best Practices Tamara Dull advocates for data lakes. All three experts are quite witty, which makes the debate much less tedious than it might otherwise sound.
They’re about halfway through the debate with five posts, with the debate loosely modeled on the Lincoln-Douglas debate format. I found it funny that each expert structured her first response as you might expect a data expert to do—with a definition of terms.
One of my favorite points, though, was made by Buff, and I think it is an overlooked issue in data lake discussions.
“I cannot stress enough that data brought into a data lake is co-located not integrated,” Buff states. “Even with schema on read, the integration happens outside of the storage environment—on the banks of this beautiful data lake.”
She explains the problem with this, which is a lack of Hadoop talent for integrating that data, the requirement for custom code, as well as the associated costs. Dull counters that there are options for virtually accessing the data, but the point remains: Hadoop is essentially a storage system. It’s not an integration solution (although, it can be used to build one). To solve that problem, you’ll need to look elsewhere.
Check out the debate because it’s fun and informative, regardless of where your organization currently stands in concern to data lakes.
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.