Data lakes may be controversial among technologists, but for many companies, they solve a pressing technology problem: bad data integration.
Andrew Oliver, president and founder of Open Software Integrators, has helped large companies build data lakes. While there is debate about what data lakes are and whether they’re even practical at this point — Gartner, in particular, has spoken against the data lake hype (see Oliver’s rebuttal)— in practice, data lakes are typically Hadoop clusters that draw data from multiple sources, Oliver said.
Often, he said, business leaders are just fed up with the bad, point-to-point integration they’ve seen with traditional BI projects.
“Business users who have been going through this sort of not-quite-integrated-as-much-as-they-need-to-be integrated and having to deal with point-to-point data integration, they tend to often times drive this,” Oliver said. “The people who have to be dragged along are actually technical people, sys ads-type people but technical people.”
Oliver defines data lake architecture as a “Hadoop File System with lots of directories and files on it.” So it makes sense that the use cases he cites for data lakes mirror Hadoop use cases. For instance, Oliver said data lakes could be used for:
- A repository that pools data from your systems so you can process it without crashing essential IT systems. For instance, some data warehouse can’t support BI projects accomplishing what’s needed at high speeds without loss.
- Retaining data that would otherwise be lost. “In the past, you threw the data away unless you knew it was useful,” he said. “So it’s about data being innocent of being useless until proven useless.”
- Providing data for decision support systems. This might be used to automate something currently handled by a human with a spreadsheet, such as reordering a supply, or it might be used to respond to machine events, such as identifying credit fraud.
- Supporting machine learning and predictive analytics, as in the cause of IP intrusion detection and network analytics.
- Storing sensor data.
“For myself — working on some of these BI projects where the company is trying to become a data-driven company — not being able to get any of the information because it will bring the source systems down or none of the source systems keep enough history is a huge problem,” Oliver said. “So Hadoop and this concept of the data lake is one tool to help drive those changes that companies are trying to accomplish."
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.