The data experts might be confounded by the concept of data lakes, but apparently that isn’t stopping organizations from building them. A recent PWC feature article contends that data lakes are not only possible, but already exist in Hadoop-based repositories.
The article highlights UC Irvine Medical Center’s data lake as one example. The health care company deployed the Hadoop-based data lake as a single place to store records — structured, semi-structured and unstructured data — for more than a million patients.
What UC Irvine and other data lakes have in common is that they use Hadoop to store data in its native format for later parsing. While you extract and load the data into Hadoop, you skip the “transform” step of ETL. This solves several problems:
- No integration work is required as there would be with a data warehouse. You’re simply extracting and loading the data — not transforming it.
- The data’s integrity and fidelity is maintained, so you can reuse it for different analysis in different contexts.
- The architecture is less expensive, less rigid and easier to modify than a relational data warehouse.
Really, this significantly changes data integration’s role in how companies handle and manage data. Essentially, it eliminates it as an initial bottleneck:
“Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery.”
It also changes data integration requirements as the data is accessed. Data lakes require fewer integration steps because they don’t enforce a rigid metadata schema.
“Instead, data lakes support a concept known as late binding, or schema on read, in which users build custom schema into their queries,” the PWC article states. “Data is bound to a dynamic schema created upon query execution.”
What does that mean? It means integration will no longer require a massive project by the data warehouse teams and DBAs, but can be done by “localized teams of business analysts and data scientists…”
This actually makes metadata more important, because the more you know about it, the easier it is to build a query.
Sounds great, right? What’s the catch?
Well… it turns out that a lot of companies are struggling with leveraging the data once they’ve built the data lake. They dump everything in, and then kind of forget what they’ve put in there. In effect, they started a data lake and created a data graveyard, as the CTO of Cambridge Semantics so cleverly puts it.
The PWC article also explains how you can avoid that, of course. It supplies lots of reader-friendly graphics to explain data lakes and how they should function.
For more on data lakes, check out my other recent posts: