Three Surprising Reasons Why Businesses Are Building Data Lakes

2014 Big Data Outlook: Opportunities and Challenges

The data experts might be confounded by the concept of data lakes, but apparently that isn’t stopping organizations from building them. A recent PWC feature article contends that data lakes are not only possible, but already exist in Hadoop-based repositories.

The article highlights UC Irvine Medical Center’s data lake as one example. The health care company deployed the Hadoop-based data lake as a single place to store records — structured, semi-structured and unstructured data — for more than a million patients.

What UC Irvine and other data lakes have in common is that they use Hadoop to store data in its native format for later parsing. While you extract and load the data into Hadoop, you skip the “transform” step of ETL. This solves several problems:

No integration work is required as there would be with a data warehouse. You’re simply extracting and loading the data — not transforming it.
The data’s integrity and fidelity is maintained, so you can reuse it for different analysis in different contexts.
The architecture is less expensive, less rigid and easier to modify than a relational data warehouse.

Really, this significantly changes data integration’s role in how companies handle and manage data. Essentially, it eliminates it as an initial bottleneck:

“Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery.”

It also changes data integration requirements as the data is accessed. Data lakes require fewer integration steps because they don’t enforce a rigid metadata schema.

“Instead, data lakes support a concept known as late binding, or schema on read, in which users build custom schema into their queries,” the PWC article states. “Data is bound to a dynamic schema created upon query execution.”

What does that mean? It means integration will no longer require a massive project by the data warehouse teams and DBAs, but can be done by “localized teams of business analysts and data scientists…”

This actually makes metadata more important, because the more you know about it, the easier it is to build a query.

Sounds great, right? What’s the catch?

Well… it turns out that a lot of companies are struggling with leveraging the data once they’ve built the data lake. They dump everything in, and then kind of forget what they’ve put in there. In effect, they started a data lake and created a data graveyard, as the CTO of Cambridge Semantics so cleverly puts it.

The PWC article also explains how you can avoid that, of course. It supplies lots of reader-friendly graphics to explain data lakes and how they should function.

For more on data lakes, check out my other recent posts:

“Are Current Data Tools Enough to Wrangle Big Data?”
“Another Barrier to Data Lakes: The Metadata”

5G and Industrial Automation: Practical Use Cases

Is 5G Enough to Boost the Metaverse?

Building a Private 5G Network for Your Business

5G and AI: Ushering in New Tech Innovation

The Role of 5G in the Sustainability Fight

Three Surprising Reasons Why Businesses Are Building Data Lakes

Get the Free Newsletter!

Latest Articles

How DeFi is Reshaping the Future of Finance

Enterprise Software Startups: What It Takes To Get VC Funding

Top RPA Tools 2022: Robotic Process Automation Software

Advertisers

Menu

Our Brands