Why Data Lakes Turn into Data Swamps

    Slide Show

    How to Monetize Data in Five Steps

    Integration and data aren’t exactly what you’d call “controversial” topics, but occasionally, folks do get their dander up.

    Recently, we’ve seen some heated words about the concept of data lakes. Data lakes are a Big Data idea, which basically means you pump all of your data into one huge Hadoop system and use that as the foundation for analytics, queries, reports or what-have-you. Gartner called out the idea as a fallacy last month, leading to a biting backlash from Andrew C. Oliver. I heard from several companies after writing about this issue, including Teradata, which sells both Hadoop appliances and its own large-capacity analytics data platform.

    Teradata has actually been involved in building data lakes, so you’d think its leaders would argue against Gartner’s Nick Heudecker and Andrew White on this issue.

    Not so, it turns out. General Manager, Enterprise Systems, Dan Graham told me during a recent phone interview that he’d talked with Heudecker and agreed with him.

    “I just wish he’d written it two years sooner, because a lot of customers have gotten in trouble with data lakes, and they didn’t have to,” Graham told me.

    Companies that jumped in too soon, without the proper support, tended to wind up with what Graham calls “data swamps.” Data swamps, of course, are the opposite of what organizations intend when they set out to build a data lake: filled with disorganized, bad or even lost data.

    Data swamps are what happen when you inadvertently follow worst practices, Graham said. That’s much easier than you’d think.

    “It’s still so new, there’s still more worst practices than best practices right now,” he explained.

    While organizations now understand that they can’t expect to replace their enterprise data warehouses with Hadoop clusters — “We have not lost one data warehouse to Hadoop,” Graham added — there are still dangerous misconceptions about data lakes. For instance, most companies don’t have the internal expertise to properly deploy a data lake, he said.

    Even those who do have the talent and are prepared to start now should expect the journey to take five or more years, he said.

    Graham said companies also underestimate the challenges of managing  security and privacy in Hadoop.

    “Security is not something where you can pile on software,” he said. “As it stands now, securing Hadoop takes custom hardening, and even then, it’s not as secure as enterprise data warehouses.”

    The harsh realities of Hadoop are good business for Teradata, though. The company has long offered a Hadoop appliance, which runs Hortonworks’ distribution of the open source software. But this year, the company stepped up its Hadoop and Big Data offerings, primarily through acquiring companies.

    Its first release was Teradata QueryGrid, which Teradata developed years ago for eBay, according to Graham. It allows you to run queries in parallel on Teradata’s machine and a Hadoop cluster, using Teradata’s tools. A unique aspect of the tool is that it supports true parallel data exchange, he added, which significantly increases query speeds.

    In July, the company acquired Revelytix and Hadapt. Revelytix’s product, Loom, solves a major Hadoop challenge: Generating metadata and statistical information on large data sets as they’re migrated to Hadoop. It also addresses another Hadoop problem: Tracking data lineage.

    At the same time, the company addressed another common Hadoop complaint by acquiring Hadapt, a start-up with software that integrates SQL with Hadoop Silicon Valley.

    This week, Teradata rounded out its offering by acquiring Think Big Analytics, an Americas-based consultancy that focuses on Hadoop and Big Data. The addition will bolster Teradata’s Center of Excellence services.

    Graham wouldn’t rule out future acquisitions, quipping that something was always in the works but he never knew when one might happen. There are fields of Big Data start-ups blooming, he pointed out.

    And, I might add, plenty of data reservoirs with which to water them.

    Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.

    Loraine Lawson
    Loraine Lawson
    Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.

    Get the Free Newsletter!

    Subscribe to Daily Tech Insider for top news, trends, and analysis.

    Latest Articles