What the Heck Is Hadoop, Anyway?

    By now, we’re familiar with the neat capabilities Hadoop stack brings to data. But it can still be confusing to understand what it is, particularly if you think about technology as fitting into certain buckets.

    In fact, there’s a tendency to think about Hadoop as a database, which is easy to understand, given the early conversations about its ability to store large sets of data.

    But that’s not what Hadoop is. In fact, that’s one of several limiting ways we talk about Hadoop.

    “The Hadoop stack is a data processing platform,” explains Mark Madsen (@markmadsen) an analyst and CEO of the research firm Third Nature, in a recent post, “What Hadoop Is. What Hadoop Isn’t.”

    It combines elements of databases, data integration tools and parallel coding environments into a new and interesting mix. The problem with the IT market today is that it distorts the view of Hadoop by looking at it as a replacement for one of these technologies.

    For instance, many see Hadoop (thanks to Hive) as competing with the data warehouse, but it’s not really up to task, writes Madsen. He lists Hadoop’s weaknesses when compared to a parallel analytic database. Surprisingly, Hadoop’s shortcomings include scaling the workload he describes to the petabyte range.

    • But on the other hand, Hadoop outdoes databases in other areas, including being:
    • Cheaper storage and retrieval (through a limited SQL interface, he adds as a caveat).
    • Easier to use with parallel programming.
    • Scalability for storage/retrieval — if it’s not used for “interactive, concurrent, complex query use,” he writes.
    • Compatibility with cloud infrastructures.

    If it’s not a database, what is it?

    You may also see pieces about Hadoop’s data integration capabilities, with the focus on it as a replacement for the ETL engine. You’ll also see pieces about Hadoop as an analytics tools.

    And both of those are true to an extent. But just as with the database example, Hadoop brings new strengths and capabilities while coming up short in others.

    “What’s overlooked by all of these vendors is that the Hadoop stack is a processing platform,” Madsen explains. “It combines data storage, retrieval and programming into a single highly scalable package. This marriage of capabilities is what makes Hadoop unique.”

    Hadoop is a victim of our own tendency to think about technology in silos.

    So what does this mean for you as you explore Hadoop in your own organization?

    Think about Hadoop as a buddy technology. In some situations, it’s going to outperform existing tools — but in some cases, it won’t. Listen to how vendors talk about integrating these new or improved capabilities. And realize that it may take time for vendors and organizations to sort it all out, he writes.

    The reality of the market is that the technology needs to settle in the areas where it offers new capabilities or more effective or efficient replacement of the old. Vendors with products in areas of significant overlap need to integrate in new ways, extend their own tools or risk oblivion.

    Loraine Lawson
    Loraine Lawson
    Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.

    Latest Articles