Confused About Spark? Six Facts You Need to Know

    Slide Show

    Defining Deep Data: What It Is and How to Use It

    Everything’s turning up Spark this summer—particularly after IBM announced plans to devote massive education, research and development resources to Spark projects. Big Blue is even opening a Spark Technology Center in San Francisco.

    But, as Information Management rightly points out, there are a slew of other organizations promoting Apache Spark, including more than 500 contributors from across more than 200 organizations. So, obviously, this is something IT should jump on right away if you don’t want to be hopelessly behind.

    But, let’s just hit pause a minute, shall we?

    Before you pay membership dues and start wearing a sparkly “Team Spark” t-shirt, read our business-focused mini-FAQ on Spark.

    What the heck is it? A lot of articles struggle with this, so here’s the short answer: It’s an open source cluster-computing framework for fast and flexible large-scale data analysis, according to Apache. Put more simply, it’s a scalable data processing engine. As you’ve no doubt heard, it began as a project at Berkeley AMPLab. It was submitted to Apache in June, 2013, but within eight months Spark generated enough involvement to become a top-level project. That leads us to a common misconception…

    Will Spark usurp Hadoop? Nope. It’s apples and oranges, really. Hadoop is all about storage, and Spark is about processing data.

    What it most likely will replace is MapReduce, which is the processing model that shipped with Hadoop. Why? Because it’s faster—by a lot. Apache reports that Spark runs programs up to 100 times faster than MapReduce in memory, or 10 times faster on disk. DataBricks—a company founded by Spark’s creators—used Spark to sort 100 terabytes of records within 23 minutes.

    “Spark is beautiful,” Rajiv Bhat, senior vice president of data sciences and marketplace at InMobi, told the Economic Times. “With Hadoop, it would take us six-seven months to develop a machine learning model. Now, we can do about four models a day.”

    What makes it fast? Spark is built around Resilient Distributed Data sets (RDDs). Here’s a lovely tech document about RDDs, but basically what you need to know is this: Spark uses clusters, yes, but it skips a bunch of that read/write to disc work by processing the data in-memory on the clusters.

    So, it can leverage two Big Data tools, drawing on Hadoop for the stored data, but processing it in-memory for faster results.

    Can you use Spark only with Hadoop? No. It’s commonly deployed on Mesos or Hadoop, but it’s not limited to either, according to Gartner analyst Nick Heudecker.

    What is the business application for Spark? The killer Spark use case seems to be highly iterative processing, like machine learning, according to Heudecker. It’s also useful in real-time analytics and potentially faster data integration, which is one of the uses start-up ClearData is exploring.

    Doesn’t Tez do the same thing? Whereas Spark is a general-purpose data processing framework that is not limited to Hadoop, Tez is more of a scheduler and API management tool for Hadoop. Heudecker said it makes sense to look at Tez instead of Spark only if you’re building an app that will reside on the Hadoop ecosystem.

    Great! How do I start? Not so fast, warns Heudecker. It’s still very early days for Spark, and frankly, it’s simply too early for the level of maturity enterprises need, he said. For instance, Spark does not ship with a resource manager, although you do tend to get that through Hadoop. It also requires a different skill set. For instance, it’s written in Scala rather than Java.

    So far, people are very interested in Spark, but that’s not enough to drive enterprise adoption, Heudecker added. In fact, most of the companies using Spark thus far are start-ups with unique Spark-based solutions for data preparation or companies using it to power dashboards.

    “There’s no 80 percent use case,” Heudecker said. “There’s a whole lot of three percent use cases. “I’m curious to see how adoption will progress over time.”

    The bottom line is, Spark has a lot going for it, including its processing speed and the full support of giant IBM. That said, it’s far from enterprise ready, and it’s not clear when, or even if, it will be.

    Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.

    Loraine Lawson
    Loraine Lawson
    Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.

    Latest Articles