More

    Vendors Align to Drive Apache Spark Adoption

    Slide Show

    Five Ways to Scale Agile for the Enterprise

    As an in-memory cluster extension to Hadoop, the Apache Spark project has been gaining a fair amount of momentum as a vehicle for running a variety of real-time applications. This week, an effort to standardize how Apache Spark is implemented was announced at the Spark Summit 2014 conference by Cloudera, Databricks, IBM, Intel, and MapR Technologies.

    Justin Erickson, director of product management at Cloudera, says vendors will still compete in terms of how Apache Spark is used as a programming tool to build applications, but by agreeing to standardize the lower-level functions, the vendors are making sure those applications can be ported across different implementations.

    Hadoop was originally built as a framework for processing massive amounts of data in batch mode using a relatively arcane MapReduce language. As an alternative to MapReduce, Spark allows certain classes of Hadoop applications to run much faster in memory.

    Apache Spark is already being used as the basis for several projects that span everything from streaming for continuous data processing to graph analytics and machine learning. In addition, various other programming environments, including Crunch, Mahout and Cascading from Concurrent now support Spark.

    The vendor alliance also announced this week that it will collectively work to move the Apache Hive SQL engine to Spark to improve the performance of SQL applications running directly against Hadoop. The group is investigating ways to adapt Apache Pig, Sqoop and Search to utilize Apache Spark as well.

    With massive amounts of data already in Hadoop, it often makes more sense to process  applications in the same environment than it does to try and move a huge amount of data to be processed elsewhere.

    Erickson says that while Apache Spark represents one extension to Hadoop, Cloudera will continue to invest in other projects such as its Impala, which is an effort to optimize SQL application performance on Hadoop.

    As Hadoop continues to evolve into a data hub, the programming environments surrounding it are starting to proliferate. Rather than thinking of it as a batch processing engine, it’s clear that an array of programming tools will enable Hadoop to be employed across a broad range of applications. In fact, the only limitation when it comes to Hadoop may soon be the imagination of the developers using it more so than how the underlying data is actually stored.

    Mike Vizard
    Mike Vizard
    Michael Vizard is a seasoned IT journalist, with nearly 30 years of experience writing and editing about enterprise IT issues. He is a contributor to publications including Programmableweb, IT Business Edge, CIOinsight and UBM Tech. He formerly was editorial director for Ziff-Davis Enterprise, where he launched the company’s custom content division, and has also served as editor in chief for CRN and InfoWorld. He also has held editorial positions at PC Week, Computerworld and Digital Review.

    Get the Free Newsletter!

    Subscribe to Daily Tech Insider for top news, trends, and analysis.

    Latest Articles