As an in-memory cluster extension to Hadoop, the Apache Spark project has been gaining a fair amount of momentum as a vehicle for running a variety of real-time applications. This week, an effort to standardize how Apache Spark is implemented was announced at the Spark Summit 2014 conference by Cloudera, Databricks, IBM, Intel, and MapR Technologies.
Justin Erickson, director of product management at Cloudera, says vendors will still compete in terms of how Apache Spark is used as a programming tool to build applications, but by agreeing to standardize the lower-level functions, the vendors are making sure those applications can be ported across different implementations.
Hadoop was originally built as a framework for processing massive amounts of data in batch mode using a relatively arcane MapReduce language. As an alternative to MapReduce, Spark allows certain classes of Hadoop applications to run much faster in memory.
Apache Spark is already being used as the basis for several projects that span everything from streaming for continuous data processing to graph analytics and machine learning. In addition, various other programming environments, including Crunch, Mahout and Cascading from Concurrent now support Spark.
The vendor alliance also announced this week that it will collectively work to move the Apache Hive SQL engine to Spark to improve the performance of SQL applications running directly against Hadoop. The group is investigating ways to adapt Apache Pig, Sqoop and Search to utilize Apache Spark as well.
With massive amounts of data already in Hadoop, it often makes more sense to process applications in the same environment than it does to try and move a huge amount of data to be processed elsewhere.
Erickson says that while Apache Spark represents one extension to Hadoop, Cloudera will continue to invest in other projects such as its Impala, which is an effort to optimize SQL application performance on Hadoop.