To join in the momentum from the open source Apache Spark in-memory computing framework build, IBM today announced that it is making a major commitment to Spark in the form of IBM SystemML machine learning software that it will donate to a project that 3,500 IBM researchers located in a dozen labs are already now working on.
Joel Horwitz, director of portfolio marketing for the IBM Analytics Platform, says that IBM views the in-memory framework for creating clusters as a foundational component of an emerging “insight economy” where analytics are processed in real time alongside transactions. As such, IBM will embed Apache Spark software into all of its analytics and e-commerce software, says Horwitz.
In addition, Horwitz says that IBM will offer Spark on its SoftLayer cloud alongside an instance of Spark that can be invoked as a service running on the IBM Bluemix platform-as-a-service (PaaS) environment that can be provisioned in as little as 10 minutes. One of the things that makes this possible, says Horwitz, is that the application programming interfaces (APIs) that have been created for Apache Spark are already well defined.
Horwitz says IBM is committed to making additional contributions to the project as it continues to invest in machine learning applications designed to, for example, advance gene sequencing or optimize transportation routes using data collected from millions of Internet of Things (IoT) endpoints. Horwitz adds that IBM is committed to extending the number of programming languages that can be used to create Spark applications. Spark itself, notes Horwitz, is written in Scala, a derivative of Java.
IBM will also open a Spark Technology Center in San Francisco. The company is pledging to educate at least 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.
Though data itself is not actually stored in Spark, as an in-memory compute engine that is layered on top of Hadoop, Horwitz says Spark is becoming part of the logical data warehouse that is starting to emerge in Big Data environments. In fact, Spark is not only multiple orders of magnitude faster than standard Hadoop, it sharply reduces the number of machines needed in a cluster to process Big Data.
As a top-level Apache open source project originally developed by Databricks, Horwitz says that IBM views Spark today as significant an open source project as Linux itself. The challenge, of course, is turning what is clearly still an emerging, immature technology into something that can be deployed in support of production applications across the enterprise.