Databricks Marries Apache Spark to Serverless Computing


At a Spark Summit 2017 conference this week, Databricks announced that it will be making an instance of the Apache Spark in-memory computing framework available as a managed cloud service running on top of serverless computing environment.

In addition, Databricks revealed it will be making available curated instances of machine learning algorithms and tools via its cloud service as well as an application programming interface (API) through which IT organizations can stream data into Spark five times faster.

Databricks CTO Matei Zaharia says all three announcements are intended to reduce the amount of time it takes for organizations to start getting return on their investments in Apache Spark. Rather than having to acquire, configure and deploy all the infrastructure needed to run an instance of Apache Spark, Zaharia says it’s much simpler for the average end user to invoke a cloud service.

“In the early days of Spark, most of the end users were very technical,” says Zaharia. “Now we’re seeing a lot more other types of users that just want to use the data.”

The Databricks serverless computing framework, says Zaharia, is based on an event-driven architecture that makes infrastructure resources instantly available and makes it simpler to scale Apache Spark usage up and down as required using infrastructure provided by Amazon Web Services (AWS).

Zaharia notes that as usage of Apache Spark has expanded, the number of data sources it’s being tied into now extends well beyond Apache Hadoop clusters. In addition, Zaharia notes that SQL is now the primary interface being used to launch queries against data stored in Apache Spark systems.

In general, serverless computing frameworks are still in their infancy. But as they continue to evolve, it’s already clear organizations of all sizes are about to experience much less friction in terms of how any given application consumes IT infrastructure resources.