The last two years have witnessed a significant proliferation in the number of SQL engines that can be layered in on top of Hadoop. Now Pivotal, a unit of EMC, is moving to try to consolidate all those engines by making the Pivotal Hawq software available as an open source project managed by the Apache Software Foundation.
At the same time, Pivotal announced it is making a library of machine learning algorithms, called MADlib, available under an open source Apache license. Those algorithms were developed by Pivotal in conjunction with University of California, Berkeley, Stanford University, University of Florida and Pivotal customers.
Gavin Sherry, vice president and CTO for data at Pivotal, says that rather than continuing to fragment the number of SQL engines that IT organizations might need to support on top of Hadoop over time, Pivotal is making the case for rallying around a single open source SQL engine. Initial support for that is coming from Hortonworks, one of the leading providers of a distribution of Hadoop, and AltiScale, a provider of Hadoop service delivered via the cloud.
For all the hype surrounding Hadoop, the primary way that organizations of all sizes continue to interact with data is SQL. Billions of dollars of investments in SQL applications have made SQL the lingua franca for business applications. In time, other approaches to interacting with large amounts of data, such as Apache Spark and document databases based on JSON, will continue to gain traction. But for the foreseeable future, SQL will continue to dominate the query landscape.
The degree to which individual SQL engines running on top of Hadoop will add value to that equation, however, is debatable. Rather than force IT organizations and developers to master the nuances of individual SQL engines, Sherry says the time has come to accelerate usage of SQL within a Hadoop environment by creating a single open source implementation.