Hadoop is big, but there’s no doubt that the game changer will be marrying SQL— the primary language used by business analysts for ad hoc analysis—with Hadoop.
If you don’t want the information in Hadoop to become an inaccessible silo, then you have to address the SQL problem, as a recent Silicon Angle article, “Hadapt Furthers SQL-NoSQL-Hadoop Integration,” points out:
‘The promise of Hadoop is that the scalable, cost-efficient framework means companies can now afford to store all their data and break down the barriers between existing databases. But in reality at many early adopters Hadoop turns into yet another data silo, limiting Hadoop’s value.’
The problem is, SQL needs data in a traditional table format of rows and columns, but this is not how Hadoop organizes data.
Running a SQL query on Hadoop data stores consumes all the cluster resources, triggering performance issues for other applications and jobs running in the cluster, Hortonworks founder and architect Arun Murphy told IDG.
Instead, processes are run with MapReduce, a framework that few already know and that’s difficult to learn.
If you want data stored in Hadoop to be useful, then you have to address the SQL problem. Vendors know this, and since Hadoop first emerged on the enterprise scene, there have been a number of announcements and solutions all targeting that sweet spot, as Ars Technica reported in its coverage of this year’s Hadoop Summit.
“It looks as if there’s not a Hadoop-related vendor here who isn’t promoting an SQL solution,” Hadoop Summit speaker and Chief Scientist at Mesosphere Paco Nathan told Ars Technica. “And a few of them sound too good to be true.”
In February, Hortonworks and the Apache community announced the Stinger Initiative, a three-phase project targeted at improving the SQL interface with Hadoop.
Since Hive is the de facto SQL engine for Hadoop, the group focused primarily on Hive, but also on more core Hadoop issues such as boosting performance with YARN and, eventually, relying on a latency-reducing framework for complex data tasks called Tez, Jaxenter reports.
The goals are not small and include improving Hive query performance by 100 percent to allow for interactive query times of seconds. They also want SQL queries to scale from terabytes to petabytes and for Hive to support a broad range of SQL semantics for analytics applications running against Hadoop.
The project was set up in three phases, two of which are already completed. That means the initiative finished 70 percent of its work in just six months.
Shaun Connolly, Hortonworks’ vice president of corporate strategy, said Stinger is already paying off. A few years ago, a Hive 0.10 query might take 1500 seconds simply because it’s batch-oriented, he explained. In the latest release, Hive 0.12, the same query runs in 10 to 12 seconds.
“So you’re seeing 60, 70, 100 times performance increase just because of this Stinger initiative works at making apache Hive, the de facto SQL interface, faster as well as more SQL compliant,” Connolly said. “Why that is important is you have this classic BI infrastructure of Tableau, MicroStrategy and others—even Excel that want to interact with data in Hadoop, leveraging SQL.”
So far, the Stinger Initiative has delivered base (phase I) and advanced (phase II) optimizations, performance boosts by leveraging YARN, and support for:
- SQL analytic functions
- SQL Types
- ORCFile Modern File Format
Phase II was delivered in September and included in Hortonworks Data Platform 2.0 GA, which was released yesterday.
The last phase of Stinger will include:
- Hive on Apache Tez
- Always on query service
- Buffer cache
- Cost-based optimizer (Optiq)
Hortonworks offers a three-step process, including tutorials, to help SQL users learn about Hive.