If today’s release of Hortonworks Data Platform 2.0 (GA) is any indication, the Apache Hadoop community is working pretty darn hard to make sure Hadoop doesn’t become a silo unto itself.
Hortonworks is one of the few companies that offers an enterprise version of Hadoop. Shaun Connolly, Hortonwork’s vice president of corporate strategy, said the 2.0 GA updates incorporate the latest innovations across the Hadoop ecosystem.
For CIOs, what that means is that Hadoop is no longer a single-use data platform that relies on MapReduce for batch processing. The platform can now handle multiple uses. That makes it more interactive; it’s able to handle online and streaming processing.
In other words, Hadoop now can deliver more and faster.
“It basically is about enabling Hadoop to go beyond its first generation MapReduce batch-only roots to actually become a multi-purpose data processing platform,” Connolly explained. “That’s pretty significant in that we’re seeing a lot of customers putting a good bit of data into the platform, but wanting to interact with that data in a wide range of ways, not just the classic MapReduce batch ways.”
Primarily, that’s because of YARN, Hadoop’s new operating system, he explained.
“Basically, the prior MapReduce had a split brain, if you will. It had the classic MapReduce data processing, but it also had all of the operational, resource management of how to spread the loads out in a cluster,” Connolly said. “All of that secondary work has been generalized into YARN as a general purpose operating system that other engines can plug into.”
In practical terms, what that means is that you can perform more jobs, faster. So, existing MapReduce jobs will still run unaltered on this platform, but will get a speed boost. Hortonworks has seen jobs run twice as fast on the 2.0 version.
The platform can handle twice the number of jobs, and sometimes more, on the same hardware. For some companies, this means they won’t have to expand the cluster nearly as quickly, and it may even mean that some companies can shrink their cluster.
For CIOs, these updates make Hadoop a more flexible and dependable platform. Hadoop tends to become a data reservoir, holding not just existing enterprise data, but new data types from sensors, machines and clickstream, Connolly said. However, the challenge is ensuring that Hadoop doesn’t become a Big Data island.
The new release opens up your Big Data stores, allowing you to access that data in broader ways. So, for instance, you can offer access to the Hadoop data in an online database.
It also gives you more predictable performance, he said.
“You’re able to run many different types of workloads and it’s able to give you a consistency of service and performance that you would expect and that’s really what YARN does,” he said. “It opens up the platform for broader data processing, not just classic, and I think that’s important particularly as many organizations begin to extend their use of the Hadoop platform.”
These revisions also could impact enterprise integration work.
“Really the ultimate goal is to minimize the number of fragmented data clusters and sort of centralize all of that and you’ll be able to operate in the same central place,” he said.
The new edition of Hortonworks Data Platform is the first distribution to offer the latest releases for Hadoop and other related Apache offerings, such as Hive and H-Base, according to Connolly. Some of these GA releases were made available only within the last month.
One other important update: more support for SQL. Apache Hive, which acts as a SQL interface for Hadoop, has been revamped. Hive queries that took 1500 seconds to perform last year can now run in 10 to 12 seconds, he explained.