Now that the Apache Software Foundation (ASF) has shipped version 3.0 of the Hadoop platform, providers of distributions of the Big Data platform are gearing up to support a wider number of classes of application workloads in 2018.
Vinod Vavilapalli, development lead for Hadoop, MapReduce and YARN at Hortonworks, says the two most significant additions to Hadoop include the ability to run applications that incorporate machine and deep learning algorithms on graphical processor units (GPUs) and field-programmable gate arrays (FPGAs). Most of those applications require access to massive amounts of data that is most efficiently made available via platforms such as Hadoop, notes Vavilapalli.
In fact, Vavilapalli says, there has been very little innovation in the algorithms themselves in the last several years. What has changed is that the cost of making available the data needed to train models based on those algorithms has dropped considerably.
The second major class of applications that will find their way on to Hadoop will come in the form of Docker containers. Vavilapalli says the portability of Docker containers will make it possible to run many new and existing applications on top of Hadoop. Vavilapalli says he’s also hopeful the Hadoop community will revisit how to integrate the YARN scheduler employed on Hadoop and Kubernetes, the container orchestration engine that is rapidly becoming a de facto standard.
“YARN and Kubernetes are really just different types of schedulers,” says Vavilapalli.
Other notable new capabilities on Hadoop 3.0 include optimizations for real-time queries and long-running services in addition to a faster data ingest engine. Integration with cloud storage services such as Amazon S3 (S3Guard), Microsoft Azure Data Lake, and Aliyun Object Storage System have also been improved.
Vavilapalli says it will take some time for all the tooling that needs to be wrapped around Hadoop in an enterprise setting to become compatible with Hadoop 3.0. But once it does, Vavilapalli says, it’s now only a matter of time before the number of types of applications running on top of data lakes based on Hadoop will rapidly expand well beyond traditional batch-oriented analytics.