There is little doubt that Hadoop adoption is growing, and not just among enterprise-sized organizations, but by small- and medium-sized businesses as well. In an effort to understand this maturing market more deeply, Pepperdata conducted a survey about how and why Hadoop is used for business operations.
The 134 survey respondents came from a range of experience, but all work at companies currently running Hadoop in production. The majority of respondents were from software engineering/development, data scientist, or data architect job titles (25 percent, 17 percent, and 12 percent, respectively). Almost half (40 percent) were from the information technology industry, with education and financial services (11 percent and 10 percent) coming in second and third. Over 45 percent have been in production for two years or more, with 15 percent of those being “advanced users” (four years or more in production).
In this slideshow, Pepperdata shares findings from the survey, such as key use cases, the size of Hadoop environments, and biggest challenges to production deployment.
Hadoop Use in BizOps
Click through for findings from a survey, conducted by Pepperdata, on how and why Hadoop is used for business operations.
Size Doesn’t Matter
The size of an organization does not always correlate to cluster size. Some of the largest Hadoop deployments and users tend to be small shops, such as ad tech companies, digital marketing, and analytics departments that don’t always have the highest numbers of employees. Most organizations just starting out with Hadoop have one cluster for production and one for test/dev. For those who have not figured out how to reliably run multi-tenant environments, it is common to isolate clusters.
Types of Workloads Do Matter
There is an interesting correlation between the types of workloads and the size of Hadoop clusters. Respondents who cited “streaming / real-time” as one of their workloads tended to have more clusters in production (46 percent had four or more clusters). Among respondents who did not have streaming or real-time workloads, only 20 percent had four or more clusters. The move to real time is adding cost and complexity to Hadoop deployments, through the use of cluster isolation as a best practice to guarantee performance. In order to successfully run Hadoop in production, organizations need to start moving away from cluster isolation and toward Quality of Service for Hadoop so they can run real-time/streaming applications (e.g., Spark) alongside batch workloads (e.g., MapReduce) on a single cluster.
Mixed Workloads Bring Cluster Chaos
In terms of the workloads that organizations are running, MapReduce leads the pack with an overwhelming 70 percent of respondents currently running MapReduce in production. Spark and Hive are close on the heels with 65 percent and 57 percent, respectively. Respondents also run HBase, batch workloads, Pig, streaming/real-time workloads, Impala and Flume in smaller increments. Given the breakdown, it is clear that many organizations are running mixed workloads in production and increasing the risk of experiencing cluster chaos.
Hadoop Is Still Hard
Respondents face a number of challenges when working with Hadoop. The biggest challenge reported was a lack of expertise or a skills gap. Too much time spent troubleshooting, resource contention, and lack of visibility all came up as common problems as well. This list confirms that with all the progress we have made over the past decade, Hadoop is still challenging, especially in production environments.
It’s Going to Get Harder
The Hadoop ecosystem is not only growing but maturing at a very rapid pace. New processing engines running on Hadoop are driving new, real-time, production use cases that bring their own set of performance challenges that need to be managed to realize true operational value from Hadoop. In order to combat this, organizations using Hadoop and other tools in the ecosystem need to find solutions that help jobs complete on time, facilitate higher utilization of existing hardware resources, and guarantee Quality of Service for Hadoop.