Where Will You Find Hadoop Talent?

Share it on Twitter  
Share it on Facebook  
Share it on Linked in  
Slide Show

Why the Hoopla over Hadoop?

Hadoop in nine easy-to-understand facts.

My colleague Loraine Lawson, who has written extensively about Hadoop, including interviewing founder Doug Cutting, recently posed the question to me: Where will companies find Hadoop talent?


Job aggregator Indeed.com's graph of demand for Hadoop skills shows a hockey stick, if there ever was one, since January 2010. It named Hadoop skills among its most recent top 10 list of job skill trends. Open source vendor OpenLogic ranks it No. 4 on its list of the hottest open source projects in the past year. Like most new, hot technologies, there aren't enough people with those skills to go around.


(And with apologies to rivals such as LexisNexis, the name of the set of Apache projects seems to have become generic, just as Xerox came to mean "to copy.")


I posed Loraine's question out to the Twitterverse and James Kobielus, an analyst with Forrester Research, responded:

Companies will find Hadoop talent from within. People are teaching themselves. It's open source. It's made for that.

Kobielus has mentioned that companies have yet to fully embrace Hadoop and has told SDTimes, "we're at the beginning of the maturation of this market" - basically that there's far to go yet. And since Oracle has just jumped into it with Cloudera, it'll be interesting to see what comes of that.


Kobielus is quoted in this useful Computerweek article on what exactly these skills are. Its definition:

Hadoop allows companies to store and manage far larger volumes of structured and unstructured data than can be managed affordably by today's relational database management systems.

Though it's often thought of solely in terms of Big Data - Yahoo is reported to have installed a 50,000-node Hadoop network - this IT World post makes the point that its ability to scale also allows it to effectively scale down to meet business needs.


There have been many references to two roles in Big Data - managing data and interpreting it - but the Computerweek article lists three:

  1. Data analysts or data scientists - Those who glean useful insight from the massive amounts of stored information. The skills: multivariate statistical analysis, data mining, predictive modeling, natural language processing, content analysis, text analysis and social network analysis. It also mentions experience in areas such as SAS, IBM's SPSS software and programming languages such as R. Lack of training or skills in this area, not just related to Hadoop, was seen as a major limitation on use of Big Data in a recent EMC survey.
  2. Data engineers - Those who create the data-processing jobs and build the distributed MapReduce algorithms for use by data analysts. Those with Java and C++ skills will have the edge.
  3. IT data management professionals - Those who choose, install, manage, provision and scale Hadoop clusters. It says these skills will be similar to those in traditional relational database and data warehouse environments.


However, the IT World article adds this:

Since Hadoop is Java-based, and MapReduce makes use of Java classes, a lot of the interaction is the kind where experience as a developer (and as a Java developer in particular) will be very handy. ... Hadoop, Hive, Sqoop, and other tools in the Hadoop ecosystem are controlled from the command line. ...
Hadoop-related jobs typically call for experience with large-scale, distributed systems, and a clear understanding of system design and development through scaling, performance, and scheduling. In addition to experience in Java, programmers should be hands-on and have a good background in data structures and parallel programming techniques. Cloud experience of any kind is a big plus.


At this point, though, Matt Asay, in a post at The Register, worries that lack of Hadoop talent could stunt its adoption. He tells of a London-based friend whose difficulty in finding Hadoop talent prompted him to put off that project and start a training business instead.


Indeed, Hadoop talent is overwhelmingly located in the Silicon Valley area, according a study by North Carolina State University's Institute for Advanced Analytics. The 451 Group compared that with NoSQL and other projects, noting that the NC State project looked at the geographic distribution of LinkedIn members who mentioned Hadoop skills, calling it "by no means perfect, but an insightful measure nonetheless."


Asay, who formerly worked for Ubuntu sponsor Canonical, suggests adopting one of its successful strategies:

Canonical has managed to hire a very strong team of Linux talent by paying well and letting developers work from home, whether that home is in Des Moines, Iowa or Villa Gesell, Argentina. ... For many top-quality engineers, the greatest perk of all, and one that might steer them to a Sears instead of Zynga, is the chance to stay in Canterbury, England, rather than moving to Menlo Park, California.

As for teaching yourself Hadoop, Loren Siebert, a San Francisco entrepreneur and software developer, wrote in a post on Cloudera's site:

The big challenge in my opinion is not that any one piece of the puzzle is too difficult. Any reasonably smart (or in my case stubborn) engineer can set themselves on the task of learning about a new technology once they know that it needs to be learned. The challenge with the Hadoop ecosystem is that it presents the newbie with the meta-problem of figuring out which of these tools are appropriate for their use case at all ...


My advice ... is to break down problems into a few discrete use cases and then work on ferreting out the technologies that are designed for that use case. ... Work toward putting something simple into production. Lather, rinse, and repeat.

For those looking for some outside help, you can check out the Hadoop Support wiki on the Apache site, as well as training courses through Cloudera University, MapR Academy, Hortonworks, IBM and others.