Great Resources for Tackling Big Data

Slide Show

Why the Hoopla over Hadoop?

Hadoop in nine easy to understand facts.

Mike Loukides, senior editor for O'Reilly Media, last week issued a "watchlist" of trends to track in 2011. At the top of his list is the Hadoop family of products.


Hadoop is a distributed way to store and process large amounts of data-and by large, we mean terabytes and petabytes of data. Obviously, "Big Data" became a more mainstream topic in 2010, as Loukides points out, so his prediction for 2011 is that we'll see a growth and maturing of the Hadoop platform and supporting technologies.


We often talk about "Hadoop" and "MapReduce," but as Loukides points out, it's actually "an ecosystem of tools that interoperate-and the total is more than the sum of its parts." So this year, look for more on similar and/or related solutions, including HBase, Pig, Hive, Mahout, Flume, ZooKeeper, he writes.


But I figure why wait? I've found two excellent resources that explain some of these solutions and what they do.


If you're a Cliff Notes person, start with this ReadWriteCloud article, "How Twitter Uses NoSQL." It only focuses on solutions used by Twitter, but given that Twitter, Facebook, Google and their mega-data ilk pioneered the Big Data space, it's no small thing. The article nicely sums up Hadoop, Scribe, Pig, Hbase, FlockDB (which is actually built on MySQL, so it's a bit different) and Cassandra, an open-source NoSQL Apache database created by Facebook.


It's a great quick resource, but really more of a glossary than anything else.


If you'd like a real nitty-gritty, in-depth discussion of Big Data solutions-and how businesses like Disney and TransUnion are using these tools-then you'll want to explore the recent PriceWaterhouse Cooper Technology Forecast, entitled "Making Sense of Big Data." It's available for online reading - follow the menu to the left of the page to catch all the articles-or you can download it as a free .pdf, which I found easier given that it's 50 pages in length.


I honestly cannot say how much I love this collection and I think if you're concerned with Big Data at all, you'll find something useful here, whether you're more concerned with the business issues or the technologies. Frankly, it's a must-read for CIOs who want to stay abreast of Big Data and how it can be accessed, processed and used for answering real business questions. In fact, CIOs and other executive-level technologists are the primary audience; it even includes a piece entitled "Revising the CIO's Data Playbook," which looks at the business case and getting started now with Big Data.


One point that's worth repeating from this article: This doesn't have to be an expensive endeavor. In fact, what's prompted many down the road of tackling Big Data with Hadoop and other tools is that it can be accomplished with aging servers and open source software. In fact, Disney estimates its project cost around $300,000 to $500,000-much less than the $3 to 5 million it would normally need to do the analysis and work achieved without Hadoop.


As the PriceWaterhouse Cooper article for CIOs advises:

Tools are not the issue. Many evolving tools, as noted in the previous article, come from the open-source community; they can be downloaded and experimented with for low cost and are certainly up to supporting any pilot project. More important is the aforementioned mind-set and a new kind of talent IT will need.

The article outlines three baby steps you can take to explore Big Data and then elaborates on how to take them:

  1. Start adding the needed skill sets for Big Data, either by training existing staff or recruiting from outside. Given how few people have experience with Big Data right now, I have to think the former would be easier, but that's just my two-cents.
  2. Set up sandboxes to experiment with Big Data technologies.
  3. "Understand the open-source nature of the tools and how to manage risk." Their words, not mine, but I noticed while reading the other articles that TransUnion's Acting CTO John Parkinson cautioned that the software available today "has some bugs in it, and it doesn't behave very well in an enterprise environment."


I suggest you check this out sooner rather than later, since big research firms don't always leave these things unlocked indefinitely.