Five Facts for Understanding Big Data and Hadoop

Loraine Lawson

Big Data is confusing — and not just for business leaders. CIOs find it befuddling, too. 

Scott Schlesinger, vice president and head of North American Business Information Management at Capgemini, shared a great anecdote about this. He attended a CIO symposium with some of the best and brightest CIOs in the nation. The speaker asked, “How many people here feel as though their organization has a Big Data problem?” and every hand in the room went up.

The speaker then asked, “How many of you feel as though you have an understanding of what that means and how to address the Big Data problem?”

All the hands disappeared.

So, if you’re confused about Big Data, you’re in good company.

That’s why I thought it might be helpful to clarify a few points about Big Data, and Hadoop in particular.

Big Data means more than a lot of data. While definitions may vary, Big Data experts typically say Big Data fits one or more of the “Three Vs”: volume, velocity and variety. Often, Big Data is described as data that falls outside the scope of traditional relational databases, either because it’s too big, unstructured or needs to be processed too fast.

Hadoop is a misnomer. What’s called “Hadoop” is really slang for a bunch of different open source software. Apache lists eight other solutions as part of Hadoop-related projects, which I’ll discuss in a future post. What people really mean by “Hadoop” is the “Hadoop Core,” which is:

The Hadoop Distributed File System, which is a way to create clustered, or distributed storage. Its advantage: It’s fast, self-healing and fault tolerant — which makes it hard to break. Best news: You can run it on any server (techies will call these “nodes”) you want, and since it’s open source, it’s cheap to do so. Need more space? Add more nodes.

MapReduce, which programmers use to write applications to very rapidly process large amounts of data on this file system.

MapReduce is really the workhorse of Hadoop. MapReduce “divides and conquers” the data, according to Hadoop creator Doug Cutting. With MapReduce, you can put all those nodes to work processing the data locally, which makes it fast and powerful. It can process a whole data set that's spread across hundreds, even thousands, of computers in parallel chunks. MapReduce then collects and sorts the process data.

James Kobielus, who now works at IBM, told me while he was still Forrester Research’s Hadoop expert that MapReduce is really the heart of Hadoop.

By the way, since this blog is about integration, MapReduce can be used to “enable” the data integration layer for Hadoop stores.

Hadoop is not an analytics platform. Although Hadoop is most frequently talked about and used for analytics, it’s not really an analytics platform. Actually, it can also be used with a more traditional analytic environment, like SAS, according to Ravi Kalakota, Ph.D. and partner at LiquidAnalytics. Another common way to use Hadoop for analytics is to use the R programming language — favored by statisticians — to write MapReduce jobs.

In addition to analytics, Hadoop can be used for archiving and for ETL (extract, transform and load) and filtering.

Hadoop is not the only way to handle Big Data. Big Data has been around for a long time, as established vendors will tell you. In fact, you can handle high volumes of data with massively parallel-processing (MPP) databases, such as those offered by Greenplum, Aster Data (now owned by Teradata) and Vertica. And, of course, the big enterprise tech companies — IBM, Oracle, Microsoft, Greenplum — all offer Big Data solutions; although, increasingly, they’re incorporating Hadoop into these platforms.

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.


Add Comment      Leave a comment on this blog post
Aug 20, 2012 2:39 PM H.M. H.M.  says:
Loraine, I agree that Hadoop is not the only way to handle Big Data. We are already hearing some complaints from companies trying to use Hadoop for critical production systems. As an alternative to Hadoop, LexisNexis has open sourced the HPCC Systems platform, which is a complete enterprise-ready solution. Designed by data scientists, it provides for a single architecture, a consistent data-centric programming language (ECL), a data processing and a data delivery cluster. Their built-in analytics libraries for Machine Learning and BI integration provide an integrated solution from data ingestion and data processing to data delivery, which is significantly more efficient to support and requires a lower number of resources. In contrast, the complexity of the Hadoop ecosystem requires a higher investment in technology and resources, up front and throughout. If you are interested, please take a look at http://hpccsystems.com. Reply
Aug 31, 2012 4:39 PM Ken Rosen Ken Rosen  says:
As Hadoop fame speads, this mainstream clarity is really helpful. We've found a critical addition to the 3Vs: Distribution. That is, Volume, Velocity, and Variety are absolutely challenges. And SO if the fact that real-world data is almost always spread out. The catch all answer, "Just consolidate it in one data center," is increasing expensive, risk, difficult...and simply impractical. More on this here: http://www.chiliad.com/are-you-in-big-data-or-just-medium-data/ People need another answer. They need to consider *Virtual* Consolidation--the same result without moving the data. Most don't know it's possible, so it is rarely discussed. Cheers, Ken Reply
Apr 30, 2015 8:51 AM Ratnesh Ratnesh  says:
Thanks for your post! Big Data has grown in significance over the last few years because of the pervasiveness of its application, across areas ranging from weather forecasting to analyzing business trends, fighting crime and preventing epidemics etc. Big data sets are so large that traditional data management tools are incapable of analyzing all the data effectively and processing valuable information out of it. Hadoop is an open source java framework that enables distributed parallel processing of large volume of data across servers which has emerged as the solution to extract potential value from all this data. More at www.youtube.com/watch?v=1jMR4cHBwZE Reply
Apr 18, 2017 11:33 AM Big Data Hadoop Online Trainin Big Data Hadoop Online Trainin  says:
Good Knowledge sharing about Big Data Hadoop. Big Data Hadoop has a huge demand in IT Industry. Reply
Sep 5, 2018 9:50 AM Hadoop Online Training Hadoop Online Training  says:
Hi, Thanks for sharing this article and nice explanation.it's very useful for beginers who wants to build career on Hadoop platform... Reply
Nov 28, 2018 9:39 AM venky venky  says:
Inspirational content, have achieved a good knowledge from the above content on Hadoop Training useful for all the aspirants of Hadoop Training. Reply
Dec 8, 2018 7:25 AM Himagirish Himagirish  says:
Great Post! Thanks you for sharing such a nice article on hadoop..Really very useful to Hadoop learners....Keep sharing Reply

Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe Daily Edge Newsletters

Sign up now and get the best business technology insights direct to your inbox.

Subscribe Daily Edge Newsletters

Sign up now and get the best business technology insights direct to your inbox.