Five Facts for Understanding Big Data and Hadoop

Loraine Lawson

Big Data is confusing — and not just for business leaders. CIOs find it befuddling, too. 

Scott Schlesinger, vice president and head of North American Business Information Management at Capgemini, shared a great anecdote about this. He attended a CIO symposium with some of the best and brightest CIOs in the nation. The speaker asked, “How many people here feel as though their organization has a Big Data problem?” and every hand in the room went up.

The speaker then asked, “How many of you feel as though you have an understanding of what that means and how to address the Big Data problem?”

All the hands disappeared.

So, if you’re confused about Big Data, you’re in good company.

That’s why I thought it might be helpful to clarify a few points about Big Data, and Hadoop in particular.

Big Data means more than a lot of data. While definitions may vary, Big Data experts typically say Big Data fits one or more of the “Three Vs”: volume, velocity and variety. Often, Big Data is described as data that falls outside the scope of traditional relational databases, either because it’s too big, unstructured or needs to be processed too fast.

Hadoop is a misnomer. What’s called “Hadoop” is really slang for a bunch of different open source software. Apache lists eight other solutions as part of Hadoop-related projects, which I’ll discuss in a future post. What people really mean by “Hadoop” is the “Hadoop Core,” which is:

The Hadoop Distributed File System, which is a way to create clustered, or distributed storage. Its advantage: It’s fast, self-healing and fault tolerant — which makes it hard to break. Best news: You can run it on any server (techies will call these “nodes”) you want, and since it’s open source, it’s cheap to do so. Need more space? Add more nodes.


MapReduce, which programmers use to write applications to very rapidly process large amounts of data on this file system.

MapReduce is really the workhorse of Hadoop. MapReduce “divides and conquers” the data, according to Hadoop creator Doug Cutting. With MapReduce, you can put all those nodes to work processing the data locally, which makes it fast and powerful. It can process a whole data set that's spread across hundreds, even thousands, of computers in parallel chunks. MapReduce then collects and sorts the process data.

James Kobielus, who now works at IBM, told me while he was still Forrester Research’s Hadoop expert that MapReduce is really the heart of Hadoop.

By the way, since this blog is about integration, MapReduce can be used to “enable” the data integration layer for Hadoop stores.

Hadoop is not an analytics platform. Although Hadoop is most frequently talked about and used for analytics, it’s not really an analytics platform. Actually, it can also be used with a more traditional analytic environment, like SAS, according to Ravi Kalakota, Ph.D. and partner at LiquidAnalytics. Another common way to use Hadoop for analytics is to use the R programming language — favored by statisticians — to write MapReduce jobs.

In addition to analytics, Hadoop can be used for archiving and for ETL (extract, transform and load) and filtering.

Hadoop is not the only way to handle Big Data. Big Data has been around for a long time, as established vendors will tell you. In fact, you can handle high volumes of data with massively parallel-processing (MPP) databases, such as those offered by Greenplum, Aster Data (now owned by Teradata) and Vertica. And, of course, the big enterprise tech companies — IBM, Oracle, Microsoft, Greenplum — all offer Big Data solutions; although, increasingly, they’re incorporating Hadoop into these platforms.



Add Comment      Leave a comment on this blog post
Aug 20, 2012 7:39 AM H.M. H.M.  says:
Loraine, I agree that Hadoop is not the only way to handle Big Data. We are already hearing some complaints from companies trying to use Hadoop for critical production systems. As an alternative to Hadoop, LexisNexis has open sourced the HPCC Systems platform, which is a complete enterprise-ready solution. Designed by data scientists, it provides for a single architecture, a consistent data-centric programming language (ECL), a data processing and a data delivery cluster. Their built-in analytics libraries for Machine Learning and BI integration provide an integrated solution from data ingestion and data processing to data delivery, which is significantly more efficient to support and requires a lower number of resources. In contrast, the complexity of the Hadoop ecosystem requires a higher investment in technology and resources, up front and throughout. If you are interested, please take a look at http://hpccsystems.com. Reply
Aug 31, 2012 9:39 AM Ken Rosen Ken Rosen  says:
As Hadoop fame speads, this mainstream clarity is really helpful. We've found a critical addition to the 3Vs: Distribution. That is, Volume, Velocity, and Variety are absolutely challenges. And SO if the fact that real-world data is almost always spread out. The catch all answer, "Just consolidate it in one data center," is increasing expensive, risk, difficult...and simply impractical. More on this here: http://www.chiliad.com/are-you-in-big-data-or-just-medium-data/ People need another answer. They need to consider *Virtual* Consolidation--the same result without moving the data. Most don't know it's possible, so it is rarely discussed. Cheers, Ken Reply

Post a comment

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.


 
Resource centers

Business Intelligence

Business performance information for strategic and operational decision-making

SOA

SOA uses interoperable services grouped around business processes to ease data integration

Data Warehousing

Data warehousing helps companies make sense of their operational data