SHARE
Facebook X Pinterest WhatsApp

Five Facts for Understanding Big Data and Hadoop

Big Data is confusing — and not just for business leaders. CIOs find it befuddling, too.  Scott Schlesinger, vice president and head of North American Business Information Management at Capgemini, shared a great anecdote about this. He attended a CIO symposium with some of the best and brightest CIOs in the nation. The speaker asked, […]

Written By
thumbnail
Loraine Lawson
Loraine Lawson
Aug 13, 2012

Big Data is confusing — and not just for business leaders. CIOs find it befuddling, too. 

Scott Schlesinger, vice president and head of North American Business Information Management at Capgemini, shared a great anecdote about this. He attended a CIO symposium with some of the best and brightest CIOs in the nation. The speaker asked, “How many people here feel as though their organization has a Big Data problem?” and every hand in the room went up.

The speaker then asked, “How many of you feel as though you have an understanding of what that means and how to address the Big Data problem?”

All the hands disappeared.

So, if you’re confused about Big Data, you’re in good company.

That’s why I thought it might be helpful to clarify a few points about Big Data, and Hadoop in particular.

Big Data means more than a lot of data. While definitions may vary, Big Data experts typically say Big Data fits one or more of the “Three Vs”: volume, velocity and variety. Often, Big Data is described as data that falls outside the scope of traditional relational databases, either because it’s too big, unstructured or needs to be processed too fast.

Hadoop is a misnomer. What’s called “Hadoop” is really slang for a bunch of different open source software. Apache lists eight other solutions as part of Hadoop-related projects, which I’ll discuss in a future post. What people really mean by “Hadoop” is the “Hadoop Core,” which is:

The Hadoop Distributed File System, which is a way to create clustered, or distributed storage. Its advantage: It’s fast, self-healing and fault tolerant — which makes it hard to break. Best news: You can run it on any server (techies will call these “nodes”) you want, and since it’s open source, it’s cheap to do so. Need more space? Add more nodes.

MapReduce, which programmers use to write applications to very rapidly process large amounts of data on this file system.

MapReduce is really the workhorse of Hadoop. MapReduce “divides and conquers” the data, according to Hadoop creator Doug Cutting. With MapReduce, you can put all those nodes to work processing the data locally, which makes it fast and powerful. It can process a whole data set that’s spread across hundreds, even thousands, of computers in parallel chunks. MapReduce then collects and sorts the process data.

James Kobielus, who now works at IBM, told me while he was still Forrester Research’s Hadoop expert that MapReduce is really the heart of Hadoop.

By the way, since this blog is about integration, MapReduce can be used to “enable” the data integration layer for Hadoop stores.

Hadoop is not an analytics platform. Although Hadoop is most frequently talked about and used for analytics, it’s not really an analytics platform. Actually, it can also be used with a more traditional analytic environment, like SAS, according to Ravi Kalakota, Ph.D. and partner at LiquidAnalytics. Another common way to use Hadoop for analytics is to use the R programming language — favored by statisticians — to write MapReduce jobs.

In addition to analytics, Hadoop can be used for archiving and for ETL (extract, transform and load) and filtering.

Hadoop is not the only way to handle Big Data. Big Data has been around for a long time, as established vendors will tell you. In fact, you can handle high volumes of data with massively parallel-processing (MPP) databases, such as those offered by Greenplum, Aster Data (now owned by Teradata) and Vertica. And, of course, the big enterprise tech companies — IBM, Oracle, Microsoft, Greenplum — all offer Big Data solutions; although, increasingly, they’re incorporating Hadoop into these platforms.

Recommended for you...

How Revolutionary Are Meta’s AI Efforts?
Kashyap Vyas
Aug 8, 2022
Data Lake Strategy Options: From Self-Service to Full-Service
Chad Kime
Aug 8, 2022
What’s New With Google Vertex AI?
Kashyap Vyas
Jul 26, 2022
Data Lake vs. Data Warehouse: What’s the Difference?
Aminu Abdullahi
Jul 25, 2022
IT Business Edge Logo

The go-to resource for IT professionals from all corners of the tech world looking for cutting edge technology solutions that solve their unique business challenges. We aim to help these professionals grow their knowledge base and authority in their field with the top news and trends in the technology space.

Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.