Big Data is confusing — and not just for business leaders. CIOs find it befuddling, too.
Scott Schlesinger, vice president and head of North American Business Information Management at Capgemini, shared a great anecdote about this. He attended a CIO symposium with some of the best and brightest CIOs in the nation. The speaker asked, “How many people here feel as though their organization has a Big Data problem?” and every hand in the room went up.
The speaker then asked, “How many of you feel as though you have an understanding of what that means and how to address the Big Data problem?”
All the hands disappeared.
So, if you’re confused about Big Data, you’re in good company.
That’s why I thought it might be helpful to clarify a few points about Big Data, and Hadoop in particular.
Big Data means more than a lot of data. While definitions may vary, Big Data experts typically say Big Data fits one or more of the “Three Vs”: volume, velocity and variety. Often, Big Data is described as data that falls outside the scope of traditional relational databases, either because it’s too big, unstructured or needs to be processed too fast.
Hadoop is a misnomer. What’s called “Hadoop” is really slang for a bunch of different open source software. Apache lists eight other solutions as part of Hadoop-related projects, which I’ll discuss in a future post. What people really mean by “Hadoop” is the “Hadoop Core,” which is:
The Hadoop Distributed File System, which is a way to create clustered, or distributed storage. Its advantage: It’s fast, self-healing and fault tolerant — which makes it hard to break. Best news: You can run it on any server (techies will call these “nodes”) you want, and since it’s open source, it’s cheap to do so. Need more space? Add more nodes.
MapReduce, which programmers use to write applications to very rapidly process large amounts of data on this file system.
MapReduce is really the workhorse of Hadoop. MapReduce “divides and conquers” the data, according to Hadoop creator Doug Cutting. With MapReduce, you can put all those nodes to work processing the data locally, which makes it fast and powerful. It can process a whole data set that’s spread across hundreds, even thousands, of computers in parallel chunks. MapReduce then collects and sorts the process data.
James Kobielus, who now works at IBM, told me while he was still Forrester Research’s Hadoop expert that MapReduce is really the heart of Hadoop.
By the way, since this blog is about integration, MapReduce can be used to “enable” the data integration layer for Hadoop stores.
Hadoop is not an analytics platform. Although Hadoop is most frequently talked about and used for analytics, it’s not really an analytics platform. Actually, it can also be used with a more traditional analytic environment, like SAS, according to Ravi Kalakota, Ph.D. and partner at LiquidAnalytics. Another common way to use Hadoop for analytics is to use the R programming language — favored by statisticians — to write MapReduce jobs.
In addition to analytics, Hadoop can be used for archiving and for ETL (extract, transform and load) and filtering.
Hadoop is not the only way to handle Big Data. Big Data has been around for a long time, as established vendors will tell you. In fact, you can handle high volumes of data with massively parallel-processing (MPP) databases, such as those offered by Greenplum, Aster Data (now owned by Teradata) and Vertica. And, of course, the big enterprise tech companies — IBM, Oracle, Microsoft, Greenplum — all offer Big Data solutions; although, increasingly, they’re incorporating Hadoop into these platforms.