Sorting Through What's Really Going on in the Hadoop Stack

Everyone tends to focus on the “big” in Big Data, so much so that it’s easy to lose focus on the fact that Hadoop is really about data. Let’s regroup for a minute and really look at what’s going on with the data on Hadoop.

First, there’s the core. When people say “Hadoop,” they’re usually referring to the Hadoop core, which Loraine Lawson explained:

The Hadoop Distributed File System. What’s it doing with the data? It’s distributing it on nodes and storing it there.

MapReduce. This does the real work in the Hadoop core. If you want to run a process or computation on the data, it “maps” that out to the nodes and then runs the process, and “reduces” the results to your answer. So, it’s processing the data.

Now, if you’re familiar with data at all, you’ll notice there are a whole lot of things missing from that equation, such as:

Modeling
Metadata
Job scheduling
Workflow
Data management

This is where the growing list of Apache Hadoop-related projects comes into play.

These projects go by an odd assortment of names: Pig, Hive, Flume, Zookeeper, but they’re often short-changed when we talk about Hadoop. Loraine has seen them referred to as the “Hadoop stack,” though some programmers prefer “Hadoop ecosystem.” Forrester refers to them as “functional layers.”

For the most part, they’re of interest to developers more than executives, but hopefully a high-level view of these solutions will add some depth to your understanding of Hadoop and its capabilities.

Here are a few of the more common names you’ll hear.

Click through for an overview of the Hadoop stack and gain a better understanding of its capabilities, as identified by Loraine Lawson.

Pig is an analytical tool and runtime engine. Yahoo developed the Pig platform as a way of analyzing data without constantly hand-coding MapReduce jobs. Pig includes a high-level data flow language called, predictably enough, PigLatin, and it’s designed to work with any kind of data (like a pig — get it?), plus a runtime environment where PigLatin programs are executed. Cloudera’s distribution of Hadoop uses Pig and Hive as analytical languages; Forrester lists it under Hadoop’s “modeling and development” layer.

Mahout is a data-mining tool. A Mahout is a person who rides an elephant — a nod to the fact that Hadoop is named after a stuffed elephant treasured by Doug Cutting’s son. Mahout applies what’s called machine learning to data, allowing it to cluster, filter and categorize data and content. It can be used to build recommendation engines and data mining. This also falls under Forrester’s “modeling and development” layer.

Flume is a way to collect, aggregate and move large amounts of event or log data into a Hadoop Distributed File System. Cloudera’s distribution of Hadoop uses Flume for data integration, according to Ravi Kalakota. Forrester classifies it under the Hadoop data collection, aggregation and analysis.

Chukwa is a data collection system and is used to process and analyze huge logs. It includes a toolkit for displaying, monitoring and analyzing the data. It’s a newbie among the Apache projects. Chukwa, by the way, is also a disappearing Nepalese language.

What’s great about Hive, at least from the enterprise IT point of view, is that the queries are written in SQL and converted to MapReduce. This makes it easier to integrate Hadoop data with other enterprise tools, such as BI and visualization tools, according to Jeff Kelly of Wikibon. It can also be used for metadata management.

While Hive helps you integrate Hadoop data into existing IT systems, Sqoop goes the other way and helps you move data out of traditional relational databases and data warehouses into Hadoop. “It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target,” writes Kelly.

HCatalog is a table and storage management service for Hadoop data. It manages schemas and supports interoperability across the other data processing tools (e.g., Pig, MapReduce, Hive, etc.).

Just as you need a way to manage multiple servers, you need a way to manage Hadoop nodes. Zookeeper fills this role.

Even though Cassandra isn’t part of the Apache Hadoop project, Loraine is mentioning it because it’s in the trade press a lot. It can be used with Hadoop, but more often, it’s used with the Hadoop ecosystem, but in place of the Hadoop File Distributed System. It’s a scalable, multi-master database with no single points of failure.

5G and Industrial Automation: Practical Use Cases

Is 5G Enough to Boost the Metaverse?

Building a Private 5G Network for Your Business

5G and AI: Ushering in New Tech Innovation

The Role of 5G in the Sustainability Fight

Sorting Through What’s Really Going on in the Hadoop Stack

Get the Free Newsletter!

Latest Articles

How DeFi is Reshaping the Future of Finance

Enterprise Software Startups: What It Takes To Get VC Funding

Top RPA Tools 2022: Robotic Process Automation Software

Advertisers

Menu

Our Brands