Why the Hoopla over Hadoop?
Hadoop in nine easy-to-understand facts.
There's a lot of confusion about Hadoop right now among business intelligence and data warehouse professionals, according to TDWI. But don't worry - you'll soon figure it out, thanks in part to the fact that, although the Hadoop family has its own query and database technologies, you'll quickly learn these tools because they're so similar to SQL and relational databases, according to the TDWI.
To help you cut through the noise around Hadoop, TDWI clarifies seven commonly misunderstood facts about Hadoop:
- Hadoop is an ecosystem, not a single product. Everybody talks about "Hadoop" as if were one thing - and I'm certainly guilty of this. But that's just shorthand for a long list of products, starting with Hadoop Distributed File System (HDFS) - this is what people typically mean when they say Hadoop. But the ecosystem also includes MapReduce, Hive, Pig, Hbase, Flume and so on. Some are ready for use, some are still in the early stages.
- HDFS is a file system, not a database management system. That's why it can handle even unstructured data - it's managing the files that contain the data, not the data by itself. So, files, not data - got it?
- MapReduce provides control for analytics - not analytics per se. TDWI warns you'll see MapReduce discussed as an analytical tool - but that's not right, either. MapReduce is more of a general-purpose execution engine that works with a variety of storage technologies, including file systems (like HDFS) and some database management systems, the paper explains before diving into the technicalities of MapReduce. I'll let you get the details from the source.
- Hive resembles SQL, but is not standard SQL. It uses an SQL-esque language called QL. And it's not SQL-compliant. The important takeaway here: Data integration and BI vendors are doing the hard work for you to simplify connecting to Hive through various interfaces, connectors and front ends to Hadoop.
- Hadoop is about more than volume. If you read this blog at all, you should know this by now. Hadoop also handles different types of data (well, technically, the files wherein that data lives) - not just large amounts of data. Text and sensor data, anyone?
- Hadoop is rarely a replacement for BI tools or data warehouses. That "rarely" really grabs your attention, doesn't it? In general, people look to Hadoop to add that unstructured data their current systems can't handle. However, there are those brave pioneers out there with plans to test Hadoop as a full replacement for their BI data platform.
- Hadoop isn't just about Web analytics. It's also for exploratory analytics, Big Data from other sources (RFID, other sensors, etc), unstructured data, semi-structured data, larger statistical samples you get the idea.
I've just hit the highlights. Assuming you're not a Hadoop expert, this paper is a must-read for anyone curious about Big Data and where Hadoop fits in. It's easy to follow, although it does sometimes speak to the IT crowd more than the business side.
Oh yeah, and it's free, so all you have to lose is your confusion.