Why the Hoopla over Hadoop?
Hadoop in nine easy to understand facts.
It can be confusing to sort through the Big Data solutions. For one thing, it seems like Big Data tools are coming out of the woodwork. It's hard to not be suspicious of that, since Hadoop and NoSQL have only recently emerged as viable solutions.
Well you're probably right to be suspicious and investigate, but the fact is, a lot of these solutions have been in the works for a while; also, some old approaches - enterprise search, for example - can be applied to Big Data.
First, Proffitt explains NoSQL databases and how they differ from relational databases. He explains the five types of NoSQL (read: non-relational) databases:
- Distributed key-value store (DKVS) databases, aka eventually consistent key-value store databases, which are designed to deal with data stored over a large number of servers. Dynamo is the name to remember here, since most DKVS databases are Dynamo or Dynamo-based, he writes.
- Key-value store (KVS) databases, which store data on a disk or in RAM and the keys are mapped to values. Solutions include the open source solution Redis, Berkeley DB and Memcache DB.
- A column-oriented store database, which he describes as a "single, giant database table, with embedded tables of data found within." Hadoop, Cassandra and Google's BigTable are column-oriented store databases.
- Document-oriented store databases, which, as the name suggests, store and sort entire documents worth of data. They do not use the familiar table/row/column approach, he notes. They also tend to use "shema-less JSON-style objects-as-documents" as opposed to XML documents. The big names here are MongoDB and CouchDB.
- Graphic-oriented store database. This is the most difficult for me, personally to understand, so I'll just quote him: "Data is manipulated in an object-oriented architecture, using graphs to map keys, values, and their relationships to each other, instead of just tables," he writes. Neo4j, HypergraphDB and Bigdata (yes, there's a Big Data solution named "Bigdata" - don't you wish you'd thought of it?) are examples.
OK, so that's the nitty-gritty network stuff, and it's well explained. But what about actually putting all that data to use?
Proffitt also delves into the processing options, and that, my friends, is a much shorter list. Relational database people are lucky here. They've got processing built into their databases. Not so the NoSQL solutions. Here, your options are:
- MapReduce, which tends to be complicated to use and often requires hand-coding. CouchDB and Hadoop use this approach to processing.
- Enterprise search products, which are great for going through large sets of documents and work best with structured data. He lists Apache Lucene, Apache Solr and ElasticSearch as examples.
And that's it: two options.
But here's the best part of Proffitt's article: He includes two-and-half "pages" of Big Data vendors and solutions. It's a must-read for anyone considering investing in or who's just curious about the emerging Big Data market.