Loraine Lawson spoke with Doug Cutting, creator of Hadoop and the chief architect at Cloudera. How do you pronounce Hadoop? And what can this open source, "big data" solution do besides store Web log data? Cutting answers these and other questions in part one of this two-part interview.
Lawson: My first question is how do you pronounce Hadoop?
Cutting: I pronounce it differently than almost everybody else. I don't know if you really want to listen to me. The process was named after my son's stuffed elephant Hadoop, and so I pronounce the name the way he pronounced it, which is HA-doop (Editor's Note: "a" as in "adder"). Then, once it was written down and people started learning it huh-DOOP. Both are acceptable.
" A vast majority of the problems that people want to solve with big data, sorting is a big component. ...Then you can do some analysis of it. You can say, Now that I've massaged it in this way, I want to reorganize it in this other way.'"
Cutting: I wish I understood, too. He was only about 2, I don't remember exactly when it was. He was a very clear speaker early on, clearer than most kids his age, and at some point we asked him, you know, "What's your elephant's name?" And he just said, "Hadoop." And we're like, "Really?" You know, a 2-year old, you can't really question him too deeply, he just knew that was the name.
Lawson: How do you explain Hadoop to people that you talk to who are non-techies?
Cutting: It takes me a long time sometimes. Usually, it's a new way of handling massive amounts of data. It's sort of a new problem in a lot of ways, but the price of computer hardware has gone down so dramatically that people can afford to hang on to much more data than they ever did before. Hard drives have gotten bigger and bigger, and they've also gotten cheaper and cheaper, so you can have lots of data and also, at the same time and for similar reasons, networking equipment has gotten cheaper. Because of all of these economies, people have more data in their businesses. Things like transactions in Web companies make a lot of logs about what users have done, but you also see, in other industries, things like sensor networks measuring real data out in the world. People are able to inexpensively store lots of data and have lots of data related to their businesses that they can store.
The classic methods for processing data-throwing it all into a relational database-isn't really applicable in this case. Relational databases tend not to scale to the size of some of these collections. The hardware configurations that are recommended for relational databases aren't the most economical way to store all that data and sort of the way you have to first fit all of your data into a schema in a relational database isn't always what people want to do. They might just want to save it all in some sort of raw form and then later on figure out what they want to do with it and how they want to structure it and what fields they want to index. But first, they just want to store it and then be able to process it. The relational database model is a different technology developed to solve different kinds of problems.
The technology stack that has proven popular for handling these kinds of problems when you've got terabytes, petabytes, huge quantities of data, and you can easily afford large amounts of commodity hardware to store and process it, has been Hadoop, more than any other platform I think.
It came out of papers that were written by Google about the way they operated internally. Then I and another fellow (Mike Cafarella, a professor at the University of Washington) recreated and re-implemented these ideas at the open source project at Apache and now Cloudera is about providing and supporting them for enterprises and the whole platform of related technologies growing up around Hadoop as the kernel.