Hadoop use cases often focus on performing complicated processes, but its real power, says Charles Zedlewski, is in simplifying tasks that are now complicated. Last week, Zedlewski, the vice president of product at Cloudera, explained how businesses are using Hadoop for fast processing of large amounts of imagery data and videos. This week, he drills down into the technical aspects of combining Hadoop with image and video software.
Lawson: You said many of the companies that want to use Hadoop for processing images and videos already have software that has the capability to run the particular process they want. If you have a program that you want to use with Hadoop, how is that done?
Zedlewski: MapReduce is the primary form of data computation and data processing in Hadoop. And, really, MapReduce is just three steps, right? You do Map, Shuffle, Reduce.
“Map” is you divvy up your data by some arbitrary key that you decide. The “Shuffle” means you dole out pieces of data to different reducers. Logically, they don’t actually physically move the data. Then “Reduce” is you process it, and when you do “Reduce,” that can be almost anything you want. Reduce can be find the max in this array; it can be take an average; it can be count. Or, it can be call this other program and run this program on this slice of the data. So that’s typically what winds up happening with these specialized programs.
No, they're not necessarily re-implemented. What’s really happening is whatever algorithm the program already had, that’s just being called in the Reduce step of the MapReduce process.
Lawson: Do you have to do custom code for that part or are there tools that will handle it?
Zedlewski: For some of these things, there is some level of integration that still has to be done. But it’s not like a herculean effort. Ultimately, a MapReduce job is just user code. It’s not like you have to have a whole development project.
Lawson: Are you hearing any enterprise or more mainstream IT uses for processing video or photos with Hadoop? Obviously these places are a bit niche.
Zedlewski: Images comes up in more and more places. Right now, the people that have really advanced the image stuff are the ones who have a strong economic motivation to do it.
Google actually does some of this, too. Google essentially has the equivalent of Hadoop in-house and in Google’s case, you know, they actually have just published a paper where they were doing image recognition where they were looking for pictures of cats, just to prove the relative accuracy of their image processing algorithms.
Lawson: What about medical imagery? Have you seen anything done with that kind of image or video??
Zedlewski: We’re doing more and more in the health care field, but mostly it’s been text. In terms of things like PET scans, CAT scans and sonograms and what have you, I don’t know of any examples like that today.
But that’s probably more just because it hasn’t been tried.
The main things we see a lot of with Hadoop right now in the medical field is either bioinformatics or people who are doing different kinds of analyses of the effectiveness of different treatments or therapies.
Lawson: I ask because another writer was talking to a medical field CIO and he was stuck on using Hadoop for structured data. I thought, with all the images health care deals with, it seems like there’s much more potential there. You hear these really high-level uses, right now, and I have to think if you can do that, what can Hadoop do for simpler problems that are insurmountable right now. Is that something Cloudera sees happening?
Zedlewski: Yeah, I think your observation is valid. I think there’s two parts to it. One is Hadoop eventually is just going to become part of the background. There are all kinds of things people do today with data and everybody just assumes the database is behind it. They don’t really think about it anymore, because databases have become so commonplace, right? At some point, that will become true for Hadoop, too.
The other thing, which you also pointed out, is actually what people are doing with Hadoop is not actually about doing the complicated thing. It’s about being able to make the complicated thing simple, because you can work with so much more data.
My favorite anecdote, which articulates this point, is spell check. So I was interviewing a guy recently and I said how would you imagine it if I told you to design a system to look at a sentence, let’s say, and look for spelling errors. How do design that system?
This guy has a computer sciences degree and says, “It would probably be a nearest-neighbor problem. What you’d do is look for letters that are close to other letters, and then you’d probably put some language model on top of that as well. Basically, you’d do an analysis of how close letters are to one another, understand the distinction between consonants and vowels, have a language model on top of that.”
By contrast, if you look at how spell check works on Google search, when you type something into the search box and you get a result back, what actually happens is much, much simpler than that. All they do is store every search attempt that everybody ever made. They also store if you did not click on anything when the search results came back. And then, if a few seconds later you did another search and then clicked on that, that’s it.
They basically assume that if you have a search, don’t click, research, do click in fast sequence that must mean you had a misspelling, you corrected, re-spelled and then you were successful.
Then all they did was count the instances of that happening and it turns out if you store a few trillion searches, you actually create a more effective spell checker than anything that’s come before it. Oh, by the way, it also works in 30 languages.
So, actually, there’s no need to understand semantics, no need to build some kind of model of the English language, no need to come up with some complicated algorithm. Just count stuff, and if you count a lot of stuff, it turns out you can discover a lot of things that were otherwise difficult to find out.