Much of the focus on Hadoop has dealt with unstructured text — but IT Business Edge’s Loraine Lawson wondered how Hadoop changes what can be done with video and photos. A lot, it turns out, as Charles Zedlewski, the vice president of product at Cloudera, explains in this Q&A.
Lawson: There’s so much information about how Hadoop is used to process text-based unstructured data, but I haven’t heard much about using Hadoop for video and imaging. Can you tell me about that? I’m particularly curious about whether it’s being used with medical imaging, such as X-rays or MRIs.
Zedlewski: Obviously there are plenty of use cases where the data is actually structured. I’d say more than half our customers today, if you looked at the data in the Hadoop system, it would be kind of garden-variety things: bank transaction records, point-of-sale records or other things that look very row- and column-oriented.
There are also cases where people are using Hadoop for non-text-based things; that includes both images and also video. So that entire range, all the way out from very database-looking, record-oriented data through to semi-database looking, log data to free-form text to images to video, all is fair game for what can be processed and analyzed inside a Hadoop system.
Lawson: So when you talk about Hadoop with video or images, obviously it would be able to store them. That makes sense. But what does it mean for enterprises and businesses in terms of what they can do with video and photos? What do you think is not known about that?
Zedlewski: Let’s take an extreme kind of contrast. If you were to store a bunch of images in a database, what you would really see in a database is they're just blobs. You can take a database, you can store it, but you’ve basically turned a database into just storage. You can’t really express a SQL function that does anything with those images meaningfully.
In Hadoop, yes, you can store it, but you can actually process it and analyze it. This is one of the nice bits of flexibility that you get with the MapReduce framework: When you do MapReduce in Hadoop, what you do for your Reduce function can be quite varied. That Reduce step could be things like study an image.
For example, one of our customers is a satellite company called Sky Box Imaging. They launch a bunch of commodities satellites and take all kinds of overhead images of things going on in the world and then run that through imaging processing, which is scaled out using MapReduce. They're basically grabbing lots of pictures of the world and they're using Hadoop to refine that and put it into valuable information. Then they're selling that information as a service to large corporate customers.
For example, is in the oil and gas industry, if they want to actually count out how many refining pads there are in a given site in the world is because you're trying to understand the aggregate capacity of your industry. That’s something that’s actually possible to do. If you want to count how many cars are in a retailer’s parking lot, right, that’s an image-processing thing you can also do. So these are the types of things they're doing.
Sonar is another one, which is not quite an image, but close to an image, where there are examples for Hadoop today. If you think about how an oil and gas company tries to find oil in the ocean, you have a boat that’s basically dragging a big sonar thing behind it. It’s bouncing sound waves off the bottom of the ocean and then based on what comes back up, it’s getting lots of data about the potential geology of the bottom of the ocean. And as you can imagine, with these boats roaming around the oceans, you acquire a tremendous amount of data, right? The vast majority of that data is noise. It’s really just that one, interesting percent that you want to find.
So all that sonar information comes back in Hadoop and, again, you can basically design algorithms in Hadoop that will let you process that sonar information and extract the signal from the noise and start to refine where you think you're most likely find oil.
Lawson: When businesses come to you, do they have the capabilities on staff to use Hadoop? Or do they come to you with a problem and say, “Can we use it?” and then you help them work through it?
Zedlewski: It’s a combination. In all these cases, really the most sophisticated part is something like taking an image and finding the interesting features about that image, or taking the results of some sonar and discerning whether or not it’s interesting or uninteresting. That know-how almost always exists at our customers already, because it’s their business to know that, right?
There are often purpose-built software programs that know how to do something like extract an interesting feature of an image. The same story is true for a video.
What they're really getting out of Hadoop is much greater throughput and being able to work on a much bigger set of data than was previously possible. It’s letting you hold on to much, much larger volumes of data and it’s letting you analyze that process.