There’s a lot of talk about Big Data as if it is one entity. We hear: How do you manage Big Data? How do you govern Big Data? What’s the ROI for Big Data? The problem with this is that it puts too much focus on the technology, while obscuring one of the major challenges in Big Data sets: the unstructured data.
I suspect CIOs haven’t forgotten that component since about 80 percent of data in organizations today is unstructured data, according to Gartner. That’s a lot of value currently hiding in social media, customer call transcripts, emails and other text-based or image-based files.
That’s a problem, because that also happens to be where you may find the real value in Big Data. These disparate data sets were previously unanalyzed or sitting in application silos. Obviously, Hadoop will let you migrate that into one location, but what then? How do you turn that into valuable information?
This recent Datamation column by Salil Godika goes a long way toward answering these questions. Godika is the chief strategy & marketing officer and Industry Group head at Happiest Minds. I admit this gave me pause, because pieces by chief marketing officers can be too self-serving.
But I give kudos to Godika for proving my misgivings amiss. He’s written a great piece on dealing with unstructured data, even breaking it down into nine manageable byte (get it?) sizes. Unlike other Big Data how-to articles, he’s put the focus on the data, not the technology.
He does recommend creating a data lake, which is actually pretty controversial in the analyst world. Gartner has done a good job of outlining the cons with data lakes, which include the fact that data lakes don’t have a set definition and a lot of vendor hype.
That’s definitely worth knowing, but I’m not sure it’s relevant here, since Godika isn’t suggesting you dump everything into a Hadoop data lake. In fact, he does just the opposite by requiring you to pinpoint what’s relevant in the first two steps.
“If the information being analyzed is only tangentially related to the topic at hand, it should be set aside,” Godika writes. “Instead, only use information sources that are absolutely relevant.”
That may seem like obvious advice, but given that 80 percent of enterprise data is unstructured data and how excited IT can get about technology projects… Well, I suspect it’s advice you’ll need to emphasize early and often.
Godika also doesn’t dwell on the technology specifics, but that’s actually what I like about it. There’s just no shortage of articles about Big Data technologies you can find. By skimming that, he’s free to focus on a much under-discussed aspect of Big Data, which is how to structure (excuse the pun) the data part of the project.
Hopefully, we’ll see more on this topic as Big Data transitions from the sandbox to part of the real data architecture. Recent research shows that this should happen this year. Deutsche Bank interviewed 26 CIOs at global companies and they report that they are now more comfortable with Hadoop in particular, and foresee the technology as playing a “significant part of the future data architecture.” The Wall Street Journal includes the details of Deutsche Bank’s research and notes that Gartner says approximately 1,000 companies currently use Hadoop in production.
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.