Do you know Gartner is predicting data will grow by 800 percent over the next five years?
More noteworthy is the prediction that 80 percent of this data will be unstructured — emails, texts, pictures, log data, social media data, XML files, videos, audio — those types of things.
That’s the prediction. It’s a bit intimidating when you consider the problems we already have with unstructured data.
One of the ultimate goals of information management is to take that unstructured data and integrate it with traditional, structured data. That’s not exactly something IT departments typically know how to do.
That needs to change, according to a recent TDWI Checklist Report, “Integrating Structured and Unstructured Data.”
“As more organizations evaluate the potential and scope of big data projects and understand the ramifications, there will be greater recognition that establishing a sound foundation for data integration is a critical factor in any information utilization strategy,” states David Loshin, the president of Knowledge Integrity, Inc. and author of the report. “The challenge of integrating structured and unstructured data will be a key factor for big data success.”
Loshin identifies seven steps you’ll need to take to manage and integrate unstructured data.
I’m not going to lie to you: This is not a simple checklist. Using unstructured data means making it meaningful. It’s not enough to be able to search for “Bob.” You have to know the context and how that impacts the meaning of “Bob.”
To do this, you’ll need what Loshin calls “meaning-based computing techniques,” and that will require layers of technology. You’ll need to be able to scan and parse text; add meta-tagging; and add concept tags (when is “Bob” Bob Newhart and when is it Billy Bob Thornton?). And, of course, all of this will need to be automated.
In addition to the technology issues, this kind of integration will require close collaboration with the business. For instance, you’ll need to establish a lexicon of key business terms and phrases, as well as establishing which context these terms and phrases are used.
Loshin offers a good dose of technical guidance in this checklist, but it seems vendor-neutral to me. HP is listed as the sponsor, and sometimes it’s hard to tell if a paper has been skewed to favor a vendor’s capabilities while downplaying weaknesses, but from what I’ve seen, that’s not an issue with TDWI reports. I’m sure a competitor will contest that statement if I’m wrong.
So check it out. Even if you know this is still a long way off for your organization, it’s smart to read it for an idea of what you need to build into your systems and data architecture over the next few years. Plus, it’s a free download, so why not?