Data without any context isn't all that useful. Unfortunately, most of the data that we do collect in the enterprise has little to no context.
To compensate for this, IT organizations have been investing billions, maybe even trillions, of dollars in data warehouses and business intelligence applications. We are collecting more data than ever, which in turn is in straining not only our ability to make sense of that information, but also our storage budgets. What if we took an entirely different approach to collecting data in the first place that was inherently more efficient in terms of assigning context to information? That's the strategic goal of a G2 project that Jeff Jonas, IBM distinguished engineer and chief scientist for Entity Analytics, is working on.
Jonas argues that one of the reasons that IT is fundamentally flawed is that we keep collecting data and then trying to ascribe context to it after the fact. Working with other elements of IBM, Jonas is developing a system where context is ascribed to each piece of data as it is collected. That new data then informs the rest of the applicable data sets about its context, which in turn would continuously update the context surrounding the data set, but also inform other related sets of data about changes in context. IBM acquired much of the capability to do this when it bought SRD, which was founded by Jonas, in 2007. Under this new project, Jonas is researching how to apply that technology, currently offered in the form of IBM's Identity Insight Server, to a much broader range of entities, including names, places and events.
On one level, the idea is pretty simple. In order to acquire knowledge, systems need to correlate information. To achieve context today, we take samples of data sets and make a "best guess" about the context of the information. That best guess, however, is very dependent on both the quality of the information and the assumption that there are no missing variables that could render the data model irrelevant. Given how often end users complain about how IT reports don't reflect the true state of the business, clearly the amount of poor quality data coupled with missing critical components of a data model is having a negative impact on the value that end users place on BI applications. Many of them would much rather rely on spreadsheets, where they can better assign all the variables, than a report from IT. It's only when the problem scales beyond human ability to process on a spreadsheet that we see business users acquiescing.
Jonas says we could eliminate much of this distrust of the data by assigning context to each piece of data as it's collected. Each new piece of data would then be, first, identified in context with existing data in the system and then used to update the rest of the system. In turn, that would not only reduce processing requirements because each new report would not have to reprocess the entire data set, it would let us work with much larger sets of streaming data using less processing power to actually provide more real context than ever in less time.
More importantly, it would become possible for systems to have a memory of processes, thus overcoming what Jonas describes as the "corporate amnesia" that plagues all our IT applications. In effect, systems would be able to correlate data sets representing separate processes to identify patterns of interest in real time without having to wait to respond to a specific query.
We're still in the very early days of reinventing how we process data. The issues at hand here are as much cultural as they are technical. But as the former CEO of SRD, Jonas says IBM has the best seat in the house to realize this vision now that the company, under the auspices of its "Smarter Planet" initiative, has added SPSS and Cognos to its already formidable investments in business intelligence. And to its credit, while other vendors are touting the importance of business intelligence, IBM seems to be the only vendor spending billions of dollars to fundamentally try to improve it.