John Lucker, principal for Deloitte’s Advanced Analytics and Modeling Sector, works with organizations that want to improve their advanced analytics, predictive modeling and data mining capabilities. He explains to IT Business Edge’s Loraine Lawson how data virtualization works as an integration option and why it’s a good way to reduce data corruption.
Lawson: I recently read integration is a huge part of BI projects, which I understand is one reason why organizations are trying data virtualization and other options for BI. But the analyst interviewed said the integration part is often poorly done or underestimated. Can you talk about that and perhaps how data virtualization can address that problem?
Lucker: One way is actually physically integrating it into an enterprise warehouse and creating a central repository for all data inside of a company. There are companies that have tried to do that. There are companies that have succeeded at it, but I would say there’s a lot more that have tried and stumbled than tried and succeeded.
The virtualization of data integration allows the primary source of data to reside in its source systems to remain pristine, which I think is an important point because the more you tend to move data around and put it in secondary repositories that are not a primary repository, the more chance you have for corruption, disruption.
Lawson: Now why is that? I mean, I understand that corruption could happen. It seems like that would be a rare thing, but in fact no?
Lucker: Well, it’s not corruption in a technical sense. It’s corruption in the business sense that somebody would have the opportunity to restate that data in some way, whether through a BI tool or perhaps inside the database itself and then when some piece of data shows up on a report, it’s very difficult to go back and reconcile. So where’d this number come from? Did it come from the source system? Did it come from this secondary repository? They go back and look at that. They find out that somebody’s created perhaps a tertiary repository for it. Data governance is an enormous issue to make sure that the integration of BI and the source data stays well documented and well understood throughout the organization. That in itself is a gigantic problem and it continues to be a huge problem in a lot of the companies that I work with, despite truly heroic efforts to get data in shape and data governance processes together, because people will continue to use BI tools to extract custom extracts and have those stored in a variety of places for reporting.
Lawson: Is data virtualization a preferred method for accessing data for BI?
Lucker: It is. What it does is at the very minimum allows companies to make sure that if you're using data, you're at least using the data from a primary source, that you're letting software and metadata glue the view of data together and you're eliminating the need to pull it or extract it from all these disparate locations. So it allows you to create this kind of virtual view of data and it isolates the user from knowing where it is and how it’s structured.
Lawson: How do you virtualize data?
Lucker: There’s a variety of different technologies, but the way to think of it conceptually is as a map that says when you ask for piece of data A) I need to invoke a query to get it from this database over here in the warehouse data system and then when you ask for data, B) I need to create a query that will go out and get it from the supply chain, the trucking system, where these things are loaded from the warehouse onto a truck. And then when you're asking for piece of data, C) and you're interested in customers, I need to go out to the order entry system and retrieve customer data.
All of that is where is it, how is it structured, how do I go get it and how do I put it in a usable condition is all handled by this virtualization. Think of it as a server that knows how to do that stuff, and the user, who is using a BI tool, doesn’t have to know how to do that stuff. All they have to do is refer to a piece of data.
Lawson: So, OK, it sounds like the Internet in that way. But how is it doing that?
Lucker: Well, like a URL, you're right, you click on something, you don’t need to know that that server is in wherever, China. It knows how to resolve the name. It knows how to root its way through the Internet, get the HTML page, bring it up, translate it, show it to you.
Lawson: But what I find confusing is that that data virtualization is capable of doing like some data quality stuff and presenting the data as one piece, right?
Lucker: Yes, if it’s taught that, yes.
Lawson: Do other people have a hard time understanding how that functions?
Lucker: It’s a very new area, so it’s evolving, in terms of how it works and what it can do depends largely upon how sophisticated a particular company wants virtualization to be.
I will say very few companies are using the theoretical concept of virtualization in a pure sense, where nobody ever moves data from its source system. Nobody ever creates an intermediate warehouse where they’ve cleaned or normalized the data in any way and that any transformations are invisible to the end user. I would say that that is an aspiration versus reality.
But there are pieces of that that are being done in many companies, particularly in systems where they either want to insulate or isolate or with types of data that are particularly challenging for an end user to deal with, either technically or from a business meaning perspective.