For some reason, I find data virtualization very hard to wrap my head around — specifically in terms of when you would use it and when you wouldn’t. Maybe I’m alone in this, but I suspect not.
I re-encountered this problem while reading a SearchDataManagement.co.UK article, “Data virtualisation on rise as ETL alternative for data integration." It talks a lot about how data virtualization differs from ETL, and in particular points out it can be a better option when you’re dealing with a growing number of data sources (and who isn’t these days, right?).
But while it focuses a lot on the benefits of data virtualization, it doesn’t really provide firm guidance on when you might not want to use it. I’ve actually written about this a bit previously, but I still had a lot of questions. So I thought it might be worthwhile to share my questions and what I learned, just in case other people are wondering, too.
What’s the big deal about data virtualization all of the sudden? Data virtualization adoption has been growing for some time now, and Forrester Research predicts it’s going to boom, with $8 billion being spent on data virtualization licenses, maintenance and services within two years, the article notes. One reason it's becoming more popular is because it solves a lot of the data integration challenges you encounter with BI reporting.
The benefit for BI purposes is two-fold, as far as I can tell:
1. You’re not moving or changing the original data. So you can tinker with things like data quality or integration without worrying about messing up the original. And, for many reasons, actually integrating the data tends to create more errors anyway, so it can actually mean you’re dealing with better data.
2. It’s more real-time. No more waiting for that overnight batch process. Data virtualization is mapping back to the original data so you’re up to date already.
And how does it work again? It can work a couple of ways, John Lucker, a principal analyst for Deloitte’s Advanced Analytics and Modeling Sector, explained to me recently, but the best way to think of it is that the virtualization software basically brings the data to a server by mapping to the original location. So you might think of it as functioning very much like the Internet — it’s “browsing” the data and making it possible to view it, without actually moving it to your computer or a new server.
OK, vendors say you can do data quality on that or use it for master data management. How does that work? This is one of the key differences between data virtualization and data federation: Whereas federation allows you to aggregate the data and read it, virtualization allows you to actually do stuff with it — like address data problems.
Lucker says it actually depends on how advanced the system is, but it’s something you would do for reporting and not about fixing the data at the source. Again, the benefit to that is you’re keeping your source data as is, which can be important if, for instance, something goes wrong and you need to revert to the original. If you’d like to learn more about data virtualization and data quality, David Loshin of Knowledge Integrity wrote a lengthy whitepaper on the topic in 2010. It’s still pertinent.
OK, but what if I want to change the data at the source? Does it support write-back? Yes, the virtualization software will. But the ability to write back to the source actually depends upon the application, according to a response tweeted by Composite Software, which sells a data virtualization solution. “About 10 percent of the apps include some writeback,” according to Composite.
If you’d like to learn more, in addition to the SearchDataManagement article and Loshin’s whitepaper, check out:
If you're attending the TDWI in San Diego at the end of the month, you might also check out Dave Wells' class, "W5 Data Virtualization: Solving Complex Data Integration Challenges."