Lora Bentley spoke with Yves de Montcheuil, marketing VP at open source data integration provider Talend. The company recently released the first open source data profiler.
Bentley: Can you explain why data quality is important when it comes to data integration?
de Montcheuil: Data integration is about moving data between applications, databases, systems, etc. It can be used for business intelligence and data warehousing, and it's also used for what we call operational data integration. One of the big issues when you're moving data around is, if you have bad quality data, it will propagate extremely quickly to other applications and databases in the information system.
It's a little bit like human viruses. In the 18th Century, they wouldn't travel very far, but in the 20th and 21st Centuries, when there's an epidemic in Asia, it propagates immediately to America and Europe because people are traveling. The same goes for data integration. If you have corrupted data in one system and you move it to another system and then to another system, you end up having big data problems everywhere.
Bentley: So how does a data profiler fit into the process?
de Montcheuil: Well, the first step before cleaning your data or fixing it, if you will, is to know what kind of shape it's in. That's what data profiling is: understanding the level of quality of your data and understanding where the trouble spots lie.
Bentley: What kind of "trouble spots" are you looking for?
de Montcheuil: For example, talking about customer data, if you don't have phone numbers for half of your customers and you start to do telemarketing, you're not going to be very successful. If you don't have zip codes and you are shipping products, they are probably never going to arrive.
In the human resources world, you need to know the Social Security Number and date of birth of your employees. If you don't have them, or if they are corrupted, you have data quality problems.
If you have a product catalog and your average product description length is the same as the average product name length, guess what? People are just copying the product names and pasting them into the description field to save time....
So profiling is the process of understanding the quality or the non-quality of your data, knowing exactly what you're dealing with, and then after you make corrections, doing the profiling over and over again to determine the improvement or the degradation of your data quality. You want to make sure that it's improving rather than getting worse over time.
Bentley: What is the advantage of an open source data profiler over a proprietary one? Are there advantages other than cost savings?
de Montcheuil: Historically there has been a handful of data profiling players, and all of them were proprietary. There has been no open source option until now. But all of the data profiling vendors were acquired by data quality vendors, and most of the data quality vendors were then acquired by data integration vendors, who in turn were acquired by BI vendors, ERP vendors or database vendors.
As a result, today data profiling is a very small component of some very large and broad product offerings, and they don't get the kind of attention they deserve. And they are very expensive. If you look at a data quality solution from IBM or Informatica or SAP, you're probably going to end up paying $100,000 or $200,000 or more just for the data quality part of the story...
Where open source really brings big value, in addition to the financial savings, is it's putting data quality in the hands of companies that think they need to do data quality but can't go to their management and justify the budget to initiate the project. It's the same way we've been democratizing data integration...but there is a difference. Data integration you can do it manually with coding. It's lengthy, it's time consuming, etc., but it's doable. Data quality cannot be done manually. The queries you have to run against the database are just too complex. And data profiling is usually done by business users who have no idea how to access the data base.
As a result, a lot of companies don't have a data quality strategy in place. Open source will help them get there.