10 Critical Myths and Realities of Master Data Management
Prevalent myths surrounding MDM alongside an explanation of the realities.
Data profiling sounds a bit dicey, doesn't it? I picture an IT "officer" taking mug shots of the database, maybe picking "the wrong kind of data" out of lineups.
Oddly enough, that's not so far off. Data profiling tools (aka, engines) take a statistical analysis "picture" of the data and, voila!, very quickly - albeit it very generally - you can see where the problems are. Details vary by which data profiling engine you're using, but basically it will give you statistics on how complete the fields are. It'll tell you things like, for each field in a database, how many entries are null, how many are missing, how many contain an actual value, and then, given those numbers, the percentage of "completeness" each field has.
There's a bit more to it - Jim Harris offers some generic samples of the type of information data profiling provides on his Obsessive-Compulsive Data Quality blog - but basically, it's a tool for quickly accessing the completeness and quality of your data so you can fix problems before you integrate it or use it for analysis.
... we were doing data profiling on some fields like date of birth. We were noticing some weird trends-people tended to be born on Jan. 1, Feb. 2, March 3, April 4, May 5-and we were like, "Okay, what's going on? Why are those particular dates coming up?" We found out that the date of birth field was a required field on an insurance application that most people-like if they were applying for automobile insurance-didn't feel the need to provide. But the data entry clerks are being paid based on how many applications they can get in per hour, so when they came across a field that was required that didn't have a value for, they just made a date up and they just basically picked their birth year, but used Jan. 1 or Feb. 2. So you had a bunch of bogus dates entered into the system. The data value was accurate. Jan. 1, 1970, is legitimately somebody's birthday, but it's not the birthday of the customer that was associated with the insurance transaction.
It turns out, data profiling is pretty common when applications are new and shiny, but not so common when people start to actually use, move or integrate the data, according to middleware and data management expert Hollis Tibbetts:
When developers write a new application for the input of some new data, it's normal for input fields to be 'validated' - a simple 'hard coded' form of profiling. ... Yet people have far fewer reservations about integrating data from here, there and everywhere - often not checking for even the most egregious data errors, and thereby polluting the organizational drinking water. Data profiling engines are a great technology for quickly improving the quality of data as it is integrated from one system into another.
That's why data profiling is an important preliminary step to data modeling. It's also used in data quality improvement programs and master data management initiatives to help "ensure the consistency of key non-transactional reference data used across the enterprise," writes Elliot King on The Data Quality Authority Blog.
It sounds like a very IT-focused function, and for a long time it has been - but interestingly enough, that's changing, according to Julie Hunt, a software industry analyst who writes the blog Highly Competitive.
Data profiling started off as a technology and methodology for IT use. But data profiling is emerging as an important tool for business users to gain full value from data assets. When given the right tools and practices for data profiling, business users should quickly identify inconsistencies and problems for data, before it is used for reporting and intelligence purposes.
That makes sense. Business users know what they need, and they know what they're willing to tolerate in terms of problems. And even if they aren't completely sure what they should tolerate in terms of errors, at least by doing data profiling, they know what they're getting into when they run analysis and make business judgments on that data.
In other words, when business users have access to data profiling tools, the data quality issues become transparent to them from the get-go and they can decide what to fix or not fix. IT doesn't have to play the guessing game about what's a tolerable margin of error anymore.
Of course, it's going to take some time for business users to really catch on to how to use this traditionally IT-only tool. Hunt suggests business users keep their goals and projects well defined, so the data profiling scope stays narrow and only looks at the most significant data sources. She also offers a series of questions that business users will need to address when they're planning to analyze the data, including:
Armed with "data intelligence," she writes, business users will ask what questions they need to be asking for BI projects. But even more importantly, it seems to me, is the second benefit she mentions: "Business users will also know if they're tapping into the data that they need to answer their questions," she writes.
That would certainly put a stop to a common IT/business disconnect. And it also elevates data profiling to a strategic discipline, according to King:
In the long run, data profiling can be used both tactically and strategically. Tactically, it can serve as an integral part of data improvement programs. Strategically, it can help managers determine the appropriateness of different data source systems under consideration for deployment in a particular project.
If you'd like to learn more about data profiling, I suggest you check out Harris' eight-part (yes, eight parts!) series, "Adventures in Data Profiling" and download "Practioners Guide to Data Profiling," which was written by Dataflux and David Loshin, president of Knowledge Integrity.