Last week, I wrote about the three paradigm shifts required for Big Data, as outlined in a recent Foreign Affairs article. One of the required shifts: understanding that Big Data is inherently messy data.
To me, that begs the question: How does Big Data change data quality and governance?
It has to, right? In practice, most large data sets are messy and, as it turns out, many also come from external sources such as social media or open data. That makes it even harder to practice traditional data governance.
Maria C. Villar, a managing partner of the business data management firm Business Data Leadership, recently tackled this tough topic in this excellent Information Management article.
With Big Data, you need to discard any needs for perfection and total control you have when it comes to governance and quality.
Likewise, you can’t govern every aspect of Big Data, she warns. Instead, Big Data governance is all about collaborating and being very specific about what matters and what’s “good enough.”
“First, it is important to go back to the basic definition of data quality as ‘fit for purpose,’ Villar writes. “In this case, fit for purpose means evaluating how this data will be used in the intended business use cases and just how ‘good’ the data needs to be in order to provide meaningful value to the business goal.”
Data governance should be a collaborative process that identifies “the most critical, strategic, shared big data sources and specific data fields that are the most important to govern,” Villar states.
In short, Big Data will force a conversation about data quality and governance, but this time the conversation will be how much risk can you live with and what’s practical.
Among the areas Villar suggests require a new game plan:
- Data quality criteria
- Lifecycle management (you really can’t keep Big Data sets indefinitely, nor should you since it can drive up the costs of ownership and degrade analytics results, she adds)
- Compliance
- Metadata requirements
As I’m sure any analyst or consultant will tell you, it’s never too early to start the data governance and data quality conversation. And, humans being human, it’s a common mistake to ignore that and move ahead without having the data governance discussion at the beginning of the program.
But perhaps it will help to think of it this way: Data governance is just agreeing to the standards and processes you’ll use for collecting, identifying, storing and using data, according to Capgemini’s Scott Schlesinger, who specializes in Business Information Management.
When you think about it, that’s really something you have to do at some point, anyway.
But beware, oh ye data governance laggards: The governance conversation becomes even more important if your data governance is non-existent or weak.
“Your data governance may not be that strong to start with. Big Data will make it worse,” Schlesinger warns.
If you’d like to read more, here are some recommended readings:
“Big Data Governance Maturity” IBM’s Steve Adler lists nine fundamental questions every organization should be able to answer about Big Data.
“Big Data Governance: Why Do You Govern Big Data Outside of Databases?” EMC’s April Reeves talks about the potential role of content management systems in governing unstructured data.
“Mission Impossible? Data Governance Process Takes on ‘Big Data’” Although it was written last year, this TechTarget article’s recommendations for quality checks on incoming Big Data and mapping new data to reference data are still relevant.
Sunil Soares, who left his position as director of IBM’s Information Governance Practice to start Information Asset, has written a book on Big Data Governance. Many people will recognize Soares’ name from his previously published books on data governance. Here’s a recent review of the Big Data governance book by data analyst/technician Andrew Robinson. DataQuality Pro also reviewed the book in November.