Data Should Be Governed, No Matter How Big, No Matter How Small

Slide Show

Five Tips for Easier Data Governance

Five steps you can take to ease the trauma of starting data governance.

For Sunil Soares, director of information governance at IBM, it's simple: Big data is just another category of information. And what that means, in a very practical way, is it needs governance just like any other data.


"I've come to the conclusion over the next 12-18 months, as companies implement big data programs, they're going to have to get their arms around what we call big data governance," Soares said during a recent ITBE interview. "I think of big data as just another category of information."


Most people seem to assume that with Big Data, the sample size is so large, it negates any data quality problems, according to Jim Harris of the OCDQ Blog. In a recent OCDQ Radio spot, Harris questioned statistician and data expert Dr. Thomas C. Redman, aka the "Data Doc," about that assumption.


"My initial reaction is to be very skeptical of any claim that data quality doesn't matter in big data," Redman said. "At least from my experience from Bell Lab, the expression 'To err is human, to really foul things up requires a computer' seems to hold."


Redman explained that when he looked at automatically generated data, if could be fine - but when it wasn't, he found "massive quantities of it were fouled up."


"Maybe the analogy is in big data, indeed, the fat finger stuff is like a teaspoon full of red dye in the lake but machine-generated stuff is the spillage from a raw sewage and stuff that gets dumped in from people who are trying to get rid of stuff," he said. "There's lots and lots of sources of pollution."


What that means is, the only way to ensure your analysis is sound is to ensure you have a governance program in place for Big Data.


In the typical information governance program, that means dealing with data quality and governance, whether it's master data governance, metadata governance, reference data governance, he points out. And increasingly, it will mean dealing with Big Data governance.


"I think the traditional disciplines in terms of data quality, metadata, information lifecycle management and security and privacy will still apply," Soares said.


In a January blog post, Soares outlined a framework for the scope of information governance, dividing it into three areas:


  • Master data governance
  • Reference data governance
  • Big Data governance, which includes managing social media, GPS data, sensor data and any data that falls into the Big Data definition of volume, velocity and variety.

He shared an example to illustrate the importance of governing data. An oil and gas company used seismic, or geological, data to understand where they might find deposits of oil and gas. It already owned a seismic data set internally, but ended up purchasing a similar data set from an external provider just because it was called something else.


"You can see now they overspent, but they had a metadata problem at the end of the day around big data," he said. "I think that's a pretty good example."


Another good example of why Big Data should be governed: social media data.


"When you start dealing with social media data, you get into the realm of privacy issues as well," Sunil said. "If you do social media data in the aggregate that's one thing, but if you start going down to what did an individual customer tweet about, that can get into some significant privacy issues."


Sunil isn't the only one who thinks governance should top the to-do list for those exploring Big Data.


In December, Jill Dyche, head of Thought Leadership and Education at DataFlux and co-founder of the former Baseline Consulting, pointed out companies are forgetting about data governance in their rush to embrace Big Data:

But in our conversations about big data, we overlook something just as important as the enabling technologies: the business-driven policy-making and oversight of all that big data. Yep. We're forgetting data governance.

Part of that, she notes, is that many organizations are still in research mode. But if data governance programs are correctly designed in the first place, then there should be guiding principles that come into play for each data category, no matter how big your data, no matter how small.