The New Data Quality Tool: Crowdsourcing, Externally or Internally

Loraine Lawson
Slide Show

Five Ways to Use Data to Be More Relevant to Customers

Crowdsourcing inevitably raises questions about data quality, but a number of companies and experts believe crowdsourcing can be used to improve data quality.

GigaOm recently profiled one of these companies, CrowdFlower, after it raised $12.5 million in its Series C round of venture capital — just under half of the $28 million it’s raised since its launch four years ago.

CrowdFlower doesn’t so much crowdsource its work, but relies on the crowd to do its work. For instance, Unilever hired CrowdFlower to extract sentiment, location, sex and other information from tweets, GigaOm reports. eBay used the company to clean up its product taxonomies.

Traditionally, collecting this data meant using your own back-office data operations employees to research gaps or correct errors in customer data. Crowdsourcing companies try to achieve similar results by collecting and validating data from online sources, according to Maria C. Villar, managing partner of Business Data Leadership.

PaaS

“Instead of relying on bots to troll the Internet, they use people who can make valuable decisions about the data they find,” Villar writes in an Information Management column. “Because works are sourced around the world, crowdsourcing companies can perform these tasks in a cost-effective and timely manner.”

Researchers are still exploring how effective this is against traditional methods. You can find a long list of research online about this. One emerging best practice seems to be a system of checks and validations, with higher-paid experts working at one level and volunteers or lower-compensated workers performing validations.

For instance, KIT Ph.D. student Maribel Acosta describes a two-tiered payment system. Linked Data experts were paid a higher rate for specific-domain tasks than M-Turk workers who simply compared the data and validated their findings. You can read more about that in a presentation,  “Crowdsourcing Linked Data Quality Assessment,” which is available on Slideshare.

Panos Ipeirotis describes a similar, but much more complicated approach, in “Crowdsourcing: Achieving Data Quality with Imperfect Humans,” which is available on YouTube. Ipeirotis holds a Ph.D. in computer science from Columbia and is now an associate professor and George A. Kellner Faculty Fellow for the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business at New York University.


What good is all of this, though, if you’re just not ready to trust crowdsourcing?

Actually, Villar suggests that organizations apply crowdsourcing data quality practices internally. As it stands now, employees encounter substantial barriers to correcting data errors, she notes. For instance, if an error is found, they have to track down the DBA or the data owner.

“Most employees simply give up and pursue other means to get the information they need, keeping the problem to themselves,” she writes.

By using crowdsourcing techniques, you can change that.

“Using a ‘like’ approach (think Facebook) will allow all employees to rate the usefulness of the data in the core applications,” Villar states. “While software application vendors would need to provide some capability to do this inside packaged applications, as data management professionals we should be requesting this capability. … Put your own crowd to work.”

Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.



Add Comment      Leave a comment on this blog post
Sep 24, 2014 5:20 AM Michele Goetz Michele Goetz  says:
Excited to read your post! The value of social data governance has been vastly ignored and under rated in data governance strategies. I came across Crowdflower a little over a year ago and am only now just starting to see both start-ups (Reltio MDM) and established vendors (IBM Watson Analytics) introducing the "easy button" to the edge of data consumption and interaction. We need more for three reasons: Data governance teams can't scale to today's data demand and agility requirements, social governance instills the role of data citizenship making all responsible for trusted data, and there is greater transparency between data and how it is used to optimize data governance priorities and practices. Social data governance will push data management from back office processing to front office value faster and more effectively. We need to be talking about this more and develop these capabilities further if we are going to finally infuse data responsibility into our corporate culture. Reply
Oct 6, 2014 2:35 AM Robert Hillard Robert Hillard  says:
You are spot on! I've argued for years that data quality is better served including stakeholders such as customers and staff. A post I wrote some time ago that it could help engage staff, engage stakeholders and even improve confidence in the identity of customers: http://www.infodrivenbusiness.com/post.php?post=2010/07/03/the-power-of-the-crowd-can-improve-your-data-quality/ Reply

Post a comment

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

null
null

 

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.