Crowdsourcing inevitably raises questions about data quality, but a number of companies and experts believe crowdsourcing can be used to improve data quality.
GigaOm recently profiled one of these companies, CrowdFlower, after it raised $12.5 million in its Series C round of venture capital — just under half of the $28 million it’s raised since its launch four years ago.
CrowdFlower doesn’t so much crowdsource its work, but relies on the crowd to do its work. For instance, Unilever hired CrowdFlower to extract sentiment, location, sex and other information from tweets, GigaOm reports. eBay used the company to clean up its product taxonomies.
Traditionally, collecting this data meant using your own back-office data operations employees to research gaps or correct errors in customer data. Crowdsourcing companies try to achieve similar results by collecting and validating data from online sources, according to Maria C. Villar, managing partner of Business Data Leadership.
“Instead of relying on bots to troll the Internet, they use people who can make valuable decisions about the data they find,” Villar writes in an Information Management column. “Because works are sourced around the world, crowdsourcing companies can perform these tasks in a cost-effective and timely manner.”
Researchers are still exploring how effective this is against traditional methods. You can find a long list of research online about this. One emerging best practice seems to be a system of checks and validations, with higher-paid experts working at one level and volunteers or lower-compensated workers performing validations.
For instance, KIT Ph.D. student Maribel Acosta describes a two-tiered payment system. Linked Data experts were paid a higher rate for specific-domain tasks than M-Turk workers who simply compared the data and validated their findings. You can read more about that in a presentation, “Crowdsourcing Linked Data Quality Assessment,” which is available on Slideshare.
Panos Ipeirotis describes a similar, but much more complicated approach, in “Crowdsourcing: Achieving Data Quality with Imperfect Humans,” which is available on YouTube. Ipeirotis holds a Ph.D. in computer science from Columbia and is now an associate professor and George A. Kellner Faculty Fellow for the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business at New York University.
What good is all of this, though, if you’re just not ready to trust crowdsourcing?
Actually, Villar suggests that organizations apply crowdsourcing data quality practices internally. As it stands now, employees encounter substantial barriers to correcting data errors, she notes. For instance, if an error is found, they have to track down the DBA or the data owner.
“Most employees simply give up and pursue other means to get the information they need, keeping the problem to themselves,” she writes.
By using crowdsourcing techniques, you can change that.
“Using a ‘like’ approach (think Facebook) will allow all employees to rate the usefulness of the data in the core applications,” Villar states. “While software application vendors would need to provide some capability to do this inside packaged applications, as data management professionals we should be requesting this capability. … Put your own crowd to work.”
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.