More

    Four Reasons Not to Trust Crowdsourced Data Yet

    Slide Show

    How to Monetize Data in Five Steps

    There are signs that crowdsourcing is becoming a legitimate data strategy. What remains unclear, though, is whether it’s a reliable one.

    One crowdsourcing project, PredictIt.org, illustrates the issues at play. PredictIt is an academic project under the auspices of the Victoria University of Wellington, New Zealand, with U.S. university affiliates. Essentially, it allows people to wager on political races (yes, it’s legal), and tap the “wisdom of the crowd.”

    Four days leading up to the elections, it ran a market on this year’s U.S. Congressional mid-term. The crowd successfully predicted the overall outcome in Congress, foreseeing a Republican take-over of the Senate and gains in the House. As of Monday, the site was predicting that Republicans would have 53 or more seats in the Senate. The final outcome (thus far) in the Senate was 52 Republican seats with 43 Democrats.

    “There are 25 or more years of data that show prediction markets do a better job predicting outcomes than polls,” Dr. Emile Servan-Schreiber, founder and CEO of Lumenogic and an expert in prediction markets, told Politico.

    When it comes to crowdsourcing, Google seems to set the tone, as M.C. Srivas, chief technology officer of Hadoop distributor MapR Technologies, explained in a recent ITWorld article on Big Data.

    “One of the basic rules to pick up from Google is that ‘more data beats complex algorithms,’” Srivas said. “This is something that Google has demonstrated again and again: The company that can process the most data will have an advantage over everybody else in the future.”

    IBM seems to agree. Recently, it penned a data analytics partnership with Twitter that will integrate Twitter data with its cloud analytics tool. Big Blue will also offer custom Big Data analytics services that leverage Twitter’s data for marketing and customer insights.

    “The more data you can bring to that problem, especially when it’s an incredibly unique dataset like Twitter, the analytics become better and the decisions become clearer,” IBM’s Alistair Rennie, general manager of the company’s Business Analytics group, told Fortune this week.

    Obviously, these are just sproutling attempts at crowdsourcing. Given time, crowdsourcing may take root, but there are technology and reliability clumps to sort out first.

    First, is there a confirmation bias when we look at what crowdsourcing predicts well? The press person for PredictIT told me that the market favorites went on to win the Senate race in Iowa, Colorado and New Hampshire, as well as the Mayoral race in DC. That leaves a lot of races unaccounted for, including North Carolina, where the market (as of early Tuesday) had predicted a win for incumbent Kay Hagen. She lost by a slim margin to Republican Thom Tillis. Now, PredictIT is still analyzing its final results, but shouldn’t we pay closer attention to the misses than the hits when determining whether crowdsourcing data is a meaningful strategy?

    Crowdsourcing Data

    There are also the dual problems of broad audience bias and technology’s ability to analyze sentiment. While these discussions tend to assume that, given enough users, you’ll have an even distribution across biases, that’s not necessarily true. Even if it were true, that doesn’t mean the technology is able to correctly interpret what people say in unstructured text. As Open Data Now Insights’ “Live-Blogging report from the Sentiment Analysis Symposium” notes:

    Augie Ray of Amex is interested in using sentiment analysis to supplement survey data – but not by turning to Twitter or other kinds of social media that lack depth. … Aggregate sentiment analysis may not be accurate, and Twitter users may not mirror your customer base: “If you looked at Twitter during the last election, Rand Paul should have won hands down.”

    The full report is well worth reading, since it provides relevant links and explores other crowdsourcing issues, such as developing useful algorithms and the use of video versus text.

    Finally, the fourth problem with crowdsourcing is that it can be compromised. Yelp has become the poster child for problems with social media ratings, and there are signs that some trust issues and other options are pushing some users away. Informatica’s blog post, “Crowdsourced data: Can you trust it?” explores the problems companies have encountered dealing with fake or bought input.

    Can vendors and technologists overcome these issues? Certainly, for some uses, I think so. For instance, we’ve already seen that crowdsourced data can be used to identify ongoing product or services problems. And there are some creative uses — including crowdsourced performance reviews and crowdsourcing data quality— that could bear fruit.

    Beyond that — who knows? Perhaps we should crowdsource that question.

    Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.

    Loraine Lawson
    Loraine Lawson
    Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.

    Get the Free Newsletter!

    Subscribe to Daily Tech Insider for top news, trends, and analysis.

    Latest Articles