It’s always interesting to me to see technology issues addressed in non-IT publications. So, naturally, I was intrigued when I saw Foreign Affairs publish “The Rise of Big Data: How It’s Changing the Way We Think About the World.”
It’s written by Kenneth Cukier, data editor of The Economist, and Viktor Mayer-Schoenberger, a professor of Internet Governance and Regulation at the Oxford Internet Institute.
What’s clever about the piece is that it puts Big Data into a larger context — as in, going back to the third century BC, when the Library of Alexandria held the sum of all human knowledge at an estimated 1,200 exabytes. And by doing so, Cukier and Mayer-Schoenberger point out that Big Data changes the playing field, not just for technology, but for all data and how we understand it.
They identify three major ways Big Data shifts how we think about data:
From some to all. Before Big Data, the only way to try to understand human behavior on any scale was to sample. Now, with Big Data, the sample size becomes incredibly large, to the point of nearing “all.”
“Big data is a matter not just of creating somewhat larger samples but of harnessing as much of the existing data as possible about what is being studied,” the authors write. “We still need statistics; we just no longer need to rely on small samples.”
From clean to messy. To me, this is the most important takeaway for IT people, and it’s actually best explained through example.
Data management experts — with good cause — spend a lot of time talking about the importance of data quality and data governance. There’s been more talk about how to bring these disciplines to Big Data, particularly so you can avoid making wrong decisions.
The problem, of course, is that you’re no longer talking about data that’s completely under your control or an easy governance program.
When you’re dealing with large data sets, that’s not always a problem. You know this, because statisticians tell you this, but it’s worth remembering when you talk about things like data quality with Big Data.
The article makes this clear by looking at the history of computer-aided language translation.
In the 1990s, IBM tried to create a translation machine using probability, masks and perfect translations of parliamentary transcripts. It wasn’t so great.
Fast forward to today: Who do you look to for on-the-spot translations? Google. And the company accomplished it by using “messy” data from the Internet — billions of translations from all over the web. Its results are better than IBM’s and cover 65 languages, the article points out.
“Large amounts of messy data trumped small amounts of cleaner data,” the authors write.
Now, it’s one thing to know that, but it’s another thing to really wrap your head around it and keep it in mind as you move forward with Big Data.
And it’s also the reason for the third shift in how we view data: Thinking about correlation rather than causation. The piece uses UPS’s truck sensors, which trigger alerts about heat and other factors that correlate with parts breaking, as one example.
I usually don’t promote paid content, but I had a chance to read this in the doctor’s office for free and thought these three shifts were worth pointing out. You can preview the article online, but if you have a chance to get the magazine, it’s in the current issue of Foreign Affairs on stands now.