Why XML's 'Too Wordy' for Twitter's Big Data

Loraine Lawson

Is XML too "wordy?" It is if you're Twitter and planning for a world of "a trillion Tweets," according to a recent IDG News article published on The New York Times website, among other places.


OK, you're not Twitter and you're not going to be storing any tweets, much less a trillion. But big sets of data are a growing concern for organizations across industries, particularly as we get more data from online use and sensors.


So, it's fascinating to see how cutting-edge companies such as Twitter and Google are tackling the problem. And this fascinating article offers a readable interpretation of the challenge and how these companies are solving it.


It turns out, common standards for dealing with structured data-particularly XML, but also the related JSON (JavaScript Object Notation)-are too "wordy," meaning that when you add all the data tags and whatnot, 1 petabyte of a trillion stored Tweets can become 10 petabytes in XML. Keep in mind we're talking tweets here-140 words max, with 17 associated fields, plus the odd number of subfields.


This jump in storage can cost big money, and companies want to minimize that cost.So, Twitter is opting for a new and little-known format called Protocol Buffers, developed by-no surprise here-Google. The really cool thing about this approach is "it can automate the process of recreating the data structures within applications," according to the article. And the reason that that's cool is you can structure the data once and then easily generate the source code for using it in different programs. What's more, you can update the data structure and not break all the programs that are using the old format, which, I think your development team will agree, is super great.


The article also talks about Hadoop, which is an open source approach for storing and processing huge amounts of data. You can learn more about Hadoop's business uses by reading Mike Vizard's post, "The Need for Speed with Big Data,"and my recent interview with Doug Cutting, which ran in two parts: "Creator of Hadoop Explains Why It's More than Just Storage" and"How Companies Are Using Hadoop."

Add Comment      Leave a comment on this blog post

Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.