Facebook announced a major new edition to the Hadoop toolset: It’s called Corona, and it basically makes Hadoop more efficient at processing data.
For most people, that’s not much of an issue — yet. But for Facebook, with its ridiculous amount of data, it’s become an issue. In a blog post on Corona, Facebook engineering said its largest cluster has more than 100 PB of data and the company runs more than 60,000 Hive queries a day.
ReadWrite Enterprise gave an excellent explanation of how Hadoop currently works and how Corona improves it. As the piece explains, it’s important to understand first that Hadoop isn’t really a database “in the usual sense” but a two-part data processing system. MapReduce, which is a key part of what we call Hadoop, processes data in batches, and it relies on a “fair scheduler” — first come, first served.
And then on top of that, the job tracker has to schedule the queries and manage the clusters, and when you get into Facebook-sized data, what happens is the equivalent of a traffic jam.
Corona changes the job scheduling function, basically opening up some lanes to let more traffic flow through. It also frees the job tracker from the clusters, an issue that — again — is well explained by ReadWrite Enterprise. For the nitty-gritty technical issues, check out the Facebook post by the company’s own engineers.
The really good news is that Corona is open source, so that means the whole Hadoop community will benefit from this advancement.
Even though Hadoop isn’t the only approach to Big Data, it’s certainly a popular one, especially with social media and Web-based companies. People like it because it’s open source, which makes it more affordable, but it’s also highly scalable and fault-resistant.
That said, it’s still evolving, and Facebook isn’t the only one trying to polish Hadoop for more efficient and wider use, obviously.
Another effort that could have far-reaching impact for Hadoop’s use in the enterprise is IBM’s recent move to promote Hadoop standards.
Writing for IBM Data Management, Big Blue Big Data Evangelist James Kobielus argues in a two-part piece that it’s time for the Hadoop market to embrace “unifying standards and visions.”
“The Hadoop market won’t fully mature and may face increasing obstacles to growth and adoption if the industry does not begin soon to converge on a truly standardized core stack,” Kobielus, a former Forrester analyst, writes. “Right now, silos reign in the Hadoop world — a situation that is aggravated by the lack of open interoperability standards.”
And, of course, silos and a lack of interoperability standards generally are a quick way to create integration problems.
In particular, Kobielus sees a need for:
- A coherent vision for Hadoop’s ongoing development.
- A reference architecture for developing Hadoop’s technologies.
- A decision on where Hadoop ends and other NoSQL technologies begin, so that the community doesn’t get bogged down in side projects.
This week, he continued the discussion with an outline of what he considers the minimum requirements for a Hadoop industry reference framework.
“Who will take the first necessary step to move the Hadoop community toward more formal standardization?” he asks at the end of his post. “That’s a big open issue.”
It’s very possible, of course, that IBM will take the lead — as it often has with standards groups.
While standards groups seem like a no-brainer to the rest of us, creating them is often a bit controversial within IT circles because it usually involves some level of vendor politics and bickering.
But, it’s hard to disagree with Kobielus about the need for standards as another important step in the polishing of Hadoop.