When it comes to physics, CERN may be at the top of the food chain. But when it comes to managing Big Data, it’s trying to become more like Google, CERN Infrastructure manger Tim Bell said at Structure: Europe.
“We conceded that our challenge is not special — Google is way ahead of us in scale. We need to build on what they’ve done,” Bell said, according to this recent GigaOm article.
Normally, I’d leave infrastructure to ITBusiness Edge blogger Arthur Cole, but as is so often the case with Big Data, this piece falls somewhere between the data software and hardware. Here’s the challenge CERN faces today:
- CERN generates 40 million pictures a second of proton collisions, which with a 100 megapixel camera translates into 1 petabyte of data per second;
- So far, CERN has 35 petabytes of data to record per year, but that will double when CERN upgrades the collider.
- Physicists want to keep all that data for 20 years.
- The result is a hardware problem: Archiving currently requires 45,000 tape drives.
The solution, however, is a blend of infrastructure and software changes.
Last year, I had the privilege of interviewing CERN physicist Alex Naumann about how the organization handles Big Data. CERN scientists use a home-grown, object-oriented program and library called Root, which was developed by physicists to perform the actual analysis of the data. Managing that was a challenge, since developers used 15 different platforms.
CERN will still use ROOT, as far as I can tell, but it will no longer rely on in-house software for managing the supporting Big Data clusters. Instead, it’s shifting that custom software to software such as Puppet.
It will also use OpenStack, the open source infrastructure cloud as a service platform, for a virtualized infrastructure. That should work well, since one of the major IT challenges at CERN is its distributed and changing workforce. It’s also shifting to the open source Puppet for configuration management.
What made me happy about this shift is the reason behind it:
“Users also want to provision an analysis cluster with 50 machines themselves for an afternoon that then goes away again. It is about providing those kinds of services,” Ian Bird, the Large Hadron Collider computing grid project leader told PC World.
If you’d like to see CERN’s new architecture data flow, check out page 38 on this slide show presented by Bell at the 2012 Puppet Conference.