The Big Data Software Problem Behind CERN's Higgs Boson Hunt

Share it on Twitter  
Share it on Facebook  
Share it on Linked in  

European Organization for Nuclear Research — aka CERN — physicists announced this week they’ve discovered evidence of a Higgs boson, possibly even the Higgs boson that’s responsible for giving mass to the universe.

But behind this discovery, there’s a very real Big Data challenge. CERN scientists had to sift through more than 800 trillion collisions looking for the Higgs.

Thanks to vendor Electric Cloud, I was lucky enough to talk with CERN physicist Axel Naumann recently about how CERN deals with its Big Data problem.

Naumann explained that azion monsters in the collider emit electronic signals — tons of them — which are written out into files. The scientists then need to make sense of that, so they sift through the data, trying to find particle traces.

“Because most of that is actually incredibly boring, we need to sift through them and find the interesting ones, and even those, we don’t see immediately because we can’t really tell what we see,” he explained. “We can only give it a probability, and so by doing this billions and billions and billions of times, we are pretty certain that we see something or don’t see something. … We do a statistical analysis on a huge amount of data and at the end we can give the results.”

So far, CERN has amassed about 200 petabytes of data, stored in a format they defined, he added.

To really understand the challenge, though, there are a few things you should know about CERN. First, its scientists are not sitting in France, watching through microscopes. No, CERN scientists are located across the world. So whatever system or application or analytics tool CERN decides to use has to run on a wide, wide variety of platforms and versions.

Second, you don’t roll out an application and expect these people to use it, questions unasked. CERN physicists are, obviously, among the brightest people in the world; they write their own code for analyzing data, thank you very much.

Third, CERN uses ROOT to analyze all this data. ROOT is an open source tool, often used by financial institutions. Naumann is actually very involved in helping to evolve it.

And finally, CERN had one IT guy supporting ROOT. It goes without saying that this somewhat unique situation can create major technology challenges.

The first challenge? CERN scientists kept “breaking” ROOT. As it turned out, most of the problem wasn’t actually the code, but some unexpected problem in the OS — a patch wasn’t applied or it had been applied and now something functioned differently than it had previously.

Any developer can relate — so often it’s not about the code, but about a patch that was or was not installed or some other weird configuration problem.

The second challenge? The one IT guy retired. This meant the physicists had exactly zero IT guys supporting ROOT.

So, CERN had a $10 billion Large Hadron Collider literally pumping out the answers to fundamental questions about the universe, and some of the brightest people in the world are spending precious time trying to get the software to run.

You can read how Naumann and CERN solved the problem in the full Q&A, "Monitoring Monsters: How CERN Stopped Breaking Its Big Data Analytics Tool." The problem and solution fall more into the domain of deployment issues than what I more typically cover with Big Data, integration or data management. But this is CERN, one of the leading technology projects in the world; if they’re willing to share, I’m willing to listen.

Plus, I suspect in the end, the experience is not so unusual. Even when you’re talking about cutting-edge IT, whether it’s Big Data or advanced analytics, the work can still be undermined by more basic, but poorly managed, IT processes and practices.