How do you deal with the biggest data sets of all? Bob Jones, a project leader for the European Organization for Nuclear Research – commonly known as CERN – described how the world’s largest particle physics laboratory manages 100 petabytes of data.
The first step is not to collect everything, ““We can’t keep all the data, the key is knowing what to keep” says Jones. This is understandable given the cameras capturing the collisions have 150 million sensors delivering data at 40 million times per second.
Jones was speaking at the ADMA Global Conference’s Advancing Analytics stream where he was describing how the project manages and analyses the vast amounts of data generated by the huge projects.
Adding to Jones’ task and that facing CERN’s boffins is that data has to be preserved and verifiable so scientists can review the results of experiments.
Discovering the Higgs Boson for instance required finding 400 positive results out of 600,000,000,000,000,000 events. This requires massive processing and storage power.
Part of the solution is to have a chain of data centres across the world to carry out both the analytics and data storage supplemented by tape archiving, something that creates other issues..
“Tape is a magnetic medium which means it deteriorates over time.” Jones says, “we have to repack this data every two years.”
Another advantage with a two year refresh is this allows CERN to apply the latest advances in data storage to pack more data into the medium.
CERN itself is funded by its 21 member states – Pakistan is its latest member – which contribute its $1.5 billion annual budget and the organisation provides data and processing power to other multinational projects like the European Space Agency and to private sector partners.
For the private sector, CERNs computing power gives the opportunity to do in depth analytics of large data sets while the unique hardware and software requirements mean the project is a proving ground for high performance equipment.
Despite the high tech, Jones says the real smarts behind CERN and the large Hadron Collider lie in the people. “All of the people analysing the data are trained physicists with detailed, multi year domain knowledge.”
“The reason being is the experiment and the technology changes so quickly, it’s not written down. It’s in the heads of those people.”
In some respects this is comforting for those of us worrying about the machines taking over.