Computing and Data Management for CMS in the LHC Era

Koen Holtman1, Frank van Lingen2,3,5, Heinz Stockinger4,5 and Ian Willers5
1California Institute of Technology, Pasadena, USA;
2University of West England,
3Eindhoven University of Technology, The Netherlands,
4Institute for Computer Science and Business Informatics, University of Vienna, Austria;
5European Laboratory for Particle Physics, CERN, Geneva Switzerland

Abstract

In 2006 when the CMS experiment will commence taking data, the CMS Offline computing will consist of a tiered hierarchical set of large computing facilities spread around the world. There will be a large, tier-0, central computing site near the experiment in CERN, five tier-1 regional centres in major research institutions and about 25 tier-2 centres at universities around the world.

Around the year 2003 major elements of this structure will already be in place taking part in the enormous simulation effort that precedes the running experiment. Already CMS has computing and data management needs equivalent to that of one of today's experiments. This paper will describe the CMS requirements and activities for Grid computing, the work, which has already taken place and is in use in the present simulation efforts, and techniques that deal with the problems of heterogeneous data sources.

The CMS Grid will be based on a layered architecture and CMS will be responsible for assembling the components that form the necessary layers. Many of the components will be generic off-the-shelf and commercially available software. Some will come from the current projects in which CMS participates, the EU DataGrid, GriPhyN and PPDG. CMS will have to incorporate some HEP-specific non-Grid components like GEANT 4 and ANAPHE. Finally, our software framework COBRA will be adapted so that it is Grid-aware.

The present CMS Grid efforts have focused on data replication for simulation. The simulated data is generated and then stored in Objectivity databases at various centres in CERN, USA, Italy, Russia, United Kingdom and France. Starting in the year 2000 this data was distributed using the Grid Data Management Pilot, GDMP, software. This is now being extended by the use of a distributed replica catalogue to cover many types of files.

Simulation depends on the availability of a good description of the detector. For example GEANT 4 requires a highly granular description of all sensitive and non sensitive parts of the detector while the event reconstruction program needs positions and shapes of sensitive tracker elements accompanied with their material properties. The information from the construction databases can be found in the sub-detector databases in various format: flat files, Excel, Oracle, Objectivity, etc. An eXtended Markup Language, XML, based framework will be used for data integration and model management.