Data Analysis in Experimental Particle Physics

Data Analysis in Experimental Particle Physics C. Javier Solano S. Grupo de Física Fundamental Facultad de Ciencias Universidad Nacional de Ingeniería

Data Analysis in Particle Physics Outline of Lecture Characteristics of data from particle experiments From DAQ data to Event Records: Event Building From hits to tracks and clusters From tracks and clusters to particles : Correlating sub-detector information Uncertainties and resolution Data reconstruction and production : Data Summary Tapes Personal data analysis: n-tuples

Data Analysis in Particle Physics Outline of Lecture (cont.) Monte Carlo simulation Statistics and error analysis Hypothesis testing Simulation of particle production and interactions with the detector Digital representations of event data Monitoring and Calibration Why physicists don t (yet) use Excel and Oracle for their daily analysis. The challenge of analysis for the LHC experiments The challenge of computing for the LHC Solving the LHC computing challenge

Characteristics of data from particle experiments

Characteristics of data from particle experiments Most data comes from digitized information from sensors activated by particles crossing them. We call the data resulting from the observation of a particle collision an event. During hours, days, weeks, months, years or even decades, we observe many events.. We group them according to the time- varying experimental conditions into runs. Calibration and environmental information is also stored, usually in a periodic fashion. For practical reasons, this data is stored in data files of many events. Almost always, events are independent from each other.

Characteristics of data from particle experiments The Experimental Particle Physics Data Worm Data file 418 Data file 419 Run 137 Run 138 Run 139 Run 140 Calibration records Event number 31896

From DAQ data to Event Records Event Building

From hits to tracks and clusters

From hits to tracks and clusters Occupancy and point resolution are related to ambiguities in track finding

From hits to tracks and clusters Calibration, monitoring and software are needed to resolve these ambiguities

From hits to tracks and clusters What you see is not always what there was! Nuclear interaction

Monitoring and Calibration Particles deposit energy in sensors Sensors give Voltages, Currents, Charges Space position of sensor is known On-detector Analog-to-Digital Converters change these into numbers representing these or other quantities (for example clock-ticks between V pulses) Calibration establishes the relationship between the ADC units and the physical units (ev, {x,y,z}, ns)

From tracks and clusters to particles Correlating sub-detector information

Uncertainties and resolution Each measurement or hit has some uncertainty,, due to alignment and the characteristic of the sensor. These uncertainties get propagated, often in a non- linear manner, to resolution functions for the physics quantities used in analysis. Resolution has various consequences: Direct on measurements Signal-Background confusion Combinatorics

Data reconstruction and production : Data Summary Tapes Reconstruction turns hits+calibration+geometry into particle hypothesis Reconstruction is time consuming and must be made coherently Centrally organized production Output is one or more levels of so-called Data Summary Tapes (DST( DST) ) which are used as input to Personal Analysis In practice, there is a lot of utility software to organize these data for easy analysis (bookkeeping( bookkeeping) Programming of complicated event structures Old: FORTRAN with home-made memory managers Today: Object-Oriented design using C++ or Java

Personal data analysis Most modern detectors can address multiple physics topics. Hundreds or thousands of professors and students distributed around the world. Modern experimental collaborations are early example of virtual communities. Historical enablers for virtual communities: Fellowship and exchange programmes Telegraph, telex, telephone and telefax National and International Laboratories Reasonably priced airline tickets Computer inter-networking, e-mail and ftp The World Wide Web Multi-media applications on the Internet

Personal data analysis Today, physics analysis topics are increasingly tackled by virtual teams within these virtual communities. Must maintain coherency of data and algorithms within the virtual team. Production for a modern detector is very complex and consumes many resources. DST contains all imagined reconstruction objects for all foreseen analysis,, so they are big. Handling a DST often requires installation of special software libraries and writing code in reconstruction dialect.

Personal data analysis Solution: Each virtual team develops a code to extract a common analysis dataset for a given topic which is written and manipulated using a lingua franca : n-tuples and the Physics Analysis Workstation (PAW)/ROOT Physicist s version of business data mining with Excel Iterative process (time-scale of weeks or months): Team agrees on complex algorithms to be coded in the extraction program. Algorithms coded and tested, extraction from DST. n-tuple file is rapidly distributed via computer network. n-tuple is analyzed using non-compiled platform- independent code (PAW/ROOT macros today, Java in future?) that are easily modified and shared by e-mail. Eventually limitations are reached, go back to step 1.

Personal data analysis PAW was the killer application for physics in the 90s Interactive, just as powerful workstations became available Platform independent, in a very diverse workstation world Graphical, just as X-windows gave graphics over network Simple to write analysis macros, just as the complexity of FORTRAN programming required in experiments decoupled most of the collaborators from the experiment s code. In summary, PAW was like going from DOS to Macintosh. One major limitation of PAW is the lack of variable length structures or more generally data objects. ROOT overcomes these limitations keeping a similar philosophy as PAW. Java Analysis Studio tries to go further with agents.

Personal data analysis Which will be the killer application for LHC analysis? Is a Mac Classic on Appletalk enough or do we need the conceptual leap equivalent of Web + Java-enabled browser? Will the personal n-tuple model work for LHC? Do we need and can we afford to support our own interactive data analysis tool? Will one of the newer tools, such as Java Analysis Studio, go exponential in the open source world? Many questions, one simple answer: It will be young people like you who will make the next step happen.

Monte Carlo simulation Monte Carlo simulation uses random numbers ( mathematics textbooks) Try the following: Find a source of random numbers in the interval [0,1] (calculator, Excel, etc.) Take a function that you want to simulate (e.g. y=x 2 ) and normalize it to fit in the interval [0,1] for both x and y. Find graph paper to histogram values of x Repeat this at least 20 times: Throw two random numbers. Use first as value for x Evaluate the function y and compare its value to 2 nd random number If function value is less than random number, add a count to histogram in the correct bin for x If function value is more than random number, forget it If function value is more than random number, forget it Compare your histogram to the shape of the function

Monte Carlo simulation If you don t know how to program, you can pick up an Excel file from http://cern.ch/manuel.delfino/brazil Here is the result for 100 trials: Note there are 30 10 9 entries so the 8 efficiency is 30% 7 6 Note the statistical 5 fluctuations 4 Homework: How is the 3 2 normalization done? y 1 Example of Monte Carlo simulation of y=x*x 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x Y SAMPLE

Statistics and error analysis Analysis involves selecting, counting and normalizing. Things are easier when you actually have a signal. Understand underlying statistics: Poisson, Binomial,Multinomial, etc. If measuring a differential distribution, understand relation between normalization of binned counts vs. total counts. Understand selection biases and their impact on observed distributions. Things are a lot harder when you place limits. Two observations: If you cannot make an analytical estimate of the uncertainties, I won t believe your result. The expression n-sigma effect should be banned.

Hypothesis testing You must understand Bayes theorem. And every time you think you understand it, you must make a big effort to understand it better! Compare differential distributions of data with predictions of theory or model Different theories Different parameters for same model Setting up the statistical test is often straight-forward, which is why it is surprising most people do it wrong Taking account of resolution and systematic uncertainties is hard Make simulation look like data to get your answers Even if graphics looks better the other way around!!!

Simulation of particle production and interactions with the detector For particle production, combine Monte Carlo with Detailed particle properties Detailed cross-sections predicted by theory of phenomenology Computation of phase-space Output consists of event records containing simulated particles (often called 4-vectors by experimentalists) For simulating the detector, combine MC with Detailed description of the detector Detailed cross-sections for interaction with detector materials Detailed phenomenology of mechanism producing signal Transport (Ray-tracing) algorithms including B fields Digitization model mapping of {x,y,z} to read-out channel

Simulation of particle production and interactions with the detector Example: Small part of design of GEANT4 Reference to Jackson s textboook in documentation!

Digital representations of event data In principle, representing event data digitally should be very simple, except: everything comes in variable numbers: hits, tracks, clusters ambiguities lead to multiple relations particle identification may depend on analysis hypothesis etc. In simple terms, events don t look like bank account data, they look like collections of objects. You can do a reasonable representation using relational tables, but actually using the data structures from Fortran/ROOT programs is still cumbersome Object Oriented Programming is a better match, but C++ does not resolve all problems Frameworks

Why physicists don t (yet) use Excel and Oracle for their daily analysis. Spreadsheets like Excel and relational databases like Oracle have a very square view of data. This is not a good match to the Data Worm. Normal people (banks and insurance companies) can define a priori the quantities that they will select on (the( keys of the database). We usually derive selection criteria a posteriori using quantities calculated from the stored data. We like (need?) to express queries as individualistic detailed low-level computer codes.. Difficult to support in database. But this is changing very rapidly due to Data Mining: Businesses are interested in analyzing their raw data in unpredictable ways. Example: Cash register tickets to choose sale items Support for this requires a more organic view of data, for example object-relational databases.

Why physicists don t (yet) use Excel and Oracle for their daily analysis. Idealized Cluster Calorimeter hit Particle hypothesis Mass Charge Momentum Origin Simple relation One to Many One to Many Position Width Depth Energy Number of hits Track Origin Curvature Extrapolation Number of hits One to Many One to Many Position Response Tracker hit Position Response

Why physicists don t (yet) use Excel and Oracle for their daily analysis. Reality Cluster Calorimeter hit Particle hypothesis Mass Charge Momentum Origin Complicated algorithmic relation Many to Many Many to Many Position Width Depth Energy Number of hits Track Origin Curvature Extrapolation Number of hits Many to Many Many to Many Position Response Tracker hit Position Response

The challenge of analysis for the LHC experiments

The challenge of analysis for the LHC experiments 1:10 12 Online 1:10 7 Analysis 1:10 5

The challenge of analysis for the LHC experiments

The challenge of analysis for the LHC experiments Detector 0.1 to 1 GB/sec Raw data 35K SI95 Event Filter (selection & reconstruction) 1 PB / year 500 TB One Experiment Event Summary Data ~200 MB/sec 350K SI95 64 GB/sec Batch Physics Analysis ~100 MB/sec Event analysis objects Reconstruction 250K SI95 Event Simulation Thousands of scientists distributed around the planet

The challenge of computing for the LHC TeraBytes Long Term Tape Storage Estimates 14'000 12'000 10'000 8'000 6'000 4'000 2'000 0 Current Experiments COMPASS LHC 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year

The challenge of computing for the LHC TeraBytes 14'000 12'000 10'000 8'000 6'000 4'000 2'000 0 1995 Long Term Tape Storage Estimates Current Experiments 1996 1997 Accumulation: 10 PB/year Signal/Background up to 1:10 12 1998 COMPASS 1999 2000 2001 2002 LHC 2003 2004 2005 2006 Year

The challenge of computing for the LHC K SI95 5,000 4,000 3,000 2,000 1,000 Estimated CPU Capacity required at CERN LHC Moore s law some measure of the capacity technology advances provide for a constant number of processors or investment 0 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Jan 2000: 3.5K SI95

Thousands of CERN Units 120 100 80 60 20 0 The challenge of computing for the LHC CERN Centre Physics Computing Capacity 40 Mainframes decomissioned First PC services CERN RD47 project RISC decomissioning agreed Moore's law (based on 1988) 1988 1990 1992 1994 1996 1998 2000 year

Thousands of CERN Units 120 100 80 60 20 0 The challenge of computing for the LHC CERN Centre Physics Computing Capacity 40 Mainframes decomissioned First PC services CERN RD47 project RISC decomissioning agreed Continued innovation Moore's law (based on 1988) 1988 1990 1992 1994 1996 1998 2000 year

Solving the LHC Computing Challenge: Technology Development Domains DEVELOPER VIEW FABRIC GRID APPLICATION USER VIEW

Solving the LHC Computing Challenge Storage Network 0.8 0.8 5 8 1.5 6 * 250 12 10 Thousand dual-cpu boxes Hundreds of tape drives 24 * Farm Network * Data Rate in Gbps 960 * LAN-WAN Routers Storage Network 0.8 Real-time detector data Grid Interface 10 Thousand disk units Computing fabric at CERN (2006)

Internet Protocol Architecture Solving the LHC Computing Challenge: Data-Intensive Grid Research Grid Protocol Architecture Specialized services : user- or appln-specific distributed services Application User Managing multiple resources : ubiquitous infrastructure services Sharing single resources : negotiating access, controlling use Talking to things : communication (Internet protocols) & security Controlling things locally : Access to, & control of, resources Collective Resource Connectivity Fabric Application Transport Internet Link

Acknowledgements Many of the figures in this talk are from the Web sites of ATLAS, CMS, Aleph and Delphi.