Data Analysis in Experimental Particle Physics

Similar documents
Monte Carlo programs

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

8.882 LHC Physics. Track Reconstruction and Fitting. [Lecture 8, March 2, 2009] Experimental Methods and Measurements

Lecture 1: January 23

Andrea Sciabà CERN, Switzerland

Summary of the LHC Computing Review

DESY at the LHC. Klaus Mőnig. On behalf of the ATLAS, CMS and the Grid/Tier2 communities

HPS Data Analysis Group Summary. Matt Graham HPS Collaboration Meeting June 6, 2013

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

Lecture 1: January 22

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Starting a Data Analysis

Atlantis: Visualization Tool in Particle Physics

8.882 LHC Physics. Analysis Tips. [Lecture 9, March 4, 2009] Experimental Methods and Measurements

CSCS CERN videoconference CFD applications

The CMS Computing Model

Virtualizing a Batch. University Grid Center

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Deep Learning Photon Identification in a SuperGranular Calorimeter

Early experience with the Run 2 ATLAS analysis model

N. Marusov, I. Semenov

Multi-threaded, discrete event simulation of distributed computing systems

The TDAQ Analytics Dashboard: a real-time web application for the ATLAS TDAQ control infrastructure

The CMS data quality monitoring software: experience and future prospects

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

IEPSAS-Kosice: experiences in running LCG site

Klaus Dehmelt EIC Detector R&D Weekly Meeting November 28, 2011 GEM SIMULATION FRAMEWORK

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

v.m.g.rajasekaran ramani sri sarada sakthi mat. Hr. sec. school

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Upgraded Swimmer for Computationally Efficient Particle Tracking for Jefferson Lab s CLAS12 Spectrometer

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

UW-ATLAS Experiences with Condor

The LHC Computing Grid Project in Spain (LCG-ES) Presentation to RECFA

A New Segment Building Algorithm for the Cathode Strip Chambers in the CMS Experiment

CERN and Scientific Computing

CMS Simulation Software

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction.

Detector Control LHC

Electron and Photon Reconstruction and Identification with the ATLAS Detector

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Storage and I/O requirements of the LHC experiments

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 27 July 2010

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

First LHCb measurement with data from the LHC Run 2

Machine Learning in Data Quality Monitoring

CERN s Business Computing

DIAL: Distributed Interactive Analysis of Large Datasets

Machine Learning for (fast) simulation

Data Intensive Science Impact on Networks

Topics for the TKR Software Review Tracy Usher, Leon Rochester

Fast pattern recognition with the ATLAS L1Track trigger for the HL-LHC

Practical 2: Using Minitab (not assessed, for practice only!)

Work in Tbilisi. David Mchedlishvili (SMART EDM_lab of TSU) GGSWBS , Tbilisi. Shota Rustaveli National Science Foundation

High-Energy Physics Data-Storage Challenges

An introduction to plotting data

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Event reconstruction in STAR

ATLAS PILE-UP AND OVERLAY SIMULATION

Adding timing to the VELO

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

A Geometrical Modeller for HEP

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

LSGI 521: Principles of GIS. Lecture 5: Spatial Data Management in GIS. Dr. Bo Wu

Tracking POG Update. Tracking POG Meeting March 17, 2009

Data Transfers Between LHC Grid Sites Dorian Kcira

Grid Computing Activities at KIT

The JINR Tier1 Site Simulation for Research and Development Purposes

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Big Data Analytics and the LHC

The performance of the ATLAS Inner Detector Trigger Algorithms in pp collisions at the LHC

High Throughput WAN Data Transfer with Hadoop-based Storage

Scattering/Wave Terminology A few terms show up throughout the discussion of electron microscopy:

Lies, Damn Lies and Performance Metrics. PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments

CC-IN2P3: A High Performance Data Center for Research

Introducing Robotics Vision System to a Manufacturing Robotics Course

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Post Silicon Electrical Validation

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Travelling securely on the Grid to the origin of the Universe

ATLAS DCS Overview SCADA Front-End I/O Applications

Simulation. Variance reduction Example

Expressing Parallelism with ROOT

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Oracle and Tangosol Acquisition Announcement

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Charged Particle Reconstruction in HIC Detectors

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures

CLAS 12 Reconstruction Software

Linux Installation Planning

Quad Module Hybrid Development for the ATLAS Pixel Layer Upgrade

CMS Conference Report

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

Diploma Of Computing

Eight units must be completed and passed to be awarded the Diploma.

Transcription:

Data Analysis in Experimental Particle Physics C. Javier Solano S. Grupo de Física Fundamental Facultad de Ciencias Universidad Nacional de Ingeniería

Data Analysis in Particle Physics Outline of Lecture Characteristics of data from particle experiments From DAQ data to Event Records: Event Building From hits to tracks and clusters From tracks and clusters to particles : Correlating sub-detector information Uncertainties and resolution Data reconstruction and production : Data Summary Tapes Personal data analysis: n-tuples

Data Analysis in Particle Physics Outline of Lecture (cont.) Monte Carlo simulation Statistics and error analysis Hypothesis testing Simulation of particle production and interactions with the detector Digital representations of event data Monitoring and Calibration Why physicists don t (yet) use Excel and Oracle for their daily analysis. The challenge of analysis for the LHC experiments The challenge of computing for the LHC Solving the LHC computing challenge

Characteristics of data from particle experiments

Characteristics of data from particle experiments Most data comes from digitized information from sensors activated by particles crossing them. We call the data resulting from the observation of a particle collision an event. During hours, days, weeks, months, years or even decades, we observe many events.. We group them according to the time- varying experimental conditions into runs. Calibration and environmental information is also stored, usually in a periodic fashion. For practical reasons, this data is stored in data files of many events. Almost always, events are independent from each other.

Characteristics of data from particle experiments The Experimental Particle Physics Data Worm Data file 418 Data file 419 Run 137 Run 138 Run 139 Run 140 Calibration records Event number 31896

From DAQ data to Event Records Event Building

From hits to tracks and clusters

From hits to tracks and clusters Occupancy and point resolution are related to ambiguities in track finding

From hits to tracks and clusters Calibration, monitoring and software are needed to resolve these ambiguities

From hits to tracks and clusters What you see is not always what there was! Nuclear interaction

Monitoring and Calibration Particles deposit energy in sensors Sensors give Voltages, Currents, Charges Space position of sensor is known On-detector Analog-to-Digital Converters change these into numbers representing these or other quantities (for example clock-ticks between V pulses) Calibration establishes the relationship between the ADC units and the physical units (ev, {x,y,z}, ns)

From tracks and clusters to particles Correlating sub-detector information

Uncertainties and resolution Each measurement or hit has some uncertainty,, due to alignment and the characteristic of the sensor. These uncertainties get propagated, often in a non- linear manner, to resolution functions for the physics quantities used in analysis. Resolution has various consequences: Direct on measurements Signal-Background confusion Combinatorics

Data reconstruction and production : Data Summary Tapes Reconstruction turns hits+calibration+geometry into particle hypothesis Reconstruction is time consuming and must be made coherently Centrally organized production Output is one or more levels of so-called Data Summary Tapes (DST( DST) ) which are used as input to Personal Analysis In practice, there is a lot of utility software to organize these data for easy analysis (bookkeeping( bookkeeping) Programming of complicated event structures Old: FORTRAN with home-made memory managers Today: Object-Oriented design using C++ or Java

Personal data analysis Most modern detectors can address multiple physics topics. Hundreds or thousands of professors and students distributed around the world. Modern experimental collaborations are early example of virtual communities. Historical enablers for virtual communities: Fellowship and exchange programmes Telegraph, telex, telephone and telefax National and International Laboratories Reasonably priced airline tickets Computer inter-networking, e-mail and ftp The World Wide Web Multi-media applications on the Internet

Personal data analysis Today, physics analysis topics are increasingly tackled by virtual teams within these virtual communities. Must maintain coherency of data and algorithms within the virtual team. Production for a modern detector is very complex and consumes many resources. DST contains all imagined reconstruction objects for all foreseen analysis,, so they are big. Handling a DST often requires installation of special software libraries and writing code in reconstruction dialect.

Personal data analysis Solution: Each virtual team develops a code to extract a common analysis dataset for a given topic which is written and manipulated using a lingua franca : n-tuples and the Physics Analysis Workstation (PAW)/ROOT Physicist s version of business data mining with Excel Iterative process (time-scale of weeks or months): Team agrees on complex algorithms to be coded in the extraction program. Algorithms coded and tested, extraction from DST. n-tuple file is rapidly distributed via computer network. n-tuple is analyzed using non-compiled platform- independent code (PAW/ROOT macros today, Java in future?) that are easily modified and shared by e-mail. Eventually limitations are reached, go back to step 1.

Personal data analysis PAW was the killer application for physics in the 90s Interactive, just as powerful workstations became available Platform independent, in a very diverse workstation world Graphical, just as X-windows gave graphics over network Simple to write analysis macros, just as the complexity of FORTRAN programming required in experiments decoupled most of the collaborators from the experiment s code. In summary, PAW was like going from DOS to Macintosh. One major limitation of PAW is the lack of variable length structures or more generally data objects. ROOT overcomes these limitations keeping a similar philosophy as PAW. Java Analysis Studio tries to go further with agents.

Personal data analysis Which will be the killer application for LHC analysis? Is a Mac Classic on Appletalk enough or do we need the conceptual leap equivalent of Web + Java-enabled browser? Will the personal n-tuple model work for LHC? Do we need and can we afford to support our own interactive data analysis tool? Will one of the newer tools, such as Java Analysis Studio, go exponential in the open source world? Many questions, one simple answer: It will be young people like you who will make the next step happen.

Monte Carlo simulation Monte Carlo simulation uses random numbers ( mathematics textbooks) Try the following: Find a source of random numbers in the interval [0,1] (calculator, Excel, etc.) Take a function that you want to simulate (e.g. y=x 2 ) and normalize it to fit in the interval [0,1] for both x and y. Find graph paper to histogram values of x Repeat this at least 20 times: Throw two random numbers. Use first as value for x Evaluate the function y and compare its value to 2 nd random number If function value is less than random number, add a count to histogram in the correct bin for x If function value is more than random number, forget it If function value is more than random number, forget it Compare your histogram to the shape of the function

Monte Carlo simulation If you don t know how to program, you can pick up an Excel file from http://cern.ch/manuel.delfino/brazil Here is the result for 100 trials: Note there are 30 10 9 entries so the 8 efficiency is 30% 7 6 Note the statistical 5 fluctuations 4 Homework: How is the 3 2 normalization done? y 1 Example of Monte Carlo simulation of y=x*x 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x Y SAMPLE

Statistics and error analysis Analysis involves selecting, counting and normalizing. Things are easier when you actually have a signal. Understand underlying statistics: Poisson, Binomial,Multinomial, etc. If measuring a differential distribution, understand relation between normalization of binned counts vs. total counts. Understand selection biases and their impact on observed distributions. Things are a lot harder when you place limits. Two observations: If you cannot make an analytical estimate of the uncertainties, I won t believe your result. The expression n-sigma effect should be banned.

Hypothesis testing You must understand Bayes theorem. And every time you think you understand it, you must make a big effort to understand it better! Compare differential distributions of data with predictions of theory or model Different theories Different parameters for same model Setting up the statistical test is often straight-forward, which is why it is surprising most people do it wrong Taking account of resolution and systematic uncertainties is hard Make simulation look like data to get your answers Even if graphics looks better the other way around!!!

Simulation of particle production and interactions with the detector For particle production, combine Monte Carlo with Detailed particle properties Detailed cross-sections predicted by theory of phenomenology Computation of phase-space Output consists of event records containing simulated particles (often called 4-vectors by experimentalists) For simulating the detector, combine MC with Detailed description of the detector Detailed cross-sections for interaction with detector materials Detailed phenomenology of mechanism producing signal Transport (Ray-tracing) algorithms including B fields Digitization model mapping of {x,y,z} to read-out channel

Simulation of particle production and interactions with the detector Example: Small part of design of GEANT4 Reference to Jackson s textboook in documentation!

Digital representations of event data In principle, representing event data digitally should be very simple, except: everything comes in variable numbers: hits, tracks, clusters ambiguities lead to multiple relations particle identification may depend on analysis hypothesis etc. In simple terms, events don t look like bank account data, they look like collections of objects. You can do a reasonable representation using relational tables, but actually using the data structures from Fortran/ROOT programs is still cumbersome Object Oriented Programming is a better match, but C++ does not resolve all problems Frameworks

Why physicists don t (yet) use Excel and Oracle for their daily analysis. Spreadsheets like Excel and relational databases like Oracle have a very square view of data. This is not a good match to the Data Worm. Normal people (banks and insurance companies) can define a priori the quantities that they will select on (the( keys of the database). We usually derive selection criteria a posteriori using quantities calculated from the stored data. We like (need?) to express queries as individualistic detailed low-level computer codes.. Difficult to support in database. But this is changing very rapidly due to Data Mining: Businesses are interested in analyzing their raw data in unpredictable ways. Example: Cash register tickets to choose sale items Support for this requires a more organic view of data, for example object-relational databases.

Why physicists don t (yet) use Excel and Oracle for their daily analysis. Idealized Cluster Calorimeter hit Particle hypothesis Mass Charge Momentum Origin Simple relation One to Many One to Many Position Width Depth Energy Number of hits Track Origin Curvature Extrapolation Number of hits One to Many One to Many Position Response Tracker hit Position Response

Why physicists don t (yet) use Excel and Oracle for their daily analysis. Reality Cluster Calorimeter hit Particle hypothesis Mass Charge Momentum Origin Complicated algorithmic relation Many to Many Many to Many Position Width Depth Energy Number of hits Track Origin Curvature Extrapolation Number of hits Many to Many Many to Many Position Response Tracker hit Position Response

The challenge of analysis for the LHC experiments

The challenge of analysis for the LHC experiments 1:10 12 Online 1:10 7 Analysis 1:10 5

The challenge of analysis for the LHC experiments

The challenge of analysis for the LHC experiments Detector 0.1 to 1 GB/sec Raw data 35K SI95 Event Filter (selection & reconstruction) 1 PB / year 500 TB One Experiment Event Summary Data ~200 MB/sec 350K SI95 64 GB/sec Batch Physics Analysis ~100 MB/sec Event analysis objects Reconstruction 250K SI95 Event Simulation Thousands of scientists distributed around the planet

The challenge of computing for the LHC TeraBytes Long Term Tape Storage Estimates 14'000 12'000 10'000 8'000 6'000 4'000 2'000 0 Current Experiments COMPASS LHC 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year

The challenge of computing for the LHC TeraBytes 14'000 12'000 10'000 8'000 6'000 4'000 2'000 0 1995 Long Term Tape Storage Estimates Current Experiments 1996 1997 Accumulation: 10 PB/year Signal/Background up to 1:10 12 1998 COMPASS 1999 2000 2001 2002 LHC 2003 2004 2005 2006 Year

The challenge of computing for the LHC K SI95 5,000 4,000 3,000 2,000 1,000 Estimated CPU Capacity required at CERN LHC Moore s law some measure of the capacity technology advances provide for a constant number of processors or investment 0 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Jan 2000: 3.5K SI95

Thousands of CERN Units 120 100 80 60 20 0 The challenge of computing for the LHC CERN Centre Physics Computing Capacity 40 Mainframes decomissioned First PC services CERN RD47 project RISC decomissioning agreed Moore's law (based on 1988) 1988 1990 1992 1994 1996 1998 2000 year

Thousands of CERN Units 120 100 80 60 20 0 The challenge of computing for the LHC CERN Centre Physics Computing Capacity 40 Mainframes decomissioned First PC services CERN RD47 project RISC decomissioning agreed Continued innovation Moore's law (based on 1988) 1988 1990 1992 1994 1996 1998 2000 year

Solving the LHC Computing Challenge: Technology Development Domains DEVELOPER VIEW FABRIC GRID APPLICATION USER VIEW

Solving the LHC Computing Challenge Storage Network 0.8 0.8 5 8 1.5 6 * 250 12 10 Thousand dual-cpu boxes Hundreds of tape drives 24 * Farm Network * Data Rate in Gbps 960 * LAN-WAN Routers Storage Network 0.8 Real-time detector data Grid Interface 10 Thousand disk units Computing fabric at CERN (2006)

Internet Protocol Architecture Solving the LHC Computing Challenge: Data-Intensive Grid Research Grid Protocol Architecture Specialized services : user- or appln-specific distributed services Application User Managing multiple resources : ubiquitous infrastructure services Sharing single resources : negotiating access, controlling use Talking to things : communication (Internet protocols) & security Controlling things locally : Access to, & control of, resources Collective Resource Connectivity Fabric Application Transport Internet Link

Acknowledgements Many of the figures in this talk are from the Web sites of ATLAS, CMS, Aleph and Delphi.