pyframe Ryan Reece A light-weight Python framework for analyzing ROOT ntuples in ATLAS University of Pennsylvania

Similar documents
Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

The Run 2 ATLAS Analysis Event Data Model

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

CMS Analysis Workflow

TreeSearch User Guide

Toward real-time data query systems in HEP

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

Principles of Programming Languages. Objective-C. Joris Kluivers

VISPA: Visual Physics Analysis Environment

Event Displays and LArg

PROOF-Condor integration for ATLAS

Analysis & Tier 3s. Amir Farbin University of Texas at Arlington

Recasting with. Eric Conte, Benjamin Fuks. (Re)interpreting the results of new physics searches at the LHC June CERN

Building a Multi-Language Interpreter Engine

Big Data Analytics Tools. Applied to ATLAS Event Data

Table of Contents. TB2009DataAnalysis...1. Introduction...2 People Involved in Analysis/Software...2 Communication...2. Data Preparation...

RELEASE OF GANGA. Basics and organisation What Ganga should do tomorrow Ganga design What Ganga will do today Next steps

An SQL-based approach to physics analysis

MadAnalysis5 A framework for event file analysis.

Ganga The Job Submission Tool. WeiLong Ueng

Data Structures (list, dictionary, tuples, sets, strings)

Control Structures. Code can be purely arithmetic assignments. At some point we will need some kind of control or decision making process to occur

Early experience with the Run 2 ATLAS analysis model

And Parallelism. Parallelism in Prolog. OR Parallelism

Physics Analysis Software Framework for Belle II

Meeting. Wataru OKAMURA 26/12/2011 Osaka ATLAS meeting

PyROOT: Seamless Melting of C++ and Python. Pere MATO, Danilo PIPARO on behalf of the ROOT Team

HOMEWORK 9. M. Neumann. Due: THU 8 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

Overview of OOP. Dr. Zhang COSC 1436 Summer, /18/2017

Large Scale Software Building with CMake in ATLAS

DAL ALGORITHMS AND PYTHON

SCHEME 8. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. March 23, 2017

Project #1: Tracing, System Calls, and Processes

Assignment 6: The Power of Caches

Parallel MATLAB at VT

The Dynamic Typing Interlude

Make sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up

Reliable programming

Implementing a VSOP Compiler. March 20, 2018

Performance quality monitoring system (PQM) for the Daya Bay experiment

Rivet. July , CERN

Simulation and Visualization in Hall-D (a seminar in two acts) Thomas Britton

Spring 2018 Discussion 7: March 21, Introduction. 2 Primitives

CS103 Handout 29 Winter 2018 February 9, 2018 Inductive Proofwriting Checklist

Magento 2 Certified Professional Developer. Exam Study Guide

Evaluation of Apache Hadoop for parallel data analysis with ROOT

CS 31: Intro to Systems Pointers and Memory. Kevin Webb Swarthmore College October 2, 2018

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

xmljson Documentation

2. Write style rules for how you d like certain elements to look.

Chapter Two Bonus Lesson: JavaDoc

CS 11 java track: lecture 1

Stitched Together: Transitioning CMS to a Hierarchical Threaded Framework

Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP

Fall 2017 Discussion 7: October 25, 2017 Solutions. 1 Introduction. 2 Primitives

UW-ATLAS Experiences with Condor

UNIT I Programming Language Syntax and semantics. Kainjan Sanghavi

L25: Modern Compiler Design Exercises

Reverse Engineering Swift Apps. Michael Gianarakis Rootcon X 2016

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Git. CSCI 5828: Foundations of Software Engineering Lecture 02a 08/27/2015

New data access with HTTP/WebDAV in the ATLAS experiment

Day 15: Science Code in Python

Validation of Missing ET reconstruction in xaod

How to discover the Higgs Boson in an Oracle database. Maaike Limper

Martinos Center Compute Cluster

HEP data analysis using ROOT

Django Test Utils Documentation

Induction and Semantics in Dafny

wagtailmenus Documentation

Boosted ZHLL analysis. Wataru OKAMURA 27/12/2011

Github/Git Primer. Tyler Hague

ECE 353 Lab 3. (A) To gain further practice in writing C programs, this time of a more advanced nature than seen before.

Getting Started with Serial and Parallel MATLAB on bwgrid

C# Language. CSE 409 Advanced Internet Technology

AIDA-2020 Advanced European Infrastructures for Detectors at Accelerators. Scientific/Technical Note

D Programming Language

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

Launch Store. University

CS56 final (E03) W15, Phill Conrad, UC Santa Barbara Wednesday, 03/18/2015. Name: Umail umail.ucsb.edu. Circle one: 4pm 5pm 6pm

Core to Murphi XML version

1) How many gigabytes make a terabyte? A) 10 B) 100 C) 1000 D) 1024 E) 512

Overview. Rationale Division of labour between script and C++ Choice of language(s) Interfacing to C++ Performance, memory

CS61C Machine Structures. Lecture 3 Introduction to the C Programming Language. 1/23/2006 John Wawrzynek. www-inst.eecs.berkeley.

DEDICATED SERVERS WITH EBS

CS 231 Data Structures and Algorithms, Fall 2016

Lecture 4. Wednesday, January 27, 2016

CIS192 Python Programming

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

Introduction to ROOT. Sebastian Fleischmann. 06th March 2012 Terascale Introductory School PHYSICS AT THE. University of Wuppertal TERA SCALE SCALE

Counter & LED (LED Blink)

Outline. Java Models for variables Types and type checking, type safety Interpretation vs. compilation. Reasoning about code. CSCI 2600 Spring

turbo-hipster Documentation

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

Book keeping. Will post HW5 tonight. OK to work in pairs. Midterm review next Wednesday

wagtailtrans Documentation

Performance quality monitoring system for the Daya Bay reactor neutrino experiment

High Performance Python Micha Gorelick and Ian Ozsvald

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

An overview of batch processing. 1-June-2017

Transcription:

pyframe A light-weight Python framework for analyzing ROOT ntuples in ATLAS Ryan Reece University of Pennsylvania ryan.reece@cern.ch Physics Analysis Tools Meeting, CERN September 21, 2011

What is this framework? pyframe is a light-weight python framework for analyzing ROOT ntuples in ATLAS. It allows you to read virtually any kind of flat ntuples in python, quickly, and in a minimal framework that can scale in complexity to include several algorithms and tools. It s design goals are: 1. No EDM 2. Flat data 3. But treat objects like objects 4. Blackboard pattern for sharing data among algorithms 5. Easy addition of user data 6. Be fast, enough 2

1. No EDM Eliminate the need of a user to maintain classes describing the entire content of the data. Variables are accessed dynamically, on-demand. Of course the execution will halt if you try to access a variable that doesn't exist, but you aren't responsible for knowing more about the data than you ask of it. If you don t read a variable, it can t hurt you. No MakeClass. No ReaderMaker. 3

2. Flat data Data formats in active experiments are subject to change. pyframe makes no assumptions about the kind of data you want to analyze except: 1. Your trees are flat, meaning that they are not filled with rich, userdefined types. The branches are basic types, arrays, or std::vectors. 2. Groups of variables that represent the attributes of a common class of objects, should have a common prefix in their branch names, for example: int el_n; std::vector<float> el_eta; std::vector<float> el_phi; std::vector<float> el_pt; pyframe was designed with the use case of reading ATLAS D3PDs in mind, but there is no reason you can't read any flat ntuple with common branch prefixes. 4

3. But treat objects like objects Flat data has the huge advantages that it is easy to inspect and there is no schema evolution. But our problems in high energy physics really are object oriented. pyframe benefits from assuming flat data (D3PDs), but allows you to group related variables into objects that behave as if they were a class; objects that can be collected, filtered, and sorted. This goal is realized with the VarProxy class. 5

VarProxy A VarProxy internally holds a reference to the tree being read, a string prefix, and an integer index. It has its getattribute method overridden such that when one calls p.pt it returns tree.el_pt[1] You can build a list of VarProxies for all the electrons in your tree with the build_var_proxies function: electrons = pyframe.core.build_var_proxies(chain, chain.el_n, prefix='el_') Building lists of VarProxies with a common prefix and saving the collection in the store dictionary is made more user friendly with the ListBuilder algorithm shown later. Now you can pt-sort the electrons electrons.sort(lambda x, y: cmp(x.pt, y.pt), reverse=true) and treat each instance as if it were an instance of a class with members pt, eta, etc. el = electrons[0] el.pt 6

4. Blackboard pattern for data sharing among algorithms A pyframe job executes a list of algorithms in order, for each event in the data. Like the Gaudi framework's StoreGate, pyframe allows algorithms to share data with each other by storing any event-level derived data in the store dictionary, which can later be retrieved by any subsequent algorithms. There is also a hists dictionary, with the only difference that it is not cleared event-by-event, but store is cleared. 7

5. Easy addition of user data In the course of data analysis, one should be able to calculate new object-level derived quantities and associate them to the object. Python's ability to dynamically set any object's attributes makes attaching new variables to objects trivial: p.my_new_var = 11.0 8

6. Be fast, enough Being written in python, pyframe's design favors the programmer's time and sanity, over the CPU. But of course your analysis needs to be quick enough to sample the data effectively. pyframe has been designed with event processing rates (per 2-3 GHz core) of 100 Hz being considered sufficienct and 1 khz being preferred. My analysis currently runs at 400 Hz. Processing lots of data should be done in parallel. pyframe has been designed for easy parallelization using Python's standard module multiprocessing, or by submitting jobs to a batch system. 9

Selectors pyframe has modules for each object type: egamma, muon, tau, jet, met. electron_selector = \ pyframe.egamma.electronselector( allowed_authors = [1, 3], min_pt = 15.0*GeV, allowed_etas = [(-2.47, -1.52), (-1.37, 1.37),(1.52, 2.47)], flags = ['tightwithtrack'] ) Each selector inherits from a base selector that uses Python s built-in filter. Each selector just has to implement a select function that returns True/ False. 10

Configuring an EventLoop One adds algorithms to an EventLoop: loop = pyframe.core.eventloop('myloop') loop += pyframe.grl.grlfilter(config['grl']) loop += pyframe.algs.listbuilder( prefixes = ['vxp_'], keys = ['all_vertices'], ) loop += \ pyframe.selectors.selectoralg('vertexselectoralg', selector = vxp_primary_selector, key_in='all_vertices', key_out='primary_vertices', )... loop.run(chain, 0, config['max_events']) 11

Turning Off Branches with TreeProxy O(10) speed improvements can be achieved by simply turning off branches that are not read in the analysis. The TreeProxy class makes this easy by logging every variable that is read, and optionally dumping the list to a file. When running an EventLoop, one can configure to dump the variable log, and/or select which branches to turn on by: loop.run(chain, 0, config['max_events'], branches_on_file = config.get('branches_on_file'), do_var_log = config.get('do_var_log'), ) 12

Profile your analysis as you develop A pyframe EventLoop automatically monitors the execution time for each algorithm. A report of the the time used is dumped at the end of the job: ALGORITHM TIME SUMMARY # ALGORITHM TIME [s] RATE [khz] 0 MCEventWeight 0 31.3 1 GRLFilter 0 30.9 2 ListBuilder 0 10.5 3 PrimaryVertexSelectorAlg 8 3.67... 13

Parallelization On a multicore machine, one can make use of Python s multiprocessing module just by supplying a command line argument specifying how many cores to use:./job.py -p 4 This gives functionality similar to PROOF-lite, with pure python. On a cluster, I ve been running Condor jobs easily. You just need Python and ROOT on all the worker nodes. pyutils/condor.py is a module I wrote for dividing up the input files and generating condor submission scripts. 14

Installation Requirements: Python 2.6 ROOT SVN Instructions and documentation: https://twiki.cern.ch/twiki/bin/view/sandbox/pyframe checkout pyframe and pyutils add to your PYTHONPATH only thing to compile is external packages using RootCore has a helloworld that runs out of the box 15

pyframe summary Pure python framework for reading flat ntuples VarProxy allows you to treat related variables like objects, without maintaining classes. Parallelization with multiprocessing and condor. Uses RootCore to use external tools and ATLAS recommendations. Get the full advantage of Python clean syntax duck typing string formatting standard modules garbage collection (no memory leaks) 16

My workflow x.d3pd.root tree_trimmer.py x.skim.root pyframe x.hist.root metaroot x.canv.root root2html.py www tree_trimmer.py is a general python script for skimming events and dropping branches. pyframe is used to make an event loop for filling histograms. metaroot is a python package for formatting histograms, producing a file of TCanvases. root2html.py is a general script that generates an html document of figures with captions with detailed statistics from any ROOT file of TCanvases. Some example output: ditau analysis: https://reece.web.cern.ch/reece/share/ditau.tau-mu.2011-09-05/ tau performance studies: https://reece.web.cern.ch/reece/share/plot_vars/ TRT performance studies: https://reece.web.cern.ch/reece/share/trt_eff.data_182787.mc_j2/ 17

More info tree_trimmer.py https://svnweb.cern.ch/trac/penn/browser/reece/rel/ tree_trimmer/trunk pyframe https://twiki.cern.ch/twiki/bin/view/sandbox/pyframe metaroot https://svnweb.cern.ch/trac/penn/browser/reece/rel/ metaroot/trunk root2html.py https://svnweb.cern.ch/trac/penn/browser/reece/rel/ root2html/trunk 18