Partial wave analysis at BES III harnessing the power of GPUs

Similar documents
Partial Wave Analysis using Graphics Cards

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

Partial Wave Analysis at BES III harnessing the power of GPUs

ComPWA: A common amplitude analysis framework for PANDA

Partial Wave Analysis at BES III harnessing the power of GPUs. Niklaus Berger IHEP Beijing INT Seattle, November 2009

GPU Linear algebra extensions for GNU/Octave

General Purpose GPU Computing in Partial Wave Analysis

The High-Level Dataset-based Data Transfer System in BESDIRAC

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Using Graphics Chips for General Purpose Computation

Technology for a better society. hetcomp.com

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

SNiPER: an offline software framework for non-collider physics experiments

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Update of the BESIII Event Display System

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Experts in Application Acceleration Synective Labs AB

PyPWA A Partial-Wave/Amplitude Analysis Software Framework

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

arxiv: v1 [physics.ins-det] 11 Jul 2015

Performance potential for simulating spin models on GPU

Generators at the LHC

An SQL-based approach to physics analysis

ATS-GPU Real Time Signal Processing Software

Experience with PROOF-Lite in ATLAS data analysis

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Accelerating image registration on GPUs

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to GPU hardware and to CUDA

Data preservation for the HERA experiments at DESY using dcache technology

High Performance Computing on GPUs using NVIDIA CUDA

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CMS High Level Trigger Timing Measurements

Geant4 Computing Performance Benchmarking and Monitoring

GPGPU. Peter Laurens 1st-year PhD Student, NSC

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Does the Intel Xeon Phi processor fit HEP workloads?

GPU Architecture. Alan Gray EPCC The University of Edinburgh

P445/515 Data Analysis using PAW

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Overview of Project's Achievements

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

Data oriented job submission scheme for the PHENIX user analysis in CCJ

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Tracking and flavour tagging selection in the ATLAS High Level Trigger

ATLAS software configuration and build tool optimisation

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

Striped Data Server for Scalable Parallel Data Analysis

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU Programming Using NVIDIA CUDA

Higher Level Programming Abstractions for FPGAs using OpenCL

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Measurement of real time information using GPU

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

GPU Programming for Mathematical and Scientific Computing

Optimized Scientific Computing:

Optimization solutions for the segmented sum algorithmic function

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Efficient Computation of Radial Distribution Function on GPUs

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

OpenMP for next generation heterogeneous clusters

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Trends in HPC (hardware complexity and software challenges)

William Yang Group 14 Mentor: Dr. Rogerio Richa Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering

Modern Processor Architectures. L25: Modern Compiler Design

HEP Experiments: Fixed-Target and Collider

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Accelerating Molecular Modeling Applications with Graphics Processors

Offloading Java to Graphics Processors

BİL 542 Parallel Computing

Making extreme computations possible with virtual machines

PoS(High-pT physics09)036

Parallel Computing Ideas

Update of the BESIII Event Display System

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

The NOvA software testing framework

Performance quality monitoring system for the Daya Bay reactor neutrino experiment

Use of containerisation as an alternative to full virtualisation in grid environments.

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Benchmark of a Cubieboard cluster

Fast 3D tracking with GPUs for analysis of antiproton annihilations in emulsion detectors

Preface. Fig. 1 Solid-State-Drive block diagram

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

Transcription:

Journal of Physics: Conference Series Partial wave analysis at BES III harnessing the power of GPUs To cite this article: Niklaus Berger 2011 J. Phys.: Conf. Ser. 331 032005 View the article online for updates and enhancements. Related content - Partial wave analysis using graphics processing units Niklaus Berger, Liu Beijiang and Wang Jike - Parallelization of maximum likelihood fits with OpenMP and CUDA Sverre Jarp, Alfio Lazzaro, Julien Leduc et al. - ComPWA: A common amplitude analysis framework for PANDA M Michel, F Feldbauer, K Götzen et al. This content was downloaded from IP address 148.251.232.83 on 04/11/2018 at 04:24

Partial wave analysis at BES III harnessing the power of GPUs Niklaus Berger Institute of High Energy Physics, 19B Yuquan Lu, Shijingshan, Beijing 100049, China E-mail: nberger@ihep.ac.cn Abstract. Partial wave analysis is a core tool in hadron spectroscopy. With the high statistics data available at facilities such as the Beijing Spectrometer III, this procedure becomes computationally very expensive. We have successfully implemented a framework for performing partial wave analysis on graphics processors. We discuss the implementation, the parallel computing frameworks employed and the performance achieved, with a focus on the recent transition to the OpenCL framework. 1. Introduction The Beijing Electron-Positron Collider (BEPC) II is currently the world s highest luminosity machine in the τ-charm energy region. The Beijing Spectrometer (BES) III experiment was designed to make the best possible use of this luminosity [1]; one of the core parts of the physics program [2] is a thorough study of the light hadron spectrum. For many final states, a wealth of, often broad, intermediate resonances will contribute to the amplitude. In order to disentangle interference effects and determine the spin and parity of the resonances, a partial wave analysis (PWA) is performed. As this involves the computation of the amplitude for every event in every iteration of a fit, this becomes computationally very expensive for large data samples. As events are independent and the amplitude calculation does not vary from event to event, this task is trivially parallelizable. This and the floating point intensity predestine PWA for implementation on graphics processing units (GPUs). We have successfully implemented a framework performing tensor calculations and partial wave fits on GPUs. Both the software architecture which completely encapsulates all the complexities of GPU based computing and the raw power of today s GPUs lead to significantly shortened analysis turnaround times. 2. Partial wave analysis In a typical PWA (we use the radiative decay J/ψ γx, X K + K [3] as an example), the decay is modelled by coherently summing the contributions from a variety of intermediate resonances X. The relative magnitudes and phases of these resonances are determined from a fit and the fit result is compared to the data. The set of resonances and their properties are changed until a sufficient agreement with the data is found. Published under licence by IOP Publishing Ltd 1

The intensity I (relative number of events) at a particular point Ω in phase space can be expressed as 2 I(Ω) = V α A α (Ω), (1) α where the sum runs over all intermediate resonances α, V α is the complex production amplitude for α and A α (Ω) the complex decay amplitude at a particular point in phase space. The likelihood for a particular model is L N Data i=1 I(Ω i ) ǫ(ω)i(ω)dω, (2) where the product runs over all N Data events in the sample and the integral is proportional to the total cross section, corrected for the detector efficiency ǫ(ω). The integral is usually performed numerically by summing over a large number NMC Gen of simulated events (Monte Carlo, MC) generated evenly in phase space. The limited acceptance and efficiency of the detector can be taken into account by summing only over the NMC Acc simulated events that pass the reconstruction and analysis cuts. In an iterative fit, a minimum of the negative logarithm of the likelihood, corresponding to the best set of parameters for the used model is searched for; ( ) N Data ln L ln V α Vα A α (Ω i )A α (Ω i ) +ln V α V 1 N MC Acc α i=1 α α α α NMC Gen j=1 (3) The first sum runs over all data events, the second over all MC events. If the widths and masses of resonances are kept constant in the fit (i.e. the V α s are the only free parameters), the last (inner) bracket and the A α (Ω i )A α (Ω i ) term for each data event can be pre-calculated. The number of floating point operations required is dominated by the sum over the data events and scales with N iterations N data Nwaves, 2 whilst the lookup table takes up storage space scaling with N data Nwaves. 2 The storage space problem can be addressed by increasing the memory of the relevant machine ( 1.5 GB are required per million events for a model with 20 partial waves), or with appropriate caching mechanisms for data samples with several million events. If the required floating point operations are performed sequentially, the time required can however become very long. There is no way of knowing whether the fit found just a local or the searched for global maximum of the likelihood. To gain confidence in the result, the fits are usually repeated with various sets of starting parameters. In addition, various models have to be tried out, especially in the study of possible new resonances and systematic effects. The thousands of fits needed in a typical partial wave analysis should thus be as fast as possible, especially as feedback from the physicist is required between the fits and they thus cannot easily be ran in parallel. A α (Ω j )A α (Ω j ). 3. GPU computing The advent of computer games with realistic three-dimensional depictions of a virtual world computed in real time has pushed the development of ever more powerful graphics processing units (GPUs). GPUs provide a large number of floating point units optimized for executing the same code on many sets of input data (Single Instruction Multiple Data SIMD). The theoretical performance of current GPUs surpasses 2 TFlop/s, at commodity prices of a few hundred dollars. Whilst early efforts for using graphics processors for general purpose computing relied on graphics interfaces such as OpenGL [4], several frameworks for easy access to GPU computing have become available in the meantime. Besides the vendor supplied frameworks (CUDA 2

from Nvidia [5] and CAL/Brook+ from ATI [6]) there is now a vendor independent standard, OpenCL [7, 8]. Most of these frameworks provide lots of low level control on the granularity of the parallelism, memory management and so on. For typical high energy physics applications (and in particular PWA), this fine-grained control is not really needed, as events are fully independent. For these cases, a framework with a high level of abstraction such as Brook+ is very convenient 1. As Brook+ is however not developed any further because AMD/ATI concentrates on OpenCL, we built an abstraction layer on top of OpenCL. 4. The GPUPWA framework We have developed a framework for GPU assisted partial wave analysis called GPUPWA. It supports amplitude creation and calculation in the covariant tensor formalism on the GPU as well as GPU based likelihood calculations. Sophisticated caching mechanisms ensure that the memory footprint on the GPU stays within the rather tight limits (especially if using commodity hardware). The ROOT framework [9] is used for data input and output, histogramming and, via the Minuit2 interface [10], for the minimisation process 2. The framework completely abstracts all GPU interna from the user, who can program in pure C++, allowing for the creation of a PWA within hours (few days in cases with many complicated amplitudes). More details on the interface and implementation (and also the striking similarities between PWA and game graphics calculations) can be found in [12], which describes the Brook+ implementation, which is very close to the current OpenCL version. Most of the calculations on the GPU are performed with single precision floating point numbers. The final sum however is always performed in double precision. For a large enough number of events the final precision is comparable to a fully double precision CPU implementation or even better, as the parallel tree summing employed on the GPU has less rounding issues than a traditional accumulator sum. 5. Performance All the performance measurements quoted here were performed on an Intel Core 2 Quad 2.4 GHz workstation with 2 GB RAM running under Scientific Linux 5.2. The GPU is a ATI Radeon 4870 with 512 MB RAM 3. We compared a FORTRAN implementation used in BES II (without multi-threading and vectorization and fully in double precision) and the BROOK+ GPUPWA implementation of a J/ψ γk + K partial wave analysis, both minimizing using Fumili with analytical gradients and Hessian. The time for one fit iteration 4 was measured with the system timer for various sizes of the data samples (see figure 1). During the fit, data transfers over the PCIe bus are restricted to the fit parameters (to the GPU) and likelihoods, gradients and Hessian matrix (to the CPU); bus bandwidth is thus not the limiting factor. For large enough samples, the BROOK+ GPUPWA code is more than 150 times faster than the FORTRAN code. The same analysis was then ported to OpenCL in the AMD/ATI implementation. Without further optimisation of the code, another 35% speed-up was obtained for large event numbers (see figure 2); in part this is due to fewer restrictions in the size of memory allocations in OpenCL compared to Brook+, the rest is probably due to a more efficient compiler. 1 We initially based our project on Brook+ and ATI hardware because at the time, only ATI allowed double precision calculations on consumer cards. This had the disadvantage that we were working with a rather experimental product with a small user base compared to Nvidias CUDA. Brook+ however offers a higher level of abstraction, a narrow interface to the GPU and more elegant programming. 2 For all cases where we can provide analytic gradients, we rely on the FUMILI fitter [11], as it converges with the fewest iterations and we do not require reliable parameter errors from the fitter in most cases. 3 In the meantime, much more powerful hardware is available both at the CPU and the GPU end; the machine has however been available during the complete GPUPWA development process and thus remains an useful reference. 4 Typical Fumili fits require 5-10 iterations, Minuit fits rather 50-100 (come however with useful error estimates). 3

10.0 Fortran Time/Iteration [s] 1.0 0.1 0.01 OpenCL CPU Brook+ OpenCL 0.001 0 100000 200000 500000 Events Figure 1. Comparison of the running time of one fit iteration of the benchmark J/ψ γk + K analysis in different implementations on the same machine. Both OpenCL datasets were obtained using the AMD implementation. The hardware used for the tests is described in section 5. Table 1. Time taken by an example analysis in various steps of the calculation in the OpenCL implementation on the GPU. For details of the set-up, see section 5. Note that the fit step includes initialization and cleanup and times are thus not directly comparable with those shown in figures 1 and 2. Step Time [s] for Time [s] for 50 000 Events 500 000 Events Start-up (initialization and reading of files) 0.75 3.67 MC Integral 1.01 4.65 Lookup table creation 1.64 4.92 Fit (5 iterations) 0.10 0.99 Plots 1.48 9.82 Total 4.98 24.05 The AMD/ATI OpenCL implementation also allows for running the code purely on the CPU, making use of multiple cores and vector units. On our test machine, this leads to a speed-up of more than one order of magnitude with regards to the FORTRAN implementation, but also still another order of magnitude removed from the GPU performance. A rough estimate on the number of floating point operations (both in single and double 4

0.06 0.05 Time/Iteration [s] 0.04 0.03 0.02 Brook+ OpenCL 0.01 0 100000 200000 500000 Events Figure 2. Comparison of the running time of one fit iteration of the benchmark J/ψ γk + K analysis using the Brook+ and the OpenCL framework (in the ATI implementation) on the same machine/gpu. Both OpenCL datasets were obtained using the AMD implementation. The hardware used for the tests is described in section 5. precision) performed indicates that we reach on the order of 10 GFlop/s on the GPU, about 1 % of the theoretical floating point performance of the card; most of the time is thus spent in memory accesses, communication over the PCI bus and the MINUIT/Fumili step on the CPU. As however the run time of a typical fit is dominated by reading input data and producing plots (see table 1 for example total timings), we do not put a high priority on further optimization of the fit proper. 6. Open issues OpenCL finally offers a vendor independent access to these resources. OpenCL leaves however many details of the implementation to the vendors, possibly leading to portability issues; our attempt to port the OpenCL version of GPUPWA to an Apple computer running the Snow Leopard operating system (which natively incorporates OpenCL), was hampered by the different implementation of vector data types in the CPU part of the code between Apple and AMD (GPUPWA uses the OpenCL vector data types also in the C++ part of the code, e.g. for initialisation of data structures and event-independent tensor manipulations). Obviously it would be desirable to overcome these portability issues in the future. It would also be helpful for many high energy physics applications if there would be a higher level interface to the GPU than OpenCL, something like Brook for OpenCL. 5

Most of the technical issues in PWA 5 are however related to the fitting; in typical fits there are 20 to 100 free parameters, with all the issues that brings with it. Most of the parameters represent complex numbers, which are not handled well by the usual fitters; if a Cartesian representation is used, large correlations between the components are created if the phase is not well constrained, hampering fit convergence. A polar representation relies on a bounded and a periodical parameter, the first of which problematic for the fitter if the bound is a valid value, the second not handled by the fitters at all and often requiring manual intervention. A minimizing algorithm for complex numbers (including the proper handling of the derivatives) would be extremely useful here. 7. Conclusion and outlook Using massively parallel computing, especially in the for of GPUs, is a powerful tool to overcome speed limits in partial wave analyses. The GPUPWA framework developed at the BES III experiment offers easy access to these resources with a pure C++ interface and all core functionalities provided by the framework. GPUPWA has been ported to use the OpenCL standard to interface parallel computing devices. The speed gain compared to a single-threaded non-vectorized version is more than an order of magnitude if using multiple threads and vector units on the CPU and more than two orders of magnitude when employing the GPU. Future development will focus on adding physics functionality to GPUPWA and to make the application truly portable. This will then also allow for hardware comparisons between the major GPU vendors. Acknowledgments I would like to thank my colleagues at IHEP who are testing and using GPUPWA for their many helpful suggestions, bug reports and additions to the framework. The work presented was supported by the Chinese Academy of Sciences and the Swiss National Science Foundation. References [1] Ablikim M et al. (BESIII) 2010 Nucl. Instr. Meth. A614 345 399 (Preprint 0911.4960) [2] Asner D M et al. 2008 Physics at BES-III (Preprint arxiv:0809.1869 [hep-ex]) [3] Bai J Z et al. (BES) 2003 Phys. Rev. D 68 052003 (Preprint hep-ex/0307058) [4] Segal M and Akeley K 2008 The OpenGL graphics system: A specification (version 3.0) Tech. rep. The Khronos Group URL http://www.opengl.org/registry/doc/glspec30.20080811.pdf [5] Halfhill T R 2008 Microprocessor Report 01/28/08 1 6 [6] Advanced Micro Devices 2007 AMD stream computing, software stack AMD white paper URL http://ati.amd.com/technology/streamcomputing/firestream-sdk-whitepaper.pdf [7] Munshi A 2008 The OpenCL specification, version 1.0 Tech. rep. The Khronos Group URL http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf [8] Munshi A 2010 The OpenCL specification, version 1.1 Tech. rep. The Khronos Group URL http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf [9] Brun R and Rademakers F 1997 Nucl. Instrum. Meth. A 389 81 86 [10] James F and Roos M 1975 Comput. Phys. Commun. 10 343 367 [11] FUMILI Tech. rep. CERN Prorgram Library, D510 [12] Berger N, Liu B and Wang J 2010 J. Phys. Conf. Ser. 219 042031 [13] 2009 The Jefferson Laboratory Upgrade to 12 GeV: Dedicated Workshop on Hadron Spectroscopy Seattle URL http://www.int.washington.edu/programs/09-3.html [14] 2011 Amplitude Analysis in Hadron Spectroscopy Trento URL http://www.ect.it/ 5 There is also a wealth of issues related to the physics, especially as to what amplitude models to employ. These issues at the interface between theory, experiment and computing have been discussed in depth in recent workshops in Seattle [13] and Trento [14]. 6