arxiv: v1 [physics.ins-det] 11 Jul 2015

Similar documents
The GAP project: GPU applications for High Level Trigger and Medical Imaging

Track pattern-recognition on GPGPUs in the LHCb experiment

Performance Study of GPUs in Real-Time Trigger Applications for HEP Experiments

Fast pattern recognition with the ATLAS L1Track trigger for the HL-LHC

Tracking and flavour tagging selection in the ATLAS High Level Trigger

First LHCb measurement with data from the LHC Run 2

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to GPGPU and GPU-architectures

PoS(High-pT physics09)036

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

3D-Triplet Tracking for LHC and Future High Rate Experiments

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

General Purpose GPU Computing in Partial Wave Analysis

Online Tracking Algorithms on GPUs for the. ANDA Experiment at. Journal of Physics: Conference Series. Related content.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Numerical Algorithms on Multi-GPU Architectures

Tracking and Vertex reconstruction at LHCb for Run II

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Parallel FFT Program Optimizations on Heterogeneous Computers

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

ATLAS, CMS and LHCb Trigger systems for flavour physics

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Use of hardware accelerators for ATLAS computing

LUNAR TEMPERATURE CALCULATIONS ON A GPU

arxiv: v2 [physics.ins-det] 21 Nov 2017

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Faster Simulations of the National Airspace System

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

GPUs and GPGPUs. Greg Blanton John T. Lubia

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing

arxiv: v1 [physics.comp-ph] 4 Nov 2013

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Tesla Architecture, CUDA and Optimization Strategies

Fast 3D tracking with GPUs for analysis of antiproton annihilations in emulsion detectors

Introduction to GPU hardware and to CUDA

Large scale Imaging on Current Many- Core Platforms

A New Segment Building Algorithm for the Cathode Strip Chambers in the CMS Experiment

Primary Vertex Reconstruction at LHCb

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

arxiv: v4 [physics.comp-ph] 15 Jan 2014

Reconstruction Improvements on Compressive Sensing

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

CMS Conference Report

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Tracking and compression techniques

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Multigrid algorithms on multi-gpu architectures

Optimization solutions for the segmented sum algorithmic function


World s most advanced data center accelerator for PCIe-based servers

Integrated CMOS sensor technologies for the CLIC tracker

B. Tech. Project Second Stage Report on

n N c CIni.o ewsrg.au

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Accelerating image registration on GPUs

Performance of the ATLAS Inner Detector at the LHC

GPU Programming Using NVIDIA CUDA

Dense Linear Algebra. HPC - Algorithms and Applications

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating CFD with Graphics Hardware

Technology for a better society. hetcomp.com

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

PoS(EPS-HEP2017)492. Performance and recent developments of the real-time track reconstruction and alignment of the LHCb detector.

high performance medical reconstruction using stream programming paradigms

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Charged Particle Reconstruction in HIC Detectors

TUNING CUDA APPLICATIONS FOR MAXWELL

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

CUDA Performance Optimization. Patrick Legresley

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

RAMSES on the GPU: An OpenACC-Based Approach

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Alignment of the CMS silicon tracker

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

MPEXS benchmark results

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Stefania Beolè (Università di Torino e INFN) for the ALICE Collaboration. TIPP Chicago, June 9-14

Track reconstruction of real cosmic muon events with CMS tracker detector

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

arxiv: v1 [physics.ins-det] 26 Dec 2017

Massively Parallel Architectures

Transcription:

GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University and INFN, via Irnerio 46, 7 Bologna, Italy INFN-Bologna,v.le Berti Pichat 6/, 7 Bologna, Italy 3 INFN-CNAF,v.le Berti Pichat 6/, 7 Bologna, Italy DOI: will be assigned The LHC experiments are designed to detect large amount of physics events produced with a very high rate. Considering the future upgrades, the data acquisition rate will become even higher and new computing paradigms must be adopted for fast data-processing: General Purpose Graphics Processing Units (GPGPU) is a novel approach based on massive parallel computing. The intense computation power provided by Graphics Processing Units (GPU) is expected to reduce the computation time and to speed-up the low-latency applications used for fast decision taking. In particular, this approach could be hence used for high-level triggering in very complex environments, like the typical inner tracking systems of the multi-purpose experiments at LHC, where a large number of charged particle tracks will be produced with the luminosity upgrade. In this article we discuss a track pattern recognition algorithm based on the Hough Transform, where a parallel approach is expected to reduce dramatically the execution time. Introduction Modern High Energy Physics (HEP) experiments are designed to detect large amount of data with very high rate. In addition to that weak signatures of new physics must be searched in complex background condition. In order to reach these achievements, new computing paradigms must be adopted. A novel approach is based on the use of high parallel computing devices, like Graphics Processing Units (GPU), which delivers such high performance solutions to be used in HEP. In particular, a massive parallel computation based on General Purpose Graphics Processing Units (GPGPU) [] could dramatically speed up the algorithms for charged particle tracking and fitting, allowing their use for fast decision taking and triggering. In this paper we describe a tracking recognition algorithm based on the Hough Transform [, 3, 4] and its implementation on Graphics Processing Units (GPU). Tracking with the Hough Transform The Hough Transform (HT) is a pattern recognition technique for features extraction in image processing, and in our case we will use a HT based algorithm to extract the tracks parameters from the hits left by charged particles in the detector. A preliminary result on this study has been already presented in [5]. Our model is based on a cylindrical multi-layer silicon detector GPUHEP4

installed around the interaction point of a particle collider, with the detector axis on the beamline. The algorithm works in two serial steps. In the first part, for each hit having coordinates (x H, y H, z H ) the algorithm computes all the circles in the x y transverse plane passing through that hit and the interaction point, where the circle equation is x + y Ax By =, and A and B are the two parameters corresponding to the coordinates of the circle centre. The circle detection is performed taking into account also the longitudinal (θ) and polar (φ) angles. For all the θ, φ, A, B, satisfying the circle equation associated to a given hit, the corresponding M H (A, B, θ, φ) Hough Matrix (or Vote Matrix) elements are incremented by one. After computing all the hits, all the M H elements above a given threshold would correspond to real tracks. Thus, the second step is a local maxima search among the M H elements. In our test, we used a dataset of simulated events (pp collisions at LHC energy, Minimum Bias sample with tracks having transverse momentum p T > MeV), each event containing up to particle hits on a cylindrical -layer silicon detector centred on the nominal collision point. The four hyper-dimensions of the Hough space have been binned in 4 6 4 4 along the corresponding A, B, θ, φ parameters. The algorithm performance compared to a χ fit method is shown in Fig. : the ρ = A + B and ϕ = tan (B/A) are shown together with the corresponding resolutions. 3 (a) χ fit Hough Transform 3 3 (b) 5 5 5 3 ρ[m] 3 4 5 6 ϕ [rad] 8 (c) 7 6 σ= % 3.5.4.3.....3.4.5 ρresolution (d) 8 6 σ=.5% 8 6..8.6.4...4.6.8. ϕ resolution Figure : Hough Transform algorithm compared to χ fit. (a) ρ distribution; (b) ϕ distribution; (c) ρ resolution; (d) ϕ resolution. 3 SINGLE-GPU implementation The HT tracking algorithm has been implemented in GPGPU splitting the code in two kernels, for Hough Matrix filling and searching local maxima on it. Implementation has been performed both in CUDA [] and OpenCL [6]. GPGPU implementation schema is shown in Fig.. Concerning the CUDA implementation, for the M H filling kernel, we set a -D grid over all the hits, the grid size being equal to the number of hits of the event. Fixed the (θ, φ) values, a thread-block has been assigned to the A values, and for each A, the corresponding GPUHEP4

B is evaluated. The M H (A, B, θ, φ) matrix element is then incremented by a unity with an atomicadd operation. The M H initialisation is done once at first iteration with cudamallochost (pinned memory) and initialised on device with cudamemset. In the second kernel, the local maxima search is carried out using a -D grid over the θ, φ parameters, the grid dimension being the product of all the parameters number over the maximum number of threads per block (N φ N θ N A N B )/maxthreadsperblock, and -D threadblocks, with dimxblock=n A and dimyblock=maxthreadperblock/n A. Each thread compares one M H (A, B, θ, φ) element to its neighbours and, if the biggest, it is stored in the GPU shared memory and eventually transferred back. With such big arrays the actual challenge lies in optimizing array allocation and access and indeed for this kernel a significant speed up has been achieved by tuning matrix access in a coalesced fashion, thus allowing to gain a crucial computational speed-up. The OpenCL Figure : GPGPU implementation schema of the two Hough Transform algorithm kernels. implementation has been done using a similar structure used for CUDA. Since in OpenCL there is no direct pinning memory, a device buffer is mapped to an already existing memallocated host buffer (clenqueuemapbuffer) and dedicated kernels are used for matrices initialisation in the device memory. The memory host-to-device buffer allocation is performed concurrently and asynchronously, saving overall transferring time. 3. SINGLE-GPU results The test has been performed using the NVIDIA [] GPU boards listed in table. The GTX77 board is mounted locally on a desktop PC, the Tesla K and K are installed in the INFN- CNAF HPC cluster. The measurement of the execution time of all the algorithm components has been carried out as a function of the number of hits to be processed, and averaging the results over independent runs. The result of the test is summarised in Fig. 3. The total execution time comparison between GPUs and CPU is shown in Fig. 3a, while in Fig. 3b the details about GPUHEP4 3

Device NVIDIA NVIDIA NVIDIA specification GeForce GTX77 Tesla Km Tesla Km Performance (Gflops) 33 354 49 Mem. Bandwidth (GB/s) 4. 8 88 Bus Connection PCIe3 PCIe3 PCIe3 Mem. Size (MB) 48 5 8 Number of Cores 536 496 88 Clock Speed (MHz) 46 76 Table : Computing resources setup. the execution on different GPUs are shown. The GPU execution is up to 5 times faster with respect to the CPU implementation, and the best result is obtained for the CUDA algorithm version on the GTX77 device. The GPUs timing are less dependent on the number of the hits with respect to CPU timing. The kernels execution on GPUs is even faster with respect to CPU timing, with two orders of magnitude GPU-CPU speed up, as shown in Figs. 3c and 3e. When comparing the kernel execution on different GPUs (Figs. 3d) and 3f), CUDA is observed to perform slightly better than OpenCL. Figure 3g shows the GPU-to-CPU data transfer timings for all devices together with the CPU I/O timing, giving a clear idea of the dominant part of the execution time. 4 MULTI-GPU implementation Assuming that the detector model we considered could have multiple readout boards working independently, it is interesting to split the workload on multiple GPUs. We have done this by splitting the transverse plane in four sectors to be processed separately, since the data across sectors are assumed to be read-out independently. Hence, a single HT is executed for each sector, assigned to a single GPU, and eventually the results are merged when each GPU finishes its own process. The main advantage is to reduce the load on a single GPU by using lightweight Hough Matrices and output structures. Only CUDA implementation has been tested, using the same workload schema discussed in Sec. 3, but using four M H (A, B, θ), each matrix processing the data of a single φ sector. 4. MULTI-GPU results The multi-gpu results are shown in Fig. 4. The test has been carried out in double configuration, separately, with two NVIDIA Tesla K and two NVIDIA Tesla K. The overall execution time is faster with double GPUs in both cases, even if timing does not scale with the number of GPUs. An approximate half timing is instead observed when comparing kernels execution times. On the other hand, the transferring time is almost independent on the number of GPUs, this leading the overall time execution. 4 GPUHEP4

6 3 (a) Total Execution 3 65 6 55 45 35 (b) Total Execution (GPUs only) 3 3 (c) Hough Matrix Filling 3 (d) Hough Matrix Filling (GPUs only) 3 (e) Local Maxima Search 8 6 (f) Local Maxima Search (GPUs only) 4 3 3 3 6 3 (g) Data Transfer CPU intel i7 GeForce GTX77 CUDA Tesla Km CUDA Tesla Km CUDA GeForce GTX77 OpenCL Tesla Km OpenCL Tesla Km OpenCL 3 Figure 3: Execution timing as a function of the number of analysed hits. (a) Total execution time for all devices; (b) Total execution time for GPU devices only; (c) M H filling time for all devices; (d) M H filling timing for GPU devices only; (e) local maxima search timing for all devices; (f) local maxima search timing for GPU devices only; (g) device-to-host transfer time (GPUS) and I/O time (CPU). 5 Conclusions A pattern recognition algorithm based on the Hough Transform has been successfully implemented on CUDA and OpenCL, also using multiple devices. The results presented in this paper show that the employment of GPUs in situations where time is critical for HEP, like triggering at hadron colliders, can lead to significant and encouraging speed-up. Indeed the problem by itself offers wide room for a parallel approach to computation: this is reflected in the results shown where the speed-up is around 5 times better than what achieved with a normal CPU. There are still many handles for optimising the performance, also taking into account the GPU architecture and board specifications. Next steps of this work go towards an interface to actual experimental frameworks, including the management of the experimental data structures and testing with more graphics accelerators and coprocessor. References [] NVidia Corporation URL http://www.nvidia.com/ NVidia Corporation URL http://www.nvidia.com/object/gpu.html NVidia Corporation URL http://www.nvidia.com/object/cuda home new.html [] P. Hough; 959 Proc. Int. Conf. High Energy Accelerators and Instrumentation C5994 554558 GPUHEP4 5

6 55 (a) Total Execution (b) Hough Matrix Filling 45 35 Tesla Km Single Tesla Km Double Tesla Km Single Tesla Km Double 3 3 3 9 8 (c) Local Maxima Search 45 (d) Data Transfer 7 6 5 35 4 3 3 3 5 3 Figure 4: Execution timing as a function of the number of the hits for multi-gpu configuration. (a) Total execution time; (b) M H filling timing; (c) local maxima search timing; (d) device-tohost transfer time. [3] V Halyo et al; 4 JINST 9 P5 [4] M R Buckley, V Halyo, P Lujan; 4 http://arxiv.org/abs/5.8 P. Hough, 96, United States Patent 369654. [5] S Amerio et al; PoS (TIPP4) 8 [6] OpenCL, Khronos Group URL https://www.khronos.org/opencl/ 6 GPUHEP4