Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Similar documents
HEP Experiments: Fixed-Target and Collider

High Performance Computing on GPUs using NVIDIA CUDA

GPU Architecture. Alan Gray EPCC The University of Edinburgh

* Fixed-target heavy-ion experiment *10 7 Au+Au collisions/sec * ~ 1000 charged particles/collision

Using GPUs to compute the multilevel summation of electrostatic forces

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance potential for simulating spin models on GPU

Introduction to Xeon Phi. Bill Barth January 11, 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to GPU hardware and to CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPUs and Emerging Architectures

Does the Intel Xeon Phi processor fit HEP workloads?

General Purpose GPU Computing in Partial Wave Analysis

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CME 213 S PRING Eric Darve

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Preparing seismic codes for GPUs and other

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

FLES First Level Event Selection Package for the CBM Experiment

Accelerating Financial Applications on the GPU

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

WHY PARALLEL PROCESSING? (CE-401)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

n N c CIni.o ewsrg.au

Performance Study of GPUs in Real-Time Trigger Applications for HEP Experiments

Parallel Algorithm Engineering

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Modern Processor Architectures. L25: Modern Compiler Design

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Accelerating image registration on GPUs

Native Computing and Optimization. Hang Liu December 4 th, 2013

Future computing in particle physics: a report from a workshop

Scientific Computing on GPUs: GPU Architecture Overview

Real-Time Rendering Architectures

Thread and Data parallelism in CPUs - will GPUs become obsolete?

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

The GAP project: GPU applications for High Level Trigger and Medical Imaging

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Parallel Programming on Larrabee. Tim Foley Intel Corp

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab

GPUs and GPGPUs. Greg Blanton John T. Lubia

CUDA. Matthew Joyner, Jeremy Williams

Track pattern-recognition on GPGPUs in the LHCb experiment

The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

TUNING CUDA APPLICATIONS FOR MAXWELL

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Programming hybrid systems with implicit memory based synchronization

Trends in HPC (hardware complexity and software challenges)

Trends and Challenges in Multicore Programming

TUNING CUDA APPLICATIONS FOR MAXWELL

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

PoS(High-pT physics09)036

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Contour Detection on Mobile Platforms

From Shader Code to a Teraflop: How Shader Cores Work

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Graphics Processor Acceleration and YOU

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Intra-MIC MPI Communication using MVAPICH2: Early Experience

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Recent Advances in Heterogeneous Computing using Charm++

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Technology for a better society. hetcomp.com

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPUs have enormous power that is enormously difficult to use

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

The Era of Heterogeneous Computing

Online Course Evaluation. What we will do in the last week?

Parallel Systems. Project topics

arxiv: v1 [physics.ins-det] 26 Dec 2017

AutoTune Workshop. Michael Gerndt Technische Universität München

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Transcription:

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Background to Workshop Organised by CERN OpenLab and held in IT Auditorium, 7-8 July 2011 ~30 registered participants Mainly from heavy-ion interaction experiments Follow up to a first Workshop in June 2010 held in GSI Aim: How exploit new processors (CPUs, co-processors) in order to reduce cost and complexity as well as speeding up triggers and analysis software? Two tracks First day: Algorithms Reconstruction methods (track finding/fitting),... Second day: Parallel Hardware and software Tools for parallelization/vectorization, new Intel MIC architecture,...

First Workshop and Context Targeted experiments GSI CBM (Compressed Baryonic Matter) BNL STAR (Relativistic Heavy Ion Experiment) PANDA (Darmstadt) and ALICE (CERN) Consolidated effort involving: GSI: Algorithms Uni Frankfurt: Vector Classes HEPHY (Vienna): Kalman Filter, track and vertex fit Intel: Intel Ct (C/C++ for throughput computing) CERN OpenLab: Many core optimization, benchmarking Topics Vertexing, Track fit, Track finding Parallelization on CPUs and GPUs

Next slides will summarize each talk Thursday: Algorithms 1) Summary of progress since 1 st workshop 2) GPUs for triggering in the NA62 experiment 3) Algorithms challenges in LHC data analysis 4) Random number generation for MC simulations using GPUs 5) New approaches in Track finding/fitting 6) KFParticle and a Vertexing toolkit 7) Experience using GPU for ATLAS Z-finder Friday morning: Hardware/Software 8) ArBB: Software for scalable programming 9) OpenCL: programming at the right or wrong level 10) Intel MIC accelerator 11) NUMA experience on large many-core servers 12) How to improve (in the HEP community)

1) Progress Since First Workshop Vertexing 3 D model under development KFParticle package extended with some routine KFPartcicle functionality: Reconstruction of decayed particle parameters based on Cellular Automaton (CA) track finder, Kalman Filter (KF) track fitter (More details in talks 5 and 6) Track fit Derived from cbmroot framework Improved stability and speed of track fitting Single precision Runge-Kutta track propagation implemented Track finding Efficiency and speed improves on CBM and STAR experiments Track propagation in magnetic field 40 X speed up on GPUs demonstrated

2) TDAQ in NA62 Experiment NA62 : to probe decay of charged Kaons, using 400 GeV/c protons L0: Input rate is ~10 MHz. latency ~1ms. (~2 microseconds for ATLAS) L1: Input rate ~1 MHz. latency is few seconds. (few ms for ATLAS)

GPUs in NA62 Triggers (1/2) The use of the GPU at the software levels (L1/2) is straightforward: just put the video card in the PC! No particular changes to the hardware are needed The main advantages is to reduce the number of PCs in the L1/L2 farms Various parallel algorithms for ring searches were described Various GPU cards tested See graph Ring finding in the RICH of NA62 ~1.5 us per event @L1 using a new algorithm (double ring)

GPUs in NA62 Triggers (2/2) The use of GPU at L0 is more challenging: Need very fast algorithm (high rate) Need event buffering in order to hide the latency of the GPU and to optimize the data transfer on the Video card Limited by size of L0 buffer Ring finding in the RICH of NA62 ~50 ns per ring @L0 using a parallel algebraic fit (single ring) As a further application, investigating the benefit of GPUs for Online track fitting at L1, using cellular automaton for parallelization

3) Algorithm Challenges in the LHC Data Analysis High level general discussion on HEP data analysis Parallelization of RooFit using OpenMP required no drastic changes to existing code RooFit: Library for modeling expected distribution of events OpenMP: API for shared memory multi-process/multi-core thread programming This Work is targeting multicore laptops and Desktop ~2.5 X speed up without vectorization Using Intel Compiler for vectorization (~5 X speed up) Currently using double precision, but would like to use single precision Can gain up to 2 speed up from vectorization But numerical accuracy might be an issue

4) Pseudo Random Number Generation for MC Simulation Using GPUs Mersenne Twister method used in standard ROOT Very good statistical properties with a period 2 19937 not suitable for GPUs Has a large state that must be updated serially Hybrid Approach more suitable: Tausworthe and LCG (Linear Congruential Algorithm) Test example: multivariate alias sampling Device: NVIDIA GeForce GTX 480 Kernel 1000 x times faster that CPU Overall execution time only 10X (device-host data transfers)

5) New approaches in Track finding/fitting Challenges in CBM 10 7 Au-Au col/sec 1000 particles/col At L1 Track reconstruction Vertex search Cellular Automaton CA as track finder Ideal for many core CPUs and GPUs Using Opladev35 node: 4 CPUs AMD E6164HE, 12 cores/cpu Larger groups of events per thread use the CPU more efficiently

Kalman Filter Based Track Fit Discussed a number of approaches and track propagation method Similar many core scalability to CA For 1000 tracks/thread, achieved 22 ns/track/node on opladev35 The use of Intel ArBB for reconstruction algorithms in the STAR experiment is under investigation ArBB: Array Building Blocks Library (Talk 8) 4D reconstruction (x, y, z,t ) for the CBM experiment has started

6) KFParticle and a Vertexing Toolkit KFParticle developed based on ALICE and CBM experiments Rewritten using SIMD model (Vectorization) Vector arguments to some functions + for loops 5X speedup for CBM on multicore CPUs Online particle finder for CBM Based on SIMD KFParticle and SIMD KF track fitter Comparison with off-line model: practically the same results Future plans also discussed Use KFParticle for 4D tracking development

7) Experience Using GPUs for ATLAS Z Finder Algorithm Introduction to GPUs, CUDA and GPU projects at Edinburgh Implementation of Z finder algorithm on GPUs Kalman Filter for CUDA in ATLAS Edinburgh feasibility studies 3 Slides on Dmitry s work Test case studies of particle tracking in a magnetic field, on GPUs Preliminary results show speedup of 10 compared to CPU Particle tracking and simulation, in addition to trigger are areas where GPUs can used to improve performance

8) ArBB: Software for Scalable Programming ArBB: Array Building Blocks C++ library for data parallel programming on many core CPUs, SIMD units and Intel MIC Containers and parallel operations Vectors and Matrices (regular, irregular, sparse) ArBB programs cannot create thread deadlocks or data races Intel Compiler vectorization switches discussed ArBB is part of Intel Parallel Building Block: Intel Cilk Plus Intel Threading Building Block Intel ArBB

Intel Parallel Building Blocks

9) OpenCl: Programming at the Right or the Wrong Level Experience with a RooFit implementation OpenCL is superset of C (no C++), cannot call host code from OpenCl Neither Intel or AMD OpenCl implementations offer auto-vectorization on CPUs (as of 1/7/11) Planned Using C++ for CPU and OpenCl for GPU OpenCl leads to larger serial fraction as compared to OpenMP Needs explicit call to Kernel, lower parallel efficiency OpenCl can be painful for use in legacy C++ code Main advantage is code reuse between CPU and GPU Need to use double precision. Peak performance? Dream on! Think before using OpenCl for CPU Intel compiler (or GCC) could be a better way to go Now investigating hybrid solution with OpenMP and OpenCl

10) Intel MIC Architecture MIC: Many Integrated Cores X86 compatible multiprocessors that are programmed using existing (CPU) tools Eliminates the need for dual programming models Has all advantage of other accelerators for parallel workload Each core is smaller, has lower power limit and executes its own instructions allowing complex code including branches, recursions, Prototype (Night s Ferry) available through 2011: Up to 32 cores, 1.2GHz each Up to 128 threads (4 threads/core) Up 8 MB shared coherent cache Up to 2GB high bandwidth GDDR5 memory Only support single precision for now Commercial (Night s Corner) late 2012-2013 More than 50 cores per chip expected

11) NUMA Experience on Large Many-core Servers from AMD/Intel NUMA: Non Uniform Memory Access Multiple nodes can access each other s memory but accessing local memory is faster Memory latency measurements Memory bandwidth measurements OpenMP on NUMA systems Using STREAM bandwidth benchmark code Compared the use of affinity switches (ICC compilers) Compact: assign thread n+1 as near as possible to thread n Scatter: distribute the threads as evenly as possible through system Scatter can help performance Useful to extract topology of the system for selection switch

12) How to Improve HEP Computing expectations from people Control Numerical Accuracy through compiler, code, 10-6 accuracy is fine, 10-18 may be impossible In the exa-scale era, you may not get the same result from two different runs using the same input on the same hardware, depending on the order in which floating point operations are interleaved Exploitation of vector capabilities Every computer is a now a vector computer SIMD sizes (up to 16X for the MIC architecture, 512bits width) Too much performance to ignore Which compiler to use Look at parallelization/vectorization and accuracy control flags Use of parallelism with good control of memory Rather incomprehensible that we use 1-10GB of memory to compute an event that is 1 MB in size Need to move all the good prototypes back into mainstream

Final Thoughts Short afternoon discussion: Many topics tabled, but limited debate Action on making KFParticle package available to users Renewed activity in HEP analysis parallelization work Vectorization seems to have a new lease of life at this workshop Speed up can be achieved with minimal code changes Standard programming models and compilers Intel MIC architecture will be another boost for vectorization and shared memory programming (OpenMP) GPUs have the edge in multi-level triggers (CA, KF) But which GPU and which programming model for performance as well as avoiding vendor lock-in? For more details see full slides: http://indico.cern.ch/conferencedisplay.py?ovw=true&confid=130322