GPU applications within HEP

Size: px

Start display at page:

Download "GPU applications within HEP"

Oswald Curtis
5 years ago
Views:

1 GPU applications within HEP Liang Sun Wuhan University The 5th International Summer School on TeV Experimental Physics (istep), Wuhan

2 Outline Basic concepts GPU, CUDA, Thrust GooFit introduction Hydra introduction Other applications 2

3 CPUs and GPUs The CPU (central processing unit) carries out all the arithmetic and computing functions of a computer. Principal components of a CPU: arithmetic logic unit (ALU),registers and a control unit The GPU (graphics processing unit) is specialized processor designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer. Modern GPUs have a highly parallel structure and are more efficient than generalpurpose CPUs for algorithms where the processing of large blocks of data is done in parallel 3

4 CPUs and GPUs 4

5 Concurrency The ability to execute or solve different parts of a program, an algorithm or a problem in out-of-order or in partial order, without affecting the final outcome Concurrent routines can be executed in parallel Significant improvement in the overall performance of the execution in multi-processor, multi-core and multi-thread systems Design of concurrent programs and algorithms requires reliable techniques for coordinating instruction execution, data exchange, memory allocation and execution scheduling to minimize response time and maximize throughput Issues: race conditions, deadlocks, resource starvation etc... 5

6 Motivation for massively parallel platforms in HEP A large fraction of the software used in HEP is legacy. It consists of libraries of single threaded, Fortran and C++03 mono-platform routines HEP experiments keep collecting samples with unprecedented large statistics. Data analyses get more and more complex. Not rarely, a calculation spend days to reach a result, which often needs re-tuning Processors will not increase clock frequency any more. The current road-map to increase overall performance is to deploy concurrency 6

7 Geforce GTX 1080 Ti GTX TITAN Z 7

8 What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. 8

9 Heterogeneous Computng Terminology: Host The CPU and its memory (host memory) Device The GPU and its memory (device memory) Host Device NVIDIA 2013

$RADIUS 3 3 #define #define BLOCK_SIZE BLOCK_SIZE 16 16 global global void void stencil_1d(int stencil_1d(int *in, *in, int int *out) *out) {{ shared shared int int temp[block_size temp[block_size +$ +2 2 ** RADIUS]; RADIUS]; int int gindex gindex = = threadidx.x threadidx.x + + blockidx.x blockidx.x ** blockdim.x; blockdim.x; int lindex = threadidx.x int lindex = threadidx.

+2 2 ** RADIUS]; RADIUS]; int int gindex gindex = = threadidx.x threadidx.x + + blockidx.x blockidx.x ** blockdim.x; blockdim.x; int lindex = threadidx.x int lindex = threadidx.

10 Heterogeneous Computng #include #include <iostream> <iostream> #include <algorithm> #include <algorithm> using using namespace namespace std; std; #define 1024 #define N N 1024 #define RADIUS #define RADIUS 3 3 #define #define BLOCK_SIZE BLOCK_SIZE global global void void stencil_1d(int stencil_1d(int *in, *in, int int *out) *out) {{ shared shared int int temp[block_size temp[block_size ** RADIUS]; RADIUS]; int int gindex gindex = = threadidx.x threadidx.x + + blockidx.x blockidx.x ** blockdim.x; blockdim.x; int lindex = threadidx.x int lindex = threadidx.x + + RADIUS; RADIUS; RADIUS]; RADIUS]; BLOCK_SIZE]; BLOCK_SIZE]; // // Read Read input input elements elements into into shared shared memory memory temp[lindex] = temp[lindex] = in[gindex]; in[gindex]; if (threadidx.x (threadidx.x < < RADIUS) RADIUS) {{ if temp[lindex RADIUS] temp[lindex - RADIUS] = = in[gindex in[gindex -temp[lindex temp[lindex + + BLOCK_SIZE] BLOCK_SIZE] = = in[gindex in[gindex + + }} // // Synchronize Synchronize (ensure (ensure all all the the data data is is available) available) syncthreads(); syncthreads(); // // Apply Apply the the stencil stencil int int result result = = 0; 0; for for (int (int offset offset = = -RADIUS -RADIUS ;; offset offset <= <= RADIUS RADIUS ;; offset++) offset++) result += += temp[lindex temp[lindex + + offset]; offset]; result }} parallel fn // // Store Store the the result result out[gindex] out[gindex] = = result; result; void void fill_ints(int fill_ints(int *x, *x, int int n) n) {{ fill_n(x, fill_n(x, n, n, 1); 1); }} int int main(void) main(void) {{ int // int *in, *in, *out; *out; // host host copies copies of of a, a, b, b, cc int *d_in, // int *d_in, *d_out; *d_out; // device device copies copies of of a, a, b, b, cc int size size = = (N (N + + 2*RADIUS) 2*RADIUS) ** sizeof(int); int sizeof(int); // // Alloc Alloc space space for for host host copies copies and and setup setup values values in = in = (int (int *)malloc(size); *)malloc(size); fill_ints(in, fill_ints(in, N N+ + 2*RADIUS); 2*RADIUS); out = = (int (int *)malloc(size); *)malloc(size); fill_ints(out, fill_ints(out, N N+ + 2*RADIUS); 2*RADIUS); out // // Alloc Alloc space space for for device device copies copies cudamalloc((void cudamalloc((void **)&d_in, **)&d_in, size); size); cudamalloc((void cudamalloc((void **)&d_out, **)&d_out, size); size); // // Copy Copy to to device device cudamemcpy(d_in, cudamemcpy(d_in, in, in, size, size, cudamemcpyhosttodevice); cudamemcpyhosttodevice); cudamemcpy(d_out, cudamemcpy(d_out, out, out, size, size, cudamemcpyhosttodevice); cudamemcpyhosttodevice); d_out d_out + + RADIUS); RADIUS); // // Launch Launch stencil_1d() stencil_1d() kernel kernel on on GPU GPU stencil_1d<<<n/block_size,block_size>>>(d_in + stencil_1d<<<n/block_size,block_size>>>(d_in + RADIUS, RADIUS, serial code // // Copy Copy result result back back to to host host cudamemcpy(out, d_out, cudamemcpy(out, d_out, size, size, cudamemcpydevicetohost); cudamemcpydevicetohost); }} // // Cleanup Cleanup free(in); free(in); free(out); free(out); cudafree(d_in); cudafree(d_out); cudafree(d_out); cudafree(d_in); return 0; 0; return parallel code serial code NVIDIA 2013

11 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory NVIDIA 2013

12 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance NVIDIA 2013

13 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory NVIDIA 2013

14 What is Thrust? High-Level Parallel Algorithms Library Parallel Analog of the C++ Standard Template Library (STL) Performance-Portable Abstracton Layer Productve way to program CUDA 14

15 Example #include #include #include #include <thrust/host_vector.h> <thrust/device_vector.h> <thrust/sort.h> <cstdlib> int main(void) { // generate 32M random numbers on the host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // sort data on the device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; }

16 Another example Containers host_vector device_vector Memory Management Allocaton Transfers // allocate host vector with two elements thrust::host_vector<int> h_vec(2); // copy host data to device memory thrust::device_vector<int> d_vec = h_vec; // write device values from the host d_vec[0] = 27; d_vec[1] = 13; // read device values from the host int sum = d_vec[0] + d_vec[1]; // invoke algorithm on device thrust::sort(d_vec.begin(), d_vec.end()); Algorithm Selecton Locaton is implicit // memory automatically released

17 Easy to Use Distributed with CUDA Toolkit Header-only library Architecture agnostc Just compile and run! $ nvcc -O2 -arch=sm_35 program.cu -o program

18 Portability Support for CUDA, TBB and OpenMP Just recompile! nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP GeForceNVIDA GTX 280 GeForce GTX 580 Core2 QuadIntel Q6600 Core i7 2600K $ time./monte_carlo pi is approximately $ time./monte_carlo pi is approximately real 0m6.190s user 0m6.052s sys 0m0.116s real 1m26.217s user 11m28.383s sys 0m0.020s

allows a PDF (GooPDF object) to be evaluated in parallel Architecture: Program flow: 19

19 GooFit (v1) introduction GooFit: an open-source project originally developed by R. Andreassen and funded by NSF FitManager object as the interface between MINUIT and a GPU which allows a PDF (GooPDF object) to be evaluated in parallel Architecture: Program flow: 19 "Implementation of a Thread-Parallel, GPU-Friendly Function Evaluation Library," R. Andreassen et al., IEEE Access, v.2, 2014.

20 Analogy with RooFit Code structure similar to RooFit framework, the overall fit set-up and running are familiar for RooFit users 20

21 GooFit PDFs Simple PDFs: ARGUS, correlated Gaussian, Crystal Ball, exponential, Gaussian, Johnson SU, polynomial, relativistic Breit-Wigner, scaled Gaussian, smoothed histogram, step function, Voigtian Composites: You can write your own PDFs based on the example PDF code with relative ease Specialized mixing PDFs: Coherent amplitude sum, incoherent sum, truth resolution, three-gaussian resolution, Dalitz-plot region veto, threshold damping function TddpPdf (DalitzPlotPdf) as the main engine for time-dependent (-integrated) Dalitz-plot (DP) fits, with a list of ResonancePdf objects as input to describe different (non-)resonance amplitudes 21

22 Gaussian PDF as an example Side note: fptype is just a typedef for double - this allows quick switching between double and float precision 22

Time-dependent amplitude analysis 0 0 of D πππ First physics analysis using GooFit Measurement on D0 mixing parameters x and y using an unbinned maximum likelihood fit Signal PDF

23 Time-dependent amplitude analysis 0 0 of D πππ First physics analysis using GooFit Measurement on D0 mixing parameters x and y using an unbinned maximum likelihood fit Signal PDF (TddpPdf): Timedependent mixing function involving coherent sums of BreitWigner amplitudes (ResonancePdf) A total of 138k data events from BABAR experiment Final results: PRD 93, (2016) 23

PRD 93, 112014 (2016) Fit projections Data DP Fit pulls The data fit takes ~1 min to

24 PRD 93, (2016) Fit projections Data DP Fit pulls The data fit takes ~1 min to complete with Nvidia Tesla K40c, a speed-up of ~300 over the original CPU version 24

25 25

26 Recent GooFit developments 26

27 How GooFit v2 works? 27

28 Code examples 28

29 29

30 GooFit (python) documentation 30

31 What is Hydra? 31

32 Hydra features 32

33 Hydra example I: Gaussian + Argus 33

34 Hydra example I: Gaussian + Argus 34

35 Hydra example I: Gaussian + Argus 35

36 Hydra example II: D+ K-π+π+ amplitudes 36

37 Hydra example II: D+ K-π+π+ amplitudes 37

38 Hydra example II: D+ K-π+π+ amplitudes Now the fit model: 38

39 Hydra example II: D+ K-π+π+ amplitudes 39

40 Hydra example II: D+ K-π+π+ amplitudes Fit projections: 40

41 Hydra example II: D+ K-π+π+ amplitudes Performance: CPU with CUDA 41

42 Hydra example II: D+ K-π+π+ amplitudes Performance: CPU with OpenMP 42

43 Hydra example II: D+ K-π+π+ amplitudes Performance: CPU with TBB 43

44 Summary & resources GooFit and Hydra are two example tools which can make your fitting hundreds of times faster A number of physics analyses have benefited from GPU acceleration Growing interests in GPUs within HEP community Useful resources: CUDA Thrust: GooFit: Hydra: 44

Massively Parallel Algorithms

Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture