Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab

Similar documents
GooFit: Massively parallel function evaluation

Optimizing a High Energy Physics (HEP) Toolkit on Heterogeneous Architectures

Does the Intel Xeon Phi processor fit HEP workloads?

RooFit Tutorial. Jeff Haas Florida State University April 16, 2010

General Purpose GPU Computing in Partial Wave Analysis

Optimisation Myths and Facts as Seen in Statistical Physics

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

GPU applications within HEP

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Accelerating image registration on GPUs

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

RooFit/RooStats Tutorials. Lorenzo Moneta (CERN) Sven Kreiss (NYU)

Mathematical computations with GPUs

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

arxiv: v1 [cs.dc] 7 Nov 2013

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Introduction to Multicore Programming

Statistics for HEP Hands-on Tutorial #2. Nicolas Berger (LAPP)

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Introduction to RooFit

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

Beyond core count: a look at new mainstream computing platforms for HEP workloads

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

New Developments of ROOT Mathematical Software Libraries

Parallel Computing. Hwansoo Han (SKKU)

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Introduction to RooFit

Accelerating Data Warehousing Applications Using General Purpose GPUs

Optimization solutions for the segmented sum algorithmic function

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Introduction to GPU hardware and to CUDA

Contour Detection on Mobile Platforms

Automated Finite Element Computations in the FEniCS Framework using GPUs

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC)

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Partial wave analysis at BES III harnessing the power of GPUs

Asian Option Pricing on cluster of GPUs: First Results

n N c CIni.o ewsrg.au

GPU Linear algebra extensions for GNU/Octave

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Accelerating Financial Applications on the GPU

Cartoon parallel architectures; CPUs and GPUs

Near Real-time Pointcloud Smoothing for 3D Imaging Systems Colin Lea CS420 - Parallel Programming - Fall 2011

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

GPGPU. Peter Laurens 1st-year PhD Student, NSC

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

THE PROCESS OF PARALLELIZING THE CONJUNCTION PREDICTION ALGORITHM OF ESAS SSA CONJUNCTION PREDICTION SERVICE USING GPGPU

Supporting Data Parallelism in Matcloud: Final Report

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Parallel Maximum Likelihood Fitting Using MPI

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

High Performance Computing and GPU Programming

University of Bielefeld

Using GPUs. Visualization of Complex Functions. September 26, 2012 Khaldoon Ghanem German Research School for Simulation Sciences

Fundamental CUDA Optimization. NVIDIA Corporation

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Portable GPU-Based Artificial Neural Networks For Data-Driven Modeling

Building NVLink for Developers

Parallelism. CS6787 Lecture 8 Fall 2017

GPU Cluster Computing. Advanced Computing Center for Research and Education

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy

Overview of Project's Achievements

Lecture 1: Introduction and Computational Thinking

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Partial Wave Analysis using Graphics Cards

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

How to Optimize Geometric Multigrid Methods on GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Scalability of Processing on GPUs

Efficient Computation of Radial Distribution Function on GPUs

Introduction to Multicore Programming

Machine Learning Software ROOT/TMVA

Double-Precision Matrix Multiply on CUDA

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Subset Sum Problem Parallel Solution

JCudaMP: OpenMP/Java on CUDA

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

The Art of Parallel Processing

Transcription:

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab International Conference on Computing in High Energy and Nuclear Physics 2010 (CHEP2010) October 21 st, 2010 Academia Sinica, Taipei Presented by Alfio Lazzaro

Maximum Likelihood Fits We have a sample composed by N events, belonging to s different species (signals, backgrounds), and we want to extract the number of events for each species and other parameters We use the Maximum Likelihood fit technique to estimate the values of the free parameters, minimizing the Negative Log- Likelihood (NLL) function j species (signals, backgrounds) n j number of events P j probability density function (PDF) θ j Free parameters in the PDFs Alfio Lazzaro (alfio.lazzaro@cern.ch) 2

MINUIT Numerical minimization of the NLL using MINUIT (F. James, Minuit, Function Minimization and Error Analysis, CERN long write-up D506, 1970) MINUIT uses the gradient of the function to find local minimum (MIGRAD), requiring The calculation of the gradient of the function for each free parameter, naively 2 function calls per each parameter The calculation of the covariance matrix of the free parameters (which means the second order derivatives) The minimization is done in several steps moving in the Newton direction: each step requires the calculation of the gradient Several calls to the NLL Alfio Lazzaro (alfio.lazzaro@cern.ch) 3

Building models: RooFit RooFit is a Maximum Likelihood fitting package (W. Verkerke and D. Kirkby) for the NLL calculation Inside ROOT (details at http://root.cern.ch/drupal/content/roofit) Allows to build complex models and declare the likelihood function Mathematical concepts are represented as C++ objects On top of RooFit developed another package for advanced data analysis techniques, RooStats Limits and intervals on Higgs mass and New Physics effects Alfio Lazzaro (alfio.lazzaro@cern.ch) 4

1. Read the values of the variables for each event 2. Make the calculation of PDFs for each event Each PDF has a common interface declared inside the class RooAbsPdf with a virtual method evaluate() which define the function Likelihood Function calculation in RooFit Each PDF implements the method evaluate() Automatic calculation of the normalization integrals for each PDF Calculation of composite PDFs: sums, products, extendend PDFs 3. Loop on all events and make the calculation of the NLL Variables Parallel execution over the events (as it is already implemented) Events 0 1 N - 1 var 1 var 2 var n Alfio Lazzaro (alfio.lazzaro@cern.ch) 5

Algorithms Two algorithms implemented: 1. RooFit Event-based (CPU Implementation), described before Parallelization at event level, using fork Not shared resources 2. PDF-Event-based Algorithm GPU Implementation (CUDA) CPU Implementation (OpenMP) NEW Note: everything done in double precision Alfio Lazzaro (alfio.lazzaro@cern.ch) 6

PDF-Event-based Algorithm New approach to the NLL calculation: 1. Read all events and store in arrays in memory 2. For each PDF make the calculation on all events Corresponding array of results is produced for each PDF Evaluation of the function inside the local PDF, i.e. not need a virtual function (drawback: require more memory to store temporary results: 1 double per each event and PDF) Apply normalization 3. Combine the arrays of results (composite PDFs) 4. Calculation of the NLL Parallelization splitting calculation of each PDF over the events Particularly suitable for thread parallelism on GPU, requiring one thread for each PDF/event Possible benefit from vectorization on the CPU Alfio Lazzaro (alfio.lazzaro@cern.ch) 7

Test environment PCs CPU: Nehalem @ 3.2GHz: 4 cores 8 hw-threads OS: SLC5 64bit - GCC 4.3.4 ROOT trunk (October 11 th, 2010) GPU: ASUS nvidia GTX470 PCI-e 2.0 Commodity card (for gamers) Architecture: GF100 (Fermi) Memory: 1280MB DDR5 Core/Memory Clock: 607MHz/837MHz Maximum # of Threads per Block: 1024 Number of SMs: 14 CUDA Toolkit 3.1 06/2010 Developer Driver 256.40 Power Consumption 200W Price ~$340 Alfio Lazzaro (alfio.lazzaro@cern.ch) 8

PDFs implemented 1D PDFs commonly used in HEP: Symmetric and Asymmetric Gaussian Breit-Wigner Crystal Ball Function Argus Generic Polynomial Chi Square Composition of PDFs: Sum of two or more PDFs Product of two or more PDFs Multivariate PDFs Very easy to build complex models (via composition) and add new PDFs Alfio Lazzaro (alfio.lazzaro@cern.ch) 9

GPU Implementation Data are copied on the GPU once Results for each PDF are resident only on the GPU Arrays of results are allocated on the global memory once and they are deallocated at the end of the fitting procedure Minimize CPU GPU communication Only the final results are copied on the CPU for the final sum to compute NLL Device algorithm performance with a linear polynomial PDF and 1,000,000 events 45 GFLOPS and 3.5 GB/s CPU GPU data transfer Alfio Lazzaro (alfio.lazzaro@cern.ch) 10

1D PDF Tests 1,000,000 events and 1000 iterations CPU algorithm is the event-based (RooFit) in sequential GPU time includes data transfer time (data and results) A significant portion of time, limiting the scalability More complex PDF => Bigger portion of time spent in evaluation VS time for data transfers Alfio Lazzaro (alfio.lazzaro@cern.ch) 11

Complex Model Test n a [f 1,a G 1,a (x)+(1 f 1,a )G 2,a (x)]ag 1,a (y)ag 2,a (z)+ n b G 1,b (x)bw 1,b (y)g 2,b (z)+ n c AR 1,c (x)p 1,c (y)p 2,c (z)+ n d P 1,d (x)g 1,d (y)ag 1,d (z) 17 PDFs in total, 3 variables, 4 components, 35 parameters G: Gaussian AG: Asymmetric Gaussian BW: Breit-Wigner P: Polynomial Note: all PDFs have analytical normalization integral Alfio Lazzaro (alfio.lazzaro@cern.ch) 12

Event-based VS PDF-event-base performance Driven by the GPU implementation, we implemented a corresponding CPU implementation take benefit from the code optimizations (due to migration from C++ to C) No virtual functions Inlining of the evaluate function Data organized in C arrays, perfect for vectorization it can be easily parallelized using OpenMP Linear increase with the number of events (as expected) Speed-up of 34% (almost flat over the number of events), just optimizing the algorithm! (not parallelization) Alfio Lazzaro (alfio.lazzaro@cern.ch) 13

PDF-event-base scalability with OpenMP Test done on the Westmere-EP @ 2.93 GHz 12 cores / 24 threads 100,000 events 98.8% of the sequential execution can be parallelized (1.2% required for initialization of the arrays for data and results and normalization integrals calculation) Negligible increase in memory (arrays are shared) Scalability as expected Using SMT (hwthreading) with 24 threads we reach 110% in efficiency w.r.t 12 threads (+32% in case of ideal speed-up) Alfio Lazzaro (alfio.lazzaro@cern.ch) 14

PDF-event-base: GPU VS OpenMP Fair comparison Same algorithm Algorithm on CPU optimized and parallelized (4 threads) CPU does the final sum of the NLL and normalization integral calculations Check that the results are compatible: asymmetry less than 10 12 36% GPU kernels 60% CPU time 4% transfers 68% GPU kernels 21% CPU time 11% transfers Speed-up increases with the dimension of the sample, taking benefit from the data streaming on GPU and the integral calculation only on the CPU ~3x for small samples, up to ~7x for large samples Alfio Lazzaro (alfio.lazzaro@cern.ch) 15

Conclusion Implementation of the algorithm in CUDA to calculate the NLL on GPU, as part of the RooFit package Require not so drastic changes in the existing RooFit code New design of the algorithm for PDF-event parallelism The CUDA implementation forces us to develop an OpenMP implementation on the CPU of the same PDF-event algorithm With 1 thread +34% better performance with respect to RooFit implementation In our test GPU implementation gives >3x speed-up (~7x for large samples) with respect to OpenMP with 4 threads Note that our target is running fits at the user-level on the GPU of small systems (laptops), i.e. with small number of CPU cores This is a preliminary work (mainly by the summer student, Felice). Still a lot to do. Some examples: Simultaneous fits with index variables More complex tests Parallelization of PDFs with numerical integrals Further optimization on the GPU (better treatment of the memory) Last but not least: insert the code in the official RooFit/ROOT release Alfio Lazzaro (alfio.lazzaro@cern.ch) 16

Backup Slides Alfio Lazzaro (alfio.lazzaro@cern.ch) 17

CUDA inside RooFit CPU GPU Alfio Lazzaro (alfio.lazzaro@cern.ch) 18

CUDA inside RooFit GPU code Alfio Lazzaro (alfio.lazzaro@cern.ch) 19

RooMinimizer Interface to MINUIT: calculate the gradient of the NLL 2 (#pars) iterations RooNLLVar Do the loop over the N events: i =1 N For each event calculate the P s N iterations RooAbsData Read the variables of an event i RooAbsPdf Calculate the log term (getlogval method) Evaluation of the function (public getval virtual method) Propagate the evaluation of the function to all P s (through the specialized public getval method of each P) Calculation of the function with the protected evaluate virtual method, defined for each P Normalization No i =N Yes No Gradient done Yes Alfio Lazzaro (alfio.lazzaro@cern.ch) 20