Parallel Maximum Likelihood Fitting Using MPI

Similar documents
Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CINES MPI. Johanne Charpentier & Gabriel Hautreux

Modern Methods of Data Analysis - WS 07/08

Introduction to parallel computing with MPI

Parallel Programming Using MPI

Message Passing Programming. Designing MPI Applications

FROM SERIAL FORTRAN TO MPI PARALLEL FORTRAN AT THE IAS: SOME ILLUSTRATIVE EXAMPLES

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

L14 Supercomputing - Part 2

The MPI Message-passing Standard Lab Time Hands-on. SPD Course 11/03/2014 Massimo Coppola

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Minimization with ROOT using TMinuit. Regis Terrier 10/15/01

MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

Practical Introduction to Message-Passing Interface (MPI)

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Introduction to Parallel Programming with MPI

MPI introduction - exercises -

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Parallelism paradigms

Parallel Programming Using Basic MPI. Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center

Lecture 9: Load Balancing & Resource Allocation

arxiv: v1 [cs.dc] 7 Nov 2013

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

MPI version of the Serial Code With One-Dimensional Decomposition. Timothy H. Kaiser, Ph.D.

AMath 483/583 Lecture 18 May 6, 2011

MPI Version of the Stommel Code with One and Two Dimensional Decomposition

OpenMP and MPI parallelization

Batch Jobs Performance Testing

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola

MPI introduction - exercises -

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Parallel Programming & Cluster Computing

Parallel Programming Using MPI

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Delaunay-based Derivative-free Optimization via Global Surrogate. Pooriya Beyhaghi, Daniele Cavaglieri and Thomas Bewley

Theory Implementation Results Conclusion References. Cloud Computing. Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018

ComPWA: A common amplitude analysis framework for PANDA

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Programming with MPI

Advanced Operating Systems (CS 202) Scheduling (2)

MPI Program Structure

RooFit Tutorial. Jeff Haas Florida State University April 16, 2010

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

Distributed Memory Programming with Message-Passing

Practical Introduction to Message-Passing Interface (MPI)

Lecture 9: MPI continued

MPI MESSAGE PASSING INTERFACE

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

The Message Passing Model

CSC630/CSC730 Parallel & Distributed Computing

Holland Computing Center Kickstart MPI Intro

CS 426. Building and Running a Parallel Application

Introduction to MPI. Ricardo Fonseca.

CS 475: Parallel Programming Introduction

ANALYSIS OF CMAQ PERFORMANCE ON QUAD-CORE PROCESSORS. George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA

Vulkan: Scaling to Multiple Threads. Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

HPC Parallel Programing Multi-node Computation with MPI - I

Linear, Quadratic, Exponential, and Absolute Value Functions

Parallel Programming. Libraries and Implementations

2 TEST: A Tracer for Extracting Speculative Threads

Message Passing Programming. Introduction to MPI

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Constrained and Unconstrained Optimization

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

8/25/2014. What is an Operating System? Operating systems: System calls (for programmers) From a user s perspective: System goals:

Intel MPI Library Conditional Reproducibility

Collective Communication: Gatherv. MPI v Operations. root

Parallel Programming, MPI Lecture 2

Parallel Computing. Lecture 17: OpenMP Last Touch

Shared Memory programming paradigm: openmp

What s in this talk? Quick Introduction. Programming in Parallel

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallelization. Tianhe-1A, 2.45 Pentaflops/s, 224 Terabytes RAM. Nigel Mitchell

MPI Casestudy: Parallel Image Processing

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

1/17/2012. What is an Operating System? Operating systems: From a user s perspective: System goals: What is system software? Operating systems:

Message Passing Interface

status and future of: frameworks for optimization & high-performance computing

Parallel Programming with MPI and OpenMP

Slides prepared by : Farzana Rahman 1

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Programming with MPI. Pedro Velho

MATH 676. Finite element methods in scientific computing

An Empirical Study of Per-Instance Algorithm Scheduling

Sung-Eui Yoon ( 윤성의 )

Using Hidden Markov Models to analyse time series data

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Introduction to CosmoMC

Advanced Operating Systems (CS 202) Scheduling (1)

Supercomputing in Plain English

AMath 483/583 Lecture 21

Transcription:

Parallel Maximum Likelihood Fitting Using MPI Brian Meadows, U. Cincinnati and David Aston, SLAC Roofit Workshop, SLAC, Dec 6th, 2007 1

What is MPI? Message Passing Interface - a standard defined for passing messages between processors (CPU s) Communications interface to Fortran, C or C++ (maybe others) Definitions apply across different platforms (can mix Unix, Mac, etc.) Parallelization of code is explicit - recognized and defined by users Memory can be Shared between CPU s Distributed among CPU s OR A hybrid of these Number of CPU s allowed is not pre-defined, but is fixed in any one application The required number of CPU s is defined by the user at job startup and does not undergo runtime optimization. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 2

How Efficient is MPI? The best you can do is speed up a job by a factor equal to the number of physical CPU s involved. Factors limiting this Poor synchronization between CPU s due to unbalanced loads Sections of code that cannot be vectorized Signalling delays. NOTE it is possible to request more CPU s than physically exist This will produce some overhead in processing, though! Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 3

Running MPI Run the program with mpirun <job> -np N which submits N identical jobs to the system (You can also specify IP addresses for distributed CPU s) The OS in each machine allocates physical CPU s dynamically as usual. Each job is given an ID (0 N-1) which it can access needs to be in an identical environment to the others Users can use this ID to label a main job ( JOB0 for example) and the remaining satellite jobs. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 4

Fitting with MPI For a fit, each job should be structured to be able to run the parts it is required to do: Any set up (read in events, etc.) The parts that are vectorized (e.g. its group of events or parameters). One job needs to be identified as the main one JOB0 and must do everything, farming out groups of events or parameters to the others. Each satellite job must send results ( signals ) back to JOB0 when done with its group and await return signal from JOB0 when it must start again. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 5

How MPI Runs Scatter-Gather running CPU 0 CPU 0 CPU 0 CPU 0 m p i r u n CPU 1 CPU 2 Wait CPU 1 CPU 2 Wait CPU CPU Start Scatter Gather Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 6

Ways to Implement MPI in Maximum Likelihood Fitting Two main alternatives: A. Vectorize FCN - evaluates f(x) = -2Σ ln W B. Vectorize MINUIT (which finds the best parameters) Alternative A has been used in previous Babar analyses E.g. Mixing analysis of D 0 K + π - Alternative B is reported here (done by DYAEB and tested by BTM) An advantage of B over A is that the vectorization is implemented outside a user s code. Vectorizing FCN may not be efficient if an integral is computed on each call Unless the integral evaluation is also vectorized. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 7

Vectorize FCN Log-likelihood always includes a sum: where n = number of events or bins. Vectorize computation of sum - 2 steps ( Scatter-Gather ): Scatter: Divide up events (or bins) among the CPU s. Each CPU computes Gather: Re-combine the N CPU s: Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 8

Vectorize FCN Computation of the integral: also needs to be vectorized This is usually a sum (over bins) so can be done in a similar way. Main advantage of this method: Assuming function evaluation dominates CPU cycles, your gain coefficient is close to 1.0 independent of number of CPU s or pars. Main dis-advantage: It requires that the user code each application appropriately. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 9

Vectorize MINUIT Several algorithms in MINUIT: A. MIGRAD (Variable metric algorithm) Finds local minimum and error matrix at that point B. SIMPLEX (Nelder-Mead method) Linear programming method C. SEEK (MC method) Random search virtually obsolete Most often used is MIGRAD so focus on that Is easily vectorized, but results may not be at highest efficiency Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 10

One iteration in MIGRAD Compute function and gradient at current position Use current curvature metric to compute step: Take (large) step: Compute function and gradient there then (cubic) interpolate back to local minimum (may need to iterate) If satisfactory, improve Curvature metric Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 11

One iteration in MIGRAD Most of the time is spent in computing the gradient: Numerical evaluation of gradient requires 2 FCN calls per parameter: Vectorize this computation in two steps ( Scatter-Gather ): Scatter: Divide up parameters (x i ) among the CPU s. Each CPU computes Gather: Re-combine the N CPU s. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 12

Vectorize MIGRAD This is less efficient the smaller the number of parameters Works well if NPAR comparable to the number of CPU s. Gain ~ NCPU*(NPAR + 2) / (NPAR + 2*NCPU) Max. Gain = NCPU 120.0% 100.0% Gain / Max. Ga 80.0% 60.0% 40.0% 150 100 50 10 5 For 105 parameters a factor 3.7 was gained with 4 CPU s. 20.0% 0.0% 1 2 3 4 5 6 7 8 Number of CPU's Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 13

Initialization of MPI Program FIT_Kpipi C C- Maximum likelihood fit of D -> Kpipi Dalitz plot. C Implicit none Save external fcn include 'mpif.h' MPIerr= 0 MPIrank= 0 MPIprocs= 1 MPIflag= 1 call MPI_INIT(MPIerr)! Initialize MPI call MPI_COMM_RANK(MPI_COMM_WORLD, MPIrank, MPIerr)! Get number of CPU s call MPI_COMM_SIZE(MPI_COMM_WORLD, MPIprocs, MPIerr)! Which one am I? call MINUIT, etc. call MPI_FINALIZE(MPIerr) Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 14

Use of Scatter-Gather Mechanism in MNDERI (Fortran) C Distribute the parameters from proc 0 to everyone 33 call MPI_BCAST(X, NPAR+1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C Use scatter-gather mechanism to compute subset of derivatives in each process: nperproc= (NPAR-1)/MPIprocs + 1 iproc1= 1+nperproc*MPIrank iproc2= MIN(NPAR,iproc1+nperproc-1) call MPI_SCATTER(GRD, nperproc, MPI_DOUBLE_PRECISION, A GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C C Loop over variable parameters DO 60 i=iproc1,iproc2 compute G(I) End Do C C Wait until everyone is done: call MPI_GATHER(GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, A GRD, nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C everyone but proc 0 goes back to await the next set of parameters If ( MPIrank.ne.0) GO TO 33 C Continue computation (CPU 0 only) Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 15