Parallel Maximum Likelihood Fitting Using MPI Brian Meadows, U. Cincinnati and David Aston, SLAC Roofit Workshop, SLAC, Dec 6th, 2007 1
What is MPI? Message Passing Interface - a standard defined for passing messages between processors (CPU s) Communications interface to Fortran, C or C++ (maybe others) Definitions apply across different platforms (can mix Unix, Mac, etc.) Parallelization of code is explicit - recognized and defined by users Memory can be Shared between CPU s Distributed among CPU s OR A hybrid of these Number of CPU s allowed is not pre-defined, but is fixed in any one application The required number of CPU s is defined by the user at job startup and does not undergo runtime optimization. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 2
How Efficient is MPI? The best you can do is speed up a job by a factor equal to the number of physical CPU s involved. Factors limiting this Poor synchronization between CPU s due to unbalanced loads Sections of code that cannot be vectorized Signalling delays. NOTE it is possible to request more CPU s than physically exist This will produce some overhead in processing, though! Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 3
Running MPI Run the program with mpirun <job> -np N which submits N identical jobs to the system (You can also specify IP addresses for distributed CPU s) The OS in each machine allocates physical CPU s dynamically as usual. Each job is given an ID (0 N-1) which it can access needs to be in an identical environment to the others Users can use this ID to label a main job ( JOB0 for example) and the remaining satellite jobs. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 4
Fitting with MPI For a fit, each job should be structured to be able to run the parts it is required to do: Any set up (read in events, etc.) The parts that are vectorized (e.g. its group of events or parameters). One job needs to be identified as the main one JOB0 and must do everything, farming out groups of events or parameters to the others. Each satellite job must send results ( signals ) back to JOB0 when done with its group and await return signal from JOB0 when it must start again. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 5
How MPI Runs Scatter-Gather running CPU 0 CPU 0 CPU 0 CPU 0 m p i r u n CPU 1 CPU 2 Wait CPU 1 CPU 2 Wait CPU CPU Start Scatter Gather Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 6
Ways to Implement MPI in Maximum Likelihood Fitting Two main alternatives: A. Vectorize FCN - evaluates f(x) = -2Σ ln W B. Vectorize MINUIT (which finds the best parameters) Alternative A has been used in previous Babar analyses E.g. Mixing analysis of D 0 K + π - Alternative B is reported here (done by DYAEB and tested by BTM) An advantage of B over A is that the vectorization is implemented outside a user s code. Vectorizing FCN may not be efficient if an integral is computed on each call Unless the integral evaluation is also vectorized. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 7
Vectorize FCN Log-likelihood always includes a sum: where n = number of events or bins. Vectorize computation of sum - 2 steps ( Scatter-Gather ): Scatter: Divide up events (or bins) among the CPU s. Each CPU computes Gather: Re-combine the N CPU s: Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 8
Vectorize FCN Computation of the integral: also needs to be vectorized This is usually a sum (over bins) so can be done in a similar way. Main advantage of this method: Assuming function evaluation dominates CPU cycles, your gain coefficient is close to 1.0 independent of number of CPU s or pars. Main dis-advantage: It requires that the user code each application appropriately. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 9
Vectorize MINUIT Several algorithms in MINUIT: A. MIGRAD (Variable metric algorithm) Finds local minimum and error matrix at that point B. SIMPLEX (Nelder-Mead method) Linear programming method C. SEEK (MC method) Random search virtually obsolete Most often used is MIGRAD so focus on that Is easily vectorized, but results may not be at highest efficiency Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 10
One iteration in MIGRAD Compute function and gradient at current position Use current curvature metric to compute step: Take (large) step: Compute function and gradient there then (cubic) interpolate back to local minimum (may need to iterate) If satisfactory, improve Curvature metric Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 11
One iteration in MIGRAD Most of the time is spent in computing the gradient: Numerical evaluation of gradient requires 2 FCN calls per parameter: Vectorize this computation in two steps ( Scatter-Gather ): Scatter: Divide up parameters (x i ) among the CPU s. Each CPU computes Gather: Re-combine the N CPU s. Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 12
Vectorize MIGRAD This is less efficient the smaller the number of parameters Works well if NPAR comparable to the number of CPU s. Gain ~ NCPU*(NPAR + 2) / (NPAR + 2*NCPU) Max. Gain = NCPU 120.0% 100.0% Gain / Max. Ga 80.0% 60.0% 40.0% 150 100 50 10 5 For 105 parameters a factor 3.7 was gained with 4 CPU s. 20.0% 0.0% 1 2 3 4 5 6 7 8 Number of CPU's Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 13
Initialization of MPI Program FIT_Kpipi C C- Maximum likelihood fit of D -> Kpipi Dalitz plot. C Implicit none Save external fcn include 'mpif.h' MPIerr= 0 MPIrank= 0 MPIprocs= 1 MPIflag= 1 call MPI_INIT(MPIerr)! Initialize MPI call MPI_COMM_RANK(MPI_COMM_WORLD, MPIrank, MPIerr)! Get number of CPU s call MPI_COMM_SIZE(MPI_COMM_WORLD, MPIprocs, MPIerr)! Which one am I? call MINUIT, etc. call MPI_FINALIZE(MPIerr) Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 14
Use of Scatter-Gather Mechanism in MNDERI (Fortran) C Distribute the parameters from proc 0 to everyone 33 call MPI_BCAST(X, NPAR+1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C Use scatter-gather mechanism to compute subset of derivatives in each process: nperproc= (NPAR-1)/MPIprocs + 1 iproc1= 1+nperproc*MPIrank iproc2= MIN(NPAR,iproc1+nperproc-1) call MPI_SCATTER(GRD, nperproc, MPI_DOUBLE_PRECISION, A GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C C Loop over variable parameters DO 60 i=iproc1,iproc2 compute G(I) End Do C C Wait until everyone is done: call MPI_GATHER(GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, A GRD, nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C everyone but proc 0 goes back to await the next set of parameters If ( MPIrank.ne.0) GO TO 33 C Continue computation (CPU 0 only) Roofit Workshop, SLAC, Dec 6th, 2007 B. Meadows, U. Cincinnati 15