O.I. Streltsova, D.V. Podgainy, M.V. Bashashin, M.I.Zuev

Size: px

Start display at page:

Download "O.I. Streltsova, D.V. Podgainy, M.V. Bashashin, M.I.Zuev"

Suzanna Hunter
5 years ago
Views:

High Performance Computing Technologies Lecture,

Streltsova, D.V. Podgainy, M.V. Bashashin, M.I.

Laboratory of Information Technologies Joint

1 High Performance Computing Technologies Lecture, Practical training 9 Parallel Computing with MPI: parallel algorithm for linear algebra O.I. Streltsova, D.V. Podgainy, M.V. Bashashin, M.I.Zuev Heterogeneous Computations Team, HybriLIT Laboratory of Information Technologies Joint Institute for Nuclear Research Dubna University 12 April 2018 Heterogeneous Computation Team, HybriLIT

Serial computing Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU);

2 Serial computing Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU); A problem is broken into a discrete series of instructions. Instructions are executed one after another. Only one instruction may execute at any moment in time.

Parallel computing Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: To be run using multiple CPUs.

3 Parallel computing Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: To be run using multiple CPUs. A problem is broken into discrete parts that can be solved concurrently. Each part is further broken down to a series of instructions. Instructions from each part execute simultaneously on different CPUs.

Examples: IBM-series, multi-core PC Uniform

CPU1 CPU2 Memory Mem2 Non-Uniform Memory

4 Types of parallel machines Distributed memory each processor has its own memory address space. Examples: clusters, Blue Gene/L Shared memory single address space for all processors. Examples: IBM-series, multi-core PC Uniform Memory Access CPU0 CPU1 CPU2 CPU0 Mem0 Mem1 CPU1 CPU2 Memory Mem2 Non-Uniform Memory Access MPI is a library standard for programming distributed memory system CPU CPU Memory CPU CPU Memory CPU CPU Memory CPU CPU Memory

5 MPI. Main conception MPI: Message Passing Interface Data transfer operation MPI is based on data transfer operations. Among MPI functions there are: point-to-point operations between two processes, collective communication actions for simultaneous interaction of several processes. 1. Grama, Gupta, Karypis, Kumar. Introduction to Parallel Computing, Second Edition Snir, Otto, Huss-Lederman, Walker, Dongarra. MPI: The Complete Reference, Volume 1, The MPI Core, Second edition. 1998

6 MPI: Concepts Processes Group of processes Communicator MPI COMM WORLD name of the default MPI communicator: MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI-Communicator: MPI_COMM_WORLD Process 0 Process 2 Process 1 Process size-1 Each process in a communicator is identified by its rank

7 The 6 most important MPI commands MPI_Init() initiate a MPI computation MPI_Comm_size() how many processes participate in a given MPI communicator? MPI_Comm_rank() which one am I? MPI_Send() send a message MPI_Recv() receive a message (A number between 0 and size-1) MPI_Finalize() terminate the MPI computation and clean up

8 MPI. Basic Datatypes At message passing operation execution, it is necessary to designate the type of passed data on order to denote the passed or received data. MPI contains a big set of base data types that coincide with datatypes in C/C++ and Fortran programming languages. MPI MPI_CHAR MPI_DOUBLE MPI_FLOAT MPI_INT C/C++ signed char double float int All using MPI datatypes you can see in file mpi.h (mpif.h) MPI includes possibilities for creation of new derived data types for more detailed and short description of sent messages.

9 Basic Needs in parallel programming In order to do parallel programming, we need basic functionality: Start Processes Send Messages Receive Messages Synchronize processes

10 Language notation Fortran include "mpif.h" call MPI_FUNCTION(parameter,..., ierr) Compilation $ mpif77 prog_name.f o prog_exec C/C++ #include <mpi.h> MPI_Function(parameter,...); Compilation $ mpicc/mpic++ prog_name.c o prog_exec

11 Example MPI-prorgam mpi_hello.c #include <stdio.h> #include <mpi.h> int main(int argc, char **argv){ int size, rank; MPI_Init(&argc,&argv); Initiate an MPI computation MPI_Comm_size (MPI_COMM_WORLD, &size); How many process? MPI_Comm_rank(MPI_COMM_WORLD, &rank); The rank value is between 0 and size-1 printf("hello world! I am process number: %d form %s\n", rank, size); MPI_Finalize(); Terminate the MPI computation and clean up return 0;

12 Compilation, execution on HybriLIT Add module $ module add openmpi/1.8.1 Compilation $ mpicc -std=c99 mpi_hello.c -o exec_mpi script_mpi #!/bin/sh #SBATCH -p tut #SBATCH -n 8 #SBATCH -t 60 mpiexec./exec_mpi Run in batch mode $ sbatch script_mpi

13 Output, parallel execution Output Numprocs is 6; Hello from 0 Numprocs is 6; Hello from 1 Numprocs is 6; Hello from 5 Numprocs is 6; Hello from 3 Numprocs is 6; Hello from 4 Numprocs is 6; Hello from 2

14 MPI point-to-point communications int MPI_Send(void *buf, // initial address of send buffer int count, // number of elements in send buffer MPI_Datatype datatype, // datatype of each send elements int dest, // rank of destination int tag, // message tag MPI_Comm comm); // communicator

15 MPI point-to-point communications int MPI_Recv(void *buf, // initial address of receive buffer int count, // number of elements in send buffer MPI_Datatype datatype, // datatype of each send elements int source, // rank of source int tag, // message tag MPI_Comm comm, // communicator MPI_Status *status); // status object

16 MPI point-to-point communications. Example mpi_send-recv.c #include <mpi.h> #include <stdio.h> #include <string.h> int main(int argc, char **argv){ int size, rank, tag,rc, i; char mess[30]; MPI_Status status; rc= MPI_Init(&argc,&argv); if (rc!= MPI_SUCCESS) { printf("error starting MPI program. Terminate.\n"); MPI_Abort(MPI_COMM_WORLD, rc);

17 MPI point-to-point communications. Example MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); tag= 8; if (rank==0){ strcpy(mess, "Hello from process 0!"); for (i=1; i <size; i++) MPI_Send(mess, 21, MPI_CHAR, i, tag, MPI_COMM_WORLD); else{ MPI_Recv(mess, 21, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status); printf("process number: %d message= %s \n", rank, mess); MPI_Finalize(); return 0;

18 Jacobi method in solution of SLAE A x f, A R M M System of linear algebraic equations M a a, i 1, M ii j 1, j i ij Condition of diagonally domination of matrix M k fi aij x j k 1 j 1, j i xi i M k aii M n 1 n n 1 n i i i 1, 1, ; 0 2 x x x x Computation the next approximation k number of iteration Criterion of ending iteration process S.A. Lupin, M.A. Posypkin. Parallel programming. M.: ID FORUM, INFRA-M, p.

19 Used functions void init (double* a, double* f, double* x){ for(int i= 0; i<m; i++){ f[i]= 3.0*M - 1.0; for(int j= 0; j < M; j++){ if(i == j) a[i*m+i]= 2.0*M; else a[i*m+j]= 1.0; for(int i= 0; i<m; i++) x[i]= 0.0; x[0]= 1.0; Initialization of matrix A x f, A R M M

20 Used functions Computational of criterion of ending iteration process double evaldiff (double* u, double* v, int m){ double d= 0.0; for(int i= 0; i<m; i++){ double b; b= v[i]-u[i]; d+= b*b; return sqrt(d); M n 1 n n 1 n i i i 1 2 x x x x

21 Used functions double matvec (double* a, double* f, double* xold, double* x, int m, int n){ for(int i= 1; i<m; i++){ double sum= 0.0; for(int j= 0; j<n; j++){ if(i!=j) sum+= a[i*m+j]*xold[j]; x[i]= (f[i]-sum)/a[i*m+i]; Computation function of the next approximation M k fi aij x j k 1 j 1, j i xi i M k aii, 1, ; 0

22 Serial code jacobi_01_serial.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <math.h> #define MAX_ITERS 200 // The maximum number of iterations #define M // The dimension of the matrix #define EPS 1e-5 // The accuracy of calculations

23 Serial code int main(int argc, char* argv[]){ // I count of iterations int I= 0; // xold previous approximation x(k) // x next approximation x(k+1) // xexact exact solution // diff norm of the difference // t program runtime double *xold, *x, *xexact, *a, *f, diff, t;

24 Serial code // Allocate memory for the matrix and vectors xold= (double*)malloc(m*sizeof(double)); x= (double*)malloc(m*sizeof(double)); xexact= (double*)malloc(m*sizeof(double)); a= (double*)malloc(m*m*sizeof(double)); f= (double*)malloc(m*sizeof(double)); init(a, f, x);

25 Serial code for(int i= 0; i<m; i++) xexact[i]= 1.0; printf(" Size of the matrix= %d, Accuracy= %5.3e\n", M, EPS); t= time(null); // Main loop do{ matvec(a, f, xold, x, M, M); diff= evaldiff(xold, x, M); I++; printf("diff= %8.5f\n", diff);

26 Serial code // Copy the next approximation to the array of the previous memcpy(xold, x, M*sizeof(double)); while((diff >= EPS) && (I <= MAX_ITERS)); // Check for reaching a maximum number of iterations if(i>=max_iters){ printf("reached a maximum number of iterations\n");

27 Serial code double err1= 0.0; double err2= 0.0; for(int i= 0; i<m; i++){ double tmp= fabs(x[i]-xexact[i]); err2+= (x[i]-xexact[i])*(x[i]-xexact[i]); if(tmp > err1) err1= tmp; double err2sqrt= sqrt(err2); // Calculating program runtime t= time(null)-t;

28 Serial code printf("error= %12.9f\n", err1); printf("norma= %12.9f\n", err2sqrt); // Output the number of iterations and program runtime printf("%d iterations consumed %lf sec\n", I, t); // Free memory free(xold); free(x); free(xexact); free(a); free(f); return 0;

29 MPI collective communication int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm );

30 MPI collective communication int MPI_Scatter( const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm );

31 MPI collective communication int MPI_Gather( const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm );

32 Update used function Computation function of the next approximation double matvec (double* a, double* f, double* xold, double* x, int rk, int sp, int m, int n){ for(int i= 1; i<m; i++){ double sum= 0.0; for(int j= 0; j<n; j++){ if(j!= i+rk*sp) sum+= a[i*m+j]*xold[j]; x[i]= (f[i]-sum)/a[i*m+i+rk*sp];

33 MPI code. Version 1 jacobi_02_mpi.c #include... #include <mpi.h> // Include header for support MPI library #define... int main(int argc, char* argv[]){ // size total number of processes // rank rank of the process // chunk the number of rows processed by each process int I= 0, size, rank, chunk; // aloc array to store the local part of the vector a // floc array to store the local part of the vector f // xloc array to store the local part of the vector x double *a, *aloc, *f, *floc, *xold, *x, *xloc, diff, t;

34 MPI code. Version 1 // Initialization of MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0){ // Work only root process printf(" Simple MPI realization\n"); printf(" Number of processes= %d\n Size of the matrix= %d, Accuracy= %5.3e\n", size, M, EPS); chunk= M/size; a= (double*)malloc(m*m*sizeof(double));

35 MPI code. Version 1 // Broadcasting value of chunk to all processes MPI_Bcast(&chunk, 1, MPI_INT, 0, MPI_COMM_WORLD); // Allocation memory for arrays and vectors aloc= (double*)malloc(chunk*m*sizeof(double)); f= (double*)malloc(m*sizeof(double)); floc= (double*)malloc(chunk*sizeof(double)); xold= (double*)malloc(m*sizeof(double)); x= (double*)malloc(m*sizeof(double)); xloc= (double*)malloc(chunk*sizeof(double));

36 MPI code. Version 1 if(rank==0){ init(a, f, xold); t= MPI_Wtime(); // Send array A and the vector f to processes MPI_Scatter(a, M*chunk, MPI_DOUBLE, aloc, M*chunk, MPI_DOUBLE, 0, MPI_COMM_WORLD); MPI_Scatter(f, chunk, MPI_DOUBLE, floc, chunk, MPI_DOUBLE, 0, MPI_COMM_WORLD);

37 Parallel code

38 MPI code. Version 1 // Main loop do{ // Broadcasting value of the previous approximation to all processes MPI_Bcast(xold, M, MPI_DOUBLE, 0, MPI_COMM_WORLD); matvec(aloc, floc, xold, xloc, rank, chunk, chunk, M); // Gather of vector x at root process MPI_Gather(xloc, chunk, MPI_DOUBLE, x, chunk, MPI_DOUBLE, 0, MPI_COMM_WORLD);

39 MPI code. Version 1 if(rank==0){ diff= evaldiff(x, xold, M); printf("diff= %10.7f\n", diff); memcpy(xold, x, M*sizeof(double)); MPI_Bcast(&diff, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); I++; while((diff >= EPS) && (I <= MAX_ITERS));

40 Parallel code

41 MPI code. Version 1 // Output the number of iterations and program runtime if(rank==0){ t= MPI_Wtime()-t; printf("%d iterations consumed %lf sec\n", I, t); // Free memory if(rank==0) free(a); free(aloc); free(f); free(floc); free(xold); free(x); free(xloc); // Finalize of all MPI processes MPI_Finalize(); return 0;

42 MPI code. Version 1. Optimization void init(double* f, double* x){ // f right hand side vector // x vector of initial approximation to the solution for(int i= 0; i<m; i++){ f[i]= 3.0*M - 1.0; for(int i= 0; i<m; i++) x[i]= 0.0; x[0]= 1.0;

43 MPI code. Version 1. Optimization jacobi_02_mpi_alloc.c // Init of matrix aloc on each process for(int i= 0; i<chunk; i++){ for(int j= 0; j<m; j++){ if(j == i+rank*chunk) aloc[i*m+j]= 2.0*M; else aloc[i*m+j]= 1.0;

44 MPI collective communication int MPI_Scatterv( const void *sendbuf, const int *sendcounts, const int *displs, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm );

45 MPI collective communication int MPI_Gatherv( const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, const int *recvcounts, const int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm );

46 MPI code. Version 2 jacobi_03_scatterv_gatherv.c #include... #define... int main(int argc, char* argv[]){ // chunks an array, each element of which contains the number of elements in the respective data block // disps an array containing the offset value measured in the sent items int I= 0, size, rank, chunk, *chunks, *disps; double *a, *aloc, *f, *floc, *xold, *x, *xloc, *xexact, *xtemp, diff, t;

47 MPI code. Version 2 // Initialization of MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0){ printf(" MPI realization with Scatterv() and Gatherv()\n"); printf(" Number of processes= %d\n Size of the matrix= %d, Accuracy= %5.3e\n", size, M, EPS); chunk= M/size; a= (double*)malloc(m*m*sizeof(double)); // Broadcasting value of "chunk" to all processes MPI_Bcast(&chunk, 1, MPI_INT, 0, MPI_COMM_WORLD);

48 MPI code. Version 2 // Allocation memory for arrays and vectors chunks= (int*)malloc(size*sizeof(int)); disps= (int*)malloc(size*sizeof(int)); f= (double*)malloc(m*sizeof(double)); x= (double*)malloc(m*sizeof(double)); xold= (double*)malloc(m*sizeof(double)); if(rank==0){ init(a, f, xold); t= MPI_Wtime();

49 MPI code. Version 2 for(int i= 0; i<size; i++){ disps[i]= i*chunk*m; if(i==(size-1)) else chunks[i]= (M-(size-1)*chunk)*M; chunks[i]= chunk*m; aloc= (double*)malloc(chunks[rank]*sizeof(double)); MPI_Scatterv(a, chunks, disps, MPI_DOUBLE, aloc, chunks[rank], MPI_DOUBLE, 0, MPI_COMM_WORLD);

50 MPI code. Version 2 for(int i= 0; i<size; i++){ disps[i]= i*chunk; if(i==(size-1)) else chunks[i]= M-(size-1)*chunk; chunks[i]= chunk; floc= (double*)malloc(chunks[rank]*sizeof(double)); MPI_Scatterv(f, chunks, disps, MPI_DOUBLE, floc, chunks[rank], MPI_DOUBLE, 0, MPI_COMM_WORLD);

51 MPI code. Version 2 xloc= (double*)malloc(chunks[rank]*sizeof(double)); // Main loop do{ MPI_Bcast(xold, M, MPI_DOUBLE, 0, MPI_COMM_WORLD); matvec(aloc, floc, xold, xloc, rank, chunk, chunks[rank], M); MPI_Gatherv(xloc, chunks[rank], MPI_DOUBLE, x, chunks, disps, MPI_DOUBLE, 0, MPI_COMM_WORLD); if(rank==0){ diff= evaldiff(x, xold, M); printf("diff= %10.7f\n", diff); memcpy(xold, x, M*sizeof(double));

52 MPI code. Version 2 MPI_Bcast(&diff, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); I++; while((diff >= EPS) && (I <= MAX_ITERS)); // Output the number of iterations and program runtime if(rank==0){ t= MPI_Wtime()-t; printf("%d iterations consumed %lf sec\n", I, t); // Free memory if(rank==0) free(a); free(chunks); free(disps); free(f); free(x); free(xold); free(aloc); free(floc); free(xloc); // Finalize of all MPI processes MPI_Finalize(); return 0;

53 MPI collective communication int MPI_Allgatherv( const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, const int *recvcounts, const int *displs, MPI_Datatype recvtype, MPI_Comm comm)

54 do{ MPI code. Version 3 The main loop will look like the following. The remaining part of the code will not change. jacobi_04_allgatherv.c matvec(aloc, floc, xold, xloc, rank, chunk, chunks[rank], M); MPI_Allgatherv(xloc, chunks[rank], MPI_DOUBLE, x, chunks, disps, MPI_DOUBLE, MPI_COMM_WORLD); diff= evaldiff(x, xold, M); memcpy(xold, x, M*sizeof(double)); I++; while((diff >= EPS) && (I <= MAX_ITERS));

55 Practical Tasks - Carry out computations at different number of MPI-processes and plot graphs on speedup and efficiency. - Perform broadcast transfer from each processor by means of peer-to-peer messaging

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point