Parallel programming with MPI. Jianfeng Yang Internet and Information Technology Lab Wuhan university

Size: px

Start display at page:

Download "Parallel programming with MPI. Jianfeng Yang Internet and Information Technology Lab Wuhan university"

Caren Thomas
5 years ago
Views:

1 Parallel programming with MPI Jianfeng Yang Internet and Information Technology Lab Wuhan university

2 Agenda Part Ⅰ: Seeking Parallelism/Concurrency Part Ⅱ: Parallel Algorithm Design Part Ⅲ: Message-Passing Programming 2

3 Part Ⅰ Seeking Parallel/Concurrency

4 1 Introduction 2 Seeking Parallel Outline 4

5 1 Introduction(1/6) Well done is quickly done Caesar Auguest Fast, Fast, Fast is not fast enough. How to get Higher Performance Parallel Computing. 5

6 1 Introduction(2/6) What is parallel computing? is the use of a parallel computer to reduce the time needed to solve a single computational problem. is now considered a standard way for computational scientists and engineers to solve problems in areas as diverse as galactic evolution, climate modeling, aircraft design, molecular dynamics and economic analysis. 6

7 Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we basically need? The ability to start the tasks A way for them to communicate 7

8 1 Introduction(3/6) What s s parallel computer? Is a Multi-processor computer system supporting parallel programming. Multi-computer Is a parallel computer constructed out of multiple computers and an interconnection network. The processors on different computers interact by passing message e to each other. Centralized multiprocessor (SMP: Symmetrical multiprocessor) Is a more high integrated system in which all CPUs share access to a single global memory. The shared memory supports communications and synchronization among processors. 8

9 1 Introduction(4/6) Multi-core platform Integrated duo/quad or more core in one processor, and each core has their own registers and Level 1 cache, all cores share Level 2 cache, which supports communications and synchronizations among cores. All cores share access to a global memory. 9

10 1 Introduction(5/6) What s s parallel programming? Is programming in language that allows you to explicitly indicate how different portions of the computation may be executed paralleled/concurrently by different processors/cores. Do I need parallel programming really? YES, for the reasons of: Although a lot of research has been invested in and many experimental parallelizing compilers have been developed, there are still no commercial system thus far. The alternative is for you to write your own parallel programs. 10

11 1 Introduction(6/6) Why should I program using MPI and OpenMP? MPI ( Message Passing Interface) is a standard specification for message passing libraries. Which is available on virtually every parallel computer system. Free. If you develop programs using MPI, you will be able to reuse them when you get access to a newer, faster parallel computer. On Multi-core platform or SMP, the cores/cpus have a shared memory space. While MPI is a perfect satisfactory way for cores/processors to communicate with each other, OpenMP is a better way for cores/processors with a single Processor/SMP to interact. The hybrid MPI/OpenMP program can get even high performance. 11

12 2 Seeking Parallel(1/7) In order to take advantage of multi-core/multiple processors, programmers must be able to identify operations that may be performed in parallel. Several ways: Data Dependence Graphs Data Parallelism Functional Parallelism Pipelining 12

13 2 Seeking Parallel(2/7) Data Dependence Graphs A directed graph Each vertex: represent a task to be completed. An edge from vertex u to vertex v means: task u must be completed before task v begins Task v is dependent on task u. If there is no path from u to v, then the tasks are independent and may be performed parallelized. 13

14 2 Seeking Parallel(3/7) Data Dependence Graphs 14

15 2 Seeking Parallel(4/7) Data Parallelism Independent tasks applying the same operation to different elements of a data set. e.g. 15

16 2 Seeking Parallel(5/7) Functional Parallelism Independent tasks applying different operations to different data elements of a data set. 16

17 2 Seeking Parallel(6/7) Pipelining A data dependence graph forming a simple path/chain admits no parallelism if only a single problem instance must be processed. If multiple problems instance to be processed: If a computation can be divided into several stage with the same e time consumption. Then, can support parallelism. E.g. Assembly line. 17

18 2 Seeking Parallel(7/7) Pipelining P[0] = a[0]; P[1] = p[0]+a[1]; P[2] = p[1]+a[2]; P[3] = p[2]+a[3]; 18

19 For Example: Landscape maintains Prepare for dinner Data cluster 19

20 Homework Given a task that can be divided into m subtasks, each require one unit of time, how much time is needed for an m-stage pipeline to process n tasks? Consider the data dependence graph in figure below. identify all sources of data parallelism; identify all sources of functional parallelism. I A A A B C D A A A O 20

21 Part Ⅱ Parallel Algorithm Design

22 Outline 1.Introduction 2.The Task/Channel Model 3.Foster s s Design Methodology 22

23 1.Introduction Foster, Ian. Design and Building Parallel Programs: Concepts and Tools for Parallel Software engineering. Reading, MA: Addison-Wesley, Describe the Task/Channel Model; A few simple problems 23

24 2.The Task/Channel Model The model represents a parallel computation as a set of tasks that may interact with each other by sending message through channels. Task: is a program, its local memory, and a collection of I/O ports. Local memory: instructions private data 24

25 2.The Task/Channel Model channel: Via channel: A task can send local data to other tasks via output ports; A task can receive data value from other tasks via input ports. A channel is a message queue: Connect one task s s output port with another task s s input port. Data value appears at the inputs port in the same order in which they were placed in the output port of the other end of the channel. Receiving data can be blocked: Synchronous. Sending data can never be blocked: Asynchronous. Access to local memory: faster than nonlocal data access. 25

26 3.Foster s s Design Methodology Four-step process: Partitioning Communication Agglomeration mapping Problem Partitioning Communication Mapping Agglomeration 26

27 3.Foster s s Design Methodology Partitioning Is the process of dividing the computation and the data into pieces. More small pieces is good. How to Data-centric approach Function-centric centric approach Domain Decomposition First, divide data into pieces; Then, determine how to associate computations with the data. Focus on: the largest and/or most frequently accessed data structure in the program. E.g., Functional Decomposition 27

28 3.Foster s s Design Methodology Domain Decomposition 1-D 2-D Primitive Task 3-D Better 28

29 3.Foster s s Design Methodology Functional Decomposition Yield collections of tasks that achieve parallel through pipelining. E.g., a system supporting interactive image-guided guided surgery. 29

30 3.Foster s s Design Methodology The quality of Partition (evaluation) At least an order of magnitude more primitive tasks than processors in the target parallel computer. Otherwise: later design options may be too constrained. Redundant computations and redundant data structure storage are minimized. Otherwise: the design may not work well when the size of the problem increases. Primitive tasks are roughly the same size. Otherwise: it may be hard to balance work among the processors/cores. ores. The number of tasks is an increasing function of the problem size. Otherwise: it may be impossible to use more processor/cores to solve s large problem. 30

31 3.Foster s s Design Methodology Communication After identifying the primitive tasks, the communications type between those primitive tasks should be determined. Two kinds of communication type: Local Global 31

32 3.Foster s s Design Methodology Communication Local: A task needs values from a small number of other tasks in order to perform a computation, a channel is created from the tasks supplying the data to the task consuming the data. Global: When a significant number of the primitive tasks must be contribute data in order to perform a computation. E.g., computing the sums of the values held by the primitive processes. 32

33 3.Foster s s Design Methodology Communication Evaluate the communication structure of the designed parallel algorithm. The communication operations are balanced among the tasks. Each task communications with only a small number of neighbors. Tasks can perform their communication in parallel/concurrently. Tasks can perform their computations in parallel/concurrently. 33

34 3.Foster s s Design Methodology Agglomeration Why we need agglomeration? If the number of tasks exceeds the number of processors/cores by several orders of magnitude, simply creating these tasks would be a source of significant overhead. So, combine primitive tasks into large tasks and map them into physical processors/cores to reduce the amount of parallel overhead. What s s agglomeration? Is the process of grouping tasks into large tasks in order to improve performance or simplify programming. When developing MPI programs, ONE task per core/processor is better. 34

35 3.Foster s s Design Methodology Agglomeration Goals 1: lower communication overhead. Eliminate communication among tasks. Increasing the locality of parallelism. Combining groups of sending and receiving tasks. 35

36 3.Foster s s Design Methodology Agglomeration Goals 2: Maintain the scalability of the parallel design. Enable that we have not combined so many tasks that we will not be able to port our program at some point in the future to a computer with more processors/cores. E.g. 3-D 3 D Matrix Operation size: 8*128*258 36

37 3.Foster s s Design Methodology Agglomeration Goals 3: reduce software engineering costs. Make greater use of the existing sequential code. Reducing time; Reducing expense. 37

38 3.Foster s s Design Methodology Agglomeration evaluation: Has increased the locality of the parallel algorithm. Replicated computations take less time than the computations the replace. The amount of replicated data is small enough to allow algorithm to scale. Agglomeration tasks have similar computational and communication costs. The number of tasks is an increasing function of the problem size. The number of tasks is as small as possible, yet at least as great as the number of cores/processors in the target computers. The trade-off between the chosen agglomeration and the cost of modifications to existing sequential code is reasonable. 38

39 3.Foster s s Design Methodology Mapping Increasing processor utilization Minimizing inter-processor communication 39

40 Part Ⅲ Message-Passing Programming

41 Preface 41

42 42

43 process 0 process 1 process 2 Load Process Gather Store 43

44 Hello World! #include <stdio.h< stdio.h> #include mpi.h int main(int argc,char *argv[]) { int size, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,, &size); MPI_Comm_rank(MPI_COMM_WORLD,, &rank); print( Process %d of %d: Hello world,, rank, size); MPI_Finalize(); } Hello world from process 0 of 4 Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 44

45 Outline Introduction The Message-Passing Model The Message-Passing Interface (MPI) Communication Mode Circuit satisfiability Point-to to-point Communication Collective Communication Benchmarking parallel performance 45

46 Introduction MPI: Message Passing Interface Is a library, not a parallel language. C&MPI, Fortran&MPI Is a standard, not a implement for a actually problem. MPICH Intel MPI MSMPI LAM MPI Is a Message Passing Model 46

47 Introduction The history of MPI: Draft: 1992 MPI-1: 1994 MPI-2:

48 Introduction MPICH: unix.mcs.anl.gov/mpi/mpich1/download.html; unix.mcs.anl.gov/mpi/mpich2/index.htm#download Main Features: Open source; Synchronized on MPI standard; Supports MPMD (multiple Program Multiple Data) and heterogeneous clusters. Supports combining with C/C++, Fortran77 and Fortran90; Supports Unix, Windows NT platform; Supports multi-core, SMP, Cluster, Large Scale Parallel Computer System. 48

49 Introduction Intel MPI According to MPI-2 2 standard. Latest version: 3.1 DAPL (Direct Access Programming Library) 49

50 Introduction-Intel Intel MPI Intel MPI Library Supports Multiple Hardware Fabrics 50

51 Introduction-Intel Intel MPI Features is a multi-fabric message passing library. implements the Message Passing Interface, v2 (MPI-2) specification. provides a standard library across Intel platforms that: Focuses on making applications perform best on IA based clusters Enables adoption of the MPI-2 2 functions as the customer needs dictate Delivers best in class performance for enterprise, divisional, departmental and workgroup high performance computing 51

52 Introduction-Intel Intel MPI Why Intel MPI Library? High performance MPI-2 2 implementation Linux and Windows CCS support Interconnect independence Smart fabric selection Easy installation Free Runtime Environment Close integration with the Intel and 3rd party development tools Internet based licensing and technical support 52

53 Introduction-Intel Intel MPI Standards Based Argonne National Laboratory's MPICH-2 implementation. Integration, can be easily integrated with: Platform LSF 6.1 and higher Altair PBS Pro* 7.1 and higher OpenPBS* * 2.3 Torque* and higher Parallelnavi* * NQS* for Linux V2.0L10 and higher Parallelnavi for Linux Advanced Edition V1.0L10A and higher NetBatch* * 6.x and higher 53

54 Introduction-Intel Intel MPI System Requirements: Host and Target Systems hardware: IA-32, Intel 64, or IA-64 architecture using Intel Pentium 4, Intel Xeon processor, Intel Itanium processor family and compatible platforms 1 GB of RAM - 4 GB recommended Minimum 100 MB of free hard disk space - 10GB recommended. 54

55 Introduction-Intel Intel MPI Operating Systems Requirements: Microsoft Windows* Compute Cluster Server 2003 (Intel 64 architecture only) Red Hat Enterprise Linux* 3.0, 4.0, or 5.0 SUSE* Linux Enterprise Server 9 or 10 SUSE Linux 9.0 thru 10.0 (all except Intel 64 architecture starts at 9.1) HaanSoft Linux 2006 Server* Miracle Linux* 4.0 Red Flag* DC Server 5.0 Asianux* * Linux 2.0 Fedora Core 4, 5, or 6 (IA-32 and Intel 64 architectures only) TurboLinux*10 (IA-32 and Intel 64 architecture) Mandriva/Mandrake* 10.1 (IA-32 architecture only) SGI* ProPack 4.0 (IA-64 architecture only) or 5.0 (IA-64 and Intel 64 architectures) 55

56 The Message-Passing Model Processor Memory Processor Memory Processor Memory Processor Memory Interconnection network Processor Memory Processor Memory Processor Memory Processor Memory 56

57 The Message-Passing Model A task in task/channel model become a process in Message-Passing Model; The number of processes: Is specified by user; Is specified when the program begins; Is constant throughout the execution of the program; Each process: Has a unique ID number; Processor Memory Processor Memory Processor Memory Processor Memory Interconnection network Processor Memory Processor Memory Processor Processor Memory Memory 57

58 The Message-Passing Model Goals of Message-Passing Model: Communication with each other; Synchronization with each other; 58

59 The Message-Passing Interface (MPI) Advantages: Run well on a wide variety of MPMD architectures; Easily to debugging; Threading safe 59

60 What is in MPI Point-to to-point message passing Collective communication Support for process groups Support for communication contexts Support for application topologies Environmental inquiry routines Profiling interface 60

61 Introduction to Groups & Communicator Process model and groups Communication scope Communicators 61

62 Process model and groups Fundamental computational unit is the process. Each process has: an independent thread of control, a separate address space MPI processes execute in MIMD style, but: No mechanism for loading code onto processors, or assigning processes to processors No mechanism for creating or destroying processes MPI supports dynamic process groups. Process groups can be created and destroyed Membership is static Groups may overlap No explicit support for multithreading, but MPI is designed to be b thread-safe. 62

63 Communication scope In MPI, a process is specified by: a group a rank relative to the group ( ) A message label is specified by: a message context a message tag relative to the context Groups are used to partition process space Contexts are used to partition ``message label space'' Groups and contexts are bound together to form a communicator object. Contexts are not visible at the application level. A communicator defines the scope of a communication operation 63

64 Communicators Communicators are used to create independent ``message universes''. Communicators are used to disambiguate message selection when an application calls a library routine that performs message passing. Nondeterminacy may arise if processes enter the library routine asynchronously, if processes enter the library routine synchronously, but there are outstanding communication operations. A communicator binds together groups and contexts defines the scope of a communication operation is represented by an opaque object 64

65 A communicator handle defines which processes a particular command will apply to All MPI communication calls take a communicator handle as a parameter, which is effectively the context in which the communication will take place MPI_INIT defines a communicator called MPI_COMM_WORLD for each process that calls it 65

66 Every communicator contains a group which is a list of processes The processes are ordered and numbered consecutively from 0. The number of each process is known as its rank The rank identifies each process within the communicator The group of MPI_COMM_WORLD is the set of all MPI processes 66

67 Skeleton MPI Program #include <mpi.h> main( int argc, char** argv ) { MPI_Init( &argc, &argv ); /* main part of the program */ } MPI_Finalize(); 67

68 Circuit satisfiability a b What combinations of input value will the circuit output the value of 1? c d e f g h i j k l m n o p 68

69 Circuit satisfiability Analysis: 16 input, a-p, a each take on 2 values of 0 or =65536 design a parallel algorithm Partition Function decomposition No channel between tasks Tasks are independent; Suit for parallelism; Partition Communication Agglomeration Mapping 69

70 Circuit satisfiability Communication: Tasks are independent; 70

71 Circuit satisfiability Agglomeration and Mapping Fixed number of tasks; The time for each task to complete is variable. WHY? How to balance the computation load? Mapping tasks in cyclic fashion. Partition Communication Agglomeration Mapping 71

72 Circuit satisfiability Each process will examine a combination of inputs in turn. 72

73 Circuit satisfiability #define EXTRACT_BIT(n,i) ((n&(1<<i))?1:0) void check_circuit(int id,int z){ int v[16]; int i; for( i=0;i<16;i++) v[i] = EXTRACT_BIT(z,i) ; if((v[0] v[1]) && (!v[1]!v[3]) && (v[2] v[3]) && (!v[3]!v[4]) && (v[4]!v[5]) && ( v[5]!v[6]) && (v[5] v[6]) && ( v[6]!v[15]) && (v[7]!v[8]) && (!v[7]!v[13]) && (v[8] v[9]) && ( v[9] v[11]) && (v[10] v[11]) && ( v[12] v[13]) && (v[13]!v[14]) && (v[14] v[15]) ) { printf( %d) %d%d%d%d%d%d%d%d%d%d%d%d%d%d%d %d,id,v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7],v[8],v[9], v[10],v[11],v[12],v[13],v[14],v[15]); fflush(stdout); } } 73

74 Point-to to-point Communication Overview Blocking Behaviors Non-Blocking Behaviors 74

75 overview A message is sent from a sender to a receiver There are several variations on how the sending of a message can interact with the program 75

76 Synchronous does not complete until the message has been received A FAX or registered mail 76

77 Asynchronous completes as soon as the message is on the way. A post card or 77

78 communication modes is selected with send routine. synchronous mode ("safest") ready mode (lowest system overhead) buffered mode (decouples sender from receiver) standard mode (compromise) Calls are also blocking or nonblocking. Blocking stops the program until the message buffer is safe to use Non-blocking separates communication from computation 78

79 Blocking Behavior int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) buf is the beginning of the buffer containing the data to be sent. For Fortran, this is often the name of an array in your program. For C, it is an address. count is the number of elements to be sent (not bytes) datatype is the type of data dest is the rank of the process which is the destination for the message tag is an arbitrary number which can be used to distinguish among messages comm is the communicator 79

80 Temporary Knowledge Message Msg: buf,, count, datatype Msg envelop: dest,, tag, comm Tag why? ( ) ( ) ( ) ( ) 80

81 81

82 When using standard-mode send It is up to MPI to decide whether outgoing messages will be buffered. Completes once the message has been sent, which may or may not imply that the massage has arrived at its destination Can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. Has non-local completion semantics, since successful completion of the send operation may depend on the occurrence of a matching receive. 82

83 Blocking Standard Send 83

84 MPI_Recv int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) buf is the beginning of the buffer where the incoming data are to be b stored. For Fortran, this is often the name of an array in your program. For C, it is an address. count is the number of elements (not bytes) in your receive buffer datatype is the type of data source is the rank of the process from which data will be accepted (This can be a wildcard, by specifying the parameter MPI_ANY_SOURCE.) tag is an arbitrary number which can be used to distinguish among messages (This can be a wildcard, by specifying the parameter MPI_ANY_TAG.) comm is the communicator status is an array or structure of information that is returned. For example, e if you specify a wildcard for source or tag, status will tell you u the actual rank or tag for the message received 84

85 85

86 86

87 Blocking Synchronous Send 87

88 Cont. can be started whether or not a matching receive was posted will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. provides synchronous communication semantics: a communication does not complete at either end before both processes rendezvous at the communication. has non-local completion semantics. 88

89 Blocking Ready Send 89

90 completes immediately may be started only if the matching receive has already been posted. has the same semantics as a standard-mode send. saves on overhead by avoiding handshaking and buffering 90

91 Blocking Buffered Send 91

92 Can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. Has local completion semantics: its completion does not depend on the occurrence of a matching receive. In order to complete the operation, it may be necessary to buffer the outgoing message locally. For that purpose, buffer space is provided by the application. 92

93 Non-Blocking Behavior MPI_Isend (buf,count,dtype,dest,tag,comm, buf,count,dtype,dest,tag,comm,request) MPI_Wait (request,status) request matches request on Isend or Irecv request status returns status equivalent to status for Recv when complete Blocks for send until message is buffered or sent so message variable is free Blocks for receive until message is received and ready 93

94 Non-blocking Synchronous Send int MPI_Issend (void *buf* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) IN = provided by programmer, OUT = set by routine buf: starting address of message buffer (IN( IN) count: : number of elements in message (IN( IN) datatype: : type of elements in message (IN( IN) dest: : rank of destination task in communicator comm (IN) tag: : message tag (IN( IN) comm: : communicator (IN( IN) request: : identifies a communication event (OUT( OUT) 94

95 Non-blocking Ready Send int MPI_Irsend (void *buf* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 95

96 Non-blocking Buffered Send int MPI_Ibsend (void *buf* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 96

97 Non-blocking Standard Send int MPI_Isend (void *buf* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 97

98 Non-blocking Receive IN = provided by programmer, OUT = set by routine buf: : starting address of message buffer (OUT-buffer contents written) count: : number of elements in message (IN( IN) datatype: : type of elements in message (IN( IN) source: : rank of source task in communicator comm (IN) tag: : message tag (IN( IN) comm: : communicator (IN( IN) request: : identifies a communication event (OUT( OUT) 98

99 int MPI_Irecv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) 99

100 request: : identifies a communication event (INOUT( INOUT) status: : status of communication event (OUT( OUT) count: : number of communication events (IN( IN) index: : index in array of requests of completed event (OUT( OUT) incount: : number of communication events (IN( IN) outcount: : number of completed events (OUT( OUT) 100

101 int MPI_Wait (MPI_Request *request, MPI_Status *status) int MPI_Waitall (int count, MPI_Request *array_of_requests, MPI_Status *array_of_statuses) int MPI_Waitany (int count, MPI_Request *array_of_requests, int *index, MPI_Status *status) int MPI_Waitsome (int incount, MPI_Request *array_of_requests, int *outcount, int* array_of_indices, MPI_Status *array_of_statuses) 101

102 Communication Mode Synchronous Ready Buffered Standard Blocking Routines MPI_SSEND MPI_RSEND MPI_BSEND MPI_SEND MPI_RECV Non-Blocking Routines MPI_ISSEND MPI_IRSEND MPI_IBSEND MPI_ISEND MPI_IRECV 102

103 Synchro nous Ready Buffered Standard Advantages Safest, and therefore most portable SEND/RECV order not critical Amount of buffer space irrelevant Lowest total overhead SEND/RECV handshake not required Decouples SEND from RECV No sync overhead on SEND Order of SEND/RECV irrelevant Programmer can control size of buffer space Good for many cases Disadvantages Can incur substantial synchronization overhead RECV must precede SEND Additional system overhead incurred by copy to buffer Your program may not be suitable 103

104 MPI Quick Start MPI_Init MPI_BCast MPI_Wtime MPI_Comm_rank MPI_Scatter MPI_Wtick MPI_Comm_size MPI_Gather MPI_Barrier MPI_Send MPI_Reduce MPI_Recv MPI_Finalize MPI_Xxxxx 104

105 MPI Routines MPI_Init To Initialize MPI execution environment. argc: Pointer to the number of arguments argv: Pointer to the argument vector The First MPI function call; Allow system to do any setup needed to hander further calls to MPI Library. defines a communicator called MPI_COMM_WORLD for each process that calls it MPI_Init must be called before any other MPI functions. Exception: MPI_Initializes,, checks to see if MPI has been initialzed. May be called before MPI_Init. 105

106 MPI Routines MPI_Comm_rank To determine a process s s ID number. Return: Process s s ID by rank Communicator: MPI_Comm: : MPI_COMM_WORLD, include all process when MPI initialized. 106

107 MPI Routines MPI_Comm_size To find the number of processes -- size 107

108 MPI Routines MPI_Send The source process send the data in buffer to destination process. buf count The starting address of the data to be transmitted. The number of data items. datatype The type of data items.(all of the data items must be in the same type) dest tag comm The rank of the process to receive the data. An integer label for the message, allowing messages serving different purpose to be identified. Indicates the communicator in which this message is being sent. 108

109 MPI Routines MPI_Send Blocks until the message buffer is once again availabel. MPI constants for C data types. 109

110 MPI Routines MPI_Recv buf count The starting address where the received data is to be stored. The maximum number of data items the receiving process is willing to receive. datatype The type of data items source tag comm status The rank of the process sending this message. The desired tag value for the message Indicates the communicator in which this message is being passed. MPI data structure. Return the status. 110

111 MPI Routines MPI_Recv Receive the message from the source process. The data type and tag of message received must be in according with the data type and tag define in MPI_Recv funciton. The count of data items received must be less than the count define in this function. Otherwise, will cause the overflow error condition. If count equal to zero, then message is empty. Blocks until the message has been recived. Or an error conditions cause the function to return. 111

112 MPI Routines MPI_Recv status->mpi_source status->mpi_tag The rank of the process sending the msg. The msg s tag value. status->mpi_erroe The error condition. int MPI_Abort(MPI_Comm comm, int errorcode) 112

113 MPI Routines MPI_Finalize Allowing system to free up resources, such as memory, that have been allocated to MPI. Without MPI_Finalize,, the result of program will unknowns. 113

114 summary 114

115 Collective communication Communication operation A group of processes work together to distribute or gather together a set of one or more values. 115

116 Collective communication MPI_Bcast A root process broadcast one or more data items of the same type to all other processed in a communicator. 116

117 Collective communication MPI_Bcast int MPI_Bcast( void* buffer, //addr of 1st broadcast element int count, // #element to be broadcast MPI_Datatype datatype, // type of element to be broadcast int root, // ID of process doing broadcast MPI_Comm comm) //communicator 117

118 Collective communication MPI_Scatter The root process send the different parts of data item to other processes. 118

119 Collective communication MPI_Scatter 119

120 Collective communication MPI_Gather Each process sending data of its buffer to root process

121 Collective communication MPI_Gather 121

122 Collective communication MPI_Reduce After a process has completed its share of the work, it is ready to participate in the reduction operation. MPI_Reduce perform one or more reduction operations on values submitted by all the processed in a communicator. 122

123 Collective communication MPI_Reduce 123

124 Collective communication MPI_Reduce MPI s built-in in reduction operators MPI_BAND MPI_BOR MPI_BXOR MPI_LAND MPI_LOR MPI_LXOR MPI_MAX MPI_MAXLOC MPI_MIN MPI_MINLOC MPI_PORD MPI_SUM Bitwise and Bitwise or Bitwise exclusive or logical and logical or Logical exclusive or Maximum Maximum and location of maximum Minimum Minimum and location of maximum Product Sum 124

125 summary 125

126 126

127 127

128 128

129 Benchmarking parallel performance Measure the performance of a parallel application. How? Measuring the number of seconds that elapse from the time we initiate execution until the program terminates. double MPI_Wtime(void) Returns the numbers of seconds that have elapsed since some point of time in the past. double MPI_Wtick(void) Returns the precision of the result returned by MPI_Wtime. 129

130 Benchmarking parallel performance MPI_Barrier int MPI_Barrier(MPI_Comm comm) comm: : indicate in which communicator the processes will participate the barrier synchronization. Function of MPI_Barrier is. 130

131 For example Send and receive operation 131

132 For example Compute pi dx = arctan( x) 0 = arctan(1) arctan(0) = arctan(1) = π / x f 4 ( x) = 2 (1 + x ) 1 f ( x) dx = π 0 132

133 For example π = N 2 i 1 f ( ) 2 N 1 N i= 1 i= 1 = 1 N N f i ( 0.5 ) N 133

134 For example Compute pi 134

135 For example Matrix Multiplication MPI_Scatter(&iaA[0][0],N,MPI_INT,&iaA[iRank][0],N,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(&iaB[0][0],N*N,MPI_INT,0,MPI_COMM_WORLD); for(i=0;i<n;i++) { temp = 0; for(j=0;j<n;j++) { remp = temp+iaa[irank][j] * iab[j][i]; } iac[irank][i] = temp; } MPI_Gather(&iaC[iRank][0],N,MPI_INT,&iaC[0][0],N,MPI_INT,0,MPI_COMM_WORLD); 135

136 136

137 l 1 C i, = = a b j i, k k, j k 0 where A is an n x l matrix and B is an l x m matrix. 137

138 138

139 139

140 140

141 141

142 Summary MPI is a Library. Six foundational functions of MPI. collective communication. MPI communication Model. 142

Thanks! Fell free to contact me via yjf@whu.edu.

143 Thanks! Fell free to contact me via for any questions or suggestions. And Welcome to Wuhan University!

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 4 Message-Passing Programming Learning Objectives n Understanding how MPI programs execute n Familiarity with fundamental MPI functions