COSC 6374 Parallel Computation. Performance Modeling and 2 nd Homework. Edgar Gabriel. Spring Motivation

Size: px

Start display at page:

Download "COSC 6374 Parallel Computation. Performance Modeling and 2 nd Homework. Edgar Gabriel. Spring Motivation"

Andra Howard
6 years ago
Views:

1 COSC 6374 Parallel Computation Performance Modeling and 2 nd Homework Spring 2007 Motivation Can we estimate the costs for a parallel code in order to Evaluate quantitative and qualitative differences between different implementation alternatives Understand the parameters effecting the performance of the application Understanding relevant hardware characteristics Restrictions: Any analytical model can not replace real measurements since parallel systems are too complex and unpredictable 2 1

2 3-D Finite Difference Code Modeling the CPU: CPU peak performance: P peak = f CPU * n FPUs Does not take any caches, micro-instructions, processor stalls etc into account Modeling the network: Eg Hockney s model: t(s) = l + s/b Does not take pipelining effects, or network congestion into account (1) (2) 3 3-D Finite Difference Code (II) Modeling the computation in the application: N 3 : No of grid points per process n: no of floating point operations per grid point ε: efficiency of this application relative to the CPU peak performance T comp = n*n 3 *ε*p peak (3) 4 2

3 3-D Finite Difference Code (III) Modeling the communication costs: A process has max 6 neighbors in a 3-D mesh Amount of data sent to each neighbor: N 2 doubles Length of a double: k bytes For non-duplex networks: T comm = 12 * ( l + (N 2 *k)/b) For duplex networks: T comm = 6 * ( l + (N 2 *k)/b) (4a) (4b) 5 3-D Finite Difference Code (IV) Total execution time for duplex networks: T total = T comp + T comm = n*n 3 *ε*p peak + 6 * ( l + (N 2 *k)/b) O(N 3 ) O(N 2 ) 6 3

4 Depdence on the problem size Dependence on N Time [sec] Problem size Tcomp Tcomm Ttotal 7 Dependence on the no of flops per grid point Dependence on n Time [sec] Tcomp Tcomm Ttotal no of FLOP/Grid point 8 4

5 Dependence on CPU peak performance Dependence on CPU peak perfomance Exec Time [sec] E E E E E E+08 Perfomance in [FLOPS] 100E E E E E E E E E+10 Tcomp Tcomm Ttotal Dependence on the efficiency of the application Dependence on eps Exec Time [sec] Tcomp Tcomm Ttotal Efficiency of App [%] 10 5

6 Dependence on network (I) 350E E E E+01 Exec time [sec] 150E E E E+00 4x InfiniBand Gigabit Ethernet bandwidth [MB/sec] Fast Ethernet latency [ms] 11 Dependence on network (II) 900E E E E-03 Exec time [sec] 500E E E E E E latency [ms] bandwidth [MB/sec] 12 6

7 How to model collective operations? Eg MPI_Bcast: strongly depending on the algorithm used to implement the operation One process (root process) distributes the same data items to all members within a process group (communicator) Linear Algorithm: the root process sends one message to each process in the communicator if (rank == root ) { for (i=0; i<size; i++ ) if ( i!= root ) MPI_Send (buf, cnt, dat, i, TAG, comm); } else MPI_Recv (buf, cnt, dat, root, TAG, comm, &stat); 13 Linear Algorithm (I) 1 Hockney s Model: t( s) = l + s b s: message size l: latency b: bandwidth Estimate of the execution time according to Hockney s model for p processes: 1 ( s, p) = ( l + s)*( p 1) b t (4:1) 14 7

8 15 Linear Algorithm (II) Using non-blocking operations: if (rank == root ) { for (i=0; i<size; i++ ) MPI_Isend (buf, cnt, dat, i, TAG, comm, &req[i]); } MPI_Recv (buf, cnt, dat, root, TAG, comm, &stat); if (rank == root ) { MPI_Waitall ( size, req, statuses); } Formula (4:1) is now arbitrarily wrong Several communications simultaneously ongoing Maximum (optimal) number of messages depending on message size and network parameters How does communication really work (I) Two protocols usually used internally: Eager protocol: message is sent immediately to the receiver, without waiting for the according receive to be posted Usually used for short messages (eg 1 KB in Open MPI) Rendezvous protocol: Send a header to receiver Wait for an acknowledgment receive has started Send message data Avoids having to buffer large messages on the receiver process ( unexpected messages) 16 8

9 How communication really works (II) Three levels of buffering Application level (eg MPI_Bsend) MPI library level unexpected message queues System buffering System buffering works similarly to file systems eg for sockets: data is copied into socket buffer before sending MPI_Send returns as soon as data is in the socket buffer! No way to alternate this data anymore, so it is safe to return control to the application 17 How communication really works (III) For a short message ( < socket buffer size (=sbsize) ) Data copied into socket buffer write operation on the according socket called MPI_Send returns control to the application in a time which is shorter than the network latency! For a long message Large message is split into chunks of size sbsize A chunk of the data is copied into socket buffer and sent As soon as the receiving process acknowledges the receipt of the data chunk, the next chunk is copied into socket buffer etc 18 9

10 How communication really works (IV) So transfer of a large message looks like Sending a small chunk Wait Sending a small chunk Wait This behavior is not modeled by Hockney, but eg by the LogGP model l: latency o: overhead per Send/Recv (eg for copying data) g: gap minimum time before Sender can Send another msg G: gap per byte takes message length into account 19 1 t( s) = l + 2o + ( s 1) G LogGP LogGP estimates sending several messages from a single process to another process more accurately gap parameter allows modeling a pipelining effect LogGP can not model network congestion Based on LogGP, one should split a large message into smaller chunks and send them simultaneously Hide the gap by using a different channel 20 10

11 Multi-segmented linear algorithm nmgs = cnt/scnt; if (rank == root ) { for (j=0; j<nmsgs; j++ ) { tbuf = buf + (j*scnt ); for (i=0; i<size; i++ ) MPI_Isend (tbuf, scnt, dat, i, TAG, comm,&req[2*j+i]); } for (j=0; j<nmsgs; j++ ) { tbuf = buf + (j*scnt); MPI_Irecv (tbuf, scnt, dat, root, TAG, comm, &rreq[j]); } if (rank == root ) { MPI_Waitall ( size*nmsgs, req, statuses); } MPI_Waitall ( nmsgs, rreq, rstatuses ); 21 Performance Effect Data courtesy of Jelena Pjesivac-Grbovic, University of Tennessee 22 11

12 Binary and Binomial Trees Number of messages increase with every iteration network saturated starting from a certain number of messages message segmenting can improve the performance as well 23 Chain Algorithms Segment a message and pass them from one process to another Performs very well for very large messages n n n n n 24 12

13 k-chain Algorithm eg k= nd Homework Rules Each student should deliver Source code (c files) Documentation (pdf, doc, tex or txt file) explanations to the code answers to questions Deliver electronically to gabriel@csuhedu Expected by Monday, April 23rd, 1159pm In case of questions: ask, ask, ask! Max number of points which can be achieved: 27pts 26 13

14 Part A: Modeling and implementing Broadcast communication As discussed in the lecture of April 4th, a broadcast operation can be implemented using multiple algorithms, eg linear, binary tree, chain 1 Implement each of these three algorithms, call the routines (my_bcast_linear, my_bcast_bin, my_bcast_chain) ( 6 pts) The prototype of each of these algorithms should be identical to the prototype of MPI_Bcast Your implementation can assume, that the process with the rank 0 is always the root process It should work for arbitrary number of processes, arbitrary cnt and datatype arguments, and arbitrary communicators All implementation should not segment messages 27 Part A 2 Write a three small benchmark programs to measure the performance of each of those functions ( 2 pts ) The benchmark shall Measure the performance of your bcast implementation for a message of 1 integer to 256,000 integers, doubling the message length with every iteration Each message length should be executed N=500 times Print for each message length the average execution time for this algorithm 3 Plot for each algorithm the execution time over the no of ( 9pts) integers Execute the measurements for 4, 8 and 16 processes -> A total of 9 plots will be required! 28 14

15 Part A 4 Using Hockney s model, give a formula for each of the implementations estimating the according execution time 5 Using your formulas, plot the estimated execution time for each algorithm and no of processes Assume that latency=45us and Bandwidth=700MB/s Compare it with the execution times achieved in the previous bullet 6 Describe how the binary tree algorithm would have to be extended to work for an arbitrary root process Note: you do not have to provide a working implementation for that! (6 pts) (3 pts) (2 pts) 29 Notes on the usage of shark The front-end node has changed! You can not login to salmoncsuhedu anymore, you have to login to sharkcsuhedu All your files has been moved to this machine Advantage of the new front-end node: You can now compile your executable on the frontend node, since it is the same architecture as the compute nodes itself Please check that your partition has been removed after your done using squeue Please start immediately with the homework Shark will be overloaded on the last days again! 30 15

COSC 4397 Parallel Computation. Performance Modeling. Edgar Gabriel. Spring Motivation

COSC 4397 Parallel Computation. Performance Modeling. Edgar Gabriel. Spring Motivation COSC 4397 Perfrmance Mdeling Spring 2010 Mtivatin Can we estimate the csts fr a parallel cde in rder t Evaluate quantitative and qualitative differences between different implementatin alternatives Understand