CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

Similar documents
Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Basic MPI Communications. Basic MPI Communications (cont d)

Lecture 2. Memory locality optimizations Address space organization

HPC Parallel Programing Multi-node Computation with MPI - I

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

MPI. (message passing, MIMD)

Lecture 16 Optimizing for the memory hierarchy

Lecture 16. Parallel Sorting MPI Datatypes

Programming with MPI Collectives

Advanced Parallel Programming

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

MPI Collective communication

Lecture 17: Array Algorithms

Collective Communications II

The MPI Message-passing Standard Practical use and implementation (V) SPD Course 6/03/2017 Massimo Coppola

Data parallelism. [ any app performing the *same* operation across a data stream ]

L19: Putting it together: N-body (Ch. 6)!

MA471. Lecture 5. Collective MPI Communication

Lecture 16. Parallel Matrix Multiplication

Lecture 18. Optimizing for the memory hierarchy

L15: Putting it together: N-body (Ch. 6)!

Parallel programming MPI

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

COMP 322: Fundamentals of Parallel Programming

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

Parallel Programming in C with MPI and OpenMP

Introduction to MPI Part II Collective Communications and communicators

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Collective Communication: Gatherv. MPI v Operations. root

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Non-Blocking Communications

High Performance Computing

Non-Blocking Communications

High-Performance Computing: MPI (ctd)

Parallel Programming, MPI Lecture 2

Collective Communication: Gather. MPI - v Operations. Collective Communication: Gather. MPI_Gather. root WORKS A OK

Basic Communication Operations (Chapter 4)

MPI - v Operations. Collective Communication: Gather

Topic Notes: Message Passing Interface (MPI)

Lecture 14. Performance Profiling Under the hood of MPI Parallel Matrix Multiplication MPI Communicators

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Collective Communications

Message Passing with MPI

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

MPI 3.0 Neighbourhood Collectives

Distributed Memory Programming with MPI

Matrix-vector Multiplication

Recap of Parallelism & MPI

PARALLEL AND DISTRIBUTED COMPUTING

Message Passing Interface. most of the slides taken from Hanjun Kim

Agenda. MPI Application Example. Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI. 1) Recap: MPI. 2) 2.

Parallel Programming. Using MPI (Message Passing Interface)

Message Passing Interface

NUMERICAL PARALLEL COMPUTING

Lecture 5. Applications: N-body simulation, sorting, stencil methods

COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell

Collective Communication in MPI and Advanced Features

What s in this talk? Quick Introduction. Programming in Parallel

MPI point-to-point communication

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Lecture 6: Parallel Matrix Algorithms (part 3)

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Masterpraktikum - Scientific Computing, High Performance Computing

Scientific Computing

Matrix Multiplication

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Introduction to MPI, the Message Passing Library

Collective Communications I

Message Passing Interface

Review of MPI Part 2

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

UNIVERSITY OF MORATUWA

Masterpraktikum - Scientific Computing, High Performance Computing

MPI Message Passing Interface. Source:

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Outline. Communication modes MPI Message Passing Interface Standard

Decomposing onto different processors

Parallel Programming

Praktikum: Verteiltes Rechnen und Parallelprogrammierung Introduction to MPI

CS 179: GPU Programming. Lecture 14: Inter-process Communication

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Communication Characteristics in the NAS Parallel Benchmarks

CSCE 5610 Midterm-2 (Krishna Kavi)

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Standard MPI - Message Passing Interface

Message Passing Interface - MPI

Introduction to MPI: Part II

Parallel Programming with MPI and OpenMP

Basic Communication Ops

Working with IITJ HPC Environment

Message-Passing and MPI Programming

Transcription:

CS 160 Lecture 23 Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

Today s lecture All to all communication Application to Parallel Sorting Blocking for cache 2013 Scott B. Baden / CS 160 / Winter 2013 2

Global to local mapping In some applications, we need to compute a a local to global mapping of array indices In the 4 th assignment, we want to set all values to zero in certain regions of the problem (0,0) (0,n global /2) (m global, n global ) 2013 Scott B. Baden / CS 160 / Winter 2013 3

All to all Also called total exchange or personalized communication: a transpose ach process sends a different chunk of data to each of the other processes Used in sorting and the Fast Fourier Transform 2013 Scott B. Baden / CS 160 / Winter 2013 4

xchange algorithm n elements / processor (n total elements) p - 1 step algorithm ach processor exchanges n/p elements with each of the others In step i, process k exchanges with processes k ± i for i = 1 to p-1 src = (rank i + p) mod p dest = (rank + i ) mod p sendrecv( from src to dest ) end for Good algorithm for long messages Running time: (p "1)# + (p "1) n p $ % n$ 2013 Scott B. Baden / CS 160 / Winter 2013 5

Recursive doubling for short messages In each of log p phases all nodes exchange ½ their accumulated data with the others Only P/2 messages are sent at any one time D = 1 while (D < p) xchange & accumulate data with rank D Left shift D by 1 end while Optimal running time for short messages " lg P#$ + np% & " lgp#$ 2013 Scott B. Baden / CS 160 / Winter 2013 6

Flow of information 2013 Scott B. Baden / CS 160 / Winter 2013 7

Flow of information 2013 Scott B. Baden / CS 160 / Winter 2013 8

Flow of information 2013 Scott B. Baden / CS 160 / Winter 2013 9

Summarizing all to all Short messages " lg P#$ Long messages P " 1 n! P 2013 Scott B. Baden / CS 160 / Winter 2013 10

Vector All to All Generalize all-to-all, gather, scatter, etc. Processes supply varying length data Gather/scatter vectors of different lengths Vector all-to-all [Used in sample sort (coming)] MPI_Alltoallv ( void *sendbuf, int sendcounts[], int sdispl [], MPI_Datatype sendtype, void* recvbuf, int recvcnts[], int rdispl[], MPI_Datatype recvtype, MPI_Comm comm ) Following diagrams courtesy of Lori Pollock (U. Delaware) www.eecis.udel.edu/~pollock/372/f06/ 2013 Scott B. Baden / CS 160 / Winter 2013 11

proc 0 proc 1 proc 2 S N D 0 A 1 B 2 C 3 D 4 5 F 2 0 3 2 2 5 0 H 1 I 2 J 3 K 4 L 5 M 3 0 3 3 1 6 0 O 1 P 2 Q 3 R 4 S 5 T 1 0 2 1 4 3 6 G 6 N 6 U proc 0 proc 1 proc 2 R C I V r b u f f e r 0 1 2 3 4 5 6 7 8 rdspl rcnt 2 0 3 2 1 5 0 1 2 3 4 5 6 7 8 3 0 3 3 2 6 0 1 2 3 4 5 6 7 8 2 0 1 2 4 3 12

proc 0 proc 1 proc 2 S N D 0 A 1 B 2 C 3 D 4 5 F 2 0 3 2 2 5 0 H 1 I 2 J 3 K 4 L 5 M 3 0 3 3 1 6 0 O 1 P 2 Q 3 R 4 S 5 T 1 0 2 1 4 3 6 G 6 N 6 U R C I V r b u f f e r proc 0 0 A 1 B 2 3 4 5 6 7 8 rc nt r d s pl 2 0 3 2 1 5 0 1 2 3 4 5 6 7 8 proc 1 3 0 3 3 2 6 0 1 2 3 4 5 6 7 8 proc 2 2 0 1 2 4 3 13

proc 0 proc 1 proc 2 S N D 0 A 1 B 2 C 3 D 4 5 F 2 0 3 2 2 5 0 H 1 I 2 J 3 K 4 L 5 M 3 0 3 3 1 6 0 O 1 P 2 Q 3 R 4 S 5 T 1 0 2 1 4 3 6 G 6 N 6 U R C I V r b u f f e r proc 0 0 A 1 B 2 3 4 5 6 7 8 rc nt r d s pl 2 0 3 2 1 5 proc 1 0 C 1 D 2 3 4 5 6 7 8 3 0 3 3 2 6 0 1 2 3 4 5 6 7 8 proc 2 2 0 1 2 4 3 14

proc 0 proc 1 proc 2 S N D 0 A 1 B 2 C 3 D 4 5 F 2 0 3 2 2 5 0 H 1 I 2 J 3 K 4 L 5 M 3 0 3 3 1 6 0 O 1 P 2 Q 3 R 4 S 5 T 1 0 2 1 4 3 6 G 6 N 6 U R C I V r b u f f e r proc 0 0 A 1 B 2 3 4 5 6 7 8 rc nt r d s pl 2 0 3 2 1 5 proc 1 0 C 1 D 2 3 4 5 6 7 8 3 0 3 3 2 6 proc 2 0 F 1 G 2 3 4 5 6 7 8 2 0 1 2 4 3 15

proc 0 proc 1 proc 2 S N D 0 A 1 B 2 C 3 D 4 5 F 2 0 3 2 2 5 0 H 1 I 2 J 3 K 4 L 5 M 3 0 3 3 1 6 0 O 1 P 2 Q 3 R 4 S 5 T 1 0 2 1 4 3 6 G 6 N 6 U R C I V r b u f f e r proc 0 0 A 1 B 2 3 4 5 6 7 8 rc nt r d s pl 2 0 3 2 1 5 proc 1 0 C 1 D 2 3 4 5 6 7 8 3 0 3 3 2 6 proc 2 0 F 1 G 2 3 4 5 6 7 8 2 0 1 2 4 3 16

Today s lecture All to all communication Application to Parallel Sorting Blocking for cache 2013 Scott B. Baden / CS 160 / Winter 2013 17

Recall sample sort Uses a heuristic to estimate the distribution of the global key range over the p threads ach processor gets about the same number of keys Sample the keys to determine a set of p-1 splitters that partition the key space into p disjoint regions (buckets) 2013 Scott B. Baden / CS 160 / Winter 2013 18

Alltoallv used in sample sort Introduction to Parallel Computing, 2 nd d,, A.Grama, A.l Gupta, G. Karypis, and V. Kumar, Addison-Wesley, 2003. 2013 Scott B. Baden / CS 160 / Winter 2013 19

The collective calls Processes transmit varying amounts of information to the other processes This is an MPI_Alltoallv ( SKeys, send_counts, send_displace, MPI_INT, RKeys, recv_counts, recv_displace, MPI_INT, MPI_COMM_WORLD ) Prior to making this call, all processes must cooperate to determine how much information they will exchange The send list describes the number of keys to send to each process k, and the offset in the local array The receive list describes the number of incoming keys for each process k and the offset into the local array 2013 Scott B. Baden / CS 160 / Winter 2013 20

Determining send & receive lists After sorting, each process scans its local keys from left to right, marking where the splitters divide the keys, in terms of send counts Perform an all to all to transpose these send counts into receive counts MPI_Alltoall(send_counts, 1, MPI_INT, recv_counts, 1, MPI_INT,MPI_COMM_WORLD) A simple loop determines the displacements for (p=1; p < nodes; p++){ s_displ[p] = s_displ[p-1] + send_counts[p-1]; r_displ[p] = r_displ[p-1] + rend_counts[p-1]; } 2013 Scott B. Baden / CS 160 / Winter 2013 21

Today s lecture All to all communication Application to Parallel Sorting Blocking for cache matrix multiplication 2013 Scott B. Baden / CS 160 / Winter 2013 22

Matrix Multiplication Given two conforming matrices A and B, form the matrix product A B A is m n B is n p Operation count: O(n 3 ) multiply-adds for an n n square matrix Discussion follows from Demmel www.cs.berkeley.edu/~demmel/cs267_spr99/lectures/lect02.html 2013 Scott B. Baden / CS 160 / Winter 2013 23

Unblocked Matrix Multiplication for i := 0 to n-1 for j := 0 to n-1 for k := 0 to n-1 C[i,j] += A[i,k] * B[k,j] C[i,j] A[i,:] += * B[:,j] 2013 Scott B. Baden / CS 160 / Winter 2013 24

Analysis of performance for i = 0 to n-1 // for each iteration i, load all of B into cache for j = 0 to n-1 // for each iteration (i,j), load A[i,:] into cache // for each iteration (i,j), load and store C[i,j] for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j] C[i,j] A[i,:] += * B[:,j] 2013 Scott B. Baden / CS 160 / Winter 2013 25

Analysis of performance for i = 0 to n-1 // n n 2 / L loads = n 3 /L, L=cache line size B[:,:] for j = 0 to n-1 // n 2 / L loads = n 2 /L A[i,:] // n 2 / L loads + n 2 / L stores = 2n 2 / L C[i,j] for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j] Total:(n 3 + 3n 2 ) / L C[i,j] A[i,:] += * B[:,j] 2013 Scott B. Baden / CS 160 / Winter 2013 26

Flops to memory ratio Let q = # flops / main memory reference q = 2n 3 n 3 + 3n 2 2 as n 2013 Scott B. Baden / CS 160 / Winter 2013 27

Blocked Matrix Multiply Divide A, B, C into N N sub blocks Assume we have a good quality library to perform matrix multiplication on subblocks ach sub block is b b b=n/n is called the block size How do we establish b? C[i,j] C[i,j] A[i,k] = + * B[k,j] 2013 Scott B. Baden / CS 160 / Winter 2013 28

Blocked Matrix Multiplication for i = 0 to N-1 for j = 0 to N-1 // load each block C[i,j] into cache, once : n 2 // b = n/n = block size for k = 0 to N-1 // load each block A[i,k] and B[k,j] N 3 times // = 2N 3 (n/n) 2 = 2Nn 2 C[i,j] += A[i,k] * B[k,j] // do the matrix multiply // write each block C[i,j] once : n 2 Total: (2*N+2)*n 2 C[i,j] C[i,j] A[i,k] = + * B[k,j] 2013 Scott B. Baden / CS 160 / Winter 2013 29

The results N,B Unblocked Time 256, 64 0.6 0.002 512,128 15 0.24 Blocked Time Amortize memory accesses by increasing memory reuse 2013 Scott B. Baden / CS 160 / Winter 2013 30

Flops to memory ratio Since data motion has become increasingly expensive, the ratio of floating point work to data motion is a factor in determining performance Let q = # flops / main memory reference q = 2n 3 (2N + 2)n 2 = n N +1 n/n = b as n 2013 Scott B. Baden / CS 160 / Winter 2013 31

More on blocked algorithms Data in the sub-blocks are contiguous within rows only We may incur conflict cache misses Idea: since re-use is so high let s copy the subblocks into contiguous memory before passing to our matrix multiply routine The Cache Performance and Optimizations of Blocked Algorithms, M. Lam et al., ASPLOS IV, 1991 http://www-suif.stanford.edu/papers/lam91.ps 2013 Scott B. Baden / CS 160 / Winter 2013 32

Revisiting broadcast (last lecture but not posted in initial version of the slides)

Revisiting Broadcast P may not be a power of 2 MPI-CH uses a binomial tree algorithm for short messages (We can use the hypercube algorithm to illustrate the special case of P=2 k ) We use a different algorithm for long messages 2013 Scott B. Baden / CS 160 / Winter 2013 35

Strategy for long messages Based van de Geijn s strategy Scatter the data Divide the data to be broadcast into pieces, and fill the machine with the pieces Do an Allgather Now that everyone has a part of the entire result, collect on all processors Faster than MST algorithm for long messages 2 p "1 n# << $ lg p%n# p 2013 Scott B. Baden / CS 160 / Winter 2013 36

Algorithm for long messages The scatter step Scatter P 0 P 1 P p-1 Root 2013 Scott B. Baden / CS 160 / Winter 2013 37

Algorithm for long messages AllGather step P 0 P 1 P p-1 2013 Scott B. Baden / CS 160 / Winter 2013 38