ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

Similar documents
Parallelization of an Example Program

Simulating ocean currents

Parallelization Principles. Sathish Vadhiyar

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Simulating Ocean Currents. Simulating Galaxy Evolution

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Multiprocessors at Earth This lecture: How do we program such computers?

Parallel Programming. Motivating Problems (application case studies) Process of creating a parallel program

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Module 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming

Lecture 1: Introduction

Lecture: Memory Technology Innovations

Parallel Programming Basics

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Flynn's Classification

Programming as Successive Refinement. Partitioning for Performance

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lect. 2: Types of Parallelism

ECE453: Advanced Computer Architecture II Homework 1

Parallel Programming Basics

Parallel Architectures & Parallelization Principles. R. Govindarajan CSA/SERC, IISc

Review: Creating a Parallel Program. Programming for Performance

Lecture 10. Shared memory programming

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming

Steps in Creating a Parallel Program

Parallel Programming. March 15,

Advanced Computer Architecture

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

wget

ECE 669 Parallel Computer Architecture

Parallel Programming Patterns

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

An Introduction to Parallel Programming

MPI Casestudy: Parallel Image Processing

Blocking SEND/RECEIVE

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallel Computing. Parallel Algorithm Design

Parallel Poisson Solver in Fortran

Parallel Programming: Background Information

Parallel Programming: Background Information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

ECE 5750 Advanced Computer Architecture Programming Assignment #1

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Parallel Computer Architecture and Programming Written Assignment 3

Compiling for Advanced Architectures

Numerical Algorithms

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Conventional Computer Architecture. Abstraction

Parallel Computation/Program Issues

Design of Parallel Algorithms. Models of Parallel Computation

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

High Performance Computing. Introduction to Parallel Computing

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Principles of Parallel Algorithm Design: Concurrency and Decomposition

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

EE/CSCI 451: Parallel and Distributed Computation

Parallel Programming Case Studies

6. Parallel Volume Rendering Algorithms

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Lecture 10 Midterm review

Cache Coherence and Atomic Operations in Hardware

Flynn classification. S = single, M = multiple, I = instruction (stream), D = data (stream)

Memory Consistency and Multiprocessor Performance

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Principles of Parallel Algorithm Design: Concurrency and Mapping

Computer Graphics. - Rasterization - Philipp Slusallek

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Adaptive-Mesh-Refinement Pattern

Multiprocessor Systems. Chapter 8, 8.1

ECE 669 Parallel Computer Architecture

Performance Optimization Part II: Locality, Communication, and Contention

Parallel Programming Case Studies

Transcription:

ECE7660 Parallel Computer Architecture Perspective on Parallel Programming

Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks like three major programming models What primitives must a system support? 2

Simulating Ocean Currents (a) Cross sections Model as two-dimensional grids Discretize in space and time finer spatial and temporal resolution => greater accuracy Many different computations per time step» set up and solve equations Concurrency across and within grid computations Static and regular (b) Spatial discretization of a cross section 3

Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O(n 2 ) brute force approach Hierarchical Methods (Barnes-Hut algorithm) take advantage of force law: G Star on which for ces are being computed m 1 m 2 r 2 Large group far enough away to appr oximate Star too close to appr oximate Small group far enough away to appr oximate to center of mass Many time-steps, plenty of concurrency across stars within each step 4

Rendering Scenes by Ray Tracing Shoot rays into scene through pixels in image plane Follow their paths they bounce around as they strike objects they generate new rays: ray tree per input ray Result is color and opacity for that pixel Parallelism across rays How much concurrency in these examples? 5

Creating a Parallel Program Pieces of the job: Identify work that can be done in parallel» work includes computation, data access and I/O Partition work and perhaps data among processes Manage data access, communication and synchronization 6

Definitions Task: Arbitrary piece of work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor: Physical engine on which process executes Processes virtualize machine to programmer» write program in terms of processes, then map to processors 7

4 Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n A s s i g n m e n t p 0 p 1 p 2 p 3 O r c h e s t r a t i o n p 0 p 1 p 2 p 3 M a p p i n g P 0 P 1 P 2 P 3 Sequential computation Tasks Processes Parallel Processors program Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors 8

Decomposition Identify concurrency and decide level at which to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically No. of available tasks may vary with time Goal: Enough tasks to keep processes busy, but not too many Number of tasks available at a time is upper bound on achievable speedup 9

Limited Concurrency: Amdahl s Law Most fundamental limitation on parallel speedup Amdahl s law: If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum Time for first phase = n 2 /p Second phase serialized at global variable, so time = n 2 Speedup <= 2n 2 or at most 2 n 2 + n p 2 Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum Parallel time is n 2 /p + n 2 /p + p, and speedup at best 2n 2 p 2n 2 + p 2 10

work done concurrently Understanding Amdahl s Law 1 (a) n 2 n 2 p (b) 1 n 2 /p n 2 p (c) 1 n 2 /p n 2 /p p Time 11

150 219 247 286 313 343 380 415 444 483 504 526 564 589 633 662 702 733 Concurrency Concurrency Profiles 1,400 1,200 1,000 800 600 400 200 0 Clock cycle number Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors) f k k Speedup is the ratio: k=1, base case: f k k=1 k p 1 s + 1-s p Amdahl s law applies to any overhead, not just limited concurrency 12

Assignment Specify mechanism to divide work up among processes E.g. which process computes forces on which stars, or which rays Balance workload, reduce communication and management cost Structured approaches usually work well Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment As programmers, we worry about partitioning first Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions 13

Orchestration Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Goals Reduce cost of communication and synch. Preserve locality of data reference Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Choices depend on Prog. Model., comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently 14

Mapping Two aspects: Which process runs on which particular processor?» mapping to a network topology Will multiple processes run on same processor? space-sharing Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS Real world User specifies desires in some aspects, system handles some Usually adopt the view: process <-> processor 15

Parallelizing Computation vs. Data Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too Computation follows data: owner computes Distinction between comp. and data stronger in many applications Barnes-Hut Raytrace 16

Architect s Perspective What can be addressed by better hardware design? What is fundamentally a programming issue? 17

High-level Goals Table 2.1 Steps in the Parallelization Process and Their Goals Step Architecture- Dependent? Major Perf ormance Goals Decomposition Mostly no Expose enough concurrency but not too much Assignment Mostly no Balance workload Reduce communication v olume Orchestration Yes Reduce noninherent communication v ia data locality Reduce communication and sy nchr onization cost as seen by the processor Reduce serialization at shared resources Schedule tasks to satisf y dependences early Mapping Yes Put related processes on the same processor if necessary Exploit locality in network topology High performance (speedup over sequential program) But low resource usage and development effort Implications for algorithm designers and architects? 18

What Parallel Programs Look Like

Example: iterative equation solver Simplified version of a piece of Ocean simulation Illustrate program in low-level parallel language C-like pseudocode with simple extensions for parallelism Expose basic comm. and synch. primitives State of most real parallel programming today Expr ession for updating each interior point: A[i,j] = 0.2 (A[i,j] + A[i,j 1] + A[i 1, j] + A[i,j + 1] + A[i + 1, j ]) 20

Grid Solver Expr ession for updating each interior point: A[i,j] = 0.2 (A[i,j] + A[i,j 1] + A[i 1, j] + A[i,j + 1] + A[i + 1, j ]) Gauss-Seidel (near-neighbor) sweeps to convergence interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid difference from previous value computed accumulate partial diffs into global diff at end of every sweep check if has converged» to within a tolerance parameter 21

Sequential Version 1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/ 2. float **A, diff = 0; 3. main() 4. begin 5. read(n) ; /*read input parameter: matrix size*/ 6. A malloc (a 2-d array of size n + 2 by n + 2 doubles); 7. initialize(a); /*initialize the matrix A somehow*/ 8. Solve (A); /*call the routine to solve equation*/ 9. end main 10. procedure Solve (A) /*solve the equation system*/ 11. float **A; /*A is an (n + 2)-by-(n + 2) array*/ 12. begin 13. int i, j, done = 0; 14. float diff = 0, temp; 15. while (!done) do /*outermost loop over sweeps*/ 16. diff = 0; /*initialize maximum difference to 0*/ 17. for i 1 to n do /*sweep over nonborder points of grid*/ 18. for j 1 to n do 19. temp = A[i,j]; /*save old value of element*/ 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i+1,j]); /*compute average*/ 22. diff += abs(a[i,j] - temp); 23. end for 24. end for 25. if (diff/(n*n) < TOL) then done = 1; 26. end while 27. end procedure 22

Decomposition Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further Not much concurrency here at this level (all loops sequential) Examine fundamental dependences Concurrency O(n) along anti-diagonals, serialization O(n) along diag. Retain loop structure, use pt-to-pt synch; Problem: too many synch ops. Restructure loops, use global synch; imbalance and too much synch 23

Exploit Application Knowledge Reorder grid traversal: red-black ordering Red point Black point Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black We use simpler, asynchronous one to illustrate» no red-black, simply ignore dependences within sweep» parallel program nondeterministic 24

Decomposition 15. while (!done) do /*a sequential loop*/ 16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/ 18. for_all j 1 to n do 19. temp = A[i,j]; 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i+1,j]); 22. diff += abs(a[i,j] - temp); 23. end for_all 24. end for_all 25. if (diff/(n*n) < TOL) then done = 1; 26. end while Decomposition into elements: degree of concurrency n 2 Decompose into rows? Degree? for_all assignment?? 25

Assignment P 0 P 1 P 2 P 4 Static assignment: decomposition into rows block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p,... Dynamic assignment» get a row index, work on the row, get a new row,... What is the mechanism? Concurrency? Volume of Communication? i p 26

Data Parallel Solver 1. int n, nprocs; /*grid size (n + 2-by-n + 2) and number of processes*/ 2. float **A, diff = 0; 3. main() 4. begin 5. read(n); read(nprocs); ; /*read input grid size and number of processes*/ 6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles); 7. initialize(a); /*initialize the matrix A somehow*/ 8. Solve (A); /*call the routine to solve equation*/ 9. end main 10. procedure Solve(A) /*solve the equation system*/ 11. float **A; /*A is an (n + 2-by-n + 2) array*/ 12. 13. begin int i, j, done = 0; 14. float mydiff = 0, temp; 14a. DECOMP A[BLOCK,*, nprocs]; 15. while (!done) do /*outermost loop over sweeps*/ 16. mydiff = 0; /*initialize maximum difference to 0*/ 17. for_all i 1 to n do /*sweep over non-border points of grid*/ 18. for_all j 1 to n do 19. temp = A[i,j]; /*save old value of element*/ 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i+1,j]); /*compute average*/ 22. mydiff[i,j] = abs(a[i,j] - temp); 23. end for_all 24. 24a. end for_all REDUCE (mydiff, diff, ADD); 25. if (diff/(n*n) < TOL) then done = 1; 26. end while 27. end procedure 27

Shared Address Space Solver Single Program Multiple Data (SPMD) Processes Solve Solve Solve Solve Sweep Test Convergence Assignment controlled by values of variables used as loop bounds 28

Generating Threads 1. int n, nprocs; /*matrix dimension and number of processors to be used*/ 2a. float **A, diff; /*A is global (shared) array representing the grid*/ /*diff is global (shared) maximum difference in current sweep*/ 2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/ 2c. BARDEC (bar1); /*barrier declaration for global synchronization between sweeps*/ 3. main() 4. begin 5. read(n); read(nprocs); /*read input matrix size and number of processes*/ 6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles); 7. initialize(a); /*initialize A in an unspecified way*/ 8a. CREATE (nprocs 1, Solve, A); 8. Solve(A); /*main process becomes a worker too*/ 8b. WAIT_FOR_END (nprocs 1); /*wait for all child processes created to terminate*/ 9. end main 10. procedure Solve(A) 11. float **A; /*A is entire n+2-by-n+2 shared array, as in the sequential program*/ 12. begin 13. ---- 27. end procedure 29

Assignment Mechanism 10. procedure Solve(A) 11. float **A; /*A is entire n+2-by-n+2 shared array, as in the sequential program*/ 12. begin 13. int i,j, pid, done = 0; 14. float temp, mydiff = 0; /*private variables*/ 14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/ 14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/ 15. while (!done) do /*outer loop sweeps*/ 16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/ 16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/ 17. for i mymin to mymax do /*for each of my rows*/ 18. for j 1 to n do /*for all nonborder elements in that row*/ 19. 20. temp = A[i,j]; A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i+1,j]); 22. mydiff += abs(a[i,j] - temp); 23. 24. endfor endfor 25a. LOCK(diff_lock); /*update global diff if necessary*/ 25b. diff += mydiff; 25c. UNLOCK(diff_lock); 25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/ 25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get same answer*/ 25f. BARRIER(bar1, nprocs); 26. endwhile 27. end procedure 30

SAS Program SPMD: not lockstep. Not necessarily same instructions Assignment controlled by values of variables used as loop bounds unique pid per process, used to control assignment done condition evaluated redundantly by all Code that does the update identical to sequential program each process has private mydiff variable Most interesting special operations are for synchronization accumulations into shared diff have to be mutually exclusive why the need for all the barriers? Can use global reduction. 31

Mutual Exclusion Why is it needed? Provided by LOCK-UNLOCK around critical section Set of operations we want to execute atomically Implementation of LOCK/UNLOCK must guarantee mutual excl. Serialization? Contention? Non-local accesses in critical section? use private mydiff for partial accumulation! 32

Global Event Synchronization BARRIER(nprocs): wait here till nprocs processes get here Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation Process P_1 Process P_2 Process P_nprocs set up eqn system set up eqn system set up eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system solve eqn system solve eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) apply results apply results apply results Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) Conservative form of preserving dependences, but easy to use WAIT_FOR_END (nprocs-1) 33

Pt-to-pt Event Synch (Not Used Here) One process notifies another of an event so it can proceed Common example: producer-consumer (bounded buffer) Concurrent programming on uniprocessor: semaphores Shared address space parallel programs: semaphores, or use ordinary variables as flags P 1 P 2 A = 1; a: while (flag is 0) do nothing; b: flag = 1; print A; Busy-waiting or spinning 34

Group Event Synchronization Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer Multiple-producer, single-consumer 35

Message Passing Grid Solver Cannot declare A to be global shared array compose it logically from per-process private arrays usually allocated in accordance with the assignment of work» process assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SPMD SAS Orchestration different data structures and data access/naming communication synchronization Ghost rows 36

Data Layout and Orchestration P 0 P 0 P 1 P 1 P 2 P 4 P 2 Data partition allocated per processor Add ghost rows to hold boundary data Send edges to neighbors Receive into ghost rows Compute as in sequential program P 4 37

10. procedure Solve() 11. begin 13. int i,j, pid, n = n/nprocs, done = 0; 14. float temp, tempdiff, mydiff = 0; /*private variables*/ 6. mya malloc(a 2-d array of size [n/nprocs + 2] by n+2); /*initialize my rows of A, in an unspecified way*/! 15. while (!done) do 16. mydiff = 0; /*set local diff to 0*/ /* Exchange border rows of neighbors into mya[0,*] and mya[n +1,*]*/ 16a. if (pid!= 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW); 16b. if (pid =! nprocs-1) then SEND(&myA[n,0],n*sizeof(float),pid+1,ROW); 16c. if (pid!= 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW); 16d. if (pid!= nprocs-1) then RECEIVE(&myA[n +1,0],n*sizeof(float), pid+1,row); 17. for i 1 to n do /*for each of my (nonghost) rows*/ 18. for j 1 to n do /*for all nonborder elements in that row*/ 19. temp = mya[i,j]; 20. mya[i,j] = 0.2 * (mya[i,j] + mya[i,j-1] + mya[i-1,j] + 21. mya[i,j+1] + mya[i+1,j]); 22. mydiff += abs(mya[i,j] - temp); 23. endfor 24. endfor /*communicate local diff values and determine if done; can be replaced by reduction and broadcast*/ 25a. if (pid!= 0) then /*process 0 holds global total diff*/ 25b. SEND(mydiff,sizeof(float),0,DIFF); 25c. RECEIVE(done,sizeof(int),0,DONE); 25d. else /*pid 0 does this*/ 25e. for i 1 to nprocs-1 do /*for each other process*/ 25f. RECEIVE(tempdiff,sizeof(float),*,DIFF); 25g. mydiff += tempdiff; /*accumulate into total*/ 25h. endfor 25i if (mydiff/(n*n) < TOL) then done = 1; 25j. for i 1 to nprocs-1 do /*for each other process*/ 25k. SEND(done,sizeof(int),i,DONE); 25l. endfor 25m. endif 26. endwhile 27. end procedure 38

Notes on Message Passing Program Use of ghost rows Receive does not transfer data, send does unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives Update of global diff and event synch for done condition Could implement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simplify code /*communicate local diff values and determine if done, using reduction and broadcast*/ 25b. REDUCE(0,mydiff,sizeof(float),ADD); 25c. if (pid == 0) then 25i. if (mydiff/(n*n) < TOL) then done = 1; 25k. endif 25m. BROADCAST(0,done,sizeof(int),DONE); 39

Send and Receive Alternatives Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Send/Receive Synchronous Asynchronous Blocking asynch. Nonblocking asynch. Affect event synch (mutual excl. : only one process touches data) Affect ease of programming and performance Synchronous messages provide built-in synch. through match Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix? 40

Orchestration: Summary Shared address space Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch. case)» mutual exclusion 41

Correctness in Grid Solver Program Decomposition and Assignment similar in SAS and message-passing Orchestration is different Data structures, data access/naming, communication, synchronization Performance? SAS Msg-Passing Explicit global data structure? Yes No Assignment indept of data layout? Yes No Communication Implicit Explicit Synchronization Explicit Implicit Explicit replication of border rows? No Yes 42