Simulating Ocean Currents. Simulating Galaxy Evolution

Size: px

Start display at page:

Download "Simulating Ocean Currents. Simulating Galaxy Evolution"

Rolf Norman
5 years ago
Views:

1 Simulating Ocean Currents (a) Cross sections (b) Satial discretization of a cross section Model as two-dimensional grids Discretize in sace and time finer satial and temoral resolution => greater accuracy Many different comutations er time ste set u and solve equations Concurrency across and within grid comutations 7 Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Comuting forces is exensive O(n 2 ) brute force aroach Hierarchical Methods take advantage of force law: G m 1 m 2 Star on which for ces are being comuted r 2 Large grou far enough away to aroximate Star too close to aroximate Small grou far enough away to aroximate to center of mass Many time-stes, lenty of concurrency across stars within one 8

2 Rendering Scenes by Ray Tracing Shoot rays into scene through ixels in image lane Follow their aths they bounce around as they strike objects they generate new rays: ray tree er inut ray Result is color and oacity for that ixel Parallelism across rays All case studies have abundant concurrency 9 Creating a Parallel Program Assumtion: Sequential algorithm is given Sometimes need very different algorithm, but beyond scoe Pieces of the job: Identify work that can be done in arallel Partition work and erhas data among rocesses Manage data access, communication and synchronization Note: work includes comutation, data access and I/O Main goal: Seedu (lus low rog. effort and resource needs) For a fixed roblem: Seedu () = Seedu () = Performance() Performance(1) Time(1) Time() 10

3 Stes in Creating a Parallel Program Partitioning D e c o m o s i t i o n A s s i g n m e n t O r c h e s t r a t i o n M a i n g P 0 P 1 P 2 P 3 Sequential comutation Tasks Processes Parallel Processors rogram 4 stes: Decomosition, Assignment, Orchestration, Maing Done by rogrammer or system software (comiler, runtime,...) Issues are the same, so assume rogrammer does it all exlicitly 11 Some Imortant Concets Task: Arbitrary iece of undecomosed work in arallel comutation Executed sequentially; concurrency is only across tasks E.g. a article/cell in Barnes-Hut, a ray or ray grou in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that erforms the tasks assigned to rocesses Processes communicate and synchronize to erform their tasks Processor: Physical engine on which rocess executes Processes virtualize machine to rogrammer first write rogram in terms of rocesses, then ma to rocessors 12

4 Decomosition Break u comutation into tasks to be divided among rocesses Tasks may become available dynamically No. of available tasks may vary with time i.e. identify concurrency and decide level at which to exloit it Goal: Enough tasks to kee rocesses busy, but not too many No. of tasks available at a time is uer bound on achievable seedu 13 Limited Concurrency: Amdahl s Law Most fundamental limitation on arallel seedu If fraction s of seq execution is inherently serial, seedu <= 1/s Examle: 2-hase calculation swee over n-by-n grid and do some indeendent comutation swee again and add each value to global sum Time for first hase = n 2 / Second hase serialized at global variable, so time = n 2 Seedu <= 2n 2 or at most 2 n 2 + n 2 Trick: divide second hase into two accumulate into rivate sum during swee add er-rocess rivate sum into global sum Parallel time is n 2 / + n2/ +, and seedu at best 2n 2 2n

5 Pictorial Deiction 1 (a) n 2 n 2 work done concurrently 1 (b) n 2 / n 2 1 (c) n 2 / n 2 / Time 15 Concurrency Profiles Cannot usually divide into serial and arallel art 1,400 1,200 1,000 Concurrency Clock cycle number Area under curve is total work done, or time with 1 rocessor Horizontal extent is lower bound on time (infinite rocessors) f k k 1 Seedu is the ratio: k=1, base case: f k s + 1-s k k=1 Amdahl s law alies to any overhead, not just limited concurrency 16

6 Assignment Secifying mechanism to divide work u among rocesses E.g. which rocess comutes forces on which stars, or which rays Together with decomosition, also called artitioning Balance workload, reduce communication and management cost Structured aroaches usually work well Code insection (arallel loos) or understanding of alication Well-known heuristics Static versus dynamic assignment As rogrammers, we worry about artitioning first Usually indeendent of architecture or rog model But cost and comlexity of using rimitives may affect decisions As architects, we assume rogram does reasonable job of it 17 Orchestration Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temorally Goals Reduce cost of communication and synch. as seen by rocessors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy deendences early Reduce overhead of arallelism management Closest to architecture (and rogramming model & language) Choices deend a lot on comm. abstraction, efficiency of rimitives Architects should rovide aroriate rimitives efficiently 18

7 Maing After orchestration, already have arallel rogram Two asects of maing: Which rocesses will run on same rocessor, if necessary Which rocess runs on which articular rocessor maing to a network toology One extreme: sace-sharing Machine divided into subsets, only one a at a time in a subset Processes can be inned to rocessors, or left to OS Another extreme: comlete resource management control to OS OS uses the erformance techniques we will discuss later Real world is between the two User secifies desires in some asects, system may ignore Usually adot the view: rocess <-> rocessor 19 Parallelizing Comutation vs. Data Above view is centered around comutation Comutation is decomosed and assigned (artitioned) Partitioning Data is often a natural view too Comutation follows data: owner comutes Grid examle; data mining; High Performance Fortran (HPF) But not general enough Distinction between com. and data stronger in many alications Barnes-Hut, Raytrace (later) Retain comutation-centric view Data access and communication is art of orchestration 20

8 High-level Goals High erformance (seedu over sequential rogram) Table 2.1 Stes in the Parallelization Process and Their Goals Ste Architecture- Deendent? Major Performance Goals Decomosition Mostly no Exose enough concurrency but not too much Assignment Mostly no Balance workload Reduce communication volume Orchestration Yes Reduce noninherent communication via data locality Reduce communication and synchronization cost as seen by the rocessor Reduce serialization at shared resources Schedule tasks to satisfy deendences early Maing Yes Put related rocesses on the same rocessor if necessary Exloit locality in network toology But low resource usage and develoment effort Imlications for algorithm designers and architects Algorithm designers: high-erf., low resource needs Architects: high-erf., low cost, reduced rogramming effort e.g. gradually imroving erf. with rogramming effort may be referable to sudden threshold after large rogramming effort 21 Parallelization of An Examle Program Motivating roblems all lead to large, comlex rograms Examine a simlified version of a iece of Ocean simulation Iterative equation solver Illustrate arallel rogram in low-level arallel language C-like seudocode with simle extensions for arallelism Exose basic comm. and synch. rimitives that must be suorted State of most real arallel rogramming today 23

9 Grid Solver Examle Exression for udating each interior oint: A[i,j] = 0.2 (A[i,j] + A[i,j 1] + A[i 1, j] + A[i,j + 1] + A[i + 1, j]) Simlified version of solver in Ocean simulation Gauss-Seidel (near-neighbor) swees to convergence interior n-by-n oints of (n+2)-by-(n+2) udated in each swee udates done in-lace in grid, and diff. from rev. value comuted accumulate artial diffs into global diff at end of every swee check if error has converged (to within a tolerance arameter) if so, exit solver; if not, do another swee int n; /*size of matrix: (n + 2-by-n + 2) elements*/ 2. float **A, diff = 0; 3. main() 4. begin 5. read(n) ; /*read inut arameter: matrix size*/ 6. A malloc (a 2-d array of size n + 2 by n + 2 doubles); 7. initialize(a); /*initialize the matrix A somehow*/ 8. Solve (A); /*call the routine to solve equation*/ 9. end main 10. rocedure Solve (A) /*solve the equation system*/ 11. float **A; /*A is an (n + 2)-by-(n + 2) array*/ 12. begin 13. int i, j, done = 0; 14. float diff = 0, tem; 15. while (!done) do /*outermost loo over swees*/ 16. diff = 0; /*initialize maximum difference to 0*/ 17. for i 1 to n do /*swee over nonborder oints of grid*/ 18. for j 1 to n do 19. tem = A[i,j]; /*save old value of element*/ 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] A[i,j+1] + A[i+1,j]); /*comute average*/ 22. diff += abs(a[i,j] - tem); 23. end for 24. end for 25. if (diff/(n*n) < TOL) then done = 1; 26. end while 27. end rocedure 25

10 Decomosition Simle way to identify concurrency is to look at loo iterations deendence analysis; if not enough concurrency, then look further Not much concurrency here at this level (all loos sequential) Examine fundamental deendences, ignoring loo structure Concurrency O(n) along anti-diagonals, serialization O(n) along diag. Retain loo structure, use t-to-t synch; Problem: too many synch os. Restructure loos, use global synch; imbalance and too much synch 26 Exloit Alication Knowledge Reorder grid traversal: red-black ordering Red oint Black oint Different ordering of udates: may converge quicker or slower Red swee and black swee are each fully arallel: Global synch between them (conservative but convenient) Ocean uses red-black; we use simler, asynchronous one to illustrate no red-black, simly ignore deendences within swee sequential order same as original, arallel rogram nondeterministic 27

11 Decomosition Only 15. while (!done) do /*a sequential loo*/ 16. diff = 0; 17. for_all i 1 to n do /*a arallel loo nest*/ 18. for_all j 1 to n do 19. tem = A[i,j]; 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] A[i,j+1] + A[i+1,j]); 22. diff += abs(a[i,j] - tem); 23. end for_all 24. end for_all 25. if (diff/(n*n) < TOL) then done = 1; 26. end while Decomosition into elements: degree of concurrency n 2 To decomose into rows, make line 18 loo sequential; degree n for_all leaves assignment left to system but imlicit global synch. at end of for_all loo 28 Assignment Static assignments (given decomosition into rows) i block assignment of rows: Row i is assigned to rocess cyclic assignment of rows: rocess i is assigned rows i, i+, and so on P 0 P 1 P 2 P 4 Dynamic assignment get a row index, work on the row, get a new row, and so on Static assignment into rows reduces concurrency (from n to ) block assign. reduces communication by keeing adjacent rows together Let s dig into orchestration under three rogramming models 29

12 Data Parallel Solver 1. int n, nrocs; /*grid size (n + 2-by-n + 2) and number of rocesses*/ 2. float **A, diff = 0; 3. main() 4. begin 5. read(n); read(nrocs); ; /*read inut grid size and number of rocesses*/ 6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles); 7. initialize(a); /*initialize the matrix A somehow*/ 8. Solve (A); /*call the routine to solve equation*/ 9. end main 10. rocedure Solve(A) /*solve the equation system*/ 11. float **A; /*A is an (n + 2-by-n + 2) array*/ 12. begin 13. int i, j, done = 0; 14. float mydiff = 0, tem; 14a. DECOMP A[BLOCK,*, nrocs]; 15. while (!done) do /*outermost loo over swees*/ 16. mydiff = 0; /*initialize maximum difference to 0*/ 17. for_all i 1 to n do /*swee over non-border oints of grid*/ 18. for_all j 1 to n do 19. tem = A[i,j]; /*save old value of element*/ 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] A[i,j+1] + A[i+1,j]); /*comute average*/ 22. mydiff += abs(a[i,j] - tem); 23. end for_all 24. end for_all 24a. REDUCE (mydiff, diff, ADD); 25. if (diff/(n*n) < TOL) then done = 1; 26. end while 27. end rocedure 30 Shared Address Sace Solver Single Program Multile Data (SPMD) Processes Solve Solve Solve Solve Swee Test Convergence Assignment controlled by values of variables used as loo bounds 31

13 1. int n, nrocs; /*matrix dimension and number of rocessors to be used*/ 2a. float **A, diff; /*A is global (shared) array reresenting the grid*/ /*diff is global (shared) maximum difference in current swee*/ 2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/ 2c. BARDEC (bar1); /*barrier declaration for global synchronization between swees*/ 3. main() 4. begin 5. read(n); read(nrocs); /*read inut matrix size and number of rocesses*/ 6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles); 7. initialize(a); /*initialize A in an unsecified way*/ 8a. CREATE (nrocs 1, Solve, A); 8. Solve(A); /*main rocess becomes a worker too*/ 8b. WAIT_FOR_END (nrocs 1); /*wait for all child rocesses created to terminate*/ 9. end main 10. rocedure Solve(A) 11. float **A; /*A is entire n+2-by-n+2 shared array, as in the sequential rogram*/ 12. begin 13. int i,j, id, done = 0; 14. float tem, mydiff = 0; /*rivate variables*/ 14a. int mymin = 1 + (id * n/nrocs); /*assume that n is exactly divisible by*/ 14b. int mymax = mymin + n/nrocs - 1 /*nrocs for simlicity here*/ 15. while (!done) do /*outer loo over all diagonal elements*/ 16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/ 16a. BARRIER(bar1, nrocs); /*ensure all reach here before anyone modifies diff*/ 17. for i mymin to mymax do /*for each of my rows*/ 18. for j 1 to n do /*for all nonborder elements in that row*/ 19. tem = A[i,j]; 20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] A[i,j+1] + A[i+1,j]); 22. mydiff += abs(a[i,j] - tem); 23. endfor 24. endfor 25a. LOCK(diff_lock); /*udate global diff if necessary*/ 25b. diff += mydiff; 25c. UNLOCK(diff_lock); 25d. BARRIER(bar1, nrocs); /*ensure all reach here before checking if done*/ 25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get same answer*/ 25f. BARRIER(bar1, nrocs); 26. endwhile 27. end rocedure 32 Notes on SAS Program SPMD: not lockste or even necessarily same instructions Assignment controlled by values of variables used as loo bounds unique id er rocess, used to control assignment Done condition evaluated redundantly by all Code that does the udate identical to sequential rogram each rocess has rivate mydiff variable Most interesting secial oerations are for synchronization accumulations into shared diff have to be mutually exclusive why the need for all the barriers? 33

14 Message Passing Grid Solver Cannot declare A to be shared array any more Need to comose it logically from er-rocess rivate arrays usually allocated in accordance with the assignment of work rocess assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SAS (e.g. SPMD), but orchestration different data structures and data access/naming communication synchronization int id, n, b; /*rocess id, matrix dimension and number of rocessors to be used*/ 2. float **mya; 3. main() 4. begin 5. read(n); read(nrocs); /*read inut matrix size and number of rocesses*/ 8a. CREATE (nrocs-1, Solve); 8b. Solve(); /*main rocess becomes a worker too*/ 8c. WAIT_FOR_END (nrocs 1); /*wait for all child rocesses created to terminate*/ 9. end main 10. rocedure Solve() 11. begin 13. int i,j, id, n = n/nrocs, done = 0; 14. float tem, temdiff, mydiff = 0; /*rivate variables*/ 6. mya malloc(a 2-d array of size [n/nrocs + 2] by n+2); /*my assigned rows of A*/ 7. initialize(mya); /*initialize my rows of A, in an unsecified way*/ 15. while (!done) do 16. mydiff = 0; /*set local diff to 0*/ 16a. if (id!= 0) then SEND(&myA[1,0],n*sizeof(float),id-1,ROW); 16b. if (id = nrocs-1) then SEND(&myA[n,0],n*sizeof(float),id+1,ROW); 16c. if (id!= 0) then RECEIVE(&myA[0,0],n*sizeof(float),id-1,ROW); 16d. if (id!= nrocs-1) then RECEIVE(&myA[n +1,0],n*sizeof(float), id+1,row); /*border rows of neighbors have now been coied into mya[0,*] and mya[n +1,*]*/ 17. for i 1 to n do /*for each of my (nonghost) rows*/ 18. for j 1 to n do /*for all nonborder elements in that row*/ 19. tem = mya[i,j]; 20. mya[i,j] = 0.2 * (mya[i,j] + mya[i,j-1] + mya[i-1,j] mya[i,j+1] + mya[i+1,j]); 22. mydiff += abs(mya[i,j] - tem); 23. endfor 24. endfor /*communicate local diff values and determine if done; can be relaced by reduction and broadcast*/ 25a. if (id!= 0) then /*rocess 0 holds global total diff*/ 25b. SEND(mydiff,sizeof(float),0,DIFF); 25c. RECEIVE(done,sizeof(int),0,DONE); 25d. else /*id 0 does this*/ 25e. for i 1 to nrocs-1 do /*for each other rocess*/ 25f. RECEIVE(temdiff,sizeof(float),*,DIFF); 25g. mydiff += temdiff; /*accumulate into total*/ 25h. endfor 25i if (mydiff/(n*n) < TOL) then done = 1; 25j. for i 1 to nrocs-1 do /*for each other rocess*/ 25k. SEND(done,sizeof(int),i,DONE); 25l. endfor 25m. endif 26. endwhile 27. end rocedure 40

15 Notes on Message Passing Program Use of ghost rows Receive does not transfer data, send does unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global sace Synchronization through sends and receives Udate of global diff and event synch for done condition Could imlement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simlify code /*communicate local diff values and determine if done, using reduction and broadcast*/ 25b. REDUCE(0,mydiff,sizeof(float),ADD); 25c. if (id == 0) then 25i. if (mydiff/(n*n) < TOL) then done = 1; 25k. endif 25m. BROADCAST(0,done,sizeof(int),DONE); 41 Send and Receive Alternatives Can extend functionality: stride, scatter-gather, grous Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Send/Receive Synchronous Asynchronous Blocking asynch. Nonblocking asynch. Affect event synch (mutual excl. by fiat: only one rocess touches data) Affect ease of rogramming and erformance Synchronous messages rovide built-in synch. through match Searate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix? 42

16 Orchestration: Summary Shared address sace Shared and rivate data exlicitly searate Communication imlicit in access atterns No correctness need for data distribution Synchronization via atomic oerations on shared data Synchronization exlicit and distinct from data communication Message assing Data distribution among local address saces needed No exlicit shared structures (imlicit in comm. atterns) Communication is exlicit Synchronization imlicit in communication (at least in synch. case) mutual exclusion by fiat 43 Correctness in Grid Solver Program Decomosition and Assignment similar in SAS and message-assing Orchestration is different Data structures, data access/naming, communication, synchronization SAS Msg-Passing Exlicit global data structure? Yes No Assignment indet of data layout? Yes No Communication Imlicit Exlicit Synchronization Exlicit Imlicit Exlicit relication of border rows? No Yes Requirements for erformance are another story... 44

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

ECE7660 Parallel Computer Architecture Perspective on Parallel Programming Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks