Parallel Programming

Size: px

Start display at page:

Download "Parallel Programming"

Mercy Miles
6 years ago
Views:

1 Parallel Programming 7. Data Parallelism Christoph von Praun 07-1

2 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3) Task Parallelism Organization by Data Flow (1.5) Pipeline (1.2) Recursive Data (1.4) Divide and Conquer 07-2

3 (1.1) Geometric decomposition Context: Application operates on a large data structure with multiple data items. Operation on each data item has regular access with clear dependencies. Application is typically data-intensive, little computation per data item 07-3

4 Example: Heat transfer... Temperature: normal hot n simulation steps 07-4

5 Stencil (=schema according to which T is updated): t 1 t 2 T t 3 t 4 Temperature: normal hot T new = Iteration until T new - T old < ε. (t 1 old +t 2 old +t 3 old +t 4 old )

6 Forces Data decomposition: Different parts of the data structure are assigned to different activities. granularity and topology? naive decomposition may not be ideal. Scheduling: Coordination required if operation of one activity require input from data belonging to another activity. 07-6

7 Solution (template) Partition data into chunks, one chunk per activity. activities must access their chunks + inputs efficiently (may require explicit communication if data is distributed) each activity updates only its chunk

8 Use data copies ( ghost cells ) to reduce dependencies across different chunks (old/new schema): - Red activity keeps copy of t 2 old. - Red and blue activity exchange data in lockstep. T new = t 1 t 2 T t 3 t 4 (t old 1 +t old 2 +t old 3 +t old 4 )

9 For certain stencils: Avoid dependencies by alternating updates of red and black elements. activities operate in lockstep (all activities update red, then all activities update black...) 07-9

10 Computations that proceed in lock-step are best described following the SPMD (Single Program Multiple Data) Clock clk = Clock.make(); clk.drop(); for (c in chunks) { // each chunk is processed // by a separate activity async clocked (clk) { <update red in chunk c> clk.next(); <update black in chunk c> } } same/single program run by different activities on multiple chunks/data

11 Consequences Data decomposition (chunk processed by an activity) must be chosen wisely according to caching and in-memory layout of data structures. Patterns in Category (2) 07-11

12 Which decomposition is preferable? for (r in rows) async { for (c in columns) <update array at (r,c)> } for (c in columns) async { for (r in rows) <update array at (r,c)> } 07-12

13 ... it depends on the memory layout: cache line (holds variables at consecutive addresses) Row major: offset = row*ncols+col Programming language C, X10 Column major: offset = col*nrows+row Programming language Fortran, Matlab 07-13

14 Sequential traversal in X10 (row major): val region = (0..NROWS) * (0..NCOLS); val arr = new Array[Double](region, 0.0); for ([r,c] in region) // r, c: Int <update arr(r,c)> for (p in region) <update arr(p)> // p: Point{rank=2} 07-14

15 Parallel traversal in X10 (row major): val region = (0..NROWS) * (0..NCOLS); val arr = new Array[Double](region, 0.0); for ([r] in region.projection(0)) async for ([c] in region.projection(1)) <update arr(r,c)> 07-15

16 Consequences (cont.) Explicit exchange of data at synchronization points may require explicit communication on platforms w/o shared memory (e.g. MPI) copy operation in shared memory (from real cells to ghost cells) 07-16

17 Abstract model: Mesh Implementation: Array with ghost cells. copy from real to ghost local memory of blue activity local memory of red activity copy from real to ghost 07-17

18 Why ghost cells? To facilitate data independence on shared memory machines To aggregate communication in distributed memory systems and enable computations on local memory 07-18

19 Consequences (cont.) Activities operate in lockstep; performance depends on (dynamic) frequency of synchronization. frequent synchronization is a sign for frequent dependences, hence little parallelism

20 Lockstep computation Data exchange (ghost cell update) at synchronization points. Clock clk = Clock.make(); clk.drop(); for (c in chunks) { async clocked (clk) { <initialize ghost cells> clk.next(); while (!done) { <read local data and ghost cells, update local data> clk.next(); <update ghost cells> clk.next(); } } 07-20

21 Lockstep computation: <init GC> <init GC> <local stencil computation> <local stencil computation> <update GC> <update GC> clk.next() clk.next() 07-21

22 Further examples Dense linear algebra computations, e.g. solver for systems of linear equations (LINPACK, measure of floating-point performance for supercomputer TOP 500) matrix muliplication 07-22

23 Matrix muliplication C = A B B (m k) A (n m) * C (n k) Naive parallel algorithm: - Each element in c(i,j) is computed by an activity - Activity reads row of A and column of B

24 B (m k) 3 lines 1 line A (n m) * C (n k) Challenge: - Computation of c(i,j) requires 2m read operations - 2m variables fall typically on many different cache lines (row major) - reading a line from memory into the cache incurs significant delay (cache miss) 07-24

25 Further Examples (cont.) Finite element methods (structured grids), e.g. simulation of electromagnetism, fluid dynamics, heat transfer (PDE solvers) Simulation of airflow and temperature in data center rack with different component layout: Source: [2] 07-25

26 Image processing, e.g., Gaussian image blurring: per pixel stencil computation value of a pixel is weighted average of neighboring points (#px) orig 5px 20px 07-26

27 Limits on parallelism? Conceptually: Most data-parallel algorithms are embarrasingly parallel no dependency among tasks, e.g. matrix multiplication no or few synchronization points lots / arbitrary parallelism perfect scaling In practice: Limitations due to the implementation and physics of the machine... Source: [3] 07-27

28 Scaling of data parallel problems Strong Scaling Fix overall problem/data size. Varying number of computational resources Do additional computational resources shorten solution of a fixed-size problem? Sometimes called scale-up Very challenging: decreasing amount of computation per activity, less opportunity for data reuse within activity (caching), requires very efficient coordination and sharing between computational resources. Source: [3] 07-28

29 Scaling of data parallel problems Weak Scaling Fixed problem size per computational unit. Varying number of computational units. Can a larger problem be computed in the the same time with additional computational resources? Sometimes called scale-out Less challenging than strong scaling. Examples: clusters computing, Blade centers, many Google applications. Source: [3] 07-29

30 (1) Parallel algorithm structure design space Organization by Data (1.1) Data Parallelism (Geometric Decomposition) (1.2) Recursive Data Organization by Tasks (1.3) Task Parallelism (1.4) Divide and Conquer Organization by Data Flow (1.5) Pipeline 07-30

31 (1.2) Recursive data Context: Like (1.1) Data structure is recursive lists, trees, graphs Operations are sometimes recursive, sometimes seem inherently sequential

32 Example: Reduction Data structure: List of values 3, 5, 17, 3, 6, 8, 12, 13 Operation: compute sum of values 07-32

33 Sequential algorithm: ((((((3+5)+17)+3)+6)+8)+12) time 07-33

34 Sequential algorithm def sum(arr: Array[Int]{rank==1}): Int { var sum: Int = 0; for (i in arr) sum += arr(i); return sum; } 07 -

35 Parallel algorithm: pair-wise summation ((3+5)+(17+3))+((6+8)+(12+13)) 3 67 Just changed the evaluation order of sequential program Simple change of schedule enables / increases parallelism time

36 Parallel algorithm def sum(arr: Array[Int]{rank==1}): Int { return pairwise(arr, arr.region.min(0), arr.region.max(0)); } def pairwise(arr: Array[Int]{rank==1}, lo: Int, hi: Int) : Int { if (lo == hi) return arr(lo); else { val lsum = Future.make(() => pairwise(arr, lo, lo + (hi-lo)/2)); val rsum = Future.make(() => pairwise(arr, lo + (hi-lo)/2 + 1, hi)); return lsum.force() + rsum.force(); } } 07 -

37 Semantics of X10 future S1; S1 val v1: Future[T] = Future.make(E1); S2; val v2: T = v1.force(); A feasible execution: 1) spaw async evaluation of expression E1 2) force future and claim result. s1 s2 v2 = <val> e1 hb-order 07-37

38 Parallel algorithm import x10.util.concurrent.future; def sum(arr: Array[Int]{rank==1}): Int { return pairwise(arr, arr.region.min(0), arr.region.max(0)); } def pairwise(arr: Array[Int]{rank==1}, lo: Int, hi: Int) : Int { if (lo == hi) return arr(lo); else { val lsum = Future.make(() => pairwise(arr, lo, lo + (hi-lo)/2)); } } concurrent recursive descent val rsum = Future.make(() => pairwise(arr, lo + (hi-lo)/2 + 1, hi)); return lsum.force() + rsum.force(); block until results are available 07 -

39 Algorithm follows divide and conquer pattern always possible for recursive operations natural opportunity for parallelization 07-39

40 Consequences Problem and its solution must be cast into a recursive form: Incurs sometimes additional cost that must be traded-off against the performance improvement due to parallelization. In the example: additional variables for temporary results, recursive caller chain Recursive formulation may not be intuitive to read 07-40

41 Amount of computation in the recursive descent must be significant to offset the cost of communication and synchronization. Example: sequential sum my be faster for arrays of size smaller than For larger arrays, take recursive, parallel algorithm

42 Parallel algorithm import x10.util.concurrent.future; val THRESHOLD = 1000; def par_sum(arr: Array[Int]{rank==1}): Int { return pairwise(arr, arr.region.min(0), arr.region.max(0)); } def pairwise(arr: Array[Int]{rank==1}, lo: Int, hi: Int) : Int { if (hi-lo < THRESHOLD) return seq_sum(arr, lo, hi); else { val lsum = Future.make(() => pairwise(arr, lo, lo + (hi-lo)/2)); val rsum = Future.make(() => pairwise(arr, lo + (hi-lo)/2 + 1, hi)); return lsum.force() + rsum.force(); } } def seq_sum(arr: Array[Int]{rank==1}, lo: Int, hi: Int): Int { var sum: Int = 0; for ((i): Point{rank==1} in [lo..hi]) sum += arr(i); return sum; } 07 -

43 Another example: Prefix sum (=scan) of list Data structure: List of values 3, 5, 17, 3, 6, 8, 12, 13 Operation: compute partial sum of first, up to current variable in the list: 3, 5, 17, 3, 6, 8, 12, , 8, 25, 28, 34, 42, 54,

44 Sequential algorithm def prefix_sum(arr: Array[Int]{rank==1}, res: Array[Int]{rank==1 && self.region == arr.region}) { for ((i): Point{rank==1} in arr) { if (i == 0) res(i) = arr(i); else res(i) = res(i-1) + arr(i); } } 07 -

45 Prefix sum is more difficult to parallelize than sum because all values (res(i), i<k) in the sequential solution are required to compute res(k)

46 Parallel prefix sum O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:

47 Parallel prefix sum O:O 0:1 1:2 2:3 3:4 4:5 5:6 6: O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:7 copy add complete temporary 07-47

48 Parallel prefix sum 3 8 O:O 0: :2 28 0:3 1:4 2:5 3:6 4: O:O 0:1 1:2 2:3 3:4 4:5 5:6 6: O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:7 copy add complete temporary 07-48

49 Parallel prefix sum 35 0:4 42 0:5 57 0:6 67 0:7 3 8 O:O 0: :2 28 0:3 1:4 2:5 3:6 4: O:O 0:1 1:2 2:3 3:4 4:5 5:6 6: O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:7 copy add complete temporary 07-49

50 Sources [1] Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill: Patterns for Parallel Programming, Addison Wesley [2] Future Facilities: [3] Maged Michael, Jose Moreira, Doron Shiloach, Robert Wisniewski: "Scale-up x Scale-out: A Case Study using Nutch/Lucene". Parallel and Distributed Processing Symposium (IPDPS),

51 This work is licensed under a Creative Commons Attribution- ShareAlike 3.0 License. You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights

Parallel Programming

Parallel Programming 9. Pipeline Parallelism Christoph von Praun praun@acm.org 09-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks