Patterns for! Parallel Programming II!

Size: px

Start display at page:

Download "Patterns for! Parallel Programming II!"

Derick Porter
5 years ago
Views:

1 Lecture 4! Patterns for! Parallel Programming II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!

2 Task Decomposition Also known as functional decomposition Some programs naturally decompose Divide tasks among cores Decide data accessed by each core Example: Event-handler for a GUI

3 Task Decomposition f() g() h() q() r() s()

4 Task Decomposition CPU 1 CPU 0 f() g() h() CPU 2 q() r() s()

5 Task Decomposition CPU 1 CPU 0 f() g() h() CPU 2 q() r() s()

6 Task Decomposition CPU 1 CPU 0 f() g() h() CPU 2 q() r() s()

7 Task Decomposition CPU 1 CPU 0 f() g() h() CPU 2 q() r() s()

8 Task Decomposition CPU 1 CPU 0 f() g() h() CPU 2 q() r() s()

9 Task Decomposition Forces Flexibility in number and size of tasks Not architecture-specific Efficiency Tasks large enough and as independent as possible Simplicity Easy to understand and debug

10 Pipeline Decomposition Special kind of task decomposition Data flows through a sequence of tasks Assembly line parallelism Example: compression Read Block Compress Write Block

11 Pipeline Decomposition Processing read uncompressed block (Step 1) Read Block Compress Write Block

12 Pipeline Decomposition Compress block (Step 2) Read Block Compress Write Block

13 Pipeline Decomposition Write compressed block (Step 3) Read Block Compress Write Block

14 Pipeline Decomposition Processing five data set (Step 1) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

15 Pipeline Decomposition Processing five data set (Step 2) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

16 Pipeline Decomposition Processing five data set (Step 3) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

17 Pipeline Decomposition Processing five data set (Step 4) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

18 Pipeline Decomposition Processing five data set (Step 5) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

19 Pipeline Decomposition Processing five data set (Step 6) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

20 Pipeline Decomposition Processing five data set (Step 7) CPU 0 CPU 1 CPU 2 Data set 0 Data set 1 Data set 2 Data set 3 Data set 4

21 Pipeline Decomposition Forces Flexibility Deeper pipelines are better Efficiency Stages of pipeline should not cause bottleneck Simplicity Manageable chunks of code

22 Dependency Analysis Control and Data Dependences Dependence Graph Graph = (nodes, edges) Data dependency graph (nodes = variables) Control flow (nodes = basic blocks) Call graph (nodes = functions) Edge indicates possible control or data dependency

23 Dependency Analysis for (i = 0; i < 3; i++) a[i] = b[i] / 2.0; b[0] 2 b[1] 2 b[2] 2 / / / a[0] a[1] a[2]

24 Dependency Analysis for (i = 0; i < 3; i++) a[i] = b[i] / 2.0; Domain decomposition possible b[0] 2 b[1] 2 b[2] 2 / / / a[0] a[1] a[2]

25 Dependency Analysis for (i = 1; i < 4; i++) a[i] = a[i-1] * b[i]; a[0] b[1] b[2] b[3] * * * a[1] a[2] a[3]

26 Dependency Analysis for (i = 1; i < 4; i++) a[i] = a[i-1] * b[i]; No domain decomposition a[0] b[1] b[2] b[3] * * * a[1] a[2] a[3]

27 Dependency Analysis a = f(x, y, z); b = g(w, x); t = a + b; c = h(z); s = t / c; w x y z g f h b a c t / s

28 Dependency Analysis a = f(x, y, z); b = g(w, x); t = a + b; c = h(z); s = t / c; w x y z g f h Task decomposition with 3 CPUs. b t a / c s

29 Evaluate Design Is the design good enough? YES - move to next phase NO - re-evaluate previous patterns Forces Suitability to target platform Should not depend on underlying architecture Design quality Trade-offs between simplicity, flexibility, and efficiency Preparation for next phase Algorithm Structure

30 Algorithm Structure Patterns Given a set of concurrent task, what s next? Important questions based on target platform: How many cores will your algorithm support? How expensive is sharing? Is design constrained to hardware? Does algorithm map well to programming environment?

31 Organizing Concurrency Major organizing principle implied by concurrency Organization by task Task Parallelism Divide and Conquer Organization by data Geometric Decomposition Recursive Data Organization by flow of data Pipeline Event-Based coordination

32 Organize by Tasks Organize By Tasks Linear Recursive Task Parallelism Divide and Conquer

33 Task Parallelism Tasks are linear (no structure or hierarchy) Can be completely independent Embarrassingly parallel Can have some dependencies

34 Task Parallelism (cont d) Common factors Tasks are associated with loop iterations All tasks are known at beginning All tasks must complete However, there are exceptions to all of these

are independent for each atom Branch-and-bound computations Repeatedly

35 Task Parallelism (Examples) Ray Tracing Each ray is separate and independent Molecular Dynamics Vibrational, rotational, nonbonded forces are independent for each atom Branch-and-bound computations Repeatedly divide into smaller solution spaces until solution found Tasks weakly dependent through queue

36 Task Parallelism Three Key Elements Is Task definition adequate? Number of tasks and their computation Schedule Load Balancing Dependencies Removable Separable by replication

37 Schedule Not all schedules of task equal in performance. Slide Source: Introduction to Parallel Computing, Grama et al.

38 Divide and Conquer Recursive Program Structure Subproblems may not be uniform Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

39 Organize by Data Organize By Data Decomposition Linear Recursive Geometric Decomposition Recursive Data

40 Organize by Data Operations on core data structure Geometric Decomposition Recursive Data

41 Geometric Deomposition Arrays and other linear structures Example: Matrix multiply

42 Recursive Data Lists, trees, and graphs May seem that can only move sequentially through data structure But, there are ways to expose concurrency

43 Recursive Data Example Find the Root: Given a forest of directed trees find the root of each node Parallel approach: For each node, find its successor s successor Repeat until no changes O(log n) vs O(n) Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

44 Organize by Flow of Data Organize By Flow of Data Regular Irregular Pipeline Event-Based Coordination

45 Organize by Flow of Data Computation can be viewed as a flow of data going through a sequence of stages Pipeline: one-way predictable communication Event-based Coordination: unrestricted unpredictable communication

46 Pipeline performance Concurrency limited by pipeline depth Stages should be equally computationally intensive Time to fill and drain pipe should be small

47 Read for Next Time Reengineering for Parallelism: An Entry Point for PLPP (Pattern Language for Parallel Programming) for Legacy Applications ParallelPatterns/plop2005.pdf

Patterns for! Parallel Programming!

Lecture 4! Patterns for! Parallel Programming! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! www.cis.udel.edu/~cavazos/cisc879! Lecture Overview Writing a Parallel Program