Parallel Programming

Size: px

Start display at page:

Download "Parallel Programming"

Lionel Wilcox
5 years ago
Views:

1 Parallel Programming 9. Pipeline Parallelism Christoph von Praun 09-1

2 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3) Task Parallelism Organization by Data Flow (1.5) Pipeline (1.2) Recursive Data (1.4) Recursive Splitting 09-2

3 (1.5) Pipeline Context: Data intensive application large number of inputs and outputs Computation on each input can be broken in several phases On each input, phases have to be computed in order Computations on different inputs are independent 09-3

4 input data3 data3 data2 data3 data2 data1 data3 data2 data1 data2 data1 data1 phase-1 kernel inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); output inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); phase-2 kernel xdata3 xdata3 xdata2 xdata3 xdata2 xdata1 xdata3 xdata2 xdata1 xdata2 xdata1 xdata1 09-4

5 Example: Open GL Imaging pipeline (simplified) inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); vector graphic evaluator-kernel: move vertices transformed vector graphic per vertex operations inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); per fragment ops (pixel shading) raster image rasterization vertices with attributes frame buffer 09-5

6 Forces Limited memory bandwidth avoid frequent data transfer between memory and processor Limited memory capacity input can be very large (streams) and cannot be held in memory reduce temporary data 09-6

7 Processing strategies Data parallelism (4 inputs, 2 kernels, 2UE) 1. UE-1+2 compute kernel-1 (in-1, in-2 tmp-a, tmp-b) 2. UE-1+2 compute kernel-1 (in-3, in-4 tmp-c, tmp-d) 3. UE-1+2 compute kernel-2 (tmp-a, tmp-b out-1, out-2) 4. UE-1+2 compute kernel-2 (tmp-c, tmp-d out-3, out-4) 09-7

8 Processing strategies Data parallelism (4 inputs, 2 kernels, 2UE) 1. UE-1+2 compute kernel-1 (in-1, in-2 tmp-a, tmp-b) 2. UE-1+2 compute kernel-1 (in-3, in-4 tmp-c, tmp-d) 3. UE-1+2 compute kernel-2 (tmp-a, tmp-b out-1, out-2) 4. UE-1+2 compute kernel-2 (tmp-c, tmp-d out-3, out-4) memory capacity: 4 x temporary data memory bandwidth: 8 x read 8 x write 09-8

9 Processing strategies Pipeline parallelism (4 inputs, 2 kernels, 2UE) 1. UE-1 computes kernel-1 (in-1 tmp-a) UE-2 is idle 2. UE-1 computes kernel-1 (in-2 tmp-b) UE-2 computes kernel-2 (tmp-a out-1) 3. UE-1 computes kernel-1 (in-3 tmp-a) UE-2 computes kernel-2 (tmp-b out-2) 4. UE-1 computes kernel-1 (in-4 tmp-b) UE-2 computes kernel-2 (tmp-a out-3) 5. UE-1 is idle UE-2 computes kernel-2 (tmp-b out-4) 09-9

10 Processing strategies Pipeline parallelism (4 inputs, 2 kernels, 2UE) 1. UE-1 computes kernel-1 (in-1 tmp-a) UE-2 is idle 2. UE-1 computes kernel-1 (in-2 tmp-b) UE-2 computes kernel-2 (tmp-a out-1) 3. UE-1 computes kernel-1 (in-3 tmp-a) UE-2 computes kernel-2 (tmp-b out-2) memory capacity: 4. UE-1 computes kernel-1 (in-4 tmp-b) UE-2 computes kernel-2 (tmp-a 2 x out-3) temporary data memory bandwidth: 5. UE-1 is idle UE-2 computes kernel-2 (tmp-b 8 x out-4) read 8 x write 09-10

11 Observation: Computation of an individual kernel can typically be split into smaller independent computations (data paralelism) inkey document name invalue document contents result collect intermediate result */ def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = iv.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars) { if (c.isletterordigit()) sb.add(c.tolowercase()); else { // emit result result.add(pair[string, Int](sb.result(), 1)); sb = new StringBuilder(); transformed vector graphic per vertex operations vertices with attributes 09-11

12 Processing strategies Pipeline parallelism (8 half-inputs, 2 kernels, 2UE) 1.1 UE-1 computes kernel-1 (in-1.1 minitmp-a) UE-2 is idle 1.2 UE-1 computes kernel-1 (in-1.2 minitmp-b) UE-2 computes kernel-2 (minitmp-a out-1.1) 2.1 UE-1 computes kernel-1 (in-2.1 minitmp-a) UE-2 computes kernel-2 (minitmp-b out-1.2) 2.2 UE-1 computes kernel-1 (in-2.2 minitmp-b) UE-2 computes kernel-2 (minitmp-a out-2.1) etc

13 Processing strategies Pipeline parallelism (4 inputs, 2 kernels, 2UE) 1.1 UE-1 computes kernel-1 (in-1.1 minitmp-a) UE-2 is idle 1.2 UE-1 computes kernel-1 (in-1.2 minitmp-b) UE-2 computes kernel-2 (minitmp-a out-1.1) memory capacity: 2.1 UE-1 computes kernel-1 (in-2.1 minitmp-a) UE-2 computes kernel-2 Buffers (minitmp-b between stages out-1.2) can be made very small (mini-tmp) 2.2 UE-1 computes kernel-1 (in-2.2 minitmp-b) UE-2 computes kernel-2 Efficient (minitmp-a storage, no large out-2.1) temporaries etc. memory bandwidth: 2 x read 2 x write 09-13

14 Solution functional programming! Partition computation into several phases each phase is described as a function be explicit about input and output Computation of phases arranged in a pipeline output of one stage is input of next stage 09-14

15 Consequences Software engineering aspects kernels can be implemented in separate modules, explicit interfaces pluggable kernels programming models and languages that specialized on stream processing (parallelism is not exposed to the programmer) Simple parallelization 09-15

16 Consequences cont. Efficient in memory bandwidth and storage capacity if all kernels were lumped together, memory efficiency could be better; not an option in practice (sofware maintenance, parallelization, etc. become much more difficult) 09-16

17 Consequences cont. Managing temporary data and communication between pipeline stages: stages are implicity synchronized Sofware support: efficient data structures, e.g., lock-free queue (Section 4), double buffering Programming model support: e.g. OpenCL distinguishes, private, local, global and constant memory OS support: pipes, shared memory Hardware support: local and global memory on NVIDIA graphics boards 09-17

18 Consequences cont. Load balancing: Computation in different pipeline stages should take about the same amount of time major issue for performance debugging of complex GPU codes 09-18

19 Consequences cont. Stream processing is supported by hardware programmable rendering pipelines of GPUs kernels are called shaders geometry shaders vertex shaders pixel shaders 10s - 100s of pipeline stages possible Programming environments/languages tailored to streaming applications, e.g. StreamIt[3] 09-19

20 Sources [1] Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill: Patterns for Parallel Programming, Addison Wesley [2] OpenGL - The Industry Standard for High Performance Graphics [3] William Thies, Michal Karczmarek, Saman Amarasinghe: StreamIt: A language for streaming application, international Conference for Compiler Construction (CC),

21 This work is licensed under a Creative Commons Attribution- ShareAlike 3.0 License. You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights

Parallel Programming

Parallel Programming 7. Data Parallelism Christoph von Praun praun@acm.org 07-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3)