Introduction to Parallel Programming Models

Size: px

Start display at page:

Download "Introduction to Parallel Programming Models"

Lorena Benson
6 years ago
Views:

1 Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1

2 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures Goals Establish basic terminology for the course Recognize idioms in your workloads Evaluate and select tools Beyond Programmable Shading 2

3 Scope Games as representative application Demand high performance, visual quality Already using MC, throughput and heterogeneous HW Visibility, illumination, physics, simulation Not covering every possible approach Explicit threads, locks Message-passing/actors/CSP Transactions/REST Beyond Programmable Shading 3

4 What goes into a game frame? Beyond Programmable Shading 4

5 This. Computation graph for Battlefied: Bad Company provided by DICE Beyond Programmable Shading 5

6 A modern game is a mix of Data-parallel algorithms Beyond Programmable Shading 6

7 A modern game is a mix of Task-parallel algorithms and coordination Beyond Programmable Shading 7

8 A modern game is a mix of Standard and extended graphics pipelines Input Assembly Vertex Shading Primitive Setup Geometry Shading Pipeline Flow Rasterization Pixel Shading Output Merging Beyond Programmable Shading 8

9 Data-Parallel Task-Parallel Pipeline-Parallel Beyond Programmable Shading 9

10 Structure of this talk For each of these approaches Key idea Mental model Applicability Composition How these models combine in the real world Beyond Programmable Shading 10 10

11 Caveats Turing Tar Pit Just being able to express it doesn t make it fast! Most general model is not always best Constraints are what enable optimizations Not every model requires dedicated tools These patterns can be expressed in many languages Beyond Programmable Shading 11

12 Data parallelism Beyond Programmable Shading 12

13 Key Idea Run a single kernel over many elements Per-element computations are independent Can exploit throughput architecture well Amortize per-element cost with SIMD/SIMT Hide memory latency with lightweight threads Beyond Programmable Shading 13

14 Mental Model Execute N independent work items aka elements, fragments, strands, threads All work items run the same program: kernel Work item uses data determined by 0 <= i < N [0, N) is the domain of computation Beyond Programmable Shading 14

15 Domain of computation Determines number and shape of work items Often based on input/output data structure Not required domain and data may be decoupled Many domain shapes possible Regular Nested Irregular Beyond Programmable Shading 15

16 Simple Data-Parallelism Data structure Regular array Data A: B: Kernel Program void k(int i) { B[i] += A[i]; } Domain of computation 1D interval k(0) K(1) K(2) K(3) K(4) K(5) Computation Beyond Programmable Shading 16

17 Simple Data-Parallelism Data structure N-D array A: B: Kernel void k(int i, int j) { B[i][j] += A[i][j]; } Domain of computation N-D interval Beyond Programmable Shading 17

18 Shapes need not match Data structure N-D array 1D array A: B: Kernel Domain of computation N-D interval void k(int i) { for(int j = 0; j < M; j++) B[i] += A[i][j]; } Beyond Programmable Shading 18

19 Advanced data-parallelism Hierarchical domains Allow work items to communicate Useful for sums, scans, sorts Irregular domains Nested or ragged data structures Beyond Programmable Shading 19

20 Flat domains Kernel temporaries / scratch data are Private: inaccessible to other work items Transient: inaccessible after work item completes Flat domain exposes work-item locality Optimization: put scratch in register file or caches Beyond Programmable Shading 20

21 Communication Need to communicate intermediate results Each work item computed value, now want sum Write to main memory, launch a new kernel? Don t exploit locality, rest of memory hierarchy Employ a hierarchy of domains Beyond Programmable Shading 21

22 Hierarchical domains A domain composed of smaller domains Each level has its own scratch memory Often tied to memory hierarchy ex. Registers, L1$, L2$, DRAM Work item can access Kernel parameters Own scratch memory Scratch memory of ancestors in hierarchy Beyond Programmable Shading 22

23 Hierarchical domains Communicate through parent item scratch ex. Each element computes value a Add local value into shared sum Data races are now possible Atomic operations Synchronization barriers Also possible for global memory sum a a a a a a a a a Beyond Programmable Shading 23

24 Irregular Domains Ragged array data structure N-D array- / grid-of-lists {{A0,A1}, {B0,B1,B2}, {}, {D0,D1}, {E0}, {F0}} Used for Bucketing: particles in a cell Collision: potential collidees A0 B0 D0 E0 F0 A1 B1 D1 B2 Beyond Programmable Shading 24

25 Irregular Domains Must choose in-memory representation Pointer per bucket? {{A0,A1}, {B0,B1,B2}, {}, {D0,D1}, {E0}, {F0}} Performance Required operations Apply kernel to each bucket? Apply kernel to each element? A0 B0 D0 E0 F0 A1 B1 D1 B2 Beyond Programmable Shading 25

26 A simple representation A0 B0 D0 E0 F0 A1 B1 D1 B2 Logical Physical Count: Offset: Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 26

27 Apply to each element Count: Offset: Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 27

Apply to each bin Count: 2 3 0 2 1 1 Offset: 0 2 5 5 7 8

28 Apply to each bin Count: Offset: Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 28

29 Irregular data parallelism Key insight: represent irregular structure as flat index and storage arrays Many other representations possible Allows efficient data-parallel implementation of some irregular algorithms Many examples in the literature Beyond Programmable Shading 29

30 Pipeline parallelism Beyond Programmable Shading 30

31 Key Idea Algorithm is an ordered sequence of stages Each stage emits zero or more items Increase throughput by running stages in parallel Exploit producer-consumer locality On-chip FIFOs Efficient bus between cores Beyond Programmable Shading 31

32 GPU Pipeline (DX10) Pipeline of Fixed-function stages Programmable stages Data-parallel kernels Stages run in parallel Even for unified cores Input Assembly Vertex Shading Primitive Setup Geometry Shading Rasterization Pixel Shading Queues between stages Often in HW Output Merging Beyond Programmable Shading 32

33 Why pipelines? Variable rate amplification Rasterizer: 1 tri in, 0-N fragments out Ray tracer: 1 hit in, 0-N secondary/shadow rays out Load imbalance Rast Rast Rast Beyond Programmable Shading 33

34 Pipelines can cope with imbalance Re-balance load between stages Buffer up results for next stage Optimize for locality Specialized inter-stage FIFOs On-chip caches, busses or scratchpads Beyond Programmable Shading 34

35 User-defined pipelines Standard practice for console developers Custom Cell/RSX graphics pipelines on PS3 Pipeline-definition tools still research area GRAMPS [Sugerman et al. 2009] Challenges Bounding intermediate storage Scheduling algorithms Beyond Programmable Shading 35

36 Task parallelism Beyond Programmable Shading 36

37 Key Idea Achieve scalability for heterogeneous and irregular work by expressing dependencies directly Lightweight cooperative scheduling Beyond Programmable Shading 37

38 What is a Task? Think of it as an asynchronous function call Do X at some point in the future Optionally after Y is done Y() Might be implemented in HW or SW X() Almost always cooperative, not preemptive Beyond Programmable Shading 38

39 Why tasks? Start with sequential workload Core 0: Beyond Programmable Shading 39

40 Why tasks? Identify data- and pipeline-parallel steps Core 0: Beyond Programmable Shading 40

41 Why tasks? Identify data- and pipeline-parallel steps Assume perfect scaling Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 41

42 Why tasks? Cost now dominated by sequential part The part not suited to data- or pipeline-parallelism Oh yeah that s just Amdahl s Law Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 42

43 Using tasks If we know dependencies between the steps Beyond Programmable Shading 43

44 Using tasks If we know dependencies between the steps We can distribute the work across cores Respecting the dependencies Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 44

45 Finite # of cores It looks more like this Multiple kinds of work fill in the cracks Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 45

46 Task/job systems Standard practice for PS3 games Gaining currency on other consoles, desktop One worker thread per HW context Cooperative scheduling Pull tasks from an incoming queue Load balance using work stealing [Cilk] Beyond Programmable Shading 46

47 Task granularity Coarse-grained tasks easy to identify Can schedule poorly Coarse-grained dependencies Bubble waiting for predecessor to clear Core 3: Core 2: Core 1: Core 0: AI AI AI Physics Physics Graphics Graphics Graphics Graphics Beyond Programmable Shading 47

48 Task granularity Fine-grained tasks pack well More scheduling overhead Tune task size to strike a balance Core 3: A A A P P G G Core 2: A A A A P P G G G Core 1: A A P P G G G Core 0: P P P P P P P G G G G Beyond Programmable Shading 48

49 Tasks take-away Can t write sequential app with parallel pieces Amdahl s Law will bite you every time Must involve parallelism from the top down Task systems Handle the code that won t fit other models Heterogeneous, irregular Dynamically generated work, dependencies Provide scalability and load balancing Beyond Programmable Shading 49

50 Composition Beyond Programmable Shading 50

51 Picking the right tools No one model is best for all apps Or even all parts of one app Real-world parallel apps use combinations Case in point: the graphics pipeline Pipeline-parallel buffering between stages Programmable stages run data-parallel Task-parallel sharing of unified shader cores Beyond Programmable Shading 51

52 Data Parallelism Strengths Easy to get high utilization of throughput architecture Implicit use of SIMD/SIMT Implicit memory latency hiding Weaknesses Works best for large, homogeneous problems Work efficiency drops with irregularity Core resources divided amongst all elements Beyond Programmable Shading 52

53 Pipeline Parallelism Strengths Copes with variable data amplification Can exploit producer-consumer locality Weaknesses Best scheduling strategy workload-dependent No general-purpose tools for current HW Beyond Programmable Shading 53

54 Task Parallelism Strengths Scales even with irregular/dynamic problems Viable parallelism approach for global app structure Weaknesses No automatic support for latency-hiding Need to explicitly target SIMD width Beyond Programmable Shading 54

55 Summary Data-, pipeline- and task-parallelism Three proven approaches to scalability Applicable to many problems in visual computing Look for these to surface as we discuss Architectures Tools Algorithms Beyond Programmable Shading 55

56 Questions? Beyond Programmable Shading 56

57 Backup Beyond Programmable Shading 57

58 Many possible syntaxes Kernel Language kernel void k( float* A, float* B, float* C) { C[id] = A[id] + B[id]; } k<n>(a, B, C); Parallel Loop par_for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; Array Operations Parallel Functional Map Stream<float> A, B, C; C = A + B; fun k(a, b) = a + b C = par_map(k, A, B) Beyond Programmable Shading 58

59 Example syntax Kernel Language Parallel Loop kernel void k( ) { level_2 float sum = 0; level_1 float a; par_for(int i=0; i < N; i++) { float sum = 0; } a =...; atomic_add(&sum, a); par_for(int j=0; j < M; j++) { float a;... k<n, M>(A, B, C); } } a =...; atomic_add(&sum, a); Beyond Programmable Shading 59

60 Host/GPU pipeline Graphics command stream Host packs, GPU consumes in parallel Distribute pack work across N host cores Common technique in console graphics Will eventually translate to desktop Host: Prepare Frame N Prepare Frame N+1 Prepare Frame N+2 GPU: Render Frame N-1 Render Frame N Render Frame N+1 Beyond Programmable Shading 60

61 Tasks and threads Task looks a lot like an OS thread Created with function to execute Waits on a queue to be scheduled to a core May trigger event on completion Differences Cooperative, not preemptive scheduling Lightweight create/destroy Join often restricted and lightweight Beyond Programmable Shading 61

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming