Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

Size: px

Start display at page:

Download "Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline"

Jasmine Taylor
6 years ago
Views:

1 Keeping Many s Busy: Scheduling the Graphics Pipeline Jonathan Ragan-Kelley, MIT CSAIL 29 July 2010

2 This talk How to think about scheduling GPU-style pipelines Four constraints which drive scheduling decisions Examples of these concepts in real GPU designs Goals Know why GPUs, APIs impose the constraints they do. Develop intuition for what they can do well. Understand key patterns for building your own pipelines.

3 First, a definition Scheduling [n.]:

4 First, a definition Scheduling [n.]: Assigning computations and data to resources in space and time.

5 The workload: Direct3D IA PA HS Tess DS PA GS Rast

6 The workload: Direct3D IA PA HS Tess DS PA GS Rast

7 The workload: Direct3D IA PA data flow HS Tess DS PA GS Rast

8 The workload: Direct3D IA PA data flow HS Tess DS PA GS Rast

9 The workload: Direct3D IA data flow PA HS Tess DS PA GS Rast Logical pipeline Fixed-function stage Programmable stage

10 The machine: a modern GPU Tex Input Assembler Primitive Assembler Logical pipeline Fixed-function stage Tex Tex Tex Rasterizer Output Task Distributor Programmable stage Physical processor Fixed-function logic Programmable core Fixed-function control

11 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output time

12 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA time

13 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA time

14 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA time PA

15 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast

16 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast

17 Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast

18 An efficient schedule keeps hardware busy Input Assembler Primitive Assembler Rasterizer Output time IA IA IA IA IA PA PA PA PA PA Rast Rast Rast Rast Rast IA PA Rast

19 Choosing which tasks to run when (and where) Resource constraints Tasks can only execute when there are sufficient resources for their computation and their data. Coherence Control coherence is essential to shader core efficiency. Data coherence is essential to memory and communication efficiency. Load balance Irregularity in execution time create bubbles in the pipeline schedule. Ordering Graphics APIs define strict ordering semantics, which restrict possible schedules.

20 Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output time

21 Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output time IA

22 Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output time??? IA

23 Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output time IA??????

24 Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output time IA?????? Deadlock

25 Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output time IA?????? Deadlock Key concept: Preallocation of resources helps guarantee forward progress.

26 Coherence is a balancing act Intrinsic tension between: Horizontal (control, fetch) coherence and Vertical (producer-consumer) locality. Locality and Load Balance.

27 Graphics workloads are irregular Rasterizer

28 Graphics workloads are irregular Rasterizer

29 Graphics workloads are irregular Rasterizer!

30 Graphics workloads are irregular Rasterizer SuperExpensive( ) Trivial( )!

31 Graphics workloads are irregular Rasterizer SuperExpensive( ) Trivial( )! But: s are optimized for regular, self-similar work. Imbalanced work creates bubbles in the task schedule.

32 Graphics workloads are irregular Rasterizer SuperExpensive( ) Trivial( )! But: s are optimized for regular, self-similar work. Imbalanced work creates bubbles in the task schedule. Solution: Dynamically generating and aggregating tasks isolates irregularity and recaptures coherence. Redistributing tasks restores load balance.

33 Redistribution after irregular amplification Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast

34 Redistribution after irregular amplification Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast

35 Redistribution after irregular amplification Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast Key concept: Managing irregularity by dynamically generating, aggregating, and redistributing tasks

36 Ordering Rule: All framebuffer updates must appear as though all triangles were drawn in strict sequential order

37 Ordering Rule: All framebuffer updates must appear as though all triangles were drawn in strict sequential order Key concept: Carefully structuring task redistribution to maintain API ordering.

38 Building a real pipeline

39 Static tile scheduling The simplest thing that could possibly work. Vertex Multiple cores: 1 front-end n back-end Exemplar: ARM Mali 400

40 Static tile scheduling Vertex Exemplar: ARM Mali 400

41 Static tile scheduling Vertex Exemplar: ARM Mali 400

42 Static tile scheduling Vertex Exemplar: ARM Mali 400

43 Static tile scheduling Vertex Exemplar: ARM Mali 400

44 Static tile scheduling Vertex Exemplar: ARM Mali 400

45 Static tile scheduling Locality captured within tiles Resource constraints static = simple Ordering single front-end, sequential processing within each tile Exemplar: ARM Mali 400

46 Static tile scheduling The problem: load imbalance only one task creation point. no dynamic task redistribution. Exemplar: ARM Mali 400

47 Static tile scheduling The problem: load imbalance only one task creation point. no dynamic task redistribution. Exemplar: ARM Mali 400

48 Static tile scheduling The problem: load imbalance only one task creation point. no dynamic task redistribution.!!! idle idle idle Exemplar: ARM Mali 400

49 Sort-last fragment shading Vertex Rasterizer Exemplars: NVIDIA G80, ATI RV770

50 Sort-last fragment shading Vertex Rasterizer Redistribution restores fragment load balance. But how can we maintain order? Exemplars: NVIDIA G80, ATI RV770

51 Sort-last fragment shading Vertex Rasterizer Preallocate outputs in FIFO order Exemplars: NVIDIA G80, ATI RV770

52 Sort-last fragment shading Vertex Rasterizer Complete shading asynchronously Exemplars: NVIDIA G80, ATI RV770

53 Sort-last fragment shading Vertex Rasterizer fragments in FIFO order Output Exemplars: NVIDIA G80, ATI RV770

54 Unified shaders Solve load balance by time-multiplexing different stages onto shared processors according to load Tex Input Assembler Primitive Assembler Tex Rasterizer Output Tex Tex Task Distributor Exemplars: NVIDIA G80, ATI RV770

55 Unified s: time-multiplexing cores Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast Exemplars: NVIDIA G80, ATI RV770

56 Unified s: time-multiplexing cores Input Assembler Primitive Assembler Rasterizer Output IA time PA Rast Exemplars: NVIDIA G80, ATI RV770

57 Prioritizing the logical pipeline IA PA Rast

58 Prioritizing the logical pipeline IA 5 4 PA Rast 3 2 priority 1 0

59 Prioritizing the logical pipeline IA 5 4 PA Rast 3 2 priority 1 0

60 Prioritizing the logical pipeline IA 5 4 fixed-size queue storage PA Rast 3 2 priority 1 0

61 Scheduling the pipeline IA PA time Rast

62 Scheduling the pipeline IA PA time Rast

63 Scheduling the pipeline IA PA time Rast

64 Scheduling the pipeline IA High priority, but stalled on output PA time Rast Lower priority, but ready to run

65 Scheduling the pipeline IA PA time Rast

66 Scheduling the pipeline Queue sizes and backpressure provide a natural knob for balancing horizontal batch coherence and producer-consumer locality.

67 Summary

68 Key concepts Think of scheduling the pipeline as mapping tasks onto cores. Preallocate resources before launching a task. Preallocation helps ensure forward progress and prevent deadlock. Graphics is irregular. Dynamically generating, aggregating and redistributing tasks at irregular amplification points regains coherence and load balance. Order matters. Carefully structure task redistribution to maintain ordering.

69 Questions for the future Can we relax the strict ordering requirements? Can you build a generic scheduler for application-defined pipelines? What application-specific information would a generic scheduler need to work well?

70 Starting points to learn more The next step: parallel primitive processing Eldridge et al. Pomegranate: A Fully Scalable Graphics Architecture. SIGGRAPH Tim Purcell. Fast Tessellated Rendering on Fermi GF100. Hot3D, HPG Scheduling cyclic graphs, in software, on current GPUs Parker et al. OptiX: A General Purpose Ray Tracing Engine. SIGGRAPH Details of the ARM Mali design Tom Olson. Mali-400 MP: A Scalable GPU for Mobile Devices. Hot3D, HPG 2010.

71 Thank you Special thanks: Tim Purcell, Steve Molnar, Henry Moreton, Steve Parker, Austin Robison - NVIDIA Jeremy Sugerman - Stanford Mike Houston - AMD Mike Doggett - Lund University Tom Olson - ARM

72 Some Lessons

73 Why don t we have dynamic resource allocation? e.g. recursion, malloc() in shaders Static preallocation of resources guarantees forward progress. Tasks which outgrow available resources can stall, causing deadlock.

74 Geometry s are slow because they allow dynamic amplification in shaders. Pick your poison: Always stream through DRAM. exemplar: ATI R600 Smooth falloff for large amplification, but very slow for small amplification (DRAM latency). Scale down parallelism to fit. exemplar: NVIDIA G80 Fast for small amplification, poor shader throughput (no parallelism) for large amplification.

75 Why isn t rasterization programmable? (Yes, it is computationally intensive.) It is highly irregular. It must generate and aggregate regular output. It must integrate with an order-preserving task redistribution mechanism.

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Beyond Programmable Shading. Scheduling the Graphics Pipeline Beyond Programmable Shading Scheduling the Graphics Pipeline Jonathan Ragan-Kelley, MIT CSAIL 9 August 2011 The Real-Time Rendering Architectures talk shows how shaders can use large, coherent batches