Introduction to Parallel Programming Models

Similar documents
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming for Graphics

Portland State University ECE 588/688. Graphics Processors

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Scheduling the Graphics Pipeline on a GPU

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Scalable GPU Graph Traversal!

Practical Near-Data Processing for In-Memory Analytics Frameworks

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277

Compute-mode GPU Programming Interfaces

From Shader Code to a Teraflop: How Shader Cores Work

! Readings! ! Room-level, on-chip! vs.!

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology


Further Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009

CUDA C Programming Mark Harris NVIDIA Corporation

PowerVR Hardware. Architecture Overview for Developers

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 25: Board Notes: Threads and GPUs

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Comparing Memory Systems for Chip Multiprocessors

An Introduction to Parallel Programming

CUDA Programming Model

Threading Hardware in G80

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

From Brook to CUDA. GPU Technology Conference

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

ASYNCHRONOUS SHADERS WHITE PAPER 0

Parallel Computing. Hwansoo Han (SKKU)

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSE 120 Principles of Operating Systems

Ray Tracing with Multi-Core/Shared Memory Systems. Abe Stephens

Exploring Parallelism At Different Levels

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Real-Time Rendering Architectures

Performance Optimization Part II: Locality, Communication, and Contention

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Programming Larrabee: Beyond Data Parallelism. Dr. Larry Seiler Intel Corporation

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

EECS 487: Interactive Computer Graphics

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

The Art of Parallel Processing

PowerVR Series5. Architecture Guide for Developers

Optimisation. CS7GV3 Real-time Rendering

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Computer Architecture

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

GeForce4. John Montrym Henry Moreton

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

A brief introduction to OpenMP

GPU Sparse Graph Traversal

Maximizing Face Detection Performance

Programmable Graphics Hardware (GPU) A Primer

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Parallelism. CS6787 Lecture 8 Fall 2017

GPU Memory Model Overview

GPUs and GPGPUs. Greg Blanton John T. Lubia

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPUfs: Integrating a file system with GPUs

Παράλληλη Επεξεργασία

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Windowing System on a 3D Pipeline. February 2005

Course Recap + 3D Graphics on Mobile GPUs

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS427 Multicore Architecture and Parallel Computing

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

CMSC 611: Advanced. Parallel Systems

Modern Processor Architectures. L25: Modern Compiler Design

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GRAPHICS PROCESSING UNITS

Antonio R. Miele Marco D. Santambrogio

Handout 3. HSAIL and A SIMT GPU Simulator

Enabling immersive gaming experiences Intro to Ray Tracing

Cloud Computing CS

Transcription:

Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1

Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures Goals Establish basic terminology for the course Recognize idioms in your workloads Evaluate and select tools Beyond Programmable Shading 2

Scope Games as representative application Demand high performance, visual quality Already using MC, throughput and heterogeneous HW Visibility, illumination, physics, simulation Not covering every possible approach Explicit threads, locks Message-passing/actors/CSP Transactions/REST Beyond Programmable Shading 3

What goes into a game frame? Beyond Programmable Shading 4

This. Computation graph for Battlefied: Bad Company provided by DICE Beyond Programmable Shading 5

A modern game is a mix of Data-parallel algorithms Beyond Programmable Shading 6

A modern game is a mix of Task-parallel algorithms and coordination Beyond Programmable Shading 7

A modern game is a mix of Standard and extended graphics pipelines Input Assembly Vertex Shading Primitive Setup Geometry Shading Pipeline Flow Rasterization Pixel Shading Output Merging Beyond Programmable Shading 8

Data-Parallel Task-Parallel Pipeline-Parallel Beyond Programmable Shading 9

Structure of this talk For each of these approaches Key idea Mental model Applicability Composition How these models combine in the real world Beyond Programmable Shading 10 10

Caveats Turing Tar Pit Just being able to express it doesn t make it fast! Most general model is not always best Constraints are what enable optimizations Not every model requires dedicated tools These patterns can be expressed in many languages Beyond Programmable Shading 11

Data parallelism Beyond Programmable Shading 12

Key Idea Run a single kernel over many elements Per-element computations are independent Can exploit throughput architecture well Amortize per-element cost with SIMD/SIMT Hide memory latency with lightweight threads Beyond Programmable Shading 13

Mental Model Execute N independent work items aka elements, fragments, strands, threads All work items run the same program: kernel Work item uses data determined by 0 <= i < N [0, N) is the domain of computation Beyond Programmable Shading 14

Domain of computation Determines number and shape of work items Often based on input/output data structure Not required domain and data may be decoupled Many domain shapes possible Regular Nested Irregular Beyond Programmable Shading 15

Simple Data-Parallelism Data structure Regular array Data A: B: Kernel Program void k(int i) { B[i] += A[i]; } Domain of computation 1D interval k(0) K(1) K(2) K(3) K(4) K(5) Computation Beyond Programmable Shading 16

Simple Data-Parallelism Data structure N-D array A: B: Kernel void k(int i, int j) { B[i][j] += A[i][j]; } Domain of computation N-D interval Beyond Programmable Shading 17

Shapes need not match Data structure N-D array 1D array A: B: Kernel Domain of computation N-D interval void k(int i) { for(int j = 0; j < M; j++) B[i] += A[i][j]; } Beyond Programmable Shading 18

Advanced data-parallelism Hierarchical domains Allow work items to communicate Useful for sums, scans, sorts Irregular domains Nested or ragged data structures Beyond Programmable Shading 19

Flat domains Kernel temporaries / scratch data are Private: inaccessible to other work items Transient: inaccessible after work item completes Flat domain exposes work-item locality Optimization: put scratch in register file or caches Beyond Programmable Shading 20

Communication Need to communicate intermediate results Each work item computed value, now want sum Write to main memory, launch a new kernel? Don t exploit locality, rest of memory hierarchy Employ a hierarchy of domains Beyond Programmable Shading 21

Hierarchical domains A domain composed of smaller domains Each level has its own scratch memory Often tied to memory hierarchy ex. Registers, L1$, L2$, DRAM Work item can access Kernel parameters Own scratch memory Scratch memory of ancestors in hierarchy Beyond Programmable Shading 22

Hierarchical domains Communicate through parent item scratch ex. Each element computes value a Add local value into shared sum Data races are now possible Atomic operations Synchronization barriers Also possible for global memory sum a a a a a a a a a Beyond Programmable Shading 23

Irregular Domains Ragged array data structure N-D array- / grid-of-lists {{A0,A1}, {B0,B1,B2}, {}, {D0,D1}, {E0}, {F0}} Used for Bucketing: particles in a cell Collision: potential collidees A0 B0 D0 E0 F0 A1 B1 D1 B2 Beyond Programmable Shading 24

Irregular Domains Must choose in-memory representation Pointer per bucket? {{A0,A1}, {B0,B1,B2}, {}, {D0,D1}, {E0}, {F0}} Performance Required operations Apply kernel to each bucket? Apply kernel to each element? A0 B0 D0 E0 F0 A1 B1 D1 B2 Beyond Programmable Shading 25

A simple representation A0 B0 D0 E0 F0 A1 B1 D1 B2 Logical Physical Count: 2 3 0 2 1 1 Offset: 0 2 5 5 7 8 Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 26

Apply to each element Count: 2 3 0 2 1 1 Offset: 0 2 5 5 7 8 Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 27

Apply to each bin Count: 2 3 0 2 1 1 Offset: 0 2 5 5 7 8 Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 28

Irregular data parallelism Key insight: represent irregular structure as flat index and storage arrays Many other representations possible Allows efficient data-parallel implementation of some irregular algorithms Many examples in the literature Beyond Programmable Shading 29

Pipeline parallelism Beyond Programmable Shading 30

Key Idea Algorithm is an ordered sequence of stages Each stage emits zero or more items Increase throughput by running stages in parallel Exploit producer-consumer locality On-chip FIFOs Efficient bus between cores Beyond Programmable Shading 31

GPU Pipeline (DX10) Pipeline of Fixed-function stages Programmable stages Data-parallel kernels Stages run in parallel Even for unified cores Input Assembly Vertex Shading Primitive Setup Geometry Shading Rasterization Pixel Shading Queues between stages Often in HW Output Merging Beyond Programmable Shading 32

Why pipelines? Variable rate amplification Rasterizer: 1 tri in, 0-N fragments out Ray tracer: 1 hit in, 0-N secondary/shadow rays out Load imbalance Rast Rast Rast Beyond Programmable Shading 33

Pipelines can cope with imbalance Re-balance load between stages Buffer up results for next stage Optimize for locality Specialized inter-stage FIFOs On-chip caches, busses or scratchpads Beyond Programmable Shading 34

User-defined pipelines Standard practice for console developers Custom Cell/RSX graphics pipelines on PS3 Pipeline-definition tools still research area GRAMPS [Sugerman et al. 2009] Challenges Bounding intermediate storage Scheduling algorithms Beyond Programmable Shading 35

Task parallelism Beyond Programmable Shading 36

Key Idea Achieve scalability for heterogeneous and irregular work by expressing dependencies directly Lightweight cooperative scheduling Beyond Programmable Shading 37

What is a Task? Think of it as an asynchronous function call Do X at some point in the future Optionally after Y is done Y() Might be implemented in HW or SW X() Almost always cooperative, not preemptive Beyond Programmable Shading 38

Why tasks? Start with sequential workload Core 0: Beyond Programmable Shading 39

Why tasks? Identify data- and pipeline-parallel steps Core 0: Beyond Programmable Shading 40

Why tasks? Identify data- and pipeline-parallel steps Assume perfect scaling Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 41

Why tasks? Cost now dominated by sequential part The part not suited to data- or pipeline-parallelism Oh yeah that s just Amdahl s Law Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 42

Using tasks If we know dependencies between the steps Beyond Programmable Shading 43

Using tasks If we know dependencies between the steps We can distribute the work across cores Respecting the dependencies Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 44

Finite # of cores It looks more like this Multiple kinds of work fill in the cracks Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 45

Task/job systems Standard practice for PS3 games Gaining currency on other consoles, desktop One worker thread per HW context Cooperative scheduling Pull tasks from an incoming queue Load balance using work stealing [Cilk] Beyond Programmable Shading 46

Task granularity Coarse-grained tasks easy to identify Can schedule poorly Coarse-grained dependencies Bubble waiting for predecessor to clear Core 3: Core 2: Core 1: Core 0: AI AI AI Physics Physics Graphics Graphics Graphics Graphics Beyond Programmable Shading 47

Task granularity Fine-grained tasks pack well More scheduling overhead Tune task size to strike a balance Core 3: A A A P P G G Core 2: A A A A P P G G G Core 1: A A P P G G G Core 0: P P P P P P P G G G G Beyond Programmable Shading 48

Tasks take-away Can t write sequential app with parallel pieces Amdahl s Law will bite you every time Must involve parallelism from the top down Task systems Handle the code that won t fit other models Heterogeneous, irregular Dynamically generated work, dependencies Provide scalability and load balancing Beyond Programmable Shading 49

Composition Beyond Programmable Shading 50

Picking the right tools No one model is best for all apps Or even all parts of one app Real-world parallel apps use combinations Case in point: the graphics pipeline Pipeline-parallel buffering between stages Programmable stages run data-parallel Task-parallel sharing of unified shader cores Beyond Programmable Shading 51

Data Parallelism Strengths Easy to get high utilization of throughput architecture Implicit use of SIMD/SIMT Implicit memory latency hiding Weaknesses Works best for large, homogeneous problems Work efficiency drops with irregularity Core resources divided amongst all elements Beyond Programmable Shading 52

Pipeline Parallelism Strengths Copes with variable data amplification Can exploit producer-consumer locality Weaknesses Best scheduling strategy workload-dependent No general-purpose tools for current HW Beyond Programmable Shading 53

Task Parallelism Strengths Scales even with irregular/dynamic problems Viable parallelism approach for global app structure Weaknesses No automatic support for latency-hiding Need to explicitly target SIMD width Beyond Programmable Shading 54

Summary Data-, pipeline- and task-parallelism Three proven approaches to scalability Applicable to many problems in visual computing Look for these to surface as we discuss Architectures Tools Algorithms Beyond Programmable Shading 55

Questions? Beyond Programmable Shading 56

Backup Beyond Programmable Shading 57

Many possible syntaxes Kernel Language kernel void k( float* A, float* B, float* C) { C[id] = A[id] + B[id]; } k<n>(a, B, C); Parallel Loop par_for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; Array Operations Parallel Functional Map Stream<float> A, B, C; C = A + B; fun k(a, b) = a + b C = par_map(k, A, B) Beyond Programmable Shading 58

Example syntax Kernel Language Parallel Loop kernel void k( ) { level_2 float sum = 0; level_1 float a; par_for(int i=0; i < N; i++) { float sum = 0; } a =...; atomic_add(&sum, a); par_for(int j=0; j < M; j++) { float a;... k<n, M>(A, B, C); } } a =...; atomic_add(&sum, a); Beyond Programmable Shading 59

Host/GPU pipeline Graphics command stream Host packs, GPU consumes in parallel Distribute pack work across N host cores Common technique in console graphics Will eventually translate to desktop Host: Prepare Frame N Prepare Frame N+1 Prepare Frame N+2 GPU: Render Frame N-1 Render Frame N Render Frame N+1 Beyond Programmable Shading 60

Tasks and threads Task looks a lot like an OS thread Created with function to execute Waits on a queue to be scheduled to a core May trigger event on completion Differences Cooperative, not preemptive scheduling Lightweight create/destroy Join often restricted and lightweight Beyond Programmable Shading 61