Introduction to Parallel Programming Models
|
|
- Lorena Benson
- 6 years ago
- Views:
Transcription
1 Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1
2 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures Goals Establish basic terminology for the course Recognize idioms in your workloads Evaluate and select tools Beyond Programmable Shading 2
3 Scope Games as representative application Demand high performance, visual quality Already using MC, throughput and heterogeneous HW Visibility, illumination, physics, simulation Not covering every possible approach Explicit threads, locks Message-passing/actors/CSP Transactions/REST Beyond Programmable Shading 3
4 What goes into a game frame? Beyond Programmable Shading 4
5 This. Computation graph for Battlefied: Bad Company provided by DICE Beyond Programmable Shading 5
6 A modern game is a mix of Data-parallel algorithms Beyond Programmable Shading 6
7 A modern game is a mix of Task-parallel algorithms and coordination Beyond Programmable Shading 7
8 A modern game is a mix of Standard and extended graphics pipelines Input Assembly Vertex Shading Primitive Setup Geometry Shading Pipeline Flow Rasterization Pixel Shading Output Merging Beyond Programmable Shading 8
9 Data-Parallel Task-Parallel Pipeline-Parallel Beyond Programmable Shading 9
10 Structure of this talk For each of these approaches Key idea Mental model Applicability Composition How these models combine in the real world Beyond Programmable Shading 10 10
11 Caveats Turing Tar Pit Just being able to express it doesn t make it fast! Most general model is not always best Constraints are what enable optimizations Not every model requires dedicated tools These patterns can be expressed in many languages Beyond Programmable Shading 11
12 Data parallelism Beyond Programmable Shading 12
13 Key Idea Run a single kernel over many elements Per-element computations are independent Can exploit throughput architecture well Amortize per-element cost with SIMD/SIMT Hide memory latency with lightweight threads Beyond Programmable Shading 13
14 Mental Model Execute N independent work items aka elements, fragments, strands, threads All work items run the same program: kernel Work item uses data determined by 0 <= i < N [0, N) is the domain of computation Beyond Programmable Shading 14
15 Domain of computation Determines number and shape of work items Often based on input/output data structure Not required domain and data may be decoupled Many domain shapes possible Regular Nested Irregular Beyond Programmable Shading 15
16 Simple Data-Parallelism Data structure Regular array Data A: B: Kernel Program void k(int i) { B[i] += A[i]; } Domain of computation 1D interval k(0) K(1) K(2) K(3) K(4) K(5) Computation Beyond Programmable Shading 16
17 Simple Data-Parallelism Data structure N-D array A: B: Kernel void k(int i, int j) { B[i][j] += A[i][j]; } Domain of computation N-D interval Beyond Programmable Shading 17
18 Shapes need not match Data structure N-D array 1D array A: B: Kernel Domain of computation N-D interval void k(int i) { for(int j = 0; j < M; j++) B[i] += A[i][j]; } Beyond Programmable Shading 18
19 Advanced data-parallelism Hierarchical domains Allow work items to communicate Useful for sums, scans, sorts Irregular domains Nested or ragged data structures Beyond Programmable Shading 19
20 Flat domains Kernel temporaries / scratch data are Private: inaccessible to other work items Transient: inaccessible after work item completes Flat domain exposes work-item locality Optimization: put scratch in register file or caches Beyond Programmable Shading 20
21 Communication Need to communicate intermediate results Each work item computed value, now want sum Write to main memory, launch a new kernel? Don t exploit locality, rest of memory hierarchy Employ a hierarchy of domains Beyond Programmable Shading 21
22 Hierarchical domains A domain composed of smaller domains Each level has its own scratch memory Often tied to memory hierarchy ex. Registers, L1$, L2$, DRAM Work item can access Kernel parameters Own scratch memory Scratch memory of ancestors in hierarchy Beyond Programmable Shading 22
23 Hierarchical domains Communicate through parent item scratch ex. Each element computes value a Add local value into shared sum Data races are now possible Atomic operations Synchronization barriers Also possible for global memory sum a a a a a a a a a Beyond Programmable Shading 23
24 Irregular Domains Ragged array data structure N-D array- / grid-of-lists {{A0,A1}, {B0,B1,B2}, {}, {D0,D1}, {E0}, {F0}} Used for Bucketing: particles in a cell Collision: potential collidees A0 B0 D0 E0 F0 A1 B1 D1 B2 Beyond Programmable Shading 24
25 Irregular Domains Must choose in-memory representation Pointer per bucket? {{A0,A1}, {B0,B1,B2}, {}, {D0,D1}, {E0}, {F0}} Performance Required operations Apply kernel to each bucket? Apply kernel to each element? A0 B0 D0 E0 F0 A1 B1 D1 B2 Beyond Programmable Shading 25
26 A simple representation A0 B0 D0 E0 F0 A1 B1 D1 B2 Logical Physical Count: Offset: Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 26
27 Apply to each element Count: Offset: Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 27
28 Apply to each bin Count: Offset: Storage: A0 A1 B0 B1 B2 D0 D1 E0 F0 Beyond Programmable Shading 28
29 Irregular data parallelism Key insight: represent irregular structure as flat index and storage arrays Many other representations possible Allows efficient data-parallel implementation of some irregular algorithms Many examples in the literature Beyond Programmable Shading 29
30 Pipeline parallelism Beyond Programmable Shading 30
31 Key Idea Algorithm is an ordered sequence of stages Each stage emits zero or more items Increase throughput by running stages in parallel Exploit producer-consumer locality On-chip FIFOs Efficient bus between cores Beyond Programmable Shading 31
32 GPU Pipeline (DX10) Pipeline of Fixed-function stages Programmable stages Data-parallel kernels Stages run in parallel Even for unified cores Input Assembly Vertex Shading Primitive Setup Geometry Shading Rasterization Pixel Shading Queues between stages Often in HW Output Merging Beyond Programmable Shading 32
33 Why pipelines? Variable rate amplification Rasterizer: 1 tri in, 0-N fragments out Ray tracer: 1 hit in, 0-N secondary/shadow rays out Load imbalance Rast Rast Rast Beyond Programmable Shading 33
34 Pipelines can cope with imbalance Re-balance load between stages Buffer up results for next stage Optimize for locality Specialized inter-stage FIFOs On-chip caches, busses or scratchpads Beyond Programmable Shading 34
35 User-defined pipelines Standard practice for console developers Custom Cell/RSX graphics pipelines on PS3 Pipeline-definition tools still research area GRAMPS [Sugerman et al. 2009] Challenges Bounding intermediate storage Scheduling algorithms Beyond Programmable Shading 35
36 Task parallelism Beyond Programmable Shading 36
37 Key Idea Achieve scalability for heterogeneous and irregular work by expressing dependencies directly Lightweight cooperative scheduling Beyond Programmable Shading 37
38 What is a Task? Think of it as an asynchronous function call Do X at some point in the future Optionally after Y is done Y() Might be implemented in HW or SW X() Almost always cooperative, not preemptive Beyond Programmable Shading 38
39 Why tasks? Start with sequential workload Core 0: Beyond Programmable Shading 39
40 Why tasks? Identify data- and pipeline-parallel steps Core 0: Beyond Programmable Shading 40
41 Why tasks? Identify data- and pipeline-parallel steps Assume perfect scaling Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 41
42 Why tasks? Cost now dominated by sequential part The part not suited to data- or pipeline-parallelism Oh yeah that s just Amdahl s Law Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 42
43 Using tasks If we know dependencies between the steps Beyond Programmable Shading 43
44 Using tasks If we know dependencies between the steps We can distribute the work across cores Respecting the dependencies Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 44
45 Finite # of cores It looks more like this Multiple kinds of work fill in the cracks Core 3: Core 2: Core 1: Core 0: Beyond Programmable Shading 45
46 Task/job systems Standard practice for PS3 games Gaining currency on other consoles, desktop One worker thread per HW context Cooperative scheduling Pull tasks from an incoming queue Load balance using work stealing [Cilk] Beyond Programmable Shading 46
47 Task granularity Coarse-grained tasks easy to identify Can schedule poorly Coarse-grained dependencies Bubble waiting for predecessor to clear Core 3: Core 2: Core 1: Core 0: AI AI AI Physics Physics Graphics Graphics Graphics Graphics Beyond Programmable Shading 47
48 Task granularity Fine-grained tasks pack well More scheduling overhead Tune task size to strike a balance Core 3: A A A P P G G Core 2: A A A A P P G G G Core 1: A A P P G G G Core 0: P P P P P P P G G G G Beyond Programmable Shading 48
49 Tasks take-away Can t write sequential app with parallel pieces Amdahl s Law will bite you every time Must involve parallelism from the top down Task systems Handle the code that won t fit other models Heterogeneous, irregular Dynamically generated work, dependencies Provide scalability and load balancing Beyond Programmable Shading 49
50 Composition Beyond Programmable Shading 50
51 Picking the right tools No one model is best for all apps Or even all parts of one app Real-world parallel apps use combinations Case in point: the graphics pipeline Pipeline-parallel buffering between stages Programmable stages run data-parallel Task-parallel sharing of unified shader cores Beyond Programmable Shading 51
52 Data Parallelism Strengths Easy to get high utilization of throughput architecture Implicit use of SIMD/SIMT Implicit memory latency hiding Weaknesses Works best for large, homogeneous problems Work efficiency drops with irregularity Core resources divided amongst all elements Beyond Programmable Shading 52
53 Pipeline Parallelism Strengths Copes with variable data amplification Can exploit producer-consumer locality Weaknesses Best scheduling strategy workload-dependent No general-purpose tools for current HW Beyond Programmable Shading 53
54 Task Parallelism Strengths Scales even with irregular/dynamic problems Viable parallelism approach for global app structure Weaknesses No automatic support for latency-hiding Need to explicitly target SIMD width Beyond Programmable Shading 54
55 Summary Data-, pipeline- and task-parallelism Three proven approaches to scalability Applicable to many problems in visual computing Look for these to surface as we discuss Architectures Tools Algorithms Beyond Programmable Shading 55
56 Questions? Beyond Programmable Shading 56
57 Backup Beyond Programmable Shading 57
58 Many possible syntaxes Kernel Language kernel void k( float* A, float* B, float* C) { C[id] = A[id] + B[id]; } k<n>(a, B, C); Parallel Loop par_for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; Array Operations Parallel Functional Map Stream<float> A, B, C; C = A + B; fun k(a, b) = a + b C = par_map(k, A, B) Beyond Programmable Shading 58
59 Example syntax Kernel Language Parallel Loop kernel void k( ) { level_2 float sum = 0; level_1 float a; par_for(int i=0; i < N; i++) { float sum = 0; } a =...; atomic_add(&sum, a); par_for(int j=0; j < M; j++) { float a;... k<n, M>(A, B, C); } } a =...; atomic_add(&sum, a); Beyond Programmable Shading 59
60 Host/GPU pipeline Graphics command stream Host packs, GPU consumes in parallel Distribute pack work across N host cores Common technique in console graphics Will eventually translate to desktop Host: Prepare Frame N Prepare Frame N+1 Prepare Frame N+2 GPU: Render Frame N-1 Render Frame N Render Frame N+1 Beyond Programmable Shading 60
61 Tasks and threads Task looks a lot like an OS thread Created with function to execute Waits on a queue to be scheduled to a core May trigger event on completion Differences Cooperative, not preemptive scheduling Lightweight create/destroy Join often restricted and lightweight Beyond Programmable Shading 61
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationParallel Programming for Graphics
Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel What s In This Talk? Overview of parallel programming models
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationBeyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline
Keeping Many s Busy: Scheduling the Graphics Pipeline Jonathan Ragan-Kelley, MIT CSAIL 29 July 2010 This talk How to think about scheduling GPU-style pipelines Four constraints which drive scheduling decisions
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationNext-Generation Graphics on Larrabee. Tim Foley Intel Corp
Next-Generation Graphics on Larrabee Tim Foley Intel Corp Motivation The killer app for GPGPU is graphics We ve seen Abstract models for parallel programming How those models map efficiently to Larrabee
More informationScheduling the Graphics Pipeline on a GPU
Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle
More informationGRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat
GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat The PPL Vision: GRAMPS Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationGRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277
GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277 History GRAMPS grew from, among other things, our GPGPU and Cell processor work, especially
More informationCompute-mode GPU Programming Interfaces
Lecture 8: Compute-mode GPU Programming Interfaces Visual Computing Systems What is a programming model? Programming models impose structure A programming model provides a set of primitives/abstractions
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationFurther Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009
Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009 Introduction Evaluation of what/where GRAMPS is today Planned next steps New graphs: MapReduce and Cloth Sim Speculative potpourri, outside
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationBeyond Programmable Shading. Scheduling the Graphics Pipeline
Beyond Programmable Shading Scheduling the Graphics Pipeline Jonathan Ragan-Kelley, MIT CSAIL 9 August 2011 Mike s just showed how shaders can use large, coherent batches of work to achieve high throughput.
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationBreaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios
Breaking Down Barriers: An Intro to GPU Synchronization Matt Pettineo Lead Engine Programmer Ready At Dawn Studios Who am I? Ready At Dawn for 9 years Lead Engine Programmer for 5 I like GPUs and APIs!
More informationBuilding scalable 3D applications. Ville Miettinen Hybrid Graphics
Building scalable 3D applications Ville Miettinen Hybrid Graphics What s going to happen... (1/2) Mass market: 3D apps will become a huge success on low-end and mid-tier cell phones Retro-gaming New game
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationASYNCHRONOUS SHADERS WHITE PAPER 0
ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationRay Tracing with Multi-Core/Shared Memory Systems. Abe Stephens
Ray Tracing with Multi-Core/Shared Memory Systems Abe Stephens Real-time Interactive Massive Model Visualization Tutorial EuroGraphics 2006. Vienna Austria. Monday September 4, 2006 http://www.sci.utah.edu/~abe/massive06/
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationGPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationGraphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics
Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationProgramming Larrabee: Beyond Data Parallelism. Dr. Larry Seiler Intel Corporation
Dr. Larry Seiler Intel Corporation Intel Corporation, 2009 Agenda: What do parallel applications need? How does Larrabee enable parallelism? How does Larrabee support data parallelism? How does Larrabee
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationEECS 487: Interactive Computer Graphics
EECS 487: Interactive Computer Graphics Lecture 21: Overview of Low-level Graphics API Metal, Direct3D 12, Vulkan Console Games Why do games look and perform so much better on consoles than on PCs with
More informationLecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 4: Processing Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today Key per-primitive operations (clipping, culling) Various slides credit John Owens, Kurt Akeley,
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationPowerVR Series5. Architecture Guide for Developers
Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationOptimisation. CS7GV3 Real-time Rendering
Optimisation CS7GV3 Real-time Rendering Introduction Talk about lower-level optimization Higher-level optimization is better algorithms Example: not using a spatial data structure vs. using one After that
More informationGraphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university
Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university Overview Parallelism Radeon 5870 Tiled Graphics Architectures Important when Memory and Bandwidth limited
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationVulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.
Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics www.imgtec.com Introduction Who am I? Kevin Sun Working at Imagination Technologies
More informationGeForce4. John Montrym Henry Moreton
GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,
More informationCS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside
CS230 : Computer Graphics Lecture 4 Tamar Shinar Computer Science & Engineering UC Riverside Shadows Shadows for each pixel do compute viewing ray if ( ray hits an object with t in [0, inf] ) then compute
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationEnhancing Traditional Rasterization Graphics with Ray Tracing. October 2015
Enhancing Traditional Rasterization Graphics with Ray Tracing October 2015 James Rumble Developer Technology Engineer, PowerVR Graphics Overview Ray Tracing Fundamentals PowerVR Ray Tracing Pipeline Using
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationGPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis
GPU Task-Parallelism: Primitives and Applications Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis This talk Will introduce task-parallelism on GPUs What is it? Why is it important?
More informationRay Tracing. Computer Graphics CMU /15-662, Fall 2016
Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives
More informationFRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS
FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS Chris Wyman, Rama Hoetzlein, Aaron Lefohn 2015 Symposium on Interactive 3D Graphics & Games CONTRIBUTIONS Full scene, fully dynamic alias-free
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationA SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization
A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization Jordi Roca Victor Moya Carlos Gonzalez Vicente Escandell Albert Murciego Agustin Fernandez, Computer Architecture
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationGPU Memory Model Overview
GPU Memory Model Overview John Owens University of California, Davis Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationΠαράλληλη Επεξεργασία
Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationCourse Recap + 3D Graphics on Mobile GPUs
Lecture 18: Course Recap + 3D Graphics on Mobile GPUs Interactive Computer Graphics Q. What is a big concern in mobile computing? A. Power Two reasons to save power Run at higher performance for a fixed
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationRSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog
RSX Best Practices Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog RSX Best Practices About libgcm Using the SPUs with the RSX Brief overview of GCM Replay December 7 th, 2004
More informationCMSC 611: Advanced. Parallel Systems
CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationEnabling immersive gaming experiences Intro to Ray Tracing
Enabling immersive gaming experiences Intro to Ray Tracing Overview What is Ray Tracing? Why Ray Tracing? PowerVR Wizard Architecture Example Content Unity Hybrid Rendering Demonstration 3 What is Ray
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part I Lecture 4, Jan 25, 2012 Majd F. Sakr and Mohammad Hammoud Today Last 3 sessions Administrivia and Introduction to Cloud Computing Introduction to Cloud
More information