A Data-Parallel Genealogy: The GPU Family Tree

Similar documents
A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

What s New with GPGPU?

GeForce4. John Montrym Henry Moreton

Mattan Erez. The University of Texas at Austin

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Windowing System on a 3D Pipeline. February 2005

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Comparing Reyes and OpenGL on a Stream Architecture

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Threading Hardware in G80

CS427 Multicore Architecture and Parallel Computing

From Brook to CUDA. GPU Technology Conference

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Portland State University ECE 588/688. Graphics Processors

Real-Time Graphics Architecture

Experiences with gpu Computing

Real-Time Graphics Architecture

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Real-Time Graphics Architecture

Real-World Applications of Computer Arithmetic

The End of Denial Architecture and The Rise of Throughput Computing

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

The NVIDIA GeForce 8800 GPU

Parallel Computing: Parallel Architectures Jin, Hai

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

PowerVR Hardware. Architecture Overview for Developers

General-Purpose Computation on Graphics Hardware

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

CS 316: Multicore/GPUs

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Target Applications

Hakam Zaidan Stephen Moore

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

Scheduling the Graphics Pipeline on a GPU

Scientific Computing on GPUs: GPU Architecture Overview

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Warps and Reduction Algorithms

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Microelectronics. Moore s Law. Initially, only a few gates or memory cells could be reliably manufactured and packaged together.

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Parallel Programming on Larrabee. Tim Foley Intel Corp

New Advances in Micro-Processors and computer architectures

Multiple Cores, Multiple Pipes, Multiple Threads Do we have more Parallelism than we can handle? David B. Kirk

GPU for HPC. October 2010

ASYNCHRONOUS SHADERS WHITE PAPER 0

GPGPU introduction and network applications. PacketShaders, SSLShader

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Chapter 4 Data-Level Parallelism

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Fast Tessellated Rendering on Fermi GF100. HPG 2010 Tim Purcell

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Computer Architecture

Multimedia in Mobile Phones. Architectures and Trends Lund

KiloCore: A 32 nm 1000-Processor Array

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPGPU. Peter Laurens 1st-year PhD Student, NSC

The University of Texas at Austin

MIT Laboratory for Computer Science

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Multicore Hardware and Parallelism

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Course Recap + 3D Graphics on Mobile GPUs

ARM Multimedia IP: working together to drive down system power and bandwidth

Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling

Graphics and Imaging Architectures

Monday Morning. Graphics Hardware

From Shader Code to a Teraflop: How Shader Cores Work

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Graphics Processor Acceleration and YOU

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

The Implications of Multi-core

Real-time Ray Tracing on Programmable Graphics Hardware

Custom computing systems

Lecture 25: Board Notes: Threads and GPUs

Outline Marquette University

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

CS195V Week 9. GPU Architecture and Other Shading Languages

Graphics Processing Unit Architecture (GPU Arch)

Lecture 1: Introduction

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

NVIDIA nfinitefx Engine: Programmable Pixel Shaders

EE482S Lecture 1 Stream Processor Architecture

Visualization and VR for the Grid

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Transcription:

A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis

Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development brought us? How can we use those transistors? Microprocessors? Stream processors Today s commodity stream processor: the GPU B2

The past: 1987 20 MIPS CPU 1987 [courtesy Anant Agarwal] B3

The future: 2007 1 Billion Transistors 2007 [courtesy Anant Agarwal] B4

Today s VLSI Capability [courtesy Pat Hanrahan] B5

Today s VLSI Capability 1. Exploits Ample Computation! [courtesy Pat Hanrahan] B6

Today s VLSI Capability 1. Exploits Ample Computation! 2. Requires Efficient Communication! [courtesy Pat Hanrahan] B7

NVIDIA Historicals 10000 1000 Fill Rate Geom Rate 100 10 Jan- 96 Jan- 97 Jan- 98 Jan- 99 Jan- 00 Jan- 01 Jan- 02 Jan- 03 Jan- 04 Tapeout Date B8

GPU History: Features < 1982 1982-87 1987-92 1992-2000 2000- B9

Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development brought us? How can we use those transistors? Microprocessors? Stream processors Today s commodity stream processor: the GPU B10

Characteristics of Graphics Apps Lots of arithmetic Lots of parallelism Simple control Multiple stages Feed forward pipelines Latency-tolerant / deep pipelines What other applications have these characteristics? Application Command Geometry Rasterization Fragment Composite Display B11

Microprocessors: A Solution? Microprocessors address a different application space Scalar programming model with no native data parallelism Excel at control-heavy tasks Not so good at data-heavy, regular applications Few arithmetic units little area Optimized for low latency not high bandwidth Maybe the scalar processing model isn t the best fit Pentium III 28.1M T B12

Stream Programming Abstraction Let s think about our problem in a new way Streams Collection of data records All data is expressed in streams Kernels Inputs/outputs are streams Perform computation on streams Can be chained together stream kernel stream B13

Why Streams? Ample computation by exposing parallelism Streams expose data parallelism Multiple stream elements can be processed in parallel Pipeline (task) parallelism Multiple tasks can be processed in parallel Kernels yield high arithmetic intensity Efficient communication Producer-consumer locality Predictable memory access pattern Optimize for throughput of all elements, not latency of one Processing many elements at once allows latency hiding B14

Taxonomy of Streaming Processors In common: Exploit parallelism for high computation rate Each stage processes many items in parallel (d.p.) Several stages can run at the same time (t.p.) Efficient communication Minimize memory traffic Optimized memory system What s different? Mapping of processing units to tasks in graphics pipeline B15

Stream Processors Fewer processing units than tasks Time multiplexed organization Each stage fully programmable Stanford Imagine 32b stream processor for image, signal, and graphics processing (2001) Stanford Merrimac 64b stream processor for scientific computing (2004) Core of Stanford Streaming Supercomputer Challenge: Efficiently mapping all tasks to one processing unit - no specialization [Stanford Imagine - 2001] 16RDRAMInterfaces Mips64 20kc ForwardECC $ bank $ bank $ bank $ bank $ bank $ bank $ bank $ bank Network Mips64 20kc Memswitch Address Gen Address Gen Address Gen Address Gen ReorderBuffer Reorder Buffer 12.5 mm B16 Microcontroller 10.2mm [Stanford Merrimac - 2004]

MIT RAW: Tiled Processor Architecture IMEM DMEM PC REG FPU ALU SMEM PC SWITCH Tile More processing units than tasks MIT RAW, IBM Cell Each tile is programmable Streams connect tiles Task parallel organization Lots of ALUs and registers Short, programmable wires Challenge: Software support [courtesy Anant Agarwal] B17

GPU: Special-Purpose Graphics Hardware Task-parallel organization Assign each task to processing unit Hardwire each unit to specific task - huge performance advantage! Provides ample computation resources Efficient communication patterns Dominant graphics architecture [ATI Flipper 51M T] B18

Today s Graphics Pipeline Application Command Geometry Rasterization Fragment Composite Graphics is well suited to: The stream programming model Stream hardware organizations GPUs are a commodity stream processor! What if we could apply these techniques to more generalpurpose problems? GPUs should excel at tasks that require ample computation and efficient communication. What s missing? Display B19

The Programmable Pipeline Application Command Geometry Rasterization Fragment Z-Cull Triangle Setup Shader Instruction Dispatch Fragment Crossbar L2 Tex Composite Display Memory Partition Memory Partition Memory Partition B20 Memory Partition [GeForce 6800, courtesy NVIDIA]

Conclusions Adding programmability to GPUs is exciting! Confluence: Lots of computation; expertise in harnessing that computation; commodity production; programmability GPUs have great performance Computation & communication Programmability allows them to address many interesting problems Many challenges remain Algorithms, programming models, architecture, languages, tools Next speaker: Aaron Lefohn The Programming Model B21