Scientific Computing on GPUs: GPU Architecture Overview

Similar documents
GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

From Shader Code to a Teraflop: How Shader Cores Work

Real-Time Rendering Architectures

Numerical Simulation on the GPU

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Warps and Reduction Algorithms

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Preparing seismic codes for GPUs and other

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Antonio R. Miele Marco D. Santambrogio

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Parallel Programming for Graphics

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GRAPHICS PROCESSING UNITS

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Processing SIMD, Vector and GPU s cont.

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Modern Processor Architectures. L25: Modern Compiler Design

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

GPU for HPC. October 2010

Lecture 10: Part 1: Discussion of SIMT Abstraction Part 2: Introduction to Shading

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Programming on Larrabee. Tim Foley Intel Corp

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Portland State University ECE 588/688. Graphics Processors

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction to GPU hardware and to CUDA

Computer Architecture

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Parallel Systems I The GPU architecture. Jan Lemeire

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Multithreading: Exploiting Thread-Level Parallelism within a Processor

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Fundamental CUDA Optimization. NVIDIA Corporation

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Fundamental CUDA Optimization. NVIDIA Corporation

CME 213 S PRING Eric Darve

Programmer's View of Execution Teminology Summary

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Paper Summary Problem Uses Problems Predictions/Trends. General Purpose GPU. Aurojit Panda

GPUs have enormous power that is enormously difficult to use

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Martin Kruliš, v

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Introduction to CUDA (1 of n*)

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Architecture. Michael Doggett Department of Computer Science Lund university

45-year CPU Evolution: 1 Law -2 Equations

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Graphics Processor Acceleration and YOU

COMP 322: Fundamentals of Parallel Programming

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Basics of Performance Engineering

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Efficient and Scalable Shading for Many Lights

GPU Fundamentals Jeff Larkin November 14, 2016

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Multicore Hardware and Parallelism

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to GPGPU and GPU-architectures

Martin Kruliš, v

WHY PARALLEL PROCESSING? (CE-401)

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Introduction. L25: Modern Compiler Design

Introduction to GPU computing

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Introduction to CUDA (1 of n*)

From Brook to CUDA. GPU Technology Conference

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

! Readings! ! Room-level, on-chip! vs.!

ME964 High Performance Computing for Engineering Applications

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Introduction to CUDA Programming

Interval arithmetic on graphics processing units

CS 179 Lecture 4. GPU Compute Architecture

CUDA Architecture & Programming Model

Transcription:

Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

Overview Goals of this talk Introduce GPU hardware on an abstract level, as a mental model Establish intuition which workloads may benefit the GPU architecture to which extend Obtain architectural understanding required to write and optimise code for GPUs Explicit non-goal: computer architecture, electrical engineering Overview Three major ideas that make GPUs fast Simplify and multiply Share instruction stream: wide-simd Hide latencies by interleaving Instantiations of these ideas: NVIDIA Fermi and AMD Cypress

Acknowledgements Kayvon Fatahalian (Stanford / Carnegie Mellon University) Most of this talk is based on, and even taken directly from, his classic talk From Shader Code to a Teraflop: How GPU Shader Cores Work presented at various SIGGRAPH conferences since 2008 See also Used with kind permission http://graphics.stanford.edu/~kayvonf/ K. Fatahalian and M. Houston: A Closer Look at GPUs, Communications of the ACM. Vol. 51, No. 10 (October 2008) M. Garland and D. B. Kirk: Understanding throughput-oriented architectures, Communications of the ACM. Vol. 53, No. 11 (November 2008)

Throughput Computing Three key design ideas

CPU-Style Cores

Idea 1: Simplification Remove everything that makes a single instruction stream run fast Caches (50% of die area in typical CPUs!) Hard-wired logic: out-of-order execution, branch prediction, memory pre-fetching

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core More copies : More than a conventional multicore CPU could afford

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core

Instruction Stream Sharing Observations The same operation, e.g. some lighting effect or geometry transformation, is typically performed on many different data items GPGPU: Many operations exhibit the same ample natural parallelism: vector addition, sparse matrix vector multiply,... This is called data parallelism Why run many identical instructions streams simultaneously? Idea 2: SIMD Processing GPU design assumption: Expect only data-parallel workloads and exploit this to maximum extent in the chip design Amortise cost/complexity of managing an instruction stream across many ALUs to reduce overhead (ALUs are very cheap!) Increase ratio of peak useful flops vs. total transistors

Instruction Stream Sharing 16 8-wide cores 128 operations in parallel Still possible: 16 independent (but 8-wide each) instruction streams

Problems with SIMD SIMD processing implies SIMD memory access Details in the talks by Tim later (Session 3) Branch divergence Resolution: Masking of ALUs = Serialisation of conditionals Implementation in hardware incurs small overhead Example (assuming identical workload in if and else branch): Phase 1: 3 of 8 ALUs treat if-branch, 5 are idle Phase 2: 5 ALUs treat else branch, 3 are idle Cost: 2x compared to no branch, plus small HW overhead Consequence: Not all ALUs do useful work, 1/8 of peak performance in the worst case But: Common practice to force few ALUs to idle, e.g. for non-multiple-of-8 workloads (empty else branch)

Branch Serialisation

SIMD Processing SIMD Instructions Option 1: Explicit vector instructions Conventional x86 SSE and AVX PowerPC AltiVec Intel Knights Ferry/Corner ( Many Integrated Core, Larrabee) Typical SIMD width: 2 8 Option 2: Scalar instructions, implicit HW vectorisation More programmer-friendly at the expense of more implicit knowledge HW determines instruction stream sharing across ALUs Amount of sharing deliberately hidden from software Contemporary GPU designs (slides+=5) NVIDIA GPUs: SIMT warps AMD GPUs: VLIW wavefronts Typical SIMD width: 16 64

Improving Throughput Stalls: Delays due to dependencies in the instruction stream But Example: ADD operands depend on completed LOAD instruction Latency: Accessing data from memory easily takes 1 000+ cycles Simplification: Fancy caches and logic that helps to hide stalls have been removed in idea # 1. Bad idea? GPUs assume LOTS of independent work (independent SIMD groups ) Idea 3: Interleave processing of many work groups on a single core Switch to instruction stream of another (non-stalled = ready) SIMD group in case currently active group stalls GPUs manage this in HW, overhead-free (!) Ideally, latency is fully hidden, throughput is maximised

Latency Hiding

Storing Contexts Register file and shared context storage

Storing Contexts 18 small contexts: High latency hiding

Storing Contexts 12 medium contexts

Storing Contexts 4 large contexts: Low latency hiding

Summary Three key ideas to maximise compute density Use many simplified cores to run in parallel Pack cores full of ALUs By sharing instruction stream across groups of work items (SIMD) Avoid latency stalls by interleaving execution of many groups and overhead-free HW context switch Benefit: Scalable design for all price/power/performance regimes Exposing the HW to programmers OpenCL / CUDA programming model virtualises the concept of interleaved SIMD groups and introduces synchronisation hierarchy among groups

Real GPUs NVIDIA Fermi, AMD Cypress

NVIDIA GTX 480 (Fermi) NVIDIA speak 480 stream processors (CUDA cores) SIMT execution Generic speak 15 cores, 2 groups of 16 SIMD functional units per core Details of one core (streaming multiprocessor) Contains 32 functional units (CUDA cores) Two groups of 32 work items (warps) are selected in each clock Up to 48 warps are interleaved, up to 1536 contexts can be stored

ATI Radeon HD 5870 (Cypress) AMD speak 1600 stream processors Generic speak 20 cores 16 beefy SIMD units per core 5-wide VLIW processing in each functional unit Details of one core (streaming multiprocessor) Groups of 64 (a wavefront) share an instruction stream of 5-wide VLIW instructions Four clocks to execute an instruction for entire group 20 of these SIMD engines on the full chip