EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Similar documents
Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Threading Hardware in G80

Introduction to CUDA (1 of n*)

Portland State University ECE 588/688. Graphics Processors

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

NVIDIA Fermi Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

Tesla GPU Computing A Revolution in High Performance Computing

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GRAPHICS PROCESSING UNITS

Parallel Computing: Parallel Architectures Jin, Hai

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Spring Prof. Hyesoon Kim

Master Informatics Eng.

Advanced CUDA Programming. Dr. Timo Stich

Tesla GPU Computing A Revolution in High Performance Computing

Fundamental CUDA Optimization. NVIDIA Corporation

The University of Texas at Austin

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Parallel Processing SIMD, Vector and GPU s cont.

The Graphics Pipeline: Evolution of the GPU!

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Antonio R. Miele Marco D. Santambrogio

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

HIGH-PERFORMANCE COMPUTING

Introduction to CUDA (1 of n*)

From Shader Code to a Teraflop: How Shader Cores Work

The University of Texas at Austin

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Parallel Programming for Graphics

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

ECE 8823: GPU Architectures. Objectives

Numerical Simulation on the GPU

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Scientific Computing on GPUs: GPU Architecture Overview

Mattan Erez. The University of Texas at Austin

The NVIDIA GeForce 8800 GPU

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Real-Time Rendering Architectures

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

CUDA Programming Model

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Data Parallel Execution Model

Introduction to GPU programming with CUDA

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

CS427 Multicore Architecture and Parallel Computing

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

The Bifrost GPU architecture and the ARM Mali-G71 GPU

ME964 High Performance Computing for Engineering Applications

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Programming in CUDA. Malik M Khan

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CSE 160 Lecture 24. Graphical Processing Units

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Parallel Accelerators

Hardware/Software Co-Design

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

ME964 High Performance Computing for Engineering Applications

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Computer Architecture

Warps and Reduction Algorithms

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Handout 3. HSAIL and A SIMT GPU Simulator

Accelerating High Performance Computing.

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

CUDA. Matthew Joyner, Jeremy Williams

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Fundamental Optimizations

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Parallel Accelerators

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Real-Time Graphics Architecture

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Introduction to CUDA

Hybrid Architectures Why Should I Bother?

Early 3D Graphics. NVIDIA Corporation Perspective study of a chalice Paolo Uccello, circa 1450

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Register File Organization

Sparse Linear Algebra in CUDA

Bifrost - The GPU architecture for next five billion

Improving Performance of Machine Learning Workloads

Transcription:

EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1

Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware 3. Avoid latency stalls by interleaving execution of many groups of fragments Kayvon Fatahalian When one group stalls, work on another group Kayvon Fatahalian, 2008 2

Processor Make the Compute The Focus of the Architecture Processors The future of execute GPUs is computing programmable threads processing Alternative So build the operating architecture mode around specifically the processor for computing 3 Host Input Input Assembler Used to be only one Setup kernel / Rstr / ZCull at a time Execution Manager Vtx Issue Manages thread blocks Geom Issue Pixel Issue Texture Texture Texture Texture Texture Texture Texture Texture Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 David Kirk/VIDIA and FB Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign FB FB Global Memory FB FB FB

Graphic Processing Units (GPUs) 4 General-purpose many-core accelerators Use simple in-order shader cores in 10s / 100s numbers Shader core == Streaming Multiprocessor (SM) in VIDIA GPUs Scalar frontend (fetch & decode) + parallel backend Amortizes the cost of frontend and control

CUDA exposes hierarchy of data-parallel threads MD model: single kernel executed by all scalar threads Kernel / -block Multiple thread-blocks (cooperative-thread-arrays (CTAs)) compose a kernel 5

CUDA exposes hierarchy of data-parallel threads MD model: single kernel executed by all scalar threads Kernel / -block / Warp / Multiple warps compose a thread-block Multiple threads (32) compose a warp 6 A warp is scheduled as a batch of threads

SIMT for balanced programmability and HW-eff. 7 SIMT: Single-Instruction Multiple- Programmer writes code to be executed by scalar threads Underlying hardware uses vector SIMD pipelines (lanes) HW/SW groups scalar threads to execute in vector lanes Enhanced programmability using SIMT Hardware/software supports conditional branches Each thread can follow its own control flow Per-thread load/store instructions are also supported Each thread can reference arbitrary address regions

DRAM I/F Giga HOST I/F DRAM I/F Older GPU, but high-level same 8 Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory Bring more users, more applications to the GPU C++ Visual Studio Integration ECC L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F

Streaming Multiprocessor (SM) 9 Objective optimize for GPU computing ew ISA Revamp issue / control flow ew CUDA core architecture 16 SMs per Fermi chip 32 cores per SM (512 total) 64KB of configurable $ / shared memory Scheduler Dispatch Instruction Scheduler Dispatch Register File Load/Store Units x 16 Special Func Units x 4 FP32 FP64 IT SFU LD/ST Ops / clk 32 16 32 4 16 Interconnect etwork 64K Configurable /Shared Mem Uniform

SM Microarchitecture 10 ew IEEE 754-2008 arithmetic standard Fused Multiply-Add (FMA) for & DP ew integer ALU optimized for 64-bit and extended precision ops CUDA Dispatch Port Operand Collector FP Unit IT Unit Result Queue Scheduler Dispatch Instruction Scheduler Dispatch Register File Load/Store Units x 16 Special Func Units x 4 Interconnect etwork 64K Configurable /Shared Mem Uniform

DRAM I/F Giga HOST I/F DRAM I/F Memory Hierarchy 11 True cache hierarchy + on-chip shared RAM On-chip shared memory: good fit for regular memory access dense linear algebra, image processing, s: good fit for irregular or unpredictable memory access ray tracing, sparse matrix multiply, physics Separate for each SM (16/48 KB) Improves bandwidth and reduces latency L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F Unified L2 for all SMs (768 KB) Fast, coherent data sharing across all cores in the GPU

Giga TM Scheduler Hardware 12 Hierarchically manages tens of thousands of simultaneously active threads 10x faster context switching on Fermi HTS Overlapping kernel execution

Giga Streaming Transfer Engine 13 Dual DMA engines Simultaneous CPU GPU and GPU CPU data transfer SDT SDT Fully overlapped with CPU/GPU processing Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1

Life Cycle in HW 14 Kernel is launched on the A Kernels known as grids of thread blocks Host Device Grid 1 Blocks are serially distributed to all the SM s Potentially >1 Block per SM At least 96 threads per block Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run Kernel 2 Block (1, 1) Grid 2 As Warps and Blocks complete, resources are freed A can distribute more Blocks (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1) (3, 0) (3, 1) (4, 0) (4, 1) David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)