Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Similar documents
Power-efficient Computing for Compute-intensive GPGPU Applications

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Portland State University ECE 588/688. Graphics Processors

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Register File Organization

GPU Fundamentals Jeff Larkin November 14, 2016

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Spring Prof. Hyesoon Kim

Lecture-13 (ROB and Multi-threading) CS422-Spring

Advanced CUDA Programming. Dr. Timo Stich

GRAPHICS PROCESSING UNITS

The instruction scheduling problem with a focus on GPU. Ecole Thématique ARCHI 2015 David Defour

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

VLSI Signal Processing

Fundamental CUDA Optimization. NVIDIA Corporation

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

ME964 High Performance Computing for Engineering Applications

EC 513 Computer Architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

ECE 154A Introduction to. Fall 2012

A Framework for Modeling GPUs Power Consumption

Fundamental CUDA Optimization. NVIDIA Corporation

Computer Architecture

Sparse Linear Algebra in CUDA

CS 179 Lecture 4. GPU Compute Architecture

D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

CS427 Multicore Architecture and Parallel Computing

ME964 High Performance Computing for Engineering Applications

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

High Performance Computing on GPUs using NVIDIA CUDA

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Processor (IV) - advanced ILP. Hwansoo Han

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Mattan Erez. The University of Texas at Austin

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Understanding GPGPU Vector Register File Usage

Advanced CUDA Optimizing to Get 20x Performance

Parallel Processing SIMD, Vector and GPU s cont.

Advanced Computer Architecture

CSE 160 Lecture 24. Graphical Processing Units

The University of Texas at Austin

Exploring GPU Architecture for N2P Image Processing Algorithms

Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES

Overview: Graphics Processing Units

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

The Processor: Instruction-Level Parallelism

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

ECE331: Hardware Organization and Design

CS516 Programming Languages and Compilers II

Handout 3. HSAIL and A SIMT GPU Simulator

igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Pipelining to Superscalar

Lecture 1: Introduction

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

HW1 Solutions. Type Old Mix New Mix Cost CPI

Handout 2 ILP: Part B

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CORF: Coalescing Operand Register File for GPUs

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Instruction Pipelining Review


Chapter 4 The Processor 1. Chapter 4D. The Processor

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

Threading Hardware in G80

Understanding Outstanding Memory Request Handling Resources in GPGPUs

COMPUTER ORGANIZATION AND DESI

" # " $ % & ' ( ) * + $ " % '* + * ' "

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Vertex Shader Design II

Metodologie di Progettazione Hardware-Software

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Appendix C: Pipelining: Basic and Intermediate Concepts

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Transcription:

Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

modern GPU architectures deeply pipelined for efficient resource sharing several buffering and collision avoidance stages hiding long read-after-write (RAW) latency w/ extensive multi-threading no power-hungry data forwarding network (DFN) modern GPGPU applications insufficient thread-level parallelism (TLP) to hide long RAW latency (18-24 cycles) 40% of total stalls

most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint

23% (SP/INT) and 33% (DP) higher performance most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint

motivation insufficient TLP to hide long RAW latency proposed GPU architectural techniques MoRF HFMA dual-v DD pipeline evaluation summary

SM (Fermi-like arch.) 32-wide SIMT w/ 32K registers 1536 (8 CTAs ) 24-cycle RAW latency longer for SFU ops hiding RAW latency requiring 24 active warps 32 /warp 24 warps = 768 active SM: stream multiprocessor CTA: cooperative thread arrays *SMIT: single-instr. multiple 24 cycles front-end register address unit bank arbitration unit RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

application char. reg. file size 10% of GPU power how is an application divided into kernels? how much TLP exists in different kernels? avoiding non-coalesced memory accesses & resource contention synchronization and data sharing b/w shared mem. size 8% of GPU power

application char. reg. file size 10% of GPU power shared mem. size 8% of GPU power GPUs allocate registers statically during work distribution are grouped in unit of CTAs # of registers per thread determined by compiler/driver a CTA can be issued only if there are enough registers for all its

application char. reg. file size 10% of GPU power shared/const. memory requirement per CTA determined by compiler/driver SM must have enough memory to accommodate a complete CTA shared mem. size 8% of GPU power

DCT 512 512 512 33% occupancy 8 CTAs w/ 64 /CTA NDL 132 132 9% occupancy 16 KB/CTA w/ 44 /CTA CFD 576 1536 47 registers/thread w/ 192 /CTA 37% occupancy

DCT application partitioning, 512 thread synchronization, data dependencies 512 8 CTAs w/ 64 /CTA 512 33% occupancy NDL 132 16 KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD 576 1536 registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy

performance of low-occupancy applications constrained by long RAW latencies DCT application partitioning, 512 thread synchronization, data dependencies 512 8 CTAs w/ 64 /CTA 512 33% occupancy NDL 132 16 KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD 576 1536 registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy

8% occupancy 50% occupancy RAW stalls due to low occupancy 40% of total stalled cycles ideal DFN 18% IPC improvement

forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages high overhead of ideal DFN (16 % of GPU power) RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

mul.wide.u16 $r3, %ctaid.y, 0x8; cvt.u32.u16 $r2, $r0.lo; mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three 80% of dynamic instr. < 3 dep. dist. store 3 most recent exe. results per thread in forwarding buffer (FB)

Most-recent Result Forwarding forwarding buffer (FB) dispatch port FB[0] FB[1] FB[2] operand collector warp[0] warp[0] warp[0] warp[1] warp[1] warp[1] warp[2] warp[2] warp[2] FP units FB INT units warp[47] warp[47] warp[47] result queue rd/wr ports one 32-bit entry per warp in each FB bank each entry indexed by warp id

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] warp[1] warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, FB[0] $r2; FB[1] $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; FB[0] FB[1] $r6 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A 24 24 75 48 75 LZC & normalizer Accumulate RF exponent update 8 result 24` exponent 26 rounder result significand

early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A 24 24 75 48 75 LZC & normalizer breaking 7-cycle dep. chain exponent allows update rounder 1-cycle Accumulate RF exe. latency 8 result 24` exponent 26 result significand

power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) Normalize (75-bit) Round (24-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit) SPwDP dynamic leakage dynamic leakage 4.6 3.7 1.4 1.2 Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit)

power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) SPwDP dynamic leakage dynamic leakage 4.6 3.7 1.4 1.2 Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (75-bit) Normalize (163-bit) Normalize (163-bit) replacing 8 out of 16 DP FPUs w/ SPwDP Round (24-bit) Round (53-bit) Round (53-bit)

SP/INT benchmarks memory intensive INT only MoRF speedup: 14% MoRF+HFMA speedup: 18%

DP benchmarks longer FP latency replaced by shorter HFMA latency memory intensive MoRF speedup: 15% MoRF+HFMA speedup: 30%

RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FPU power consumption 35% of total GPU power in TDP condition!

RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower V DD for exe. units double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles lower V DD

RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback lower V DD for exe. units FP/INT exe. units result queue writeback MoRF +HFMA counter negative impact of double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles increased exe. latency lower V DD

RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower GPU power increase # of SMs (i.e., higher perf.) for the same power budget 16 cycles lower V DD

1 To support DP FMA operations. 2 For half as many DP FMA units. 3 Including additional pipeline overhead. DFN GPU peak power overhead 16% MoRF GPU peak power overhead 0.9% MoRF+HFMA MoRF+HFMA+Dual V dd SP FMA power increase 1 20% DP FMA power decrease 2 50% GPU peak power reduction 6.4% exe. units power reduction 3 35% GPU peak power reduction 14% 1. to support DP FMA ops. 2. for half as many DP FMA units. 3. Including additional pipeline overhead. MoRF+HFMA slightly higher power due to additional logic and SRAM dual-v DD design reduced SP and DP power consumption

SP/INT benchmarks 23% speedup w/ MoRF+HFMA+dual-V DD +more SMs 5% additional improvement over MoRF+HFMA

DP benchmarks 33% speedup w/ MoRF+HFMA+dual-V DD +more SMs 3% additional improvement over MoRF+HFMA

insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint

23% (SP/INT) and 33% (DP) higher performance insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint

Question? 38