Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Similar documents
Register File Organization

Spring Prof. Hyesoon Kim

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

GRAPHICS PROCESSING UNITS

Fundamental CUDA Optimization. NVIDIA Corporation

Analyzing CUDA Workloads Using a Detailed GPU Simulator

GPU Fundamentals Jeff Larkin November 14, 2016

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

CUDA OPTIMIZATIONS ISC 2011 Tutorial

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Modern Processor Architectures. L25: Modern Compiler Design

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Parallel Programming on Larrabee. Tim Foley Intel Corp

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Portland State University ECE 588/688. Graphics Processors

Parallel Processing SIMD, Vector and GPU s cont.

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

ME964 High Performance Computing for Engineering Applications

Multithreaded Processors. Department of Electrical Engineering Stanford University

Understanding GPGPU Vector Register File Usage

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

ME964 High Performance Computing for Engineering Applications

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Preparing seismic codes for GPUs and other

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

TUNING CUDA APPLICATIONS FOR MAXWELL

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Multiple Instruction Issue. Superscalars

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Introduction to Control Divergence

CUDA Performance Optimization. Patrick Legresley

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

TUNING CUDA APPLICATIONS FOR MAXWELL

William Stallings Computer Organization and Architecture. Chapter 11 CPU Structure and Function

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Processors, Performance, and Profiling

Occupancy-based compilation

From Application to Technology OpenCL Application Processors Chung-Ho Chen

CEC 450 Real-Time Systems

Exploring GPU Architecture for N2P Image Processing Algorithms

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Fundamental Optimizations

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Parallel Computing: Parallel Architectures Jin, Hai

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Hardware-Based Speculation

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Threading Hardware in G80

A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS427 Multicore Architecture and Parallel Computing

Maximizing Face Detection Performance

Handout 3. HSAIL and A SIMT GPU Simulator

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

All About the Cell Processor

Mattan Erez. The University of Texas at Austin

Lecture 2: CUDA Programming

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

Getting CPI under 1: Outline

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The instruction scheduling problem with a focus on GPU. Ecole Thématique ARCHI 2015 David Defour

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Advanced CUDA Programming. Dr. Timo Stich

Portland State University ECE 588/688. Cray-1 and Cray T3E

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Continuum Computer Architecture

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

EECS 570 Final Exam - SOLUTIONS Winter 2015

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

Limitations of Scalar Pipelines

Transcription:

Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate) idle cycles à improve utilization (2) 1

References V. Narasiman, et. al., Improving GPU Performance via Large Warps and Two-Level Scheduling, MICRO 2011 A. S. Vaidya, et.al., SIMD Divergence Optimization Through Intra-Warp Compaction, ISCA 2013 (3) Improving GPU Performance via Large Warps and Two-Level Scheduling V. Narsiman et. al MICRO 2011 Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) 2

Goals Improve performance of divergent code via compaction of threads within a warp Integrate warp scheduling optimization with intra-warp compaction (5) Resource Underutilization 32 warps, 32 threads per warp, single SM Due to control divergence Due to memory divergence (6) 3

Time Time Warp0 Load Warp0 Warp1 Load Warp Scheduling and Locality WarpN Load Interconnection Bus HW Work Queues Pending Kernels Opportunities for memory coalescing Potential for exposing memory stalls Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX Warp Schedulers Warp Context Registers L1 Cache / Shard Memory GPU Load Warp1 Host CPU L2 Cache Memory Controller DRAM Load WarpN Load Degrades memory reference locality and row buffer locality Overlaps memory accesses (7) Key Ideas Conventional GPU RR fetch policy Conventional design today à warp size = #SIMD lanes Use large warps à multicycle issue of sub-warps v Compact threads in a warp to form fully utilized subwarps 2-level scheduler to spread memory accesses in time v Reduce memory related stall cycles (8) 4

Large Warps Typical Operation Warp size = 4 Proposed Approach Warp size = 16 Warp size = SIMD width Large Warp Multi-cycle Issue SIMD Width = 4 RF sub-warp SIMD Width = 4 RF pipeline pipeline pipeline Pipeline pipeline pipeline pipeline Pipeline Large warps converted to a sequence of sub-warps (9) Sub-Warp Compaction Iteratively select one thread per column to create a packed sub-warp Dynamic generation of sub-warps (10) 5

Impact on the Register File Baseline Register File Large Warp Active Mask Organization Large Warp Register File Need separate decoders per bank (11) Scheduling Constraints Next large warp cannot be scheduled until first subwarp completes execution Scoreboard checks for issue dependencies v not available for packing into a sub-warp unless previous issue (sub-warp) has completed à single bit status v Simple check v However on a branch, all sub-warps must complete before it is eligible for instruction fetch scheduling Re-fetch policy for conditional branches v Must wait till last sub-warp finishes Optimization for unconditional branch instructions v Don t create multiple sub-warps v Sub-warping always completes in a single cycle (12) 6

Effect of Control Divergence Note that divergence is unknown until all sub-warps execute v Divergence management just happens on large warp boundaries v Need to buffer sub-warp state, e.g., active masks The last warp effect v Cannot fetch the next instruction in a warp until all sub-warps issue v Trailing warp (warp divergence effect) can lead to many idle cycles Effect of the last thread v E.g., in data dependent loop iteration count across threads v Last thread can hold up reconvergence (13) Time Warp0 Load A Round Robin Warp Scheduler Warp1 Load WarpN Load Exploit inter-warp reference locality in the cache Interconnection Bus HW Work Queues Pending Kernels Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX Warp Schedulers Warp Context Registers L1 Cache / Shard Memory GPU Exploit inter-warp reference locality in the DRAM row buffers However, need to maintain latency hiding Host CPU L2 Cache Memory Controller DRAM (14) 7

Two Level Round Robin Scheduler Fetch Group 0 Fetch Group 1 LW0 LW1 LW4 LW5 LW2 RR LW3 LW6 LW7 Fetch Group 2 RR Fetch Group 3 LW8 LW9 LW12 LW13 LW10 LW11 LW14 LW15 Fetch Group Size: Enough to keep the pipeline busy (15) Scheduler Behavior Need to set fetch group size carefully à tune to fill the pipeline Timeout on switching fetch groups to mitigate the last warp effect (16) 8

Summary Intra-warp compaction made feasible due to multicycle warp execution v Mismatch between warp size and SIMD width enables flexible intra-warp compaction Do not make warps too big à last thread effect begins to dominate (17) SIMD Divergence Optimization Through Intra-warp Compaction A. S. Vaidya, et. al ISCA 2013 Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) 9

Improve utilization in divergent code via intra-warp compaction Goals Become familiar with the architecture of Intel s Gen integrated general purpose GPU architecture (19) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (20) 10

Integrated GPUs: Intel HD Graphics Shared physical memory Gen graphics processor 32-byte bidirectional ring Dedicated coherence signals Figure from The Computer Architecture of the Intel Processor Graphics Gen9, Coherent distributed cache Shared with GPU Operates as a memory side cache (21) Inside the Gen9 EU Architecture Register File (ARF) 0 Per thread register state Up to 7 threads 128, 256-bit registers/thread (8-way SIMD) Each thread executes a kernel v s may execute different kernels v Multi-instruction dispatch Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (22) 11

Operation (2) SIMD-16, SIMD-8, SIMD-32 instructions transform SIMD-4 instructions Divergence/reconvergence management Support both FP and Integer operations 4, 32-bit FP operations 8, 16-bit integer operations 8, 16-bit FP operations MAD operations each cycle 96 bytes/cycle read BW 32 bytes/cycle write BW Dispatch 4 instructions from 4 threads Constraints between issue slots Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (23) Operation SIMD-16, SIMD-8, SIMD-32 instructions transform SIMD-4 instructions Intermix SIMD instructions of various lengths SIMD 8 SIMD 16 SIMD 4 Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (24) 12

Mapping the BSP Model SIMD-16, SIMD-8, SIMD-32 instructions Grid 1 Block (0, 0) Block (0, 1) transform Block (1, 0) Block (1, 1) SIMD-4 instructions Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) (0,0,0) (0,1,0) (0,0,1) (0,1,1) (0,0,2) (0,1,2) (0,0,3) (0,0,0) (0,1,3) Map multiple threads to SIMD instance executed by a EU All threads in a TB or workgroup mapped to same thread (shared memory access) Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (25) Subslice Organization #Eus * #threads/eu determines width of the slice Flexible data interface v Scatter/gather support v Memory request coalescing across 64-byte cache lines v Shared memory access From The Computer Architecture of the Intel Processor Graphics Gen9, (26) 13

Slice Organization Shared memory 64Kbyte/slice Not coherent with other structures Flexible partitioning SM Data cache Buffers for accelerators From The Computer Architecture of the Intel Processor Graphics Gen9, (27) Product Organization Load balancing Honor barrier and shared memory constraints Shared virtual memory Share pointer rich data structures between CPU and GPU Coherent shared memory between CPU and GPU Implemented shared atomics (with CPU) From The Computer Architecture of the Intel Processor Graphics Gen9, (28) 14

Coherent Memory Hierarchy Not coherent From The Computer Architecture of the Intel Processor Graphics Gen9, (29) Microarchitecture Operation I-Fetch Decode Per-thread operation I-Buffer Issue PRF RF Per-thread scoreboard check arbitration and dual issue/2-cycles Operand fetch/swizzle Encode swizzle in RF access Pipeline pipeline pipeline Instruction execution happens in waves of 4-wide operations Note: variable width SIMD instructions D-Cache All Hit? Data pending Writeback (30) 15

Divergence Assessment SIMD Efficiency Coherent applications Divergent Applications (31) Basic Cycle Compression RF for a Single Operand Example: Actual operation depends on data types, execution cycles/op Note power/energy savings (32) 16

Swizzle Cycle Compression (33) SCC Operation RF for a Single Operand Can compact across Quads Swizzle settings overlapped with RF access 128b Pack lanes into a quad Note increased area, power/energy cycle i cycle i+1 cycle i+2 cycle i+3 4 lanes (34) 17

Compaction Opportunities SIMD 8 Idle lanes For K active threads what is the maximum cycle savings for SIMD N instructions? SIMD 16 No further compaction possible Idle lanes No further compaction possible (35) Performance Savings Difference between saving cycles and saving time v When is #cycles time? (36) 18

Multi-cycle warp/simd/work_group execution Summary Optimize #cycles/warp by compressing idle cycles v Rearrange idle cycles via swizzling to create opportunity Sensitivities to the memory interface speeds v Memory bound applications may experience limited benefit (37) Intra-Warp Compaction A Block Block B C D F E G Scope limited to within a warp Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp (38) 19