Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Similar documents
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Register File Organization

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

RegMutex: Inter-Warp GPU Register Time-Sharing

A Framework for Modeling GPUs Power Consumption

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Staged Memory Scheduling

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Exploring GPU Architecture for N2P Image Processing Algorithms

Understanding Reduced-Voltage Operation in Modern DRAM Devices

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Rodinia Benchmark Suite

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

Practical Near-Data Processing for In-Memory Analytics Frameworks

Techniques for Efficient Processing in Runahead Execution Engines

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Threading Hardware in G80

Understanding Outstanding Memory Request Handling Resources in GPGPUs

What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Mattan Erez. The University of Texas at Austin

NVIDIA Fermi Architecture

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

CS427 Multicore Architecture and Parallel Computing

Parallel Computing: Parallel Architectures Jin, Hai

EE 457 Unit 7b. Main Memory Organization

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

ME964 High Performance Computing for Engineering Applications

Fundamental CUDA Optimization. NVIDIA Corporation

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

The Pennsylvania State University. The Graduate School. College of Engineering KERNEL-BASED ENERGY OPTIMIZATION IN GPUS.

GPU Performance Nuggets

CUDA Programming Model

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Cache efficiency based dynamic bypassing technique for improving GPU performance

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Exploiting Core Criticality for Enhanced GPU Performance

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

Convergence of Parallel Architecture

EC 513 Computer Architecture

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

A Comprehensive Analytical Performance Model of DRAM Caches

Scalable Distributed Memory Machines

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Understanding GPGPU Vector Register File Usage

Fundamental CUDA Optimization. NVIDIA Corporation

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Introduction to OpenMP. Lecture 10: Caches

GPU Fundamentals Jeff Larkin November 14, 2016

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Visualization of OpenCL Application Execution on CPU-GPU Systems

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Tesla GPU Computing A Revolution in High Performance Computing

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

ELE 375 Final Exam Fall, 2000 Prof. Martonosi


Tesla GPU Computing A Revolution in High Performance Computing

Tesla Architecture, CUDA and Optimization Strategies

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Comparing Memory Systems for Chip Multiprocessors

Cray XE6 Performance Workshop

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Analyzing CUDA Workloads UsingaDetailedGPUSimulator

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

Handout 4 Memory Hierarchy

Transcription:

Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar, Onur Mutlu, Stephen W. Keckler

GPUs and Memory Bandwidth MEM MEM MEM GPU MEM Many GPU applications are bottlenecked by off-chip memory bandwidth 2

Opportunity: Near-Data Processing 3D-stacked memory (memory stack) MEM SM (Streaming Multiprocessor) MEM MEM Logic layer GPU Logic layer SM MEM Crossbar switch Mem Mem. Ctrl Ctrl Near-data processing (NDP) can significantly improve performance 3

Near-Data Processing: Key Challenges Which operations should we offload? How should we map data across multiple memory stacks? 4

Key Challenge 1 Which operations should be executed on the logic layer SMs?? T = D0; D0 = D0 + D2; D2 = T - D2; T = D1; D1 = D1 + D3; D3 = T - D3; T = D0; d_dst[i0] = D0 + D1; d_dst[i1] = T - D1;?? GPU Logic layer SM Crossbar switch Mem Mem. Ctrl Ctrl 5

Key Challenge 2 How should data be mapped across multiple 3D memory stacks? C = A + B A B C? GPU? 6

The Problem Solving these two key challenges requires significant programmer effort Challenge 1:Which operations to offload? Programmers need to identify offloaded operations, and consider run time behavior Challenge 2: How to map data across multiple memory stacks? Programmers need to map all the operands in each offloaded operation to the same memory stack 7

Our Goal Enable near-data processing in GPUs transparently to the programmer 8

Transparent Offloading and Mapping () Component 1 - Offloading: A new programmer-transparent mechanism to identify and decide what code portions to offload The compiler identifies code portions to potentially offload based on memory profile. The runtime system decides whether or not to offload each code portion based on runtime characteristics. Component 2 - Mapping: A new, simple, programmer-transparent data mapping mechanism to maximize data co-location in each memory stack 9

Outline Motivation and Our Approach Transparent Offloading Transparent Data Mapping Implementation Evaluation Conclusion 10

: Transparent Offloading Static compiler analysis Identifies code blocks as offloading candidate blocks Dynamic offloading control Uses run-time information to make the final offloading decision for each code block 11

: Transparent Offloading Static compiler analysis Identifies code blocks as offloading candidate blocks Dynamic offloading control Uses run-time information to make the final offloading decision for each code block 12

Static Analysis: What to Offload? Goal: Save off-chip memory bandwidth Addr Conventional System Load Store GPU Memory GPU Addr Live-in Data Ack for +Data cost/benefit analysis Reg Memory Near-Data Processing Offload GPU Compiler uses equations (in paper) Memory Live-out Reg Offloading benefit: load & store instructions Offloading cost: live-in & live-out registers 13

Offloading Candidate Block Example... float D0 = d_src[i0]; float D1 = d_src[i1]; float D2 = d_src[i2]; float D3 = d_src[i3]; float T; T = D0; D0 = D0 + D2; D2 = T - D2; T = D1; D1 = D1 + D3; D3 = T - D3; T = D0; d_dst[i0] = D0 + D1; d_dst[i1] = T - D1; T = D2; d_dst[i2] = D2 + D3; d_dst[i3] = T - D3; Code block in Fast Walsh Transform (FWT) 14

Offloading Candidate Block Example Offloading benefit outweighs cost... float D0 = d_src[i0]; float D1 = d_src[i1]; float D2 = d_src[i2]; float D3 = d_src[i3]; float T; Cost: Live-in registers Benefit: Load/store inst T = D0; D0 = D0 + D2; D2 = T - D2; T = D1; D1 = D1 + D3; D3 = T - D3; T = D0; d_dst[i0] = D0 + D1; d_dst[i1] = T - D1; T = D2; d_dst[i2] = D2 + D3; d_dst[i3] = T - D3; Code block in Fast Walsh Transform (FWT) 15

Conditional Offloading Candidate Block Cost: Live-in registers Benefit: Load/store inst... for (n = 0; n < Nmat; n++){ L_b[n] = v delta /( 1.0 + delta L[n]); }... Code block in LIBOR Monte Carlo (LIB) The cost of a loop is fixed, but the benefit of a loop is determined by the loop trip count. The compiler marks the loop as a conditional offloading candidate block, and provides the offloading condition to hardware (e.g., loop trip count > N) 16

: Transparent Offloading Static compiler analysis Identifies code blocks as offloading candidate blocks Dynamic offloading control Uses run-time information to make the final offloading decision for each code block 17

When Offloading Hurts: Bottleneck Channel Bottlenecked! Reg Data Reg Data Reg TX Main GPU RX Data Data Memory stack Transmit channel becomes full, leading to slowdown with offloading. 18

When Offloading Hurts: Memory Stack Computational Capacity Too many warps! TX Full Main GPU RX Memory stack SM capacity Memory stack SM becomes full, leading to slowdown with offloading. 19

Dynamic Offloading Control: When to Offload? Key idea: offload only when doing so is estimated to be beneficial Mechanism: The hardware does not offload code blocks that increase traffic on a bottlenecked channel When the computational capacity of a logic layer s SM is full, the hardware does not offload more blocks to that logic layer 20

Outline Motivation and Our Approach Transparent Offloading Transparent Data Mapping Implementation Evaluation Conclusion 21

: Transparent Data Mapping Goal: Maximize data co-location for offloaded operations in each memory stack Key Observation: Many offloading candidate blocks exhibit a predictable memory access pattern: fixed offset 22

Fixed Offset Access Patterns: Example... for (n = 0; n < Nmat; n++){ L_b[n] = v delta /( 1.0 + delta L[n]); }... L_b base n L base n Some 85% address of offloading bits are candidate always the blocks same: Use exhibit them to fixed decide offset memory access stack patterns mapping 23

Transparent Data Mapping: Approach Key idea: Within the fixed offset bits, find the memory stack address mapping bits so that they maximize data co-location in each memory stack Approach: Execute a tiny fraction (e.g, 0.1%) of the offloading candidate blocks to find the best mapping among the most common consecutive bits Problem: How to avoid the overhead of data remapping after we find the best mapping? 24

Conventional GPU Execution Model CPU GPU Launch Kernel GPU Data CPU Memory GPU Memory 25

Transparent Data Mapping: Mechanism CPU Delay Memory Copy and Launch Kernel GPU Learn the best mapping among the most common consecutive bits GPU Data CPU Memory GPU Memory Memory copy happens only after the best mapping is found There is no remapping overhead 26

Outline Motivation and Our Approach Transparent Offloading Transparent Data Mapping Implementation Evaluation Conclusion 27

: Putting It All Together Fetch/ I-Cache/ Decode Scoreboard Instruction Buffer Issue Operand Collector ALU MEM Offload Controller Makes offloading decision Sends offloading request Channel Busy Monitor Monitors TX/RX memory bandwidth Shared Mem Data Cache Memory Port / MSHR 28

Outline Motivation and Our Approach Transparent Offloading Transparent Data Mapping Implementation Evaluation Conclusion 29

Evaluation Methodology Simulator: GPGPU-Sim Workloads: Rodinia, GPGPU-Sim workloads, CUDA SDK System Configuration: 68 SMs for baseline, 64 + 4 SMs for NDP system 4 memory stacks Core: 1.4 GHz, 48 warps/sm Cache: 32KB L1, 1MB L2 Memory Bandwidth: GPU-Memory: 80 GB/s per link, 320 GB/s total Memory-Memory: 40 GB/s per link Memory Stack: 160 GB/s per stack, 640 GB/s total 30

Results: Performance Speedup Speedup 1.5 1.0 0.5 1.30 1.20 0.0 BP BFS KM CFD HW LIB RAY FWT SP RD AVG 30% average (76% max) performance improvement 31

1.5 Results: Off-chip Memory Traffic GPU-Memory RX GPU-Memory TX Memory-Memory Memory Traffic 1.0 0.5 0.0 BP BFS KM CFD HW LIB RAY FWT SP RD AVG 13% average (37% max) memory traffic reduction 2.5X memory-memory traffic reduction 32

More in the Paper Other design considerations Cache coherence Virtual memory translation Effect on energy consumption Sensitivity studies Computational capacity of logic layer SMs Internal and cross-stack bandwidth Area estimation (0.018% of GPU area) 33

Conclusion Near-data processing is a promising direction to alleviate the memory bandwidth bottleneck in GPUs Problem: It requires significant programmer effort Which operations to offload? How to map data across multiple memory stacks? Our Approach: Transparent Offloading and Mapping A new programmer-transparent mechanism to identify and decide what code portions to offload A programmer-transparent data mapping mechanism to maximize data co-location in each memory stack Key Results: 30% average (76% max) performance improvement in GPU workloads 34

Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar, Onur Mutlu, Stephen W. Keckler

Observation on Access Pattern Percent of offloading candidate blocks All accesses fixed offset 75%-99% fixed offset 50%-75% fixed offset 25%-50% fixed offset 0%-25% fixed offset No access fixed offset 100% 80% 60% 40% 20% 0% BP BFS KM CFD HW LIB RAY FWT SP RD AVG 85% of offloading candidate blocks exhibit fixed offset pattern 36

Bandwidth Change Equations BW TX =(REG TX S W ) (N LD Coal LD Miss LD + N ST (S W + Coal ST )) (3) BW RX =(REG RX S W ) (N LD Coal LD S C Miss LD +1/4 N ST Coal ST ) (4) In Equations (3) and (4), is the size of a warp (e.g. 32) 37

Best memory mapping search space We only need 2 bits to determine the memory stack in a system with 4 memory stacks. The result of the sweep starts from bit position 7 (128B GPU cache line size) to bit position 16 (64 KB). Based on our results, sweeping into higher bits does not make a noticeable difference. This search is done by a small hardware (memory mapping analyzer), which calculates how many memory stacks would be accessed by each offloading candidate instance for all different potential memory stack mappings (e.g., using bits 7:8, 8:9,..., 16:17 in a system with four memory stacks) 38

Best Mapping From Different Fraction of Offloading Candidate Blocks Probability of accessing one memory stack in an offloading candidate instance Baseline mapping Best mapping in first 0.5% NDP blocks Best mapping in all NDP blocks 100% 80% 60% 40% 20% 0% Best mapping in first 0.1% NDP blocks Best mapping in first 1% NDP blocks BP BFS KM CFD HW LIB RAY FWT SP RD AVG 39

Energy Consumption Results Normalized Energy 1.5 1.0 0.5 SMs Off-chip Links DRAM Devices 0.0 bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap bmap tmap BP BFS KM CFD HW LIB RAY FWT SP RD AVG 40

Sensitivity to Computational Capacity of memory stack SMs -1X-warp -1X-warp -2X-warp -4X-warp Speedup 1.5 1.0 0.5 0.0 BP BFS KM CFD HW LIB RAY FWT SP RD AVG Normalized Memory Traffic 1.0 0.5 0.0-1X-warp -1X-warp -2X-warp -4X-warp BP BFS KM CFD HW LIB RAY FWT SP RD AVG 41

Sensitivity to Internal Memory Bandwidth Speedup 1.5 1.0 0.5 2X-internal-BW 1X-internal-BW 0.0 BP BFS KM CFD HW LIB RAY FWT SP RD AVG 42