MRPB: Memory Request Priori1za1on for Massively Parallel Processors

Similar documents
MRPB: Memory Request Prioritization for Massively Parallel Processors

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

Caches. Hiding Memory Access Times

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Register File Organization

Starchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

An Optimization Compiler Framework Based on Polyhedron Model for GPGPUs

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Multithreaded Processors. Department of Electrical Engineering Stanford University

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Cache efficiency based dynamic bypassing technique for improving GPU performance

DaCache: Memory Divergence-Aware GPU Cache Management

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Lecture 11 Cache. Peng Liu.

Priority-Based Cache Allocation in Throughput Processors

NVIDIA Fermi Architecture

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

GRAPHICS PROCESSING UNITS

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Managing GPU Concurrency in Heterogeneous Architectures

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Comparing Memory Systems for Chip Multiprocessors

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Page 1. Multilevel Memories (Improving performance using a little cash )

Antonio R. Miele Marco D. Santambrogio

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

Lecture 14: Multithreading

ECE 8823: GPU Architectures. Objectives

Parallel Processing SIMD, Vector and GPU s cont.

Understanding GPGPU Vector Register File Usage

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Staged Memory Scheduling

Benchmarking the Memory Hierarchy of Modern GPUs

Analyzing CUDA Workloads Using a Detailed GPU Simulator

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE

Visualization of OpenCL Application Execution on CPU-GPU Systems

Portland State University ECE 588/688. Graphics Processors

Tag-Split Cache for Efficient GPGPU Cache Utilization

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

{jtan, zhili,

Exploring GPU Architecture for N2P Image Processing Algorithms

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Introduction to GPU programming with CUDA

Advanced Memory Organizations

A Comparison of Capacity Management Schemes for Shared CMP Caches

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

The Nios II Family of Configurable Soft-core Processors

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Inter-warp Divergence Aware Execution on GPUs

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Memory. Lecture 22 CS301

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

EECS 570 Final Exam - SOLUTIONS Winter 2015

Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement

GPU Fundamentals Jeff Larkin November 14, 2016

Pipelining. CSC Friday, November 6, 2015

The University of Texas at Austin

Unit 8: Superscalar Pipelines

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Chapter 2: Memory Hierarchy Design Part 2

Spring Prof. Hyesoon Kim

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

Computer Architecture Lecture 24: Memory Scheduling

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Handout 4 Memory Hierarchy

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Transcription:

MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University

Benefits of GPU Caches Heterogeneous parallelism: High performance per waj Prominent example: CPU- GPU pairs GPU memory hierarchy evolu1on: SW- managed scratchpads general- purpose caches Why GPU caches? AMD Kaveri GPU CPU Reduce memory latency: esp. for irregular accesses Memory bandwidth filter: reduce bandwidth demand Programmability: easier to use than scratchpads

GPU Caches: Usage Issues Unpredictable performance impact Real- system characteriza1on [ICS 12 Jia]: caches helpful for some, ineffec1ve for other kernels Occasionally harmful: long cache line sizes causing excessive over- fetching Even some vendor uncertainty NVIDIA Fermi L1: users should experimentally determine on/off and size configura1ons

GPU Caches: Research Challenges 1. Thrashing due to low per- thread capacity CPU L1: ~8 16 kilobytes per thread GPU L1: ~8 16 bytes per thread 2. Resource conflict stalls due to bursts of concurrent requests from ac1ve threads 100s of memory requests from 1000s of threads GPU L1: 64- way & dozens of MSHRs

Too Many Threads, Too Few Resources Streaming Mul1processor Fetch, Decode, Issue Warp Scheduler Register File Coalescer ALU & SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Warp 2 Coalesced requests G T C B D Y A Tags & Data MSHRs Cache set full, stall! B P L T C U T C Y I D Y Warp 8 U L #threads >> #cache sets, thrashing! G D A 4- way set- associa1ve A I P F U MSHRs full, stall!

Cache Miss Categoriza1on Cache- insensi1ve (CI) Intra- warp (IW) mainly stalls Cross- warp (XW) mainly thrashing MPKI 140 120 100 80 60 40 20 0 Based on rela1onships between evic1ng & evicted threads

Related Work Warp schedulers [Two- Level, CCWS] Indirectly manage caches, can t target IW conten1on MRPB outperforms a state- of- the- art scheduler, CCWS Sotware op1miza1on [ThroJling, Dymaxion, DL] Significant user effort, plauorm- dependent Hardware techniques [RIPP, Tap, DRAM- Sched] Not directly applicable to GPU characteris1cs

Research Challenges Solu1ons 1. Thrashing due to low per- thread capacity CPU L1: ~8 16 kilobytes per thread GPU L1: ~8 16 bytes per thread Solu1on: reorder reference streams to group related requests 2. Resource conflict stalls due to bursts of requests 100s of requests from 1000s of threads GPU L1: 64- way & dozens of MSHRs Solu1on: let requests bypass caches to avoid resource stalls Our approach: request priori1za1on = reorder + bypass Priori1zed cache use effec1ve per- thread cache resources 2.65 and 1.27 IPC improvements for PolyBench and Rodinia

Memory Request Priori1za1on Buffer Fetch, Decode, Issue Streaming Mul1processor Warp Scheduler Register File Fetch, Decode, Issue Warp Scheduler Coalescer Register File Coalescer MRPB ALU & SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Memory requests Tags & Data MSHRs T G C B P Cache T set C full, stall! Warp Key 2 insight: delay some requests/warps U to T C Y I D Y B D Y A Goal: make reference streams cache- friendly priori1ze others higher overall throughput Warp 8 U L #threads >> #cache sets, thrashing! G D A 4- way set- associa1ve A I P F L MSHRs full, stall!

Request Reordering: Reduce Thrashing Earlier pipeline stages Same rate Same throughput U I V P D F X Color indicates source warps Queue selector FIFO queue 1 FIFO queue 2 U FIFO queue N K A V D I Priori1za1on vs. fairness X F P Drain selector Cache- friendly request order I P U D F V Design op1ons Signature E.g. warp IDs (48 queues) L 6 5 4 3 2 1 Drain policy E.g. lowest ID queue first X L1

Cache Bypassing: Reduce Stalls Earlier pipeline stages X U Bypassed requests Hit I P U T & D MSHRs D F V X Probe V F D Miss L2 & DRAM L1 I P Reorder queues On stalls, bypass L1 Bypassing condi1on modulates aggressiveness: bypass- on- all- stalls vs. bypass- on- conflict- stalls- only Exploit GPUs weak consistency model for correctness

MRPB Design Summary Key observa1ons: Request reordering: longer 1mescale, reduces thrashing from cross- warp (XW) conten1on Cache bypassing: more bursty, reduces resource conflict stalls from intra- warp (IW) conten1on Full design space explora1on in the paper Signature, drain policy, conges1on/write flush, queue size and latency, bypassing policy

Experimental Methodology Simulated GPU: NVIDIA Tesla C2050 How does MRPB handle different cache sizes? Baseline- S: 4- way 16KB L1 vs. Baseline- L: 6- way 48KB L1 Benchmark suites: PolyBench & Rodinia Different usage scenarios: PolyBench: cross- plauorm vs. Rodinia: GPU- centric Different op1miza1on levels: PolyBench: rely on caches vs. Rodinia: rely on scratchpads Different purposes in our study: PolyBench: explora1on vs. Rodinia: evalua1on

Reordering Benefits XW Applica1ons 3 2.5 Normalized IPC 2 1.5 1 0.5 0 BeJer Cache- insensi1ve (CI) Intra- warp (IW) Cross- warp (XW)

Bypassing Benefits IW Applica1ons 14 12 Normalized IPC 10 8 6 4 2 0 BeJer Cache- insensi1ve (CI) Intra- warp (IW) Cross- warp (XW)

Final Design: Reordering + Bypassing 20 Normalized IPC 5 1 BeJer 0.5 PolyBench- S 2.65 PolyBench- L 2.25 Rodinia- S 1.27 Rodinia- L 1.15 MRPB doesn t harm any app s performance!

Improving Programmability with MRPB Use caching + MRPB instead of shared memory 6 unshared (/U) Rodinia apps: 37% slower 9% with MRPB For SRAD- S, caching + MRPB outperforms shared version Best of both worlds: bejer programmability & performance 2.5 Normalized IPC 2 1.5 1 0.5 BeJer MRPB off MRPB on 0 SRAD- S SRAD- S/U

Conclusion Highlighted and characterized how high thread counts oten lead to thrashing- and stall- prone GPU caches MRPB: a simple HW unit for improving GPU caching 2.65 /1.27 for PolyBench/Rodinia for 16KB L1 L1- to- L2 traffic reduced by 15.4 26.7% Low hardware cost: 0.04% of chip area Future work and broader implica1ons Rethink GPU caches primary role: latency throughput (Re- )design GPU components with throughput as a goal