MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University

Benefits of GPU Caches Heterogeneous parallelism: High performance per waj Prominent example: CPU- GPU pairs GPU memory hierarchy evolu1on: SW- managed scratchpads general- purpose caches Why GPU caches? AMD Kaveri GPU CPU Reduce memory latency: esp. for irregular accesses Memory bandwidth filter: reduce bandwidth demand Programmability: easier to use than scratchpads

GPU Caches: Usage Issues Unpredictable performance impact Real- system characteriza1on [ICS 12 Jia]: caches helpful for some, ineffec1ve for other kernels Occasionally harmful: long cache line sizes causing excessive over- fetching Even some vendor uncertainty NVIDIA Fermi L1: users should experimentally determine on/off and size configura1ons

GPU Caches: Research Challenges 1. Thrashing due to low per- thread capacity CPU L1: ~8 16 kilobytes per thread GPU L1: ~8 16 bytes per thread 2. Resource conflict stalls due to bursts of concurrent requests from ac1ve threads 100s of memory requests from 1000s of threads GPU L1: 64- way & dozens of MSHRs

Too Many Threads, Too Few Resources Streaming Mul1processor Fetch, Decode, Issue Warp Scheduler Register File Coalescer ALU & SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Warp 2 Coalesced requests G T C B D Y A Tags & Data MSHRs Cache set full, stall! B P L T C U T C Y I D Y Warp 8 U L #threads >> #cache sets, thrashing! G D A 4- way set- associa1ve A I P F U MSHRs full, stall!

Cache Miss Categoriza1on Cache- insensi1ve (CI) Intra- warp (IW) mainly stalls Cross- warp (XW) mainly thrashing MPKI 140 120 100 80 60 40 20 0 Based on rela1onships between evic1ng & evicted threads

Related Work Warp schedulers [Two- Level, CCWS] Indirectly manage caches, can t target IW conten1on MRPB outperforms a state- of- the- art scheduler, CCWS Sotware op1miza1on [ThroJling, Dymaxion, DL] Significant user effort, plauorm- dependent Hardware techniques [RIPP, Tap, DRAM- Sched] Not directly applicable to GPU characteris1cs

Research Challenges Solu1ons 1. Thrashing due to low per- thread capacity CPU L1: ~8 16 kilobytes per thread GPU L1: ~8 16 bytes per thread Solu1on: reorder reference streams to group related requests 2. Resource conflict stalls due to bursts of requests 100s of requests from 1000s of threads GPU L1: 64- way & dozens of MSHRs Solu1on: let requests bypass caches to avoid resource stalls Our approach: request priori1za1on = reorder + bypass Priori1zed cache use effec1ve per- thread cache resources 2.65 and 1.27 IPC improvements for PolyBench and Rodinia

Memory Request Priori1za1on Buffer Fetch, Decode, Issue Streaming Mul1processor Warp Scheduler Register File Fetch, Decode, Issue Warp Scheduler Coalescer Register File Coalescer MRPB ALU & SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Memory requests Tags & Data MSHRs T G C B P Cache T set C full, stall! Warp Key 2 insight: delay some requests/warps U to T C Y I D Y B D Y A Goal: make reference streams cache- friendly priori1ze others higher overall throughput Warp 8 U L #threads >> #cache sets, thrashing! G D A 4- way set- associa1ve A I P F L MSHRs full, stall!

Request Reordering: Reduce Thrashing Earlier pipeline stages Same rate Same throughput U I V P D F X Color indicates source warps Queue selector FIFO queue 1 FIFO queue 2 U FIFO queue N K A V D I Priori1za1on vs. fairness X F P Drain selector Cache- friendly request order I P U D F V Design op1ons Signature E.g. warp IDs (48 queues) L 6 5 4 3 2 1 Drain policy E.g. lowest ID queue first X L1

Cache Bypassing: Reduce Stalls Earlier pipeline stages X U Bypassed requests Hit I P U T & D MSHRs D F V X Probe V F D Miss L2 & DRAM L1 I P Reorder queues On stalls, bypass L1 Bypassing condi1on modulates aggressiveness: bypass- on- all- stalls vs. bypass- on- conflict- stalls- only Exploit GPUs weak consistency model for correctness

MRPB Design Summary Key observa1ons: Request reordering: longer 1mescale, reduces thrashing from cross- warp (XW) conten1on Cache bypassing: more bursty, reduces resource conflict stalls from intra- warp (IW) conten1on Full design space explora1on in the paper Signature, drain policy, conges1on/write flush, queue size and latency, bypassing policy

Experimental Methodology Simulated GPU: NVIDIA Tesla C2050 How does MRPB handle different cache sizes? Baseline- S: 4- way 16KB L1 vs. Baseline- L: 6- way 48KB L1 Benchmark suites: PolyBench & Rodinia Different usage scenarios: PolyBench: cross- plauorm vs. Rodinia: GPU- centric Different op1miza1on levels: PolyBench: rely on caches vs. Rodinia: rely on scratchpads Different purposes in our study: PolyBench: explora1on vs. Rodinia: evalua1on

Reordering Benefits XW Applica1ons 3 2.5 Normalized IPC 2 1.5 1 0.5 0 BeJer Cache- insensi1ve (CI) Intra- warp (IW) Cross- warp (XW)

Bypassing Benefits IW Applica1ons 14 12 Normalized IPC 10 8 6 4 2 0 BeJer Cache- insensi1ve (CI) Intra- warp (IW) Cross- warp (XW)

Final Design: Reordering + Bypassing 20 Normalized IPC 5 1 BeJer 0.5 PolyBench- S 2.65 PolyBench- L 2.25 Rodinia- S 1.27 Rodinia- L 1.15 MRPB doesn t harm any app s performance!

Improving Programmability with MRPB Use caching + MRPB instead of shared memory 6 unshared (/U) Rodinia apps: 37% slower 9% with MRPB For SRAD- S, caching + MRPB outperforms shared version Best of both worlds: bejer programmability & performance 2.5 Normalized IPC 2 1.5 1 0.5 BeJer MRPB off MRPB on 0 SRAD- S SRAD- S/U

Conclusion Highlighted and characterized how high thread counts oten lead to thrashing- and stall- prone GPU caches MRPB: a simple HW unit for improving GPU caching 2.65 /1.27 for PolyBench/Rodinia for 16KB L1 L1- to- L2 traffic reduced by 15.4 26.7% Low hardware cost: 0.04% of chip area Future work and broader implica1ons Rethink GPU caches primary role: latency throughput (Re- )design GPU components with throughput as a goal