MRPB: Memory Request Priori1za1on for Massively Parallel Processors

Size: px

Start display at page:

Download "MRPB: Memory Request Priori1za1on for Massively Parallel Processors"

Brianna Watson
5 years ago
Views:

1 MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University

Benefits of GPU Caches Heterogeneous parallelism: High performance per waj Prominent example: CPU- GPU pairs GPU memory hierarchy evolu1on: SW- managed scratchpads general- purpose

2 Benefits of GPU Caches Heterogeneous parallelism: High performance per waj Prominent example: CPU- GPU pairs GPU memory hierarchy evolu1on: SW- managed scratchpads general- purpose caches Why GPU caches? AMD Kaveri GPU CPU Reduce memory latency: esp. for irregular accesses Memory bandwidth filter: reduce bandwidth demand Programmability: easier to use than scratchpads

3 GPU Caches: Usage Issues Unpredictable performance impact Real- system characteriza1on [ICS 12 Jia]: caches helpful for some, ineffec1ve for other kernels Occasionally harmful: long cache line sizes causing excessive over- fetching Even some vendor uncertainty NVIDIA Fermi L1: users should experimentally determine on/off and size configura1ons

4 GPU Caches: Research Challenges 1. Thrashing due to low per- thread capacity CPU L1: ~8 16 kilobytes per thread GPU L1: ~8 16 bytes per thread 2. Resource conflict stalls due to bursts of concurrent requests from ac1ve threads 100s of memory requests from 1000s of threads GPU L1: 64- way & dozens of MSHRs

Too Many Threads, Too Few Resources Streaming Mul1processor

SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Warp 2

5 Too Many Threads, Too Few Resources Streaming Mul1processor Fetch, Decode, Issue Warp Scheduler Register File Coalescer ALU & SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Warp 2 Coalesced requests G T C B D Y A Tags & Data MSHRs Cache set full, stall! B P L T C U T C Y I D Y Warp 8 U L #threads >> #cache sets, thrashing! G D A 4- way set- associa1ve A I P F U MSHRs full, stall!

6 Cache Miss Categoriza1on Cache- insensi1ve (CI) Intra- warp (IW) mainly stalls Cross- warp (XW) mainly thrashing MPKI Based on rela1onships between evic1ng & evicted threads

7 Related Work Warp schedulers [Two- Level, CCWS] Indirectly manage caches, can t target IW conten1on MRPB outperforms a state- of- the- art scheduler, CCWS Sotware op1miza1on [ThroJling, Dymaxion, DL] Significant user effort, plauorm- dependent Hardware techniques [RIPP, Tap, DRAM- Sched] Not directly applicable to GPU characteris1cs

8 Research Challenges Solu1ons 1. Thrashing due to low per- thread capacity CPU L1: ~8 16 kilobytes per thread GPU L1: ~8 16 bytes per thread Solu1on: reorder reference streams to group related requests 2. Resource conflict stalls due to bursts of requests 100s of requests from 1000s of threads GPU L1: 64- way & dozens of MSHRs Solu1on: let requests bypass caches to avoid resource stalls Our approach: request priori1za1on = reorder + bypass Priori1zed cache use effec1ve per- thread cache resources 2.65 and 1.27 IPC improvements for PolyBench and Rodinia

9 Memory Request Priori1za1on Buffer Fetch, Decode, Issue Streaming Mul1processor Warp Scheduler Register File Fetch, Decode, Issue Warp Scheduler Coalescer Register File Coalescer MRPB ALU & SFU L1T & L1C L1D Shared Mem Interconnect L2 & DRAM Warp 1 Memory requests Tags & Data MSHRs T G C B P Cache T set C full, stall! Warp Key 2 insight: delay some requests/warps U to T C Y I D Y B D Y A Goal: make reference streams cache- friendly priori1ze others higher overall throughput Warp 8 U L #threads >> #cache sets, thrashing! G D A 4- way set- associa1ve A I P F L MSHRs full, stall!

Request Reordering: Reduce Thrashing Earlier pipeline stages Same rate Same throughput U I V P D F X Color indicates source warps Queue selector FIFO queue 1 FIFO queue 2 U FIFO queue N K A V D I

10 Request Reordering: Reduce Thrashing Earlier pipeline stages Same rate Same throughput U I V P D F X Color indicates source warps Queue selector FIFO queue 1 FIFO queue 2 U FIFO queue N K A V D I Priori1za1on vs. fairness X F P Drain selector Cache- friendly request order I P U D F V Design op1ons Signature E.g. warp IDs (48 queues) L Drain policy E.g. lowest ID queue first X L1

11 Cache Bypassing: Reduce Stalls Earlier pipeline stages X U Bypassed requests Hit I P U T & D MSHRs D F V X Probe V F D Miss L2 & DRAM L1 I P Reorder queues On stalls, bypass L1 Bypassing condi1on modulates aggressiveness: bypass- on- all- stalls vs. bypass- on- conflict- stalls- only Exploit GPUs weak consistency model for correctness

12 MRPB Design Summary Key observa1ons: Request reordering: longer 1mescale, reduces thrashing from cross- warp (XW) conten1on Cache bypassing: more bursty, reduces resource conflict stalls from intra- warp (IW) conten1on Full design space explora1on in the paper Signature, drain policy, conges1on/write flush, queue size and latency, bypassing policy

13 Experimental Methodology Simulated GPU: NVIDIA Tesla C2050 How does MRPB handle different cache sizes? Baseline- S: 4- way 16KB L1 vs. Baseline- L: 6- way 48KB L1 Benchmark suites: PolyBench & Rodinia Different usage scenarios: PolyBench: cross- plauorm vs. Rodinia: GPU- centric Different op1miza1on levels: PolyBench: rely on caches vs. Rodinia: rely on scratchpads Different purposes in our study: PolyBench: explora1on vs. Rodinia: evalua1on

14 Reordering Benefits XW Applica1ons Normalized IPC BeJer Cache- insensi1ve (CI) Intra- warp (IW) Cross- warp (XW)

15 Bypassing Benefits IW Applica1ons Normalized IPC BeJer Cache- insensi1ve (CI) Intra- warp (IW) Cross- warp (XW)

16 Final Design: Reordering + Bypassing 20 Normalized IPC 5 1 BeJer 0.5 PolyBench- S 2.65 PolyBench- L 2.25 Rodinia- S 1.27 Rodinia- L 1.15 MRPB doesn t harm any app s performance!

Improving Programmability with MRPB Use caching + MRPB instead of shared memory 6 unshared (/U) Rodinia apps: 37% slower 9% with MRPB For SRAD- S, caching +

17 Improving Programmability with MRPB Use caching + MRPB instead of shared memory 6 unshared (/U) Rodinia apps: 37% slower 9% with MRPB For SRAD- S, caching + MRPB outperforms shared version Best of both worlds: bejer programmability & performance 2.5 Normalized IPC BeJer MRPB off MRPB on 0 SRAD- S SRAD- S/U

18 Conclusion Highlighted and characterized how high thread counts oten lead to thrashing- and stall- prone GPU caches MRPB: a simple HW unit for improving GPU caching 2.65 /1.27 for PolyBench/Rodinia for 16KB L1 L1- to- L2 traffic reduced by % Low hardware cost: 0.04% of chip area Future work and broader implica1ons Rethink GPU caches primary role: latency throughput (Re- )design GPU components with throughput as a goal

MRPB: Memory Request Prioritization for Massively Parallel Processors

MRPB: Memory Request Prioritization for Massively Parallel Processors Wenhao Jia Princeton University wjia@princeton.edu Kelly A. Shaw University of Richmond kshaw@richmond.edu Margaret Martonosi Princeton