CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

Size: px

Start display at page:

Download "CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads"

Cory Quinn
5 years ago
Views:

1 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole- Jean Wu Arizona State University 6/17/2015

2 GPU Execution Model Nowadays GPUs are used to process parallel workloads Hundreds or thousands of threads are running concurrently thread- block warp 0 warp 1 warp/wavefront warp 2 warp 3 2/28

3 Modern GPU Architecture Modern GPUs consist of muldple stream- muldprocessors (SMs), which are similar to SIMD processors. 3/28

The Warp Criticality Problem Significant warp execudon Dme disparity for warps in the same thread block Execu6on Time (Compared to the Fastest Warp) 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.

4 The Warp Criticality Problem Significant warp execudon Dme disparity for warps in the same thread block Execu6on Time (Compared to the Fastest Warp) % Breadth- first- search w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) Execu6on Time Disparity ComputaDon 100% 75% 50% 25% 0% bfs b+tree heartwall Idle Time kmeans needle srad_1 strcltr_small Avg *Lee and Wu, CAWS: Cri)cality- Aware Warp Scheduling for GPGPU Workloads, PACT- 2014

5 Research Questions What is the source of warp cridcality? How can we effecdvely accelerate cridcal warp execudon? 5/28

6 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Conclusion 6/28

7 Source of Warp Criticality Workload Imbalance Diverging Branch Behavior Memory ContenDon and Memory Access Latency ExecuDon Order of Warp Scheduling 7/28

8 Workload Imbalance & Diverging Branch Workload imbalance or diverging branch behavior makes warps have different number of dynamic instrucdon counts. exe Dme inst count Execu6on Time Normalized to the Fastest Warp w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w Inst Count Normalized to the Fastest Warp Warps (Sorted by Cri6cality) 8/28

9 Memory Contention While warps experience different latency to access memory, memory contendon can induce warp cridcality. Execu6on Time Normalized to the Fastest Warp Total Memory System Latency Other Latency w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) 9/28

10 Warp Scheduling Order The warp scheduler may introduce addidonal stall cycles for a ready warp, resuldng in warp cridcality Execu6on Time Normalized to the Fastest Warp Scheduler Latency Other Latency w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) 10/28

11 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Discussion and Conclusion 11/28

12 CAWA: Criticality-Aware Warp Acceleration Coordinated warp scheduling and cache prioritization design CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse

13 CAWA: Criticality-Aware Warp Acceleration Cri6cality Predic6on Logic (CPL) Predic6ng and iden6fying the cri6cal warp at run6me greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse

14 CAWA CPL : Criticality Prediction Logic EvaluaDng number of addidonal cycles a warp may experience ninst is decremented whenever an instrucdon is executed Cri)cality = ninst * w.cpiavg + nstall instrucdon count disparity diverging branch memory latency scheduling latency 14/28

15 CAWA: Criticality-Aware Warp Acceleration CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy Cri6cality Aware Warp Scheduler (gcaws) Priori6zing and accelera6ng the cri6cal warp execu6on CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse Reduce delay from the scheduler

16 CAWA gcaws : greedy Criticality-Aware Warp Scheduler PrioriDzing warps based on their cridcality given by CPL ExecuDng warps in a greedy* manner Select the most cridcal ready- warp Keep on execudng the select warp undl it stalls Warp Pool Warp 0 Warp 1 Warp 2 Warp 3 CriDcality Warp Scheduler Selec6on Sequence TradiDonal Approach (e.g. RR, 2L, GTO): W0à W1à W2à W3 gcaws: W1à W3à W0à W2 *Rogers et al., Cache- Conscious Wavefront Scheduler, MICRO 12 16/28

17 CAWA: Criticality-Aware Warp Acceleration CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon Cri6cality- Aware Cache Priori6za6on (CACP) Priori6zing and alloca6ng cache lines for cri6cal warp reuse Reduce delay from the memory

18 Critical Cache Block Reuse Behavior 100% Reuse CriDcal Line Reuse Non CriDcal Line No Reuse 80% 60% 40% 20% 0% Design a PC/MEM- based cache reuse predictor! Capture cache lines that will be used by cridcal warps! Capture cache lines that are likely to be reused 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 18/28

19 CAWA CACP : Criticality-Aware Cache Prioritization *Wu et al., SHiP: Signatured- based Hit Predictor for High Performance Caching, MICRO 11 19/28

20 CAWA CACP : Criticality-Aware Cache Prioritization Cache Request 20/28

21 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Discussion and Conclusion 21/28

22 Experimental Environment GPU simuladon infrastructure GPGPU- sim v3.2.0* nvidia nvcc toolchain v3.2 + gcc v4.4.7 nvidia GTX480 architecture 15 SMs (maximum 48- warps/8- blocks per SM) 16kB (8- sets/16- ways) L1D$ per SM 768kB (64- sets/16- ways) shared L2$ Benchmarks 12 applicadons from Rodinia benchmark suites** including 7 applicadons with high degree of cridcality *Bakhoda et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 09 **Che et al., Rodinia: A Benchmark Suite for Heterogeneous CompuDng, ISPASS 09 22/28

23 Overall Performance Improvement Normalized IPC Normalized MPKI L GTO CAWA 3.13x 2.54x CAWA aims to retain data for cri?cal warps Improving GPU performance via large warps and two- level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O Connor, and Aamodt. MICRO `12. 23/28

24 gcaws Performance Improvement gcaws CAWA IPC Normalized to Baseline RR x 2.63x Large memory footprint Sta?c cache par??oning needs improvement 24/28

Performance Improvement with CAWA CACP RR+CACP 2L + CACP GTO+CACP CAWA MPKI Normalized to Baseline RR 1.50 1.25 1.00 0.75 0.

25 Performance Improvement with CAWA CACP RR+CACP 2L + CACP GTO+CACP CAWA MPKI Normalized to Baseline RR CACP can effec?vely reduce cache interference with any schedulers Schedulers need modifica?on for robust and higher performance improvement with CACP 25/28

26 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Conclusion 26/28

27 Research Questions What is the source of warp cridcality? workload imbalance and/or diverging branch different memory access latencies addidonal scheduling latency How can we effecdvely accelerate cridcal warp execudon? CAWA gcaws allocates more Dme resources for cridcal warps CAWA CACP retains (1) cache lines that will be reused by cridcal warps in the cridcal cache parddon and (2) cache lines that will be reused with higher priority 27/28

28 Conclusion This is the first work that proposes an effecdve soludon for midgadng the degree of warp cridcality by acceleradng the execudon of cridcal warps of GPGPUs CAWA consists of An accurate dynamic warp predicdon scheme to guide cridcal warp acceleradon A coordinated management scheme to improve warp scheduling and cache prioridzadon for cridcal warp acceleradon Our soludon improves GPGPU performance by an average of 23% (9% across all applicadons). 28/28

29 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) Thank you! CAWA: Coordinated Warp Scheduling and Cache PrioriDzaDon for CriDcal Warp AcceleraDon of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole- Jean Wu Arizona State University 6/17/2015

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October