Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Size: px

Start display at page:

Download "Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs"

Vernon Simon
5 years ago
Views:

1 The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October 3, 2016

2 Graphics Processing Unit (GPU) Run massive number of parallel threads to maximize the throughput Core 0 Accelerate parallel computa9on Image processing Weather forecas9ng Scien9ﬁc and engineering computa9on Data mining Machine learning Core 2 Core 1 Core 3

3 Cache Contention and Thrashing Massive number of concurrent threads (1000+) contend the limited cache storage (16 64kB) A large amount of memory requests result in cache misses A large amount of cache lines are evicted too early

4 GPU Cache Sensitivity Caches do not effectively improve the performance of GPUs GPGPU applications can obtain high speedup with larger L1 data caches Speedup over Baseline L1D$ Off 16kB L1D$ (Baseline) 64kB L1D$ 3.94x 6.82x BO PTH HOT BP FWT HTW SR1 NW SR2 SC BT DCT WC MIS CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Avg

5 Research Question How can we effec9vely mi9gate the cache conten9on problem and improve the GPU performance without increasing the cache capacity?

6 Outline Introduction and Motivation GPU Cache Access Behavior Characterization Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing Methodology and Evaluation Result Conclusion

7 per-inst Cache Line Reuse Distance Bypassing memory requests generated by the instructions which exhibit long reuse distance to increase the effective cache capacity Reuse Distance Distribu.on 100% 80% 60% 40% 20% 0% PC_0 < > 16 PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 Memory Instruc.ons of Rodinia* BFS Applica.on PC_8 * Che et al., Rodinia: A Benchmark Suite for Heterogeneous Computing, IISWC 09

8 GPU Cache Bypassing Reuse Distance Distribu9on Speedup with Bypassing Reuse Distance Distribu.on < > % 80% 60% 40% 20% 0% PC_3 PC_4 PC_6 Speedup over No Bypassing Bypassing Complete PC_3 PC_4 BFS PC_6

9 Bypassing Aggressiveness Speedup over No Bypassing AGG=1 AGG=3 AGG=5 AGG=7 Bypassing Complete 1.5 AGG 1.4 opt = AGG opt =3 1.1 AGG opt = PC_3 PC_4 PC_6 PC_1 PC_2 PC_ AGG opt =5 PC_0 KMN BFS The optimal bypassing aggressiveness varies across applications and memory instructions ELL AGG Bypassing Probability 100% 7 99% 5 96% 3 75% 1 50%

10 Outline Introduction and Motivation GPU Cache Access Behavior Characterization Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing Methodology and Evaluation Result Conclusion

11 [1] Wu et al., SHiP: Signature-based Hit Predictor for High Performance Caching, MICRO 12 [2] Tian et al., Adaptive GPU Cache Bypassing, GPGPU 15 [3] Lee et al., CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads, ISCA 15 [4] Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 Ctrl-C: Instruction-Aware Control Loop Based Cache Bypassing Adjus9ng the bypassing aggressiveness per instruc9on to achieve the op9mal cache hit rate by using feedback control loops Capturing the per-instruc9on cache line reuse behavior by the unique PC signature of memory instruc9ons [1][2][3] Bypassing memory requests stochas9cally to alleviate degree of cache thrashing [4]

12 Ctrl-C Design Overview Cache Tag Valid Data Reuse Inser2onPC Instruction-Reuse (ireuse) Table: An array indexed by the lower 7-bit of the instruction PC ireuse AGG: bypassing AGGressiveness BYP: number of requests BYPassed INSERT: number of cache line INSERTed ZERO: number of ZERO-reuse lines (TH L, TH H ): target thresholds of zero reuse lines

13 Ctrl-C Feedback Control Loop at at Eviction Miss Feedback Control Loop Cache if (k > TH H ) AGG if (k < TH L ) AGG PC ireuse reuse? mem request bypass? if (BYP == 2 AGG ) insert BYP = 0 else bypass BYP++

14 Outline Introduction and Motivation GPU Cache Access Behavior Characterization Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing Methodology and Evaluation Result Conclusion

15 Methodology GPGPU-sim [1] to simulate a NVIDIA Fermi-based GTX-480 GPU 15 streaming mul9processors 16kB L1 data cache (4-way/32-set) per SM with Fermihashing [2] index 768kB L2 cache (8-way/64-set/12-par99on) (TH L, TH H ) = (0.1, 0.4) 27 benchmarks (including 13 high cache sensi9ve workloads) to represent a wide range of GPU behavior [1] Bakhoda et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 09 [2] Nugteren et al., A Detailed GPU Cache Model Based on Reuse Distance Theory, HPCA 14

16 Ctrl-C Performance Improvement Speedup over Baseline kB L1D$ (Baseline) Ctrl-C 32kB L1D$ 2.39x 2.38x

17 Ctrl-C Performance Improvement Speedup over Baseline kB L1D$ (Baseline) Adap9ve Bypass* Ctrl-C 0.42x 2.39x * Tian et al., Adaptive GPU Cache Bypassing, GPGPU 15

18 Research Question How can we effec9vely mi9gate the cache conten9on problem and improve the GPU performance without increasing the cache capacity? Proposed an instruc9on-aware algorithm to predict the cache access behavior Employed feedback control loops to adap9vely bypass memory requests based on an instruc9on s reuse patern

19 Conclusion This is the first work that designs a feedback control loop to determine the op9mal bypassing seung at the instruc9on granularity This paper offers detailed characteriza9on results that show the op9mal cache bypassing aggressiveness varies across applica9ons and memory instruc9ons We propose an instruc9on-aware Ctrl-C cache bypassing scheme to Dynamically predict the best bypassing aggressiveness Improve the performance by an average of 1.42x speedup for cache sensi9ve applica9ons

20 The 34 th IEEE International Conference on Computer Design Thank you! Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October 3, 2016

21 Backup

22 Cache Contention and Thrashing GPU cache capacity is too small to fit in the ac9ve dataset for all concurrent threads. Way 0 Offset 0 Offset 1 Working Dataset Thread_0: D[0] D[N-1] Thread_1: D[N] D[2N-1] Thread_2: D[2N] D[3N-1] Offset N-1 Way 1 Offset 0 Offset 1 Offset N-1 D[2N] D[N] D[0] X D[2N+1] D[N+1] D[1] X D[3N-1] D[2N-1] D[N-1] X D[2N] D[N] D[0] X D[2N+1] D[N+1] D[1] X D[2N-1] D[3N-1] D[N-1] X Access Sequence 1. Thread_0: D[0] 2. Thread_1: D[N] 3. Thread_2: D[2N] 4. Thread_0: D[1] 5. Thread_1: D[N+1] 6. Thread_2: D[2N+1] 7. Thread_0: D[3]

23 GPU Cache Line Reuse Distance > % 80% 60% 40% 20% 0% CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Reuse Distance Distribu.on

24 Storage Overhead AGG counter: 3-bit per ireuse entry BYP counter: 7-bit per ireuse entry REF counter: 10-bit per ireuse entry ZERO counter: 10-bit per ireuse entry Cache meta data: 8-bit per cache line Only need 3.5% overhead for a 16kB cache (32-set/4-way) with 128-entry ireuse to gain 1.41x speedup

25 Storage Overhead Comparison 16kB L1 data cache 32sets 4 ways 76kB unified L2 cache 64 sets 8 ways 12 par99ons Ctrl-C Adap.ve Bypassing* Storage Overhead wrt L1D$ 3.5% 5.6% Speedup 1.42x 1.23x * Tian et al., Adaptive GPU Cache Bypassing, GPGPU 15

26 Ctrl-C MPKI Reduction 16kB L1D$ (Baseline) Ctrl-C MPKI Normalized to Baseline CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Avg (CS) Avg (NS) Avg (All)

27 Ctrl-C Bus Traffic Reduction 16kB L1D$ (Baseline) Adap9ve Bypass Ctrl-C Bus Traffic Normalized to Baseline CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Avg (CS) Avg (NS) Avg (All)

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-