Exploiting Core Criticality for Enhanced GPU Performance

Size: px

Start display at page:

Download "Exploiting Core Criticality for Enhanced GPU Performance"

Tobias Chester Hancock
5 years ago
Views:

1 Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16

2 Era of Throughput Architectures GPUs are scaling: Number of CUDA Cores, DRAM bandwidth Number of CUDA Cores are scaling rapidly Memory bandwidth is scaling at a much slower pace 2010: GTX 480 (Fermi) 448 cores (178 GB/sec) 2012: GTX 680 (Kepler) 1536 cores (192 GB/sec) 2016: GTX 1080 (Pascal) 2560 cores (320 GB/sec) 2

3 Current Trend Modern Schedulers (e.g., FR-FCFS) assume that all memory requests are equally critical towards performance. maximize memory data throughput. Inability of FR-FCFS to distinguish memory requests from different GPU cores lead to GPU cores experiencing significant variation in average memory access latencies some GPU cores becoming more critical than others 3

4 Average Coefficient of Variation (COV) 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB 1xB 2xB 4xB Coefficient of Variation (COV) in Average Memory Access Latencies To understand further, consider the COV (ratio of Standard Deviation over Arithmetic Mean) in memory access latencies We need to take core criticality into account. 0.4Prioritize requests from GPU cores with Latency less latency IPC 0.3tolerance Contention 0.2 is present in entire memory hierarchy. In this work, we only consider main memory contention 0.1 We propose CLAMS, a criticality aware memory scheduling 0mechanism LUH RED SCAN LPS RAY CONS SCP BLK HS CFD GAUSS AVG. some GPU cores experience higher avg. memory latency than others these cores are less latency tolerant ( critical ) latency variations correlate with IPC variation 4

5 Outline Introduction and Motivation Core Criticality: Metrics and Analysis Design of Criticality Aware Memory Scheduler Infrastructure Setup and Evaluation Conclusions 5

6 Core Criticality: Metrics We need to quantify core criticality. Use latency tolerance as a measure of core criticality. 1. Classify warps into short- and long-latency warp Short-latency: compute instruction/data in private cache Long-latency: stalled due to pending memory requests 2. Calculate short-latency ratio Ratio of short-latency warps over total issued warps 3. Assign criticality rank Quantize short-latency ratio 8 discrete steps, with step size of 1/8. rank-1: short-latency ratio < 1/8, Most critical rank-8: short-latency ratio > 7/8, Least critical 6

7 Rank Core Criticality: Metrics Th CR : Criticality-Rank-Threshold Core is critical if current rank <= Th CR Takes any integer value from 1 to 8. PCC(Th CR ): Percentage of critical cores for given Th CR PCC(Th CR ) of 100%, all cores critical PCC(Th CR ) of 0%, all cores non-critical Th CR = 7, PCC(7)=100% Th CR = 4, PCC(4)=33% Th CR = 1, PCC(1)=0% Core-1 Core-2 Core-3 7

PCC varies within an application over time 3.

8 Average PCC(1) Core Criticality: Metrics 80% 60% 40% 20% 0% 1xB 2xB 4xB 1. PCC is dependent on the chosen criticality-rank threshold 2. PCC varies within an application over time 3. PCC varies across applications 4. PCC reduces significantly as main memory bandwidth increases 8

9 Core Criticality: Metrics and Analysis If PCC(Th CR ) is low for any given Th CR, then GPU cores have similar latency tolerance Memory scheduler should preserve locality PCC(Th CR ) is calculated periodically and requires Exchange of global information across cores and MCs Hardware overhead of calculation Expensive approach! We need a metric which can be calculated at the MCs 9

10 Core Criticality: Metrics and Analysis We use Percentage of Critical Requests (PCR). Tag memory requests with core s current rank Calculate the percentage of critical memory requests present in the MC request buffer Eliminates the exchange of global information between GPU cores and MCs Similar to PCC, PCR needs to be calculated for a given Th CR. 10

11 PCC PCR PCC PCR Core Criticality: Metrics and Analysis PCR patterns for an application similar to PCC PCC(1) PCC(4) PCC(7) PCR(1) PCR(4) PCR(7) 100% 100% 80% 80% 60% 60% 40% 40% 20% Similar observations hold for 20% PCR 0% PCR considers criticality of 0% requests instead of Time Time SCP SCP 100% 80% 60% 40% 20% 0% their corresponding cores PCC(1) PCC(4) PCC(7) Time CONS 100% 80% 60% 40% 20% 0% PCR(1) PCR(4) PCR(7) Time CONS 11

12 Percentage of DRAM Cycles Core Criticality: Analysis Scope of criticality aware memory scheduler 100% 80% 60% 40% 20% 0% diff-7 diff-6 diff-5 diff-4 diff-3 diff-2 diff-1 diff-0 Significant rank difference in LUH, RAY, SCAN Distribution of criticality-rank differences across DRAM requests High scope criticality-rank differences: difference between highest and lowest criticality-rank of requests in MC at the same time Rank difference is 0 for most of the time for CFD, diff-0 GAUSS denotes % of DRAM cycles, all memory requests have same rank Low scope 12

13 Outline Introduction and Motivation Core Criticality: Metrics and Analysis Design of Criticality Aware Memory Scheduler Infrastructure Setup and Evaluation Conclusions 13

14 Two major challenges Design of CLAMS Co-existence of critical and non-critical requests Finding appropriate value of Th CR Low Th CR -> less critical cores High Th CR -> too many critical cores Balancing DRAM locality and criticality Switching between schedulers optimized for criticality or locality Calculate PCR(Th CR ) periodically and compare with Switching- Mode-Threshold(Th SM ) PCR(Th CR ) > Th SM, locality mode PCR(Th CR ) <= Th SM, criticality mode Need to find appropriate value of Th SM 14

15 Three different approaches Static-CLAMS Single and fixed set values for Th CR and Th SM Semi-Dyn-CLAMS Dynamically calculates Th CR, based on fixed Th SM and PCR(k) k information at MC Dyn-CLAMS Calculates both, Th CR and Th SM. Working modes of CLAMS Decided based on per bank s memory requests Locality mode Criticality mode Design of CLAMS 15

16 Percentage of Requests Design of Static-CLAMS 100% 80% 60% 40% Take 20% away: 1. 0% We need to adapt Th CR based on the application rank-8 rank-7 rank-6 rank-5 rank-4 rank-3 rank-2 2. Th CR and Th SM should not be determined independently rank-1 rank-4 provides a mix of both, critical and non-critical requests We choose Th CR = 4 But, many applications have different distribution such as SCP Assuming Th SM = 20% -> locality mode most of the time Assuming Th SM = 80% -> criticality mode most of the time Potentially degrade DRAM row buffer locality 16

17 Design of Semi-Dyn-CLAMS Computes Th CR Based on fixed Th SM value, and PCR(k) k {1...8} information at MC We find a value for Th CR such that PCR(Th CR ) is Th SM and is as close to it as possible This will switch scheduler into criticality mode In case no such Th CR can be found, switch to locality mode 17

18 Percentage of Requests Design of Semi-Dyn-CLAMS Lets look at SCP Assume Th SM = 40%, and Th CR can take any value in {1,4,7}. 100% PCR(1) PCR(4) PCR(7) 80% 2 Take Thaway: CR is obtained using the PCR(k) information at the MC 1. Semi-Dyn-CLAMS 60% aggressively uses criticality mode 2. But, Only actual goes working to 40% locality mode mode, of scheduler when too many determined critical 1 based requests on the present requests in the 20% to MC be issued buffer to a particular bank. 3. Can Aware lead to of significant 0% requests destined loss locality to each and bank performance Time 4. No Aware feedback of all on requests Th SM when in the new MCvalue of Th CR is calculated. Th CR = 1, is chosen for first half of execution as PCR(1) <= 40%. Scheduler works in criticality mode. 1 For second half of execution, for no Th CR, is PCR(Th CR ) <= 40%. Scheduler switches to locality mode. 2 Most of the requests are critical and cannot be prioritized 18

19 Design of Dyn-CLAMS Even though Semi-Dyn-CLAMS facilitates criticality mode Actual mode based on requests to each bank Key Idea: Can be locality mode even though PCR(Th CR ) <= Th SM 1. Gauge the negative effect of loss in row locality on latency tolerance 2. Mitigate the loss by lowering Th SM while maintaining the value of Th CR. Dyn-CLAMS attempts to improve the loss in locality. Dynamically calculates Th CR and updates Th SM based on the new value of Th CR. 19

20 Percentage of Requests Design of Dyn-CLAMS Initialize Th SM = 40%, and Th CR = 8 Calculate Th CR based on Semi-Dyn-CLAMS Th CR updated to 1 Take away: 3 1. Dynamically updating Th CR allows scheduler to aggressively work in criticality mode 2 2. By Scheduler reducing goes Th SM into, the locality scheduler s mode, starts similar to to improve Semi-Dyn-CLAMS locality by using locality mode for the banks 1. chances of a bank s PCR(Th CR ) > Th SM increase Update Th SM to new value of PCR(Th CR ) In second half of execution 100% 80% PCR(1) PCR(4) PCR(7) 2 1 Th SM 60% 40% 20% 0% Time

21 Outline Introduction and Motivation Core Criticality: Metrics and Analysis Design of Criticality Aware Memory Scheduler Simulation Setup and Evaluation Conclusions 21

22 Simulation Setup Evaluated on GPGPU-Sim, a cycle accurate GPU simulator Baseline configuration similar to scaled-up version of GTX SMs, 32-SIMT lanes, 32-threads/warp 16KB L1 (4-way, 128B cache block) + 48KB Shared Mem/SM 6 partitions/channels Applications classified into 2 groups Group A: Moderate to High scope Group B: Low scope 22

23 Normalized IPC Performance with CLAMS (Normalized to FR-FCFS) 1.2 Static-CLAMS Semi-Dyn-CLAMS Dyn-CLAMS FR-FCFS-Cap-Best Static-CLAMS-Best More results and detailed description of Dyn-CLAMS effects in the paper Group A Performance improvement for Group A applications Static-CLAMS = 4.6% Static-CLAMS-Best = 9.3% FR-FCFS-Cap-Best = 4% Group B Semi-Dyn-CLAMS = 6.5% Dyn-CLAMS = 8.4% 23

24 Take Away Messages Variation in average memory access latencies across the GPU cores, makes some GPU cores more critical than others. Memory schedulers being agnostic to the criticality of GPU cores can lead to sub-optimal performance We need to orchestrate the exploitation of core-criticality and DRAM locality Dyn-CLAMS provides an average performance improvement of 8.4%. 24

25 Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog Onur Kayiran 2 Ashutosh Pattnaik Mahmut T. Kandemir Onur Mutlu 4,5 Ravishankar Iyer 6 Chita R. Das College of William and Mary 2 Advanced