A Comparison of Capacity Management Schemes for Shared CMP Caches

Size: px

Start display at page:

Download "A Comparison of Capacity Management Schemes for Shared CMP Caches"

Domenic Cameron
5 years ago
Views:

1 A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28

Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip Cache Heterogeneous Workloads Web servers Video streaming Graphic intensive Scientific Data mining Security scan

2 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip Cache Heterogeneous Workloads Web servers Video streaming Graphic intensive Scientific Data mining Security scan File/Data Base Core counts scaling up Shared cache becomes highly contested LRU replacement is not enough No distinction between process priority and applications memory needs 2/23

3 per Instruction (CPI) per Instruction (CPI) When there is no capacity management instances of mcf libquantum /23

4 This paper Offer an extensive and detailed study of shared resource management schemes Way-partitioned management [D. Chiou, MIT PhD Thesis, 99] Decay-based management [Petoumenos et al., IEEE Workload Characterization, 6] Demonstrate potential benefits of each management scheme Cache space utilization Performance Flexibility and scalability 4/23

5 Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 5/23

6 Shared Cache Capacity Management Apportioning shared cache resources among multiple processor cores Way-Partitioned Management [D. Chiou, MIT PhD Thesis, 99] Decay-Based Management [Petoumenos et al., IEEE Workload Characterization, 6] 6/23

7 Way-Partitioned Management Statically allocate number of L2 cache ways to processes... 4-way set-associative cache 7/23

8 L2 Miss Rate (%) How do applications benefit from cache sizes and set-associativity? Number of Ways Allocated vs. L2 Miss Rate Some applications are more sensitive to the number of cache ways (cache resource) allocated to them Number of Ways Allocated out of 16 Way Set-Associative L2 Cache of 4MB MCF SJENG -Miss rates are improved as the number of cache ways allocated to them increases. -Used to achieve performance predictability. 8/23

Prior Work: Cache Decay for Leakage Power Management Cache New Data Access: Cache Miss represent 2 DISTINCT memory addresses mapped to the same cache set M: Miss H:Hit M H H H H H H H M time Multiple

9 Prior Work: Cache Decay for Leakage Power Management Cache New Data Access: Cache Miss represent 2 DISTINCT memory addresses mapped to the same cache set M: Miss H:Hit M H H H H H H H M time Multiple Accesses in a Short Time timer per cache line If cache line accessed frequently, maintain power: reset timer w/ every access If not accessed for long time switch off V dd: timer=decay interval switch off V dd Re-Power a decayed line on an access 9/23

10 Decay for Capacity Management Decay counter reaches Cache line becomes an immediate candidate for replacement, even if NOT LRU Set decay counters on per-process basis Long decay interval high priority process Short decay interval low priority process, so cache lines are evicted more frequently 1/23

11 Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 11/23

12 Experimental Setup Simulation Framework GEMS: Full system simulator [Simics+Ruby] 16-core multiprocessor on the Sparc architecture running unmodified Solaris 1 operating system Workload SPEC26 CINT Benchmark Suite [program initialization is included] P P1... P15 L1 L1 L1 Shared L2 Cache Off Chip Private L1: 32KB each; 4-way; 64B cache line Shared L2: 4MB; 16-way; 64B cache line L1 miss latency: 2 cycles L2 miss latency: 4 cycles MESI Directory protocol between L1 and L2 12/23

13 Evaluation Mechanisms Baseline: No Cache Capacity Management Way-Partitioned Management Decay-Based Management Scenarios High contention General Workload 1: Constraining one memoryintensive application General Workload 2: Protecting a high-priority application (refer to the paper) 13/23

14 per Instruction (CPI) High Contention Scenario instances of mcf libquantum Alone No Management Way-Partitioned Management [libquantum- 2Way;others-14Way] Decay-Based Management [libquantum-no decay; others-1m] No management: taking turns evicting each other s cache lines out Way-partitioning: Decay-based: 14/23

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1

2x1 9 Way-Partitioned Management (libquantum-2way; mcf+os-14way) 1 9 8

15 Cache Occupancy (%) Cache Space Distribution Memory interference No Management 2.2x1 9 Way-Partitioned Management (libquantum-2way; mcf+os-14way) x1 9 Decay-Based Management (libquantum-no decay; mcf+os-1k) x1 9 15/23

per Instruction (CPI) Constraining a Memory-Intensive Application.8.7.6.5.4.3.2.

16 per Instruction (CPI) Constraining a Memory-Intensive Application Alone No Management Way-Partitioned Management [mcf- 4Way;others-12Way] Decay-Based Management [mcf-1m; other-no decay] --Way-partitioning s coarse granularity control trades off 5% performance for mcf with an average 1% performance improvement for the rest --Decay-based management: only 2% for mcf and others 3% because of its fine-grained control and improved ability to exploit data temporal locality 16/23

17 Cache Occupancy (%) Cache Space Distribution No Management 2.2x1 9 Way-Partitioned Management (mcf-4way; others-12way) x Decay-Based Management (mcf-1m; others-no decay) 2.2x1 9 17/23

No Management Way-Partitioning Decay-based 2 2 2 2 Cache space 2 of mcf is constrained to 4 ways 2 5 5 5 4 2 4 2 libquantum

18 No Management Way-Partitioning Decay-based Cache space 2 of mcf is constrained to 4 ways libquantum is having more of the shared L2 cache space More cache space is used by gcc in way-partitioning and decay-based

19 Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 19/23

Priority classification and enforcement to achieve differentiable QoS [Iyer, ICS 4] Architectural support for optimizing performance of high priority application with minimal performance degradation

20 Priority classification and enforcement to achieve differentiable QoS [Iyer, ICS 4] Architectural support for optimizing performance of high priority application with minimal performance degradation based on QoS policies [Iyer et al., SIGMETRICS 7] Performance metric, such as miss rates, bandwidth usage, IPC, and fairness, to assist resource allocation [Hsu Further et al., cache PACT 6] fairness policies can be incorporated into both Resource capacity allocation management fairness mechanisms in virtual private discussed in this cache, where its capacity work. manager implements waypartitioning [Nesbit et al., ISCA 7]

21 Related Work: Dynamic Cache Capacity Management OS distributes equal amount of cache space to all running processes, keeps statistics on the fly, and dynamically adjust cache space distribution [Suh et al., HPCA 2; Kim et al., PACT 4; Qureshi and Patt, MICRO 6] Adaptive set pinning to eliminate inter-process misses [Srikantaiah et al., ISCA 8] To Statistical the best model of our to knowledge, predict thread there behaviors has not been and capacity management through decay [Petoumenos et any prior work based on decay management al., IEEE Workload Characterization 6 ] taking full system effects into account.

22 Conclusion Advantages Way-Partitioned Management Simple hardware complexity Straightforward technique Decay-Based Management Fine granularity control more effective space utilization Drawbacks Great performance isolation Preferably, the number of cache ways the number of concurrent processes Data remaining in the cache: high priority and good temporal locality 5% 55% More complex hardware Coarse granularity in space allocation inefficient space utilization 22/23

24 Hardware Overhead: Way-partitioning index MUX process ID Set cache ways for tag comparison MUX data result of tag comparison

25 What happens to the replaced lines? P P1... P15 L1 L1 L1 Shared L2 Cache L2 s replacement lines are replaced without evicting L1 s copy. Works because L1 and L2 cache blocks Are the same size!! 64Bytes. 25

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 No Management Way-Partitioned Management (lbm-4way; others-12way) 1 9 9 8 8 7 7 6 6 5

26 Cache Occupancy (%) Cache Space Distribution No Management Way-Partitioned Management (lbm-4way; others-12way) Decay-Based Management (lbm-no decay; others-1k) 4 ways out of 16 way set-associative cache allocated to lbm for its exclusive access

27 Related Work: Iyer s QoS Shared Cache Capacity Management 27

28 Decay-based Management reference stream EA BC D EA BC MISS HIT MISS HIT MISS Bb DA CE A, C, E from the HIGH PRIOIRTY process -> NO DECAY B, D from the LOW PRIORITY process -> DECAY D decays B decays Memory controller 5 out of 9 are hits All 5 hits belong to the high priority process LRU: NO HITS at all A 4 32 C D B E way set-associative cache LRU: Temporal Behaviors Decay-based: Process Priority and Temporal Behaviors 28/29

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering