A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28

Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip Cache Heterogeneous Workloads Web servers Video streaming Graphic intensive Scientific Data mining Security scan File/Data Base Core counts scaling up Shared cache becomes highly contested LRU replacement is not enough No distinction between process priority and applications memory needs 2/23

per Instruction (CPI) per Instruction (CPI) When there is no capacity management 1.4 1.2.8.7 1.8.6.4.2 7 instances of mcf libquantum.6.5.4.3.2.1 3/23

This paper Offer an extensive and detailed study of shared resource management schemes Way-partitioned management [D. Chiou, MIT PhD Thesis, 99] Decay-based management [Petoumenos et al., IEEE Workload Characterization, 6] Demonstrate potential benefits of each management scheme Cache space utilization Performance Flexibility and scalability 4/23

Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 5/23

Shared Cache Capacity Management Apportioning shared cache resources among multiple processor cores Way-Partitioned Management [D. Chiou, MIT PhD Thesis, 99] Decay-Based Management [Petoumenos et al., IEEE Workload Characterization, 6] 6/23

Way-Partitioned Management Statically allocate number of L2 cache ways to processes... 4-way set-associative cache 7/23

L2 Miss Rate (%) How do applications benefit from cache sizes and set-associativity? 1 9 8 7 6 5 4 3 2 1 Number of Ways Allocated vs. L2 Miss Rate Some applications are more sensitive to the number of cache ways (cache resource) allocated to them 2 4 6 8 1 12 14 16 Number of Ways Allocated out of 16 Way Set-Associative L2 Cache of 4MB MCF SJENG -Miss rates are improved as the number of cache ways allocated to them increases. -Used to achieve performance predictability. 8/23

Prior Work: Cache Decay for Leakage Power Management Cache New Data Access: Cache Miss represent 2 DISTINCT memory addresses mapped to the same cache set M: Miss H:Hit M H H H H H H H M time Multiple Accesses in a Short Time timer per cache line If cache line accessed frequently, maintain power: reset timer w/ every access If not accessed for long time switch off V dd: timer=decay interval switch off V dd Re-Power a decayed line on an access 9/23

Decay for Capacity Management Decay counter reaches Cache line becomes an immediate candidate for replacement, even if NOT LRU Set decay counters on per-process basis Long decay interval high priority process Short decay interval low priority process, so cache lines are evicted more frequently 1/23

Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 11/23

Experimental Setup Simulation Framework GEMS: Full system simulator [Simics+Ruby] 16-core multiprocessor on the Sparc architecture running unmodified Solaris 1 operating system Workload SPEC26 CINT Benchmark Suite [program initialization is included] P P1... P15 L1 L1 L1 Shared L2 Cache Off Chip Private L1: 32KB each; 4-way; 64B cache line Shared L2: 4MB; 16-way; 64B cache line L1 miss latency: 2 cycles L2 miss latency: 4 cycles MESI Directory protocol between L1 and L2 12/23

Evaluation Mechanisms Baseline: No Cache Capacity Management Way-Partitioned Management Decay-Based Management Scenarios High contention General Workload 1: Constraining one memoryintensive application General Workload 2: Protecting a high-priority application (refer to the paper) 13/23

per Instruction (CPI) High Contention Scenario 1.4 1.2 1.8.6.4.2 1.29 1..84.85.87.74.67.55 7 instances of mcf libquantum Alone No Management Way-Partitioned Management [libquantum- 2Way;others-14Way] Decay-Based Management [libquantum-no decay; others-1m] No management: taking turns evicting each other s cache lines out Way-partitioning: Decay-based: 14/23

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 Memory interference No Management 2.2x1 9 Way-Partitioned Management (libquantum-2way; mcf+os-14way) 1 9 8 7 6 5 4 3 2 1 2.2x1 9 Decay-Based Management (libquantum-no decay; mcf+os-1k) 1 9 8 7 6 5 4 3 2 1 2.2x1 9 15/23

per Instruction (CPI) Constraining a Memory-Intensive Application.8.7.6.5.4.3.2.1 Alone No Management Way-Partitioned Management [mcf- 4Way;others-12Way] Decay-Based Management [mcf-1m; other-no decay] --Way-partitioning s coarse granularity control trades off 5% performance for mcf with an average 1% performance improvement for the rest --Decay-based management: only 2% for mcf and others 3% because of its fine-grained control and improved ability to exploit data temporal locality 16/23

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 No Management 2.2x1 9 Way-Partitioned Management (mcf-4way; others-12way) 1 9 8 7 6 5 4 3 2 1 2.2x1 9 1 9 8 7 6 5 4 3 2 1 Decay-Based Management (mcf-1m; others-no decay) 2.2x1 9 17/23

No Management Way-Partitioning Decay-based 2 2 2 2 Cache space 2 of mcf is constrained to 4 ways 2 5 5 5 4 2 4 2 libquantum is having more of the shared L2 cache space More cache space is used by gcc in way-partitioning and decay-based 4 2 5 5 5 2 2 2

Outline Motivation Shared Cache Capacity Management Experimental Setup and Evaluation Related Work Conclusion 19/23

Priority classification and enforcement to achieve differentiable QoS [Iyer, ICS 4] Architectural support for optimizing performance of high priority application with minimal performance degradation based on QoS policies [Iyer et al., SIGMETRICS 7] Performance metric, such as miss rates, bandwidth usage, IPC, and fairness, to assist resource allocation [Hsu Further et al., cache PACT 6] fairness policies can be incorporated into both Resource capacity allocation management fairness mechanisms in virtual private discussed in this cache, where its capacity work. manager implements waypartitioning [Nesbit et al., ISCA 7]

Related Work: Dynamic Cache Capacity Management OS distributes equal amount of cache space to all running processes, keeps statistics on the fly, and dynamically adjust cache space distribution [Suh et al., HPCA 2; Kim et al., PACT 4; Qureshi and Patt, MICRO 6] Adaptive set pinning to eliminate inter-process misses [Srikantaiah et al., ISCA 8] To Statistical the best model of our to knowledge, predict thread there behaviors has not been and capacity management through decay [Petoumenos et any prior work based on decay management al., IEEE Workload Characterization 6 ] taking full system effects into account.

Conclusion Advantages Way-Partitioned Management Simple hardware complexity Straightforward technique Decay-Based Management Fine granularity control more effective space utilization Drawbacks Great performance isolation Preferably, the number of cache ways the number of concurrent processes Data remaining in the cache: high priority and good temporal locality 5% 55% More complex hardware Coarse granularity in space allocation inefficient space utilization 22/23

Hardware Overhead: Way-partitioning index MUX process ID Set cache ways for tag comparison MUX data result of tag comparison

What happens to the replaced lines? P P1... P15 L1 L1 L1 Shared L2 Cache L2 s replacement lines are replaced without evicting L1 s copy. Works because L1 and L2 cache blocks Are the same size!! 64Bytes. 25

Cache Occupancy (%) Cache Space Distribution 1 9 8 7 6 5 4 3 2 1 No Management Way-Partitioned Management (lbm-4way; others-12way) 1 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 Decay-Based Management (lbm-no decay; others-1k) 4 ways out of 16 way set-associative cache allocated to lbm for its exclusive access

Related Work: Iyer s QoS Shared Cache Capacity Management 27

Decay-based Management reference stream EA BC D EA BC MISS HIT MISS HIT MISS Bb DA CE A, C, E from the HIGH PRIOIRTY process -> NO DECAY B, D from the LOW PRIORITY process -> DECAY D decays B decays Memory controller 5 out of 9 are hits All 5 hits belong to the high priority process LRU: NO HITS at all A 4 32 C 21 43 D B 2 31 4... E 43 21 4-way set-associative cache LRU: Temporal Behaviors Decay-based: Process Priority and Temporal Behaviors 28/29