Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi

Introduction and Motivation 2 A serious issue to the effective utilization of multicore processors is cache partitioning and sharing Simulation were used to evaluate cache partitioning in the existing studies, however, it has some limitations Excessive simulation time Absence of OS activities Proneness to simulation inaccuracy

Introduction and Motivation (cont.) 3 In this paper, a software approach has been used It supports static and dynamic cache partitioning by using memory address mapping It emulates hardware partitioning mechanism will examine cache partitioning policies on real time systems Three metrics were used through evaluation for optimization purposes Performance Fairness QoS

Cache Partitioning for Multicore Processors 4 It has two interdependent parts Mechanism Forces cache partitioning Provides partitioning policy input Policy Decides how much cache resources will be allocated to each program with an optimization objective

Adopted Evaluation Metrics in The Study 5 Performance Metrics Throughput (IPCs) Absolute number of IPCs Combined miss rates Summarizes miss rates Combined misses Summarizes number of cache misses QoS Metrics Suppose that QoS constraints are never violated in their case

Adopted Evaluation Metrics in The Study (cont.) 6 Fairness Metrics Miss rates The number of misses The slowdown for each co- secheduled program should be identical after cache partitioning In the study, fairness metrics related to single core execution with dedicated L2 cache Date required for policy metric and the evaluation metric were acquired by running a workload with different cache partitioning The result value will be in the range (-1 to 1) If the result is 1, the correlation between the 2 metrics is perfect

Static OS-based Cache Partitioning 7 Static cache partitioning policy predetermines the amount of cache blocks allocated to each program at the beginning of its execution Page coloring will be used in the partitioning mechanism There several bits between cache index and physical page number in the physical address It will be used for page color Addressed cache will be divided to non-intersecting regions by page color Pages with the same color are mapped to the same cache region

Cache Partitioning Page Coloring 8

Cache Partitioning Page Coloring 9

Dynamic OS-based Cache Partitioning 10 Adjust cache quotas among processes dynamically Page recoloring procedure Increasing the process cache resources ( i.e number of colors used by the process) The kernel rearrange the virtual memory mapping of the process Allocating physical pages of the new color Copying the memory contents Freeing the old pages Remapping virtual pages cause performance overhead Reduce the overall overhead by lowering the frequency of cache allocation adjustment Another option is using lazy method of page migration, so the content of colored page is moved only when it s accessed Average overhead of dynamic partitioning reduced to 2% Highest migration overhead observed 7%

Page Recoloring 11

Dynamic Cache Partitioning Policies 12 Cache partitioning will be adjusted periodically by the policies at the end of each epoch Dynamic cache partitioning policy for performance Adjust cache partitioning dynamically Metrics Throughput (IPCs) Combined miss rate Combined misses Fair speedup Dynamic cache partitioning policy for fairness Two dynamic policies were implemented based on FM0 and FM4 FM0 is the evaluation metric ( i.e. the ratio of the current cumulative IPC over the baseline IPC) FM4 is the cache miss rates

Dynamic Cache Partitioning Policies (cont.) 13 Dynamic cache partitioning policy for QoS consideration Two core workload of two programs The first is the target program The second is the partner program QoS guarantee Ensure the target program performance is larger than or equal to X% of a baseline execution of homogeneous workload on a dual core processor with half of the cache capacity allocated for each program Increase the performance of the partner program

Experimental Methodology 14 Hardware and software platform Dell PowerEdge1950 Two dual core, 3.0GHz Intel Xeon 5160 processors and 8GB fully Buffered DIMM (FB-DIMM) main memory Shared, 4MB, 16-way set associative L2 cache Each core has a private 32KB instruction cache and a private 32KB data cache Red Hat Enterprise Linux 4.0 Kernel linux-2.6.20.3 Performance collected using pfmon

Evaluation Results 15 Show the improvement with the best static partitioning of each workload over shared cache

The Performance Static & Dynamic 16

Fairness Correlation between Evaluation Metrics and Policy Metrics 17

QoS Static & Dynamic 18

Related Work 19 Cache partitioning for multicore processors Page Coloring

Summary 20 An OS-based cache partitioning mechanism on multicore processors were designed and implemented Using it to study different cache partitioning polices Some simulation-based study findings were confirmed, however, this approach shows new insights haven t been shown by simulation Future work Reduce cache partitioning overhead Adding easy user interface Conducting partitioning research at the compiler level for both multiprogramming and multithreaded applications

Discussion 21 Does OS-based approach had provided new insights and observations that simulation couldn t or failed to show it?

References 22 Gaining Insights into Multicore Cache Partitioning:Bridging the Gap between Simulation and Real Systems http://www.contrib.andrew.cmu.edu/~hyoseunk/pdf/ecrts13- hyos-slides.pdf http://ftp.cs.rochester.edu/~xiao/eurosys09/euro061-zhang.pdf