Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

Size: px

Start display at page:

Download "Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view"

Victoria Quinn
5 years ago
Views:

1 1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011

2 2 Outline Self-performance contract Proposition for making the performance of sequential programs executing on symmetric multicores more deterministic Implications for the microarchitecture Focus on bus bandwidth and shared cache

3 3 Context / assumptions General-purpose symmetric multicore E.g., Intel Nehalem Multiprogrammed environment Sequential programming still predominant Even parallelizable applications have sequential parts Some programs require a minimum performance level

4 4 Performance-concerned programmer Application performance (depends on platform, programming language, programmer s skill, ) minimum acceptable performance maximum tolerable effort Programming effort (~ time)

5 5 Parallel vs. sequential The performance-concerned programmer may not necessarily want to write parallel programs Application performance parallel sequential Programming effort

6 6 Example Sam invents a new video decompression method

7 Non deterministic performance? 7

8 8 Programming a (soft) real-time app Application performance best perf worst perf non determinism requires more effort Programming effort

9 9 Resource sharing in multicores Caches, bandwidth, power, temperature, Performance depends on which apps run simultaneously Shared cache is a major source of non determinism observed slowdowns on an Intel Nehalem up to 2 x The operating system cannot solve the problem but by preventing apps to run simultaneously

10 10 The bandwidth paradox A higher bus bandwidth allows evicting more blocks from the last-level shared cache during a given time Increasing bandwidth may decrease the performance of some apps Not a healthy situation!

11 11 Known solutions Do nothing, ignore the problem (current situation) Private caches But bus bandwidth is still shared Programmable quotas (not implemented so far) Shared cache and bus bandwidth are partitioned on demand The operating system is supposed to do the partitioning The developer of an application has no guarantee that the application can get more than 1/N of a resource on an N-core machine

12 12 Proposition : self-performance contract Symmetric run copies of the application run simultaneously on all cores and use the same inputs Self-performance performance measured for one instance of the app under a symmetric run The OS provides a selfperf utility that the programmer can use for doing symmetric runs and measuring self-performance Self-performance contract the microarchitect tries to keep the actual performance greater than or equal to selfperformance

13 13 Rationale Defines a performance target in isolation No need to make assumptions about other apps With N cores, this gives 1/N of shared resources to the app Simple for the programmer No need to know internal microarchitecture details Just be aware of the self-performance contract Programmers who are not concerned can still measure performance as usual

14 Implications for the microarchitecture 14

15 15 Static partitioning? Static partitioning = private resources Possible way to implement the self-performance contract but static partitioning of bus bandwidth is quite inefficient Shared resources provide higher average performance Especially when running simultaneously fewer threads than cores

16 16 Dealing with shared resources How to manage shared resources so as to implement the selfperformance contract without harming throughput? threads needing less than 1/N of a shared resource get what they need threads needing more than 1/N of a shared resource get at least 1/N If some threads need less than their fair share, the surplus is allotted to the others threads

17 17 Bus bandwidth Self-performance is generally higher than what would be obtained by a static partitioning of bus bandwidth bursts of last-level cache misses LLC misses per cycle Symmetric run, threads slightly out-of-sync bandwidth time

18 18 Thread arbitration policy Self-performance requires fair arbitration between threads that takes into account memory requests burstiness Fair policies have been proposed with programmable quotas Self-performance contract can use simpler implementations A simple policy that works well for bus access N threads N counters, one per thread (e.g., 4-bit counter) priority to thread with smallest counter value add N-1 to selected thread counter, subtract 1 from all other counters

19 19 Shared cache Last-level cache is the shared resource most critical for selfperformance contract The app should ask for large pages, or the OS should implement superpages or do some page coloring so that mapping to cache sets is as much as possible deterministic Thread-oblivious replacement policies (e.g., LRU) are incompatible with self-performance contract

20 20 Replacement policy Replacement policy should partition each cache set equally among threads competing for that set Proposition: the SAR B2 policy Underlying policy can be LRU, CLOCK, NRU, DIP, DRRIP, Upon a miss Pick a random block in the cache set random thread B2 rule: find which of random thread and missing thread has more blocks in the cache set Victim selection : victimize a block from that thread

21 21 Thread ID Thread ID must be stored along with each cached block Microarchitectural TID affects performance only N logical cores Log 2 (N) bits TID Unused cores (i.e., not currently running a thread) means some TIDs are inactive Reclaim rule : if random block TID is inactive, victimize that block

22 22 Remarks Random block selection ensures convergence to fair partitioning The fair partitioning of a cache set may change after an increase or a decrease of the number of active threads If an active thread needs less than its fair share, other active threads can share the surplus

23 23 Hardware cost Stored TIDs Example: 8 cores, 64-byte blocks storage overhead < 0.6 % Underlying replacement policy may require extra storage CLOCK requires one clock hand per core and per set prefer NRU and NRU-based (e.g., DRRIP) Some logic for B2 rule and victim selection Last-level cache miss latency 100 s of clock cycles can use some sequential logic

24 24 B2 rule sequential implementation read TIDs of blocks in the set Missing block TID M =? incr. shift register counter R =? decr. random value Random block TID Final counter sign used for victim selection

25 25 Conclusion Possible to make performance of sequential apps much more deterministic Requires modifying shared cache replacement policy and bus arbitration Reasonable hardware cost Parallel programs? no obvious solution so far Non determinism inherent to some parallel programs

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud To cite this version: Pierre Michaud. Replacement policies for shared caches on symmetric