Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

Size: px

Start display at page:

Download "Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho"

Brandon Hensley
5 years ago
Views:

1 Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February 2014

2 Directory Chip mul+processors with many cores. Coherence is needed across private Caches Directory for scalable coherence solu+ons. Shared LLC Directory L2 Cache L2 Cache L2 Cache L2 Cache Core 0 Core 1 Core 2 Core N

3 Directory: Energy VS Area Duplicate- Tags [e.g., Piranha, NiagraT2] Directory Sparse [e.g., AMD Opteron] L2 Cache L2 Cache L2 Cache L2 Cache Core 0 Core 1 Core 2 Core N

4 Directory: Energy VS Area Directory Duplicate- Tags Energy inefficient [e.g., Piranha, NiagraT2] Sparse [e.g., AMD Opteron] Ways (L2 assoc. x N-Cores L2 Tags Core 0 Core 1 Core 2 Core N

5 Directory: Energy VS Area Directory Duplicate- Tags Energy inefficient Sparse Area Inefficient [e.g., AMD Opteron] How big is enough? 2x-4x over-provision L2 Tags Core 0 Core 1 Core 2 Core N

6 Sparse- based Directories Area efficiency Conven+onal Sparse (2-4x) [Gupta:isca90, Conway:micro10] Clever hashing (1.5x) [Ferdman:hpca 11, Sanchez:hpca 12] Course- grain set indexing [Alisafaee:micro12] Disabling coherence for private pages [Cuesta:isca11] How big is enough? Stash Directory

7 Stash Directory Stash is allowed to not track all cached tags. Entries that track private blocks can be silently removed from directory. LLC and the coherence protocol are involved to discover unregistered blocks. Contribu+on: Power Efficient (low associa+vely) Space Efficient (as small as 0.25x provisioning size without performance impact) Transparent (no OS support, simple design). Scalable (largely independent to core count).

8 Outline Introduc+on Directory- Induced Invalida+ons Stash Directory Evalua+on Conclusion 8

9 Directory- Induced Invalida+ons Insert Dir Entry Eviction Directory Forces Invalidation L2 Miss Core 0 Core 1 Core 2 Core N

10 Conflict Rate Cache Miss Rate DIR Conflict Rate Cache Miss Rate Directory Size Directory Size [Benchmark: fluidanimte]

11 Forcing Invalida+on on Private Blocks Insert Dir Entry Eviction HOT Private Block Core 0 Core 1 Core 2 Core N

12 (1) Invalida+on of Hot Blocks MRU LRU Eviction MRU LRU Core 0 Core 1 Core 2 Core N

13 (2) Causing (unnecessary) Addi+onal Misses MRU LRU L2 Miss MRU LRU Core 0 Core 1 Core 2 Core N

14 (3) Pollu+ng the Directory Set MRU LRU L2 Miss MRU LRU Core 0 Core 1 Core 2 Core N

15 Forcing Invalida+on on Private Blocks PARSEC and SPLASH2 Workloads 1/4x Directory Size Provisioning On average: 72% of directory- induced invalida+ons target Private blocks 80% of invalidated blocks will be re- loaded, causing misses.

16 Outline Introduc+on Directory- Induced Invalida+ons Stash Directory Evalua+on Conclusion 16

17 Stash Directory: Overview Directory knows if an entry is tracking a private block. If evicted entry is private, then do not enforce invalida+on. Private blocks remain hidden from the directory. LLC and the coherence protocol are involved to discover hidden blocks if necessary.

18 Stash Directory: Silent Evic+on Mark as Stash-hidden Block Shared LLC Directory P Dir Eviction Do not enforce invalidation L2 Cache Core 0 Core 1 Core 2 Core N

19 Stash Directory: Handling False Misses Marked Stash-hidden Shared LLC (False) Directory Miss L2 Miss Core 0 Core 1 Core 2 Core N

20 Stash Directory: Handling False Misses Unmark Shared LLC L2 Miss Found Core 0 Core 1 Core 2 Core N

21 Outline Introduc+on Directory- Induced Invalida+ons Stash Directory Evalua+on Conclusion 21

22 Evalua+on Methodology Workloads Mul+threaded benchmarks from SPLAS2 and PARSEC.2.1. Trace x86 traces generated using PIN. Feed into cache/noc cycle detailed model 1- IPC in- order core model. Simulated Machine Configura+on 16- core +led based CMP. Distributed shared LLC 16MB, L1/L2 private caches, inclusive. 4- way/8- way 4x4 mesh NoC. Distributed directory (same associa+vity as L2). Varying Size.

23 Comparison Schemes. 1. Sparse: Conven+onal Sparse Directory. 2. PDC: Deac+va+ng Coherence for Private blocks [Cuesta:isca11]. course- grain classifica+on of blocks into private/shared (page granularity). If miss on a private block, do not invoke coherence protocol. => private blocks are not tracked by the directory. Recover mechanism when page goes from private to shared. OS- supported technique. All schemes use the same sharer- vector encoding. All schemes use the same associa+vely (same as L2).

24 Cache Size VS Miss Rate 220 ocean % Miss Rate Change Sparse PDC Stash x 1x 1/2x 1/4x 1/8x 1/16x Directory Provisioning RaKo

25 Cache Size VS Miss Rate bodytrack 200 % Miss Rate Change Sparse PDC Stash x 1x 1/2x 1/4x 1/8x 1/16x Directory Provisioning RaKo

26 Cache Size VS Miss Rate 220 canneal % Miss Rate Change Sparse PDC Stash x 1x 1/2x 1/4x 1/8x 1/16x Directory Provisioning RaKo

27 Cache Performance 2x 1x 1/2x 1/4x Miss Rate (Normalized)

28 Cache Performance 2x 1x 1/2x 1/4x 1/4x STASH Miss Rate (Normalized) For 1/4x Directory Size, improve execu+on +me by 16% on average. Similar in performance to Sparse- 2x, while being 8 +mes smaller. False misses are few (<6% of directory misses).

29 Scalability Area: Can Stash remain small? (1/4x) Bandwidth: Can Stash remain bandwidth efficient? Miss Rate (Normalized) Sparse PDC Stash Bandwidth (Normalized) Core Count Core Count

30 Conclusion Stash inherits the power efficiency of spares directories. Reduces the directory size requirements significantly. Provides a transparent op+miza+on, independent of system somware, core type and count. Leverages a shared, on- chip last level cache.

31 Thank you for your auenkon! 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February 2014

Stash Directory: A Scalable Directory for Many-Core Coherence

Stash Directory: A Scalable Directory for Many-Core Coherence Socrates Demetriades and Sangyeun Cho Computer Science Department, University of Pittsburgh Memory Division, Samsung Electronics Co. {socrates,cho}@cs.pitt.edu