Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Size: px

Start display at page:

Download "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches"

Oliver Goodman
6 years ago
Views:

Michael Ferdman, Babak Falsafi, Anastasia Ailamaki

Distributed Caches cache slice Data placement

1 Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed Caches cache slice Data placement determines performance Goal: place data on chip close to where they are used 2 1

2 Prior Work Several proposals for CMP cache management ASR, cooperative caching, victim replication, CMP-NuRapid, D-NUCA...but suffer from shortcomings complex, high-latency lookup/coherence don t scale lower effective cache capacity optimize only for subset of accesses We need: Simple, scalable mechanism for fast access to all data 3 Our Proposal: Reactive NUCA Cache accesses can be classified at run-time Each class amenable to different placement Per-class block placement Simple, scalable, transparent No need for HW coherence mechanisms at LLC Avg. speedup of 6% & 14% over shared & private Up to 32% speedup -5% on avg. from ideal cache organization i Rotational Interleaving Data replication and fast single-probe lookup 4 2

3 Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 5 Terminology: Data Types Read or Write Read Read Read Write Private Shared Read-Only Shared Read-Write 6 3

4 Conventional Multi Caches Shared Private dir Addr-interleave blocks + High effective capacity Slow access Each block cached locally + Fast access (local) Low capacity (replicas) Coherence: via indirection (distributed directory) We want: high capacity (shared) + fast access (priv.) 7 Where to Place the Data? Close to where they are used! Accessed by single : migrate locally Accessed by many s: replicate (?) If read-only, replication is OK If read-write, coherence a problem Low reuse: evenly distribute across sharers read write read-only migrate eread-write share replicate sharers# 8 4

5 Methodology Flexus: Full-system cycle-accurate timing simulation Workloads OLTP: TPC-C WH IBM DB2 v8 Oracle 10g DSS: TPC-H Qry 6, 8, 13 IBM DB2 v8 SPECweb99 on Apache 2.0 Multiprogammed: Spec2K Scientific: em3d Model Parameters Tiled, LLC = Server/Scientific wrkld. 16-s, 1MB/ Multi-programmed wrkld. 8-s, 3MB/ OoO, O 2GHz, 96-entry ROB Folded 2D-torus 2-cycle router 1-cycle link 45ns memory 9 Cache Access Classification Example Each bubble: cache blocks shared by x s Size of bubble proportional to % accesses y axis: % blocks in bubble that are read-write % % RW Read-W Block Write ks in Blocks Bubble Instructions Data-Private Data-Shared % accesses Number of Sharers 10 5

6 % RW Read-Write Blocks in Blo Bu ubble Cache Access Clustering Instructions Data-Private Data-Shared S 10 ocks Number of Sharers migrate locally Server Apps share (addr-interleave) % % RW Read-Write Blocks in Blo Bu ocks ubble Instructions Data-Private Data-Shared R/W R/O migrate share replicate sharers# Number of Sharers Scientific/MP Apps replicate Accesses naturally form 3 clusters 11 Instruction Replication Instruction working set too large for one cache slice Distribute in cluster of neighbors, replicate across 12 6

7 Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 13 Rotational Interleaving RID +log 2 (k) size-4 clusters: local slice + 3 neighbors Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 14 7

8 Rotational Interleaving RID +log 2 (k) PC: 0xfa480 ( Addr + RID +1) & ( 1) Destination = n Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 15 Rotational Interleaving RID +log 2 (k) each slice caches the same blocks on behalf of any cluster Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 16 8

9 Classification Mechanisms Instructions classification: all accesses from L1-I Per-page classification for data: at TLB miss Utilize OS page table & TLB for book-keeping k info Page classification is accurate (<0.5% error) On 1 st access On access by another Core i Ld A TLB Miss Ld A TLB Miss Core j OS A: Private to i OS A: Private to i A: Shared 17 Coherence: No Need for HW Mechanisms at LLC Reactive NUCA placement guarantee Each R/W datum in unique & known location Shared data: addr-interleave Private data: local slice Fast access, eliminates HW overhead 18 9

10 over Private Speedup Evaluation ASR I ASR I ASRI ASR I ASR I ASR I ASR I ASR I ASR (A) Shared (S) R-NUCA (R) Ideal (I) OLTP Apache DSS DSS DSS em3d OLTP DB2 Qry6 Qry8 Qry13 Oracle MIX Private-averse workloads Shared-averse workloads Delivers robust performance across workloads Shared: same for Web, DSS; 17% for OLTP, MIX Private: 17% for OLTP, Web, DSS; same for MIX 19 Conclusions Reactive NUCA: near-optimal block placement and replication in distributed caches Cache accesses can be classified at run-time Each class amenable to different placement Reactive NUCA: placement of each class Simple, scalable, low-overhead, transparent Obviates HW coherence mechanisms for LLC Rotational Interleaving Replication + fast lookup (neighbors, single probe) Robust performance across server workloads Near-optimal placement (-5% avg. from ideal) 20 10

11 Questions? Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University Flexus full-system simulator available at 21 Backup Slides Classification and Lookup 22 11

12 Classification Granularity: OS Page Per-block classification: high area/latency overhead Per-page classification (utilize OS page table) Core accesses it for every access anyway (TLB) Page classification is accurate (<0.5% error) TLB entry: P/S vpage ppage 1 bit Page table entry: P/S/I id vpage ppage 2 bits log(n) Page granularity allows simple + practical HW 23 Data Class Bookkeeping private data: place in local slice Page table entry: P id vpage ppage TLB entry: P vpage ppage shared data: place in aggregate (addr interleave) Page table entry: S id vpage ppage TLB entry: S vpage ppage Physical Addr.: tag id cache index offset 24 12

13 Data Classification and Lookup Core i Ld A TLB Miss allocate A P i OS vpage ppage 25 Data Classification and Lookup Core i Core j evict A inval A TLBi Ld A TLB Miss reply A Core k OS allocate A P i vpage ppage i j S x vpage ppage 26 13

14 Data Classification and Lookup Core i Core j Core k OS S x vpage ppage Fast & simple lookup for data 27 Total Accesses A Misclassifications at Page Granularity One Class Instructions+Data Private+Shared Data OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX Total Accesses Private Data as Shared Correct Accesses from pages with Access misclassifications multiple li l access types A page may service multiple access types But, one type always dominates accesses Classification at page granularity is accurate OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX 28 14

15 Backup Slides Placement 29 Private Data Placement Total Acc cesses (CDF) ,024 4,096 16,384 65, ,144 1,048,576 Private Data (KB) OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX Spill to neighbors if working set too large? NO!!! Each runs similar threads Store in local slice (like in private cache) 30 15

16 Shared Data Placement Total Access ses (CDF) ,024 4,096 16,384 65, ,144 1,048,576 Shared Data (KB) OLTP DB2 OLTP Oracle 10 Apache 8 DSS Qry6 6 DSS Qry8 4 DSS Qry13 em3d 2 MIX Total Accesses OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 1st access 2nd access 3rd-4th access 5th-8th access 9+ access em3d MIX Read-write + large working set + low reuse Unlikely to be in local slice for reuse Also, next sharer is random [WMPI 04] Address-interleave in aggregate (like shared cache) 31 Instruction Placement Total Acc cesses (CDF) ,024 Instructions (KB) 4,096 OLTP DB2 OLTP Oracle 10 Apache 8 DSS Qry6 6 DSS Qry8 Qy 4 DSS Qry13 2 em3d MIX Total Accesses OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 1st access 2nd access 3rd-4th access 5th-8th access 9+ access DSS Qry13 em3d MIX Working set too large for one slice Slices store private & shared data too! Sufficient capacity with 4 slices Share in clusters of neighbors, replicate across 32 16

17 Backup Slides Detailed Evaluation 33 Cache Accesses Breakdown Instr. + shared-rw dominate server workloads Private dominate scientific/mix 34 17

18 CPI Breakdown 35 Impact of Eliminating Coherence 36 18

19 Impact of Private Allocation 37 Impact of Instruction Replication 38 19

Instruction Clustering 39 Off-chip Misses 0.8 Normalized CPI 0.6 0.4 0.

20 Instruction Clustering 39 Off-chip Misses 0.8 Normalized CPI Off chip atomic Off chip load Off chip instructions 0 PASR PASR PASR PASR PASR PASR PASR PASR OLTP DB2 Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d OLTP Oracle MIX Private-averse workloads Shared-averse workloads 40 20

21 1 Evaluation: Speedup over Ideal Speedup ov ver Ideal PASR I PASR I PASR I PASR I PASR I PASR I PASR I PASR I Private (P) ASR (A) Shared (S) R-NUCA (R) OLTP DB2 Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d OLTP Oracle MIX Private-averse workloads Shared-averse workloads Near-optimal placement: -5% on avg. from ideal 41 21

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019