EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Similar documents
SUPPORTING VERY LARGE DRAM CACHES WITH COMPOUND-ACCESS SCHEDULING AND MISSMAP

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

A Comprehensive Analytical Performance Model of DRAM Caches

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Staged Memory Scheduling

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

BREAKING THE MEMORY WALL

Memory Hierarchy. Slides contents from:

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

1/19/2009. Data Locality. Exploiting Locality: Caches

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Chapter 2: Memory Hierarchy Design Part 2

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Lecture 11: Large Cache Design

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Chapter 2: Memory Hierarchy Design Part 2

Lecture notes for CS Chapter 2, part 1 10/23/18

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

EEM 486: Computer Architecture. Lecture 9. Memory

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Emerging NVM Memory Technologies

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Cache Performance (H&P 5.3; 5.5; 5.6)

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Memories: Memory Technology

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

ECE 485/585 Midterm Exam

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Use-Based Register Caching with Decoupled Indexing

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

SGI Challenge Overview

SEESAW: Set Enhanced Superpage Aware caching

Memory Hierarchy. Slides contents from:

Token Coherence. Milo M. K. Martin Dissertation Defense

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Bandwidth Adaptive Snooping

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

DRAM Main Memory. Dual Inline Memory Module (DIMM)

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Hybrid Memory Cube (HMC)

ECE/CS 552: Introduction To Computer Architecture 1

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

ECE 485/585 Microprocessor System Design

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Introduction to memory system :from device to system

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Page 1. Multilevel Memories (Improving performance using a little cash )

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

Using Transparent Compression to Improve SSD-based I/O Caches

Computer Architecture Memory hierarchies and caches

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Memory. Lecture 22 CS301

A Cache Hierarchy in a Computer System

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Practical Near-Data Processing for In-Memory Analytics Frameworks

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Mainstream Computer System Components

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Addressing the Memory Wall

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

A Comparison of Capacity Management Schemes for Shared CMP Caches

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Transcription:

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University of Wisconsin-Madison Hill s work largely performed while on sabbatical at [1].

EXECUTIVE SUMMARY Good use of stacked DRAM is cache, but: Tags in stacked DRAM believed too slow On-chip tags too large (e.g., 96 MB for 1 GB stacked DRAM cache) Solution put tags in stacked DRAM, but: Faster Hits: Schedule together tag & data stacked DRAM accesses Faster Miss: On-chip MissMap bypasses stacked DRAM on misses Result (e.g., 1 GB stacked DRAM cache w/ 2 MB on-chip MissMap) 29-67% faster than naïve tag+data in stacked DRAM Within 88-97% of stacked DRAM cache w/ impractical on-chip tags 1 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

OUTLINE Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary 2 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary CHIP STACKING IS HERE DRAM layers DRAM layers cores cores silicon interposer Vertical Horizontal 256 MB Samsung @ ISSCC 11: A 1.2V 12.8Gb/s 2Gb Mobile Wide-I/O DRAM with 4x128 I/Os Using TSV-Based Stacking 3 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary HOW TO USE STACKED MEMORY? Complete Main Memory Few GB too small for all but some embedded systems OS-Managed NUMA Memory Page-size fragmentation an issue Requires OS-HW cooperation (across companies) Cache w/ Conventional Block (Line) Size (e.g., 64B) But on-chip tags for 1 GB cache is impractical 96 MB! (TAKE 1) Sector/subblock Cache Tag w/ 2KB block (sector) + state bits w/ each 64B subblock Tags+state fits on-chip, but fragmentation issues (see paper) 4 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Row Decoder Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary TAG+DATA IN DRAM (CONVENTIONAL BLOCKS TAKE 2) Use 2K-Stacked-DRAM pages but replace 32 64B blocks with 29 tags (48b) + 29 blocks 32 x 64-byte cachelines = 2048 bytes Sense Amps Row Buffer Tags 29 ways of data But previously dismissed as too slow DRAM Tag Lookup DRAM Latency Tag Updated DRAM Latency Request latency Total bank occupancy Data Returned 5 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary IMPRACTICAL IDEAL & OUR RESULT FORECAST Ideal SRAM Tags CAS + MissMap Comp. Acc. Sched. Tags in DRAM Compound Access Scheduling + MissMap Approximate impractical on-chip SRAM tags Methods Later; Avg of Web-Index, SPECjbb05, TPC-C, & SPECweb05 6 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

OUTLINE Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary 7 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary FASTER HITS (CONVENTIONAL BLOCKS TAKE 3) Data Returned not to scale DRAM Latency SRAM Tag Lookup CPU-side SRAM tags Tags in DRAM DRAM Tag Lookup DRAM Latency Data Returned DRAM Latency Tag Updated t RCD t CAS ACT RD Data Xfer PRE t RAS DRAM Tag Lookup ACT RD Data Xfer Tag Check Row Buffer Hit Latency RD Data Xfer WR Data Returned Tag Updated Data Xfer Compound Access Scheduling 8 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary COMPOUND ACCESS SCHEDULING Reserve the bank for data access; guarantee row buffer hit Approximately trading an SRAM lookup for a row-buffer hit: tags SRAM ACT RD tags data data ACT RD RD On a miss, unnecessarily holds bank open for the tag-check latency Prevents tag lookup on another row in same bank Effective penalty is minimal since t RAS must elapse before closing this row, so bank will be unavailable anyway 9 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

OUTLINE Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary 10 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary FASTER MISSES (CONVENTIONAL BLOCKS TAKE 4) Want to avoid delay & power of stacked DRAM access on miss Impractical on-chip tags answer Q1 Present: Is block in stacked DRAM cache? Q2 Where: Where in stacked DRAM cache (set/way)? New on-chip MissMap Approximate impractical tags for practical cost Answer Q1 Present But NOT Q2 Where 11 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary MISSMAP On-chip structures to answer Q1: Is block in stacked DRAM cache? Lookup address in MissMap hit Lookup tag in DRAM cache hit Hit: get data from DRAM cache miss MissMap Requirements Add block in miss; remove block on victimization No false negatives: If says, not present must be not present False positives allowed: If says, present may (rarely) miss Sounds like a Bloom Filter? (miss) Miss: go to main memory But our implementation is precise no false negatives or positives Extreme subblocking with over-provisioning 12 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary MISSMAP IMPLEMENTATION MissMap entry Installing a line in the DRAM $ tag bit vector Tag+16 bits tracks 1KB of memory (e.g.) DRAM $ X Y MissMap 15 7 6 5 4 3 2 1 0. 1KB memory segment X[7] X[7] 64B Evicting a line from the DRAM $ Key 1: Extreme Subblocking Y[3] Y[3] 13 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary MISSMAP IMPLEMENTATION Tags Subblocked cache Data 1KB MissMap Tags Data Tags only for cached (large) blocks Key 2: Over-provisioning Key 3: Answer Q1 Present NOT Q2 Where 36b tag + 64b vector = 100b Poor cache efficiency Not cached due to fragmentation NOT 36b tag + 5*64b vector = 356b (3.6x) Few bits likely set Example: 2MB MissMap 4KB pages Each entry is ~12.5 bytes (36b tag, 64b vector) 167,000 entries total Best case, tracks ~640MB 14 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

OUTLINE Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary 15 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary METHODOLOGY (SEE PAPER FOR DETAILS) Workloads (footprint) Web-Index (2.98 GB) // SPECjbb05 (1.20 GB) TPC-C (1.03 GB) // SPECweb05 (1.02 GB) Base Target System 8 3.2 GHz cores with 1 IPC peak w/ 2-cycle 2-way 32KB I$ + D$ 10-cyc 8-way 2MB L2 for 2 cores + 24-cyc 16-way 8MB shared L3 Off-chip DRAM: DDR3-1600, 2 channels Enhanced Target System 12-way 6MB shared L3 + 2MB MissMap Stacked DRAM: 4 channels, 2x freq (~½ latency), 2x bus width gem5 simulation infrastructure (= Wisconsin GEMS + Michigan M5) 16 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary KEY RESULT: COMPOUND SCHEDULING + MISSMAP WORK Ideal SRAM Tags CAS + MissMap Comp. Acc. Sched. Tags in DRAM Compound Access Scheduling + MissMap Approximate impractical on-chip SRAM tags 17 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary 2 ND KEY RESULT: OFF-CHIP CONTENTION REDUCED Base 128MB 256MB 512MB 1024MB For requests that miss, main memory is more responsive Fewer requests lower queuing delay Fewer requests More row-buffer hits lower DRAM latency 18 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary OTHER RESULTS IN PAPER Impact on all off-chip DRAM traffic (activate, read, write, precharge) Dynamic active memory footprint of the DRAM cache Additional traffic due to MissMap evictions Cacheline vs. MissMap lifetimes Sensitivity to how L3 is divided between data and the MissMap Sensitivity to MissMap segment size Performance against sub-blocked caches 19 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

OUTLINE Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary 20 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary RELATED WORK Stacked DRAM as main memory Mostly assumes all of main memory can be stacked [Kgil+ ASPLOS 06, Liu+ IEEE D&T 05, Loh ISCA 08, Woo+ HPCA 10] As a large cache Mostly assumes tag-in-dram latency too costly [Dong+ SC 10, Ghosh+ MICRO 07, Jiang+ HPCA 10, Loh MICRO 09, Zhao+ ICCD 07] Other stacked approaches (NVRAM, hybrid technologies, etc.) [Madan+ HPCA 09, Zhang/Li PACT 09] MissMap related Subblocking [Liptay IBMSysJ 68, Hill/Smith ISCA 84, Seznec ISCA 94, Rothman/Smith ICS 99] Density Vector for prefetch suppression [Lin+ ICCD 01] Coherence optimization [Moshovos+ HPCA 01, Cantin+ ISCA 05] 21 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary EXECUTIVE SUMMARY Good use of stacked DRAM is cache, but: Tags in stacked DRAM believed too slow On-chip tags too large (e.g., 96 MB for 1 GB stacked DRAM cache) Solution put tags in stacked DRAM, but: Faster Hits: Schedule together tag & data stacked DRAM accesses Faster Miss: On-chip MissMap bypasses stacked DRAM on misses Result (e.g., 1 GB stacked DRAM cache w/ 2 MB on-chip MissMap) 29-67% faster than naïve tag+data in stacked DRAM Within 88-97% of stacked DRAM cache w/ impractical on-chip tags 22 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. 2011 Advanced Micro Devices, Inc. All rights reserved. 23 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

BACKUP SLIDES 24 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

UNIQUE PAGES IN L4 VS. MISSMAP REACH Ex. 70% of the time a 256MB cache held ~90,000 or fewer unique pages 25 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

IMPACT ON OFF-CHIP DRAM ACTIVITY 26 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

MISSMAP EVICTION TRAFFIC Many MissMap evictions correspond to clean pages (e.g., no writeback traffic from the L4) By the time a MissMap entry is evicted, most of its cachelines have are long past dead/evicted. 27 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

SENSITIVITY TO MISSMAP VS. DATA ALLOCATION OF L3 2MB MissMap + 6MB Data provides good performance 3MB MissMap + 5MB Data slightly better, but can hurt server workloads that are more sensitive to L3 capacity. 28 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

SENSITIVITY TO MISSMAP SEGMENT SIZE 4KB segment size works the best Our simulations make use of physical addresses, so consecutive virtual pages can be mapped to arbitrary physical pages 29 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

COMPARISON TO SUB-BLOCKED CACHE Beyond 128MB, overhead is greater than MissMap At largest sizes (512MB, 1GB), sub-blocked cache delivers similar performance to our approach, but at substantially higher cost 30 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public

BENCHMARK FOOTPRINTS TPC-C: ~80% of accesses served by hottest 128MB worth of pages SPECWeb05: ~80% accesses served by 256MB SPECjbb05: ~80% accesses served by 512MB Web-Index: huge active footprint 31 Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Dec 7, 2011 Public