A Comprehensive Analytical Performance Model of DRAM Caches

Similar documents
A Comprehensive Analytical Performance Model of DRAM Caches

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Introduction to memory system :from device to system

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Lecture: Memory Technology Innovations

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

DRAM Main Memory. Dual Inline Memory Module (DIMM)

CSE502: Computer Architecture CSE 502: Computer Architecture

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

1/19/2009. Data Locality. Exploiting Locality: Caches

EEM 486: Computer Architecture. Lecture 9. Memory

The University of Adelaide, School of Computer Science 13 September 2018

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

CSE502: Computer Architecture CSE 502: Computer Architecture

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Advanced Memory Organizations

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Improving DRAM Performance by Parallelizing Refreshes with Accesses

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

The Memory Hierarchy 1

Chapter 2: Memory Hierarchy Design Part 2

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

CS311 Lecture 21: SRAM/DRAM/FLASH

CpE 442. Memory System

Memory Hierarchy. Slides contents from:

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

HY225 Lecture 12: DRAM and Virtual Memory

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

Memory Hierarchy. Slides contents from:

CS377P Programming for Performance Single Thread Performance Caches I

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

EE 457 Unit 7b. Main Memory Organization

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

ECE 2300 Digital Logic & Computer Organization. Caches

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Recall

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CS 6354: Memory Hierarchy I. 29 August 2016

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

CSE502: Computer Architecture CSE 502: Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Handout 4 Memory Hierarchy

A Cache Hierarchy in a Computer System

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

LECTURE 10: Improving Memory Access: Direct and Spatial caches

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Low-power Architecture. By: Jonathan Herbst Scott Duntley

1. Memory technology & Hierarchy

ECE468 Computer Organization and Architecture. Memory Hierarchy

An introduction to SDRAM and memory controllers. 5kk73

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

OASIS: Self-tuning Storage for Applications

The Memory Hierarchy. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

LECTURE 11. Memory Hierarchy

AS the processor-memory speed gap continues to widen,

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Chapter 2: Memory Hierarchy Design Part 2

Transcription:

A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science University of Texas, Austin 6th ACM/SPEC International Conference on Performance Engineering, 2015 1

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY ) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions ANATOMY: An Analytical Model of System Performance (Published in the 2014 ACM international conference on Measurement and modeling of computer systems) 2

Stacked DRAM DRAM vertically stacked over the processor die. Stacked DRAMs offer High bandwidth High capacity Moderately low latency. Several proposals to organize this large DRAM as a last-level cache. Picture courtesy Bryan Black (From MICRO 2013 Keynote) 3

Processor Orgn. With DRAM Cache Core 0 L1D L1I MetaData on DRAM Core 1... L1D L1I L2 (LLSC) MetaData Tagon Pred SRAM Miss Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Core N L1D L1I Controller Processor with Stacked DRAM 4

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 5

Overview of a DRAM based memory Controller Control Address Data DRAM Bank Columns DIMM Rank Device Bank Bank Logic Rows Row Buffer Data Read & Write operations 6

Basic DRAM Operations ACTIVATE Bring data from DRAM core into the row-buffer READ/WRITE Perform read/write operations on the contents in the row-buffer PRECHARGE Store data back to DRAM core (ACTIVATE discharges capacitors), put cells back at neutral voltage Requests M H M PRE ACT RD RD PRE ACT RD Bank Level Parallelism (BLP) Row buffer hits (RBH) are faster and Parallelism improves performance consume less power Some switching delays hurt performance 7

ANATOMY Analytical Model of Two components 1) Queuing Model of Organizational and Technological characteristics Workload characteristics used as input 2) Use of Workload Characteristics Locality and Parallelism in workload s memory accesses 8

Analytical Model for System Performance Q = /(2µ*(1- )) for M/D/1 queue Multiple M/D/1 M/D/1 Bank Server 1 M/D/1 Arrival Rate: Address Bus Server Bank Server 2 Data Bus Server Q addr 1/µ addr Service Time: (RBH*1 + (1-RBH)*3) * BUS_CYCLE_TIME Q bank Bank Server N 1/µ bank Service Time: t CL * RBH + (t CL +t PRE +t RCD ) * (1-RBH) Q data Latency = Q addr + Q bank + Q data + 1/µ addr + 1/µ bank + 1/µ data 1/µ data Service Time: Burst_Length * BUS_CYCLE_TIME 9

Average % Error Validation - Model Accuracy 12.5 Latency RBH BLP 7.5 2.5-2.5-7.5-12.5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Avg Low Errors in RBH, BLP and Latency Estimation Average error of 3.9%, 4.2% and 4% ANATOMY predicts trends accurately 10

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 11

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Controller Miss Processor with Stacked DRAM 12

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Arrival Rate Controller Miss Processor with Stacked DRAM 13

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit Miss DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Arrival Rate Tag access time Controller Processor with Stacked DRAM 14

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit Miss DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Arrival Rate Tag access time Controller Cache hit rate Processor with Stacked DRAM 15

ANATOMY-Cache Model L2 (LLSC) Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main Key Parameters that govern performance: Arrival Rate Tag access time Cache hit rate Cache RBH 16

ANATOMY-Cache Model L2 (LLSC) Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main Key Parameters that govern performance: Arrival Rate Tag access time Cache hit rate Cache RBH Cache Miss Penalty 17

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. ANATOMY Cache ANATOMY Mem 18

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. ANATOMY Cache ANATOMY Mem 19

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Predicted Hits No predictions ANATOMY Cache ANATOMY Mem 20

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Line fills and write back requests from main memory Predicted Hits No predictions ANATOMY Cache Line Fills, Writebacks ANATOMY Mem 21

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Line fills and write back requests from main memory Predicted Misses Predicted Misses Predicted Hits No predictions ANATOMY Cache Line Fills, Writebacks ANATOMY Mem 22

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Line fills and write back requests from main memory Predicted Misses Requests from Cache Predicted Misses Predicted Hits No predictions ANATOMY Cache Line Fills, Writebacks Misses, Line fills and Writebacks ANATOMY Mem 23

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. We compute the latencies at the cache and memory using ANATOMY. Predicted Misses L Cache Predicted Hits No predictions L Mem ANATOMY Cache Line Fills, Writebacks Misses, Line fills and Writebacks ANATOMY Mem 24

Obtaining the average LLSC miss penalty L cache and L mem are combined by to estimate the average LLSC miss penalty. But first we discuss the estimation of the key parameters that govern L Cache and L Mem. 25

Estimating Key Parameters Arrival Rate Tag access time Cache hit rate Cache RBH L2 (LLSC) Tag-Pred Hit Miss DRAM Cache (Vertically Stacked) (Off Chip) Main Cache Miss Penalty Controller Processor with Stacked DRAM 26

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. L2 (LLSC) λ Tag-Pred λ Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Miss Controller Processor with Stacked DRAM 27

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. L2 (LLSC) λ Tag-Pred Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Predicted Hits Controller Miss Processor with Stacked DRAM 28

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. Predicted Hits No predictions L2 (LLSC) λ Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main 29

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. Predicted Hits No predictions Line fills and writebacks L2 (LLSC) λ Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main 30

Summarizing the Cache Arrival Rate Request Stream Predicted Hits No predictions Rate λ*h pred *h cache λ*(1-h pred ) Notes They are sent to the cache for tag look-up Line Fills λ*(1-h cache )*B s B s is the cache block size Writebacks λ*(1-h cache )*w w is the fraction of misses that cause write-backs λ cache = λ*h pred *h cache + λ*(1-h pred ) + λ*(1-h cache )*B s + λ*(1-h cache )*w 31

Estimating Tag Predictor Hit Rate and Access Time Tags-on-SRAM All tags on SRAM. Hit Rate = 100% Tags-on-DRAM A small setassociative cache. Hit Rate determined by running an access trace through the cache model. Predictor access time depends on its size. An estimate is obtained using CACTII. 32

Estimating Cache Hit Rate Cache Hit Rate depends on 3 key parameters: Cache Size Set Associativity Block Size Well-studied problem A trace-based model and reuse distance analysis. We use a trace of accesses from the LLSC. Cache Size Associativity LLSC Miss Trace Reuse Distance Analysis Hit Rate 33

Estimating Cache Hit Rate with Block Size Larger block sizes can capture spatial locality. Bandwidth-neutral model: Cache miss rate halves with doubling of cache block size.» If this holds, then measuring hit rate at smallest block size via trace based analysis is sufficient.» For larger block sizes, estimate via: Workload Q5 is bandwidth-neutral 34

Not all workloads are bandwidth-neutral For such workloads, bandwidth-neutral model leads to lower miss rate prediction. Use trace-based cache simulations in such cases. Workload Q22 is NOT bandwidth-neutral 35

Estimating DRAM Cache Row- Buffer Hit Rate Row-Buffer Hit rate (RBH) of the DRAM cache depends on the access pattern and the data organization on the DRAM. We estimate RBH using the Reuse-Distance framework similar to ANATOMY. Details are in the paper. 36

Putting them together LLSC Miss Penalty from: L cache and L mem L2 (LLSC) λ Tag-Pred λ Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Event Controller Miss Processor with Stacked DRAM Latency Predicted Cache Hit A: h pred *h cache *L cache Predicted Cache Miss B: h pred *(1-h cache )*L mem No Prediction and Cache Hit C: (1-h pred )*h cache *L cache No Prediction and Cache Miss D: (1-h pred )*(1-h cache ) *[L cache +L mem ] L Avg A+B+C+D 37

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 38

Experimental Evaluation Validation using GEM5 Simulation (with detailed model) Use of Multiprogrammed workloads Workloads comprising of SPEC2000/SPEC2006 benchmarks Architecture Configurations 4 core and 8 core 128MB (4 core) and 256MB (8 core) DRAM caches Cache : 1.6GHz DRAM, 2KB page, 128-bit bus DRAM Main : 3.2GHz DRAM, 64-bit bus Tags-on-DRAM: Direct Mapped 64B block size Tags and Data on the same DRAM rows Tag Predictor: 2-way set associative tag cache Tags-on-SRAM: Block Size: 1024B 2 way set associative 39

Validation of the Tags-on- DRAM Model Low errors in estimation of Avg. LLSC Miss Penalty (10.9% in 4-core and 9.3% in 8-core workloads) 40

Validation of the Tags-on- SRAM Model Low errors in estimation of Avg. LLSC Miss Penalty (10.5% in 4-core and 8.2% in 8-core workloads) 41

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 42

Insight 1: It is hard to out-perform Tags-on-SRAM designs Tag Access Time: Requires High Predictor Hit Rate to beat Tags-on-SRAM Latency for Tag Lookup 43

Insight 2 - Motivation The DRAM Cache gets a very high cache hit rate. The Main remains mostly idle! Cache is congested and memory is free! So we consider if bypassing some cache hits to main memory would get an overall latency benefit We extend ANATOMY-Cache model by accounting for a fraction of requests that bypass the cache (details in the paper). 44

Insight 2: Cache Bypass/Offload Helps! Congested Workload: Misses Are Expensive! There is a sweet-spot at which Avg. LLSC Miss Penalty is minimized 45

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 46

ANATOMY-Cache First Analytical Model of Stacked DRAM Caches Covers Both Tags-on-DRAM and Tags-on-SRAM organizations We investigated two insights with the help of the model 47

Thank You! Thank You!! 48