A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science University of Texas, Austin 6th ACM/SPEC International Conference on Performance Engineering, 2015 1

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY ) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions ANATOMY: An Analytical Model of System Performance (Published in the 2014 ACM international conference on Measurement and modeling of computer systems) 2

Stacked DRAM DRAM vertically stacked over the processor die. Stacked DRAMs offer High bandwidth High capacity Moderately low latency. Several proposals to organize this large DRAM as a last-level cache. Picture courtesy Bryan Black (From MICRO 2013 Keynote) 3

Processor Orgn. With DRAM Cache Core 0 L1D L1I MetaData on DRAM Core 1... L1D L1I L2 (LLSC) MetaData Tagon Pred SRAM Miss Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Core N L1D L1I Controller Processor with Stacked DRAM 4

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 5

Overview of a DRAM based memory Controller Control Address Data DRAM Bank Columns DIMM Rank Device Bank Bank Logic Rows Row Buffer Data Read & Write operations 6

Basic DRAM Operations ACTIVATE Bring data from DRAM core into the row-buffer READ/WRITE Perform read/write operations on the contents in the row-buffer PRECHARGE Store data back to DRAM core (ACTIVATE discharges capacitors), put cells back at neutral voltage Requests M H M PRE ACT RD RD PRE ACT RD Bank Level Parallelism (BLP) Row buffer hits (RBH) are faster and Parallelism improves performance consume less power Some switching delays hurt performance 7

ANATOMY Analytical Model of Two components 1) Queuing Model of Organizational and Technological characteristics Workload characteristics used as input 2) Use of Workload Characteristics Locality and Parallelism in workload s memory accesses 8

Analytical Model for System Performance Q = /(2µ*(1- )) for M/D/1 queue Multiple M/D/1 M/D/1 Bank Server 1 M/D/1 Arrival Rate: Address Bus Server Bank Server 2 Data Bus Server Q addr 1/µ addr Service Time: (RBH*1 + (1-RBH)*3) * BUS_CYCLE_TIME Q bank Bank Server N 1/µ bank Service Time: t CL * RBH + (t CL +t PRE +t RCD ) * (1-RBH) Q data Latency = Q addr + Q bank + Q data + 1/µ addr + 1/µ bank + 1/µ data 1/µ data Service Time: Burst_Length * BUS_CYCLE_TIME 9

Average % Error Validation - Model Accuracy 12.5 Latency RBH BLP 7.5 2.5-2.5-7.5-12.5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Avg Low Errors in RBH, BLP and Latency Estimation Average error of 3.9%, 4.2% and 4% ANATOMY predicts trends accurately 10

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 11

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Controller Miss Processor with Stacked DRAM 12

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Arrival Rate Controller Miss Processor with Stacked DRAM 13

ANATOMY-Cache Model Key Parameters that L2 (LLSC) Tag-Pred Hit Miss DRAM Cache (Vertically Stacked) (Off Chip) Main govern performance: Arrival Rate Tag access time Controller Processor with Stacked DRAM 14

ANATOMY-Cache Model L2 (LLSC) Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main Key Parameters that govern performance: Arrival Rate Tag access time Cache hit rate Cache RBH 16

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. ANATOMY Cache ANATOMY Mem 18

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. ANATOMY Cache ANATOMY Mem 19

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Line fills and write back requests from main memory Predicted Hits No predictions ANATOMY Cache Line Fills, Writebacks ANATOMY Mem 21

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Line fills and write back requests from main memory Predicted Misses Predicted Misses Predicted Hits No predictions ANATOMY Cache Line Fills, Writebacks ANATOMY Mem 22

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. Predicted Cache Hits No Predictions Line fills and write back requests from main memory Predicted Misses Requests from Cache Predicted Misses Predicted Hits No predictions ANATOMY Cache Line Fills, Writebacks Misses, Line fills and Writebacks ANATOMY Mem 23

Extending ANATOMY to DRAM Caches Two ANATOMY instances - one for DRAM cache and one for main memory. The models are fed by the output of the tag server and each other s outputs. We compute the latencies at the cache and memory using ANATOMY. Predicted Misses L Cache Predicted Hits No predictions L Mem ANATOMY Cache Line Fills, Writebacks Misses, Line fills and Writebacks ANATOMY Mem 24

Obtaining the average LLSC miss penalty L cache and L mem are combined by to estimate the average LLSC miss penalty. But first we discuss the estimation of the key parameters that govern L Cache and L Mem. 25

Estimating Key Parameters Arrival Rate Tag access time Cache hit rate Cache RBH L2 (LLSC) Tag-Pred Hit Miss DRAM Cache (Vertically Stacked) (Off Chip) Main Cache Miss Penalty Controller Processor with Stacked DRAM 26

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. L2 (LLSC) λ Tag-Pred λ Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Miss Controller Processor with Stacked DRAM 27

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. L2 (LLSC) λ Tag-Pred Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Predicted Hits Controller Miss Processor with Stacked DRAM 28

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. Predicted Hits No predictions L2 (LLSC) λ Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main 29

Estimating the Cache Arrival Rate Arrival Rate at the Cache is a sum of several streams of accesses. Predicted Hits No predictions Line fills and writebacks L2 (LLSC) λ Tag-Pred Controller Hit Miss Processor with Stacked DRAM DRAM Cache (Vertically Stacked) (Off Chip) Main 30

Summarizing the Cache Arrival Rate Request Stream Predicted Hits No predictions Rate λ*h pred *h cache λ*(1-h pred ) Notes They are sent to the cache for tag look-up Line Fills λ*(1-h cache )*B s B s is the cache block size Writebacks λ*(1-h cache )*w w is the fraction of misses that cause write-backs λ cache = λ*h pred *h cache + λ*(1-h pred ) + λ*(1-h cache )*B s + λ*(1-h cache )*w 31

Estimating Tag Predictor Hit Rate and Access Time Tags-on-SRAM All tags on SRAM. Hit Rate = 100% Tags-on-DRAM A small setassociative cache. Hit Rate determined by running an access trace through the cache model. Predictor access time depends on its size. An estimate is obtained using CACTII. 32

Estimating Cache Hit Rate Cache Hit Rate depends on 3 key parameters: Cache Size Set Associativity Block Size Well-studied problem A trace-based model and reuse distance analysis. We use a trace of accesses from the LLSC. Cache Size Associativity LLSC Miss Trace Reuse Distance Analysis Hit Rate 33

Estimating Cache Hit Rate with Block Size Larger block sizes can capture spatial locality. Bandwidth-neutral model: Cache miss rate halves with doubling of cache block size.» If this holds, then measuring hit rate at smallest block size via trace based analysis is sufficient.» For larger block sizes, estimate via: Workload Q5 is bandwidth-neutral 34

Not all workloads are bandwidth-neutral For such workloads, bandwidth-neutral model leads to lower miss rate prediction. Use trace-based cache simulations in such cases. Workload Q22 is NOT bandwidth-neutral 35

Estimating DRAM Cache Row- Buffer Hit Rate Row-Buffer Hit rate (RBH) of the DRAM cache depends on the access pattern and the data organization on the DRAM. We estimate RBH using the Reuse-Distance framework similar to ANATOMY. Details are in the paper. 36

Putting them together LLSC Miss Penalty from: L cache and L mem L2 (LLSC) λ Tag-Pred λ Hit DRAM Cache (Vertically Stacked) (Off Chip) Main Event Controller Miss Processor with Stacked DRAM Latency Predicted Cache Hit A: h pred *h cache *L cache Predicted Cache Miss B: h pred *(1-h cache )*L mem No Prediction and Cache Hit C: (1-h pred )*h cache *L cache No Prediction and Cache Miss D: (1-h pred )*(1-h cache ) *[L cache +L mem ] L Avg A+B+C+D 37

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 38

Experimental Evaluation Validation using GEM5 Simulation (with detailed model) Use of Multiprogrammed workloads Workloads comprising of SPEC2000/SPEC2006 benchmarks Architecture Configurations 4 core and 8 core 128MB (4 core) and 256MB (8 core) DRAM caches Cache : 1.6GHz DRAM, 2KB page, 128-bit bus DRAM Main : 3.2GHz DRAM, 64-bit bus Tags-on-DRAM: Direct Mapped 64B block size Tags and Data on the same DRAM rows Tag Predictor: 2-way set associative tag cache Tags-on-SRAM: Block Size: 1024B 2 way set associative 39

Validation of the Tags-on- DRAM Model Low errors in estimation of Avg. LLSC Miss Penalty (10.9% in 4-core and 9.3% in 8-core workloads) 40

Validation of the Tags-on- SRAM Model Low errors in estimation of Avg. LLSC Miss Penalty (10.5% in 4-core and 8.2% in 8-core workloads) 41

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 42

Insight 1: It is hard to out-perform Tags-on-SRAM designs Tag Access Time: Requires High Predictor Hit Rate to beat Tags-on-SRAM Latency for Tag Lookup 43

Insight 2 - Motivation The DRAM Cache gets a very high cache hit rate. The Main remains mostly idle! Cache is congested and memory is free! So we consider if bypassing some cache hits to main memory would get an overall latency benefit We extend ANATOMY-Cache model by accounting for a fraction of requests that bypass the cache (details in the paper). 44

Insight 2: Cache Bypass/Offload Helps! Congested Workload: Misses Are Expensive! There is a sweet-spot at which Avg. LLSC Miss Penalty is minimized 45

Talk Outline Introduction to stacked DRAM Caches Background (An overview of ANATOMY) ANATOMY-Cache: Modeling Stacked DRAM Cache Organizations Evaluation Insights Conclusions 46

ANATOMY-Cache First Analytical Model of Stacked DRAM Caches Covers Both Tags-on-DRAM and Tags-on-SRAM organizations We investigated two insights with the help of the model 47

Thank You! Thank You!! 48