Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Size: px

Start display at page:

Download "Hybrid Cache Architecture (HCA) with Disparate Memory Technologies"

Lorraine Austin
6 years ago
Views:

1 Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement: Elmootazbellah (Mootaz) Elnozahy, Hung Le, Balaram Sinharoy, William (Bill) J. Starke, and Chung-Lung Kevin Shum 1

2 Outline Motivation and Introduction Methodology Level based Hybrid Cache Architecture Region based Hybrid Cache Architecture 3D Hybrid Cache Stacking Conclusion 2

Introduction Traditional SRAM-based Cache Architecture - Limited

levels: design overhead, coherence - Non-Uniform Cache

Emerging Memory Technologies, under the same chip area/footprint

3 Introduction Traditional SRAM-based Cache Architecture - Limited size with CMP: cache-core balance - Leakage power - More cache levels: design overhead, coherence - Non-Uniform Cache Architecture (wire delay) Improve cache power-performance with Emerging Memory Technologies, under the same chip area/footprint - Embedded DRAM - Magnetic RAM - Phase Change RAM - Three-dimensional space 3

4 Different Memory Technologies SRAM 6T structure DRAM 1T1C structure MTJ (Magnetic Tunnel Junction) Magnetic RAM 1T1J structure Phase Change RAM 1T1J structure 4

5 Comparisons SRAM edram MRAM PRAM Density (ratio) Low (1) High (4) High (4) High(16) Dynamic Power Low Medium Low for read; High for write Reduce Cache miss rate Increase hit latency Medium for read; High for write Leakage Power High Medium Low Low Speed Very Fast Fast Low leakage power High dynamic power Fast for read; Slow for write Slow for read; Very slow for write Non-volatility No No Yes Yes Scalability Yes Yes Yes Yes Endurance >

6 Motivation 1.4 1M-SRAM 4M-DRAM 4M-MRAM 16M-PRAM No 1 single memory technology has astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean Normalized IPC the best power-performance Static Dynamic Hybrid Cache may outperform its counterpart of single technology astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean Normalized Power 6

7 Outline Introduction and Motivation Methodology Level based Hybrid Cache Architecture Region based Hybrid Cache Architecture 3D Hybrid Cache Stacking Conclusions 7

8 Evaluation Methodology Fast Slow (edram/ MRAM) (PRAM) Baseline: a 2D 8-core CMP (3-level SRAM Caches) Fast A LHCA Fast Fast Middle (edram/ MRAM) Slow (PRAM) (edram/ MRAM/ PRAM) (edram/ MRAM/ PRAM) Flattening and with hybrid cache 3DHCA 3DHCA 3DHCA Slow E(eDRAM/ DMiddle C MRAM) (PRAM) (edram/ MRAM) Slow (PRAM) L1s Fast Fast B Slow (edram/ RHCA MRAM/ Slow PRAM) (edram/ MRAM/ 3D design PRAM) scenario w/ L1s 3D Layer 1 (edram/ MRAM) (edram/ MRAM) L4 (PRAM) (PRAM) 3D Layer 2 Flattening and L4 with hybrid cache Flattening, and L4 with hybrid cache A cache design scenario with 3D chip integration 8

9 Evaluation Setup Cache Density Latency (cycles) Dyn. eng (nj) Static power (W) SRAM(1MB) edram(4mb) MRAM(4MB) 4 Read:20, write:60 PRAM(16MB) 16 Read:40 write:200 Read:0.4 write:2.3 Read:0.8 write: Processor L1 //L4 Memory 8-way issue out-of-order, 8-core, 4GHz 32KB DL1, 32KB IL1, 128B, 4-way, 1 R/W port See corresponding design cases 400 cycles latency Benchmarks: SpecInt06, Specjbb, NAS, Bioperf, Parsec Simulator: SystemSim full system simulator 9

10 Outline Introduction and Motivation Methodology Level based Hybrid Cache Architecture Intra-Level Hybrid Cache Architecture 3D Hybrid Cache Stacking Conclusions 10

11 LHCA: Performance and Power SRAM edram-lhca MRAM-LHCA PRAM-LHCA 11 astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean A (edram/ MRAM/ PRAM) Simple LHCA can provide LHCA Performance and Static Power Dynamic benefits Processor 8-way issure out-of-order, 8-core, 4GHz astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean L1 32KB DL1, 32KB IL1, 128B, 4-way, 1 R/W port 256KB, 128B, 4-way, 1 R/W port Parameters from the methodology 0.59 Normalized Power Normalized IPC

12 Outline Introduction and Motivation Methodology Level based Hybrid Cache Architecture Region based Hybrid Cache Architecture 3D Hybrid Cache Stacking Conclusions 12

13 RHCA Features Fast and slow regions in one cache level Mutually Exclusive Regions Parallel search Unified LRU Intra-cache data movement policy: Move frequently used data to the fast region RHCA Drowsy RHCA: Keep slow region in drowsy mode, wake up latency (details in paper) 13

14 RHCA Policy Cache line migration policy Access cache line Allocate the new line and replace an LRU line if needed, across all regions Reset 2-bit saturation counter; set sticky bit N Hit? No action Y N Provide data; increment saturation counter MSB set (hot) in slow region? Y Select an LRU cache line of the same set in the fast region Clear sticky bit Y Sticky bit set? N Cancel swap Swap; reset saturation counter; set sticky bit 14

15 RHCA Hardware Support Tag & Status Array Decoder Data Array Fast Region Fast Region Sticky bit and Saturating counter(s) valid tag Index Tag Offset Offset Slow Region Saturating counter(s) =? valid tag Index Slow Region swap buffer Tag =? Hardware support Saturating counter in slow and sticky bit in fast, swap buffer Minimum hardware support: 1-bit sticky bit in fast region 15

16 RHCA Configuration RHCA (fast+slow) Fast region total size (latency) SRAM+eDRAM 256KB (6 cycles) 4MB (24 cycles) SRAM+MRAM w/ 256KB L1s (6 cycles) 4MB (r:20, w:60) Fast SRAM+PRAM 256KB (6 cycles) B 16MB (r:40, w:200) RHCA Slow region: (edram/ 256KB/bank, 1 r/w port, block size 128B, associativity:16, MRAM/ 16, 64 RHCA is 256KB less size than corresponding LHCA Slow PRAM) Avoid odd-sized cache DNUCA policy: more fine grained, move a line to a closer bank on each hit, bank-based, same size 16

17 RHCA Result astar bzip2 gcc gobmk h264 hmmer-sp h264 libquantum mcf omnetpp perl sjeng blast bt blast cg bt clustalw cg hmmer clustalw lu bt hmmer mg cg lu sp clustalw mg ua hmmer sp specjbb lu ua dedup mg specjbb fluidanimate sp dedup freqmine ua fluidanimate streamcluster specjbb freqmine Geomean dedup streamcluster fluidanimate Geomean freqmine streamcluster Geomean SRAM-eDRAM RHCA RHCA achieves better performance level SRAM Best LHCA DNUCA RHCA than SRAM 3-level SRAM baseline Best LHCA DNUCA and RHCA LHCA level SRAM Best LHCA DNUCA RHCA SRAM-MRAM RHCA SRAM-PRAM RHCA astar bzip2 gcc gobmk astar bzip2 gcc gobmk h264 hmmer-sp hmmer-sp libquantum mcf omnetpp perl sjeng libquantum mcf omnetpp perl sjeng blast PRAM is not promising for Cache Normalized IPCNormalized IPC 17

18 Outline Introduction and Motivation Methodology Level based Hybrid Cache Architecture Region based Hybrid Cache Architecture 3D Hybrid Cache Stacking Conclusions 18

19 3DHCA-configuration 3D Layer 1 edram/ Fast Middle edram/ 3DHCA 3DHCA 3DHCA C D E Slow edram/ Fast L4 (PRAM) 3D Layer 2 Slow (PRAM) (PRAM) 3DHCA-C (3D LHCA): 256KB SRAM, 4M edram, 32M L4 PRAM 3DHCA-D: 32M fast, middle, slow region (3D RHCA) Data in slow region can be moved to fast and middle regions 3DHCA-E: 4M fast+slow region, 32M PRAM (LHCA+RHCA) 19

20 DHCA-result 3-level SRAM Best LHCA RHCA 3DHCA-C 3DHCA-D 3DHCA-E astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean Static Normal_dyn Swap_dyn gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean 20 astar bzip2 gcc Normalized Power Normalized IPC

21 Conclusion Hybrid cache architecture is promising to improve cache power-performance under same chip area/footprint RHCA and LHCA achieve better power-performance than SRAM-based design RHCA outperforms LHCA with minimal hardware support 3DHCA achieves better performance than LHCA and RHCA, while still maintains lower power than 2D SRAM baseline 21

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement