Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies

Size: px

Start display at page:

Download "Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies"

Paulina Atkins
5 years ago
Views:

1 Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Doe Hyun Yoon, Tobin Gonzalez, Parthasarathy Ranganathan, and Robert S. Schreiber Intelligent Infrastructure Lab (IIL), Hewlett-Packard Labs Computing Frontiers May

2 Deeper and Deeper Memory Hierarchy Cache memory for reducing average memory latency Initially, a small L1 cache Then, L2, L3, DRAM cache for NVM main memory SLC / MLC NVM memory Is a DEEP hierarchy

2 2 Deeper and Deeper Memory Hierarchy Cache memory for reducing average memory latency Initially, a small L1 cache Then, L2, L3, DRAM cache for NVM main memory SLC / MLC NVM memory Is a DEEP hierarchy power efficient? We develop a model to explore latency-power tradeoffs Proc SRAM L1 SRAM L2 SRAM L3 DRAM DRAM main cache memory SLC NVM NVM main memory MLC NVM Flattened hierarchies with a large NVM cache are power efficient

3 3 Power 1. Latency-Power Model Configuration: - # levels - Size of each level - Array type: SRAM, DRAM, Change config for exhaustive search Pareto-optimal frontier Tech parameters App characteristics Performance & Power model Optimizer AMAT Estimate power and performance of an application Exhaustive search to find Pareto-optimal frontier

4 MPKI Application Characteristics 1000 Program 100 An exclusive hierarchy 4 Binary Instrumentation 32kB cache 32kB cache Load/store instr Miss Miss 64kB cache. Miss Traffic below a 32kB cache 10 Traffic below a combined 64kB cache Traffic below a combined 128kB cache 1 Proc 128kB 1MB 8MB 64MB 512MB 4GB Profile MPKI vs. cache sizes

5 MPKI (log scale) Example) SPEC CPU mcf hmmer astar omnetpp bzip2 lbm milc Proc 128kB 1MB 8MB 64MB 512MB 4GB Cache Size

6 MPKI (log scale) Example) MiniFE, MiniMD, and Graph MiniFE MiniMD Grah500 Graph Proc 128kB 1MB 8MB 64MB 512MB 4GB Cache Size

7 7 Technology Parameters Latency, static power, and energy per access of different caches SRAM and DRAM CACTI 6 Phase-Change RAM (PCRAM) NVSim

8 8 Performance Model Use AMAT (Average Memory Access Time) # loads / KI AMAT = L(1) + [ M(i)/M(0) L(i+1) ] Total N cache levels L(i): Latency of cache level i (tech parameter) M(i): MPKI of cache level i (app characteristic)

9 9 Power Model i Wr P = P static + P dyn i+1 Rd P static = P s (i) E dyn = M(0) E(1) + [ M(i) (E r (i+1) + E w (i)) ] P dyn = E dyn /[ (1000 M(0)) T cyc + AMAT M(0) ] E(i): Energy per access (tech parameter) T cyc : cycle time (1ns in our study) Time to execute 1000 instrs

10 MPKI Power Power The Model, Again Technology parameters Configuration: - # levels - Size of each level - Array type: SRAM, DRAM, Change config for exhaustive search Pareto-optimal frontier Application CACTI NVSim PIN cache model Performance & Power model Optimizer AMAT Proc 128kB 1MB 8MB 64MB 512MB 4GB AMAT

11 Power (mw) (log scale) Exhaustive Search Example AMAT (ns)

12 Power (mw) (log scale) Pareto-Optimal Frontier Pareto-Optimal Frontier AMAT (ns)

Power (mw) (log scale) Optimum Configurations 10 6 Minimum latency configuration 10 5 10 4 10 3

13 Power (mw) (log scale) Optimum Configurations 10 6 Minimum latency configuration Pareto-Optimal Frontier 13 Power-efficient configuration AMAT (ns) Minimum power configuration

14 14 2. Depth of a Cache Hierarchy Use the latency-power model SRAM caches, ranging from 32kB to 2GB 2 to 6-level hierarchies

15 Power (mw) (log scale) Power (mw) Graph level is better 2-level 3-level 4-level 5-level 6-level 2-level 3-level 4-level 5-level 6-level level is better AMAT (ns) AMAT (ns)

16 Power (mw) (log scale) MiniFE AMAT reduces only when cache is bigger than 1.5GB 2-level 3-level 4-level 5-level 6-level Deep hierarchies are not effective AMAT (ns)

17 17 Lessons Learned Cache hierarchies deeper than 3 levels Increase AMAT Use more power Large SRAM caches Increase static power a lot Only a small improvement in AMAT

18 18 3. Cache Hierarchies with NVM Use the latency-power model Replace DRAM main memory with NVM main memory Caches with heterogeneous technologies SRAM caches, ranging 32kB to 32MB DRAM caches, ranging 4MB to 64MB PCRAM caches, ranging 16MB to 1GB

19 Power (mw) (log scale) Graph SRAM cache + DRAM main memory SRAM cache + PCRAM main memory SRAM/DRAM cache + PCRAM main memory SRAM/DRAM/PCRAM cache + PCRAM main memory 10 3 Penalty due to slow NVM main memory Further reducing power with lowleakage NVM caches Mitigating NVM penalty with DRAM caches AMAT (ns) 120

20 20 (log scale) Power (mw) MCF 10 4 SRAM cache + DRAM main memory SRAM cache + PCRAM main memory SRAM/DRAM cache + PCRAM main memory SRAM/DRAM/PCRAM cache + PCRAM main memory NVM cache DRAM cache 10 3 NVM main memory AMAT (ns)

21 21 4. Streaming Patterns Working set is simply larger than the cache LRU policy thrashes data Large caches can t improve AMAT but waste power unless the cache is larger than the working set

22 Power (mw) (log scale) MiniFE 10 4 SRAM cache + DRAM main memory SRAM cache + PCRAM main memory SRAM/DRAM cache + PCRAM main memory SRAM/DRAM/PCRAM cache + PCRAM main memory NVM main memory 10 3 DRAM cache Even the largest NVM cache can t hold the working set NVM cache AMAT (ns)

23 23 Alternative Replacement Policy Inserted lines are never replaced Mimicking more recent techniques such as DIP [Qureshi+ ISCA 07] or RRIP [Jaleel+ ISCA 10] Can preserve a fraction of working set in the cache

24 MPKI MiniFE: MPKI vs. Cache Size 1000 LRU Alternative replacement Caches smaller than 2GB can reduce MPKI Proc 128kB 1MB 8MB 64MB 512MB 4GB Cache Size

25 Power (mw) (log scale) MiniFE with the Alternative Policy 10 4 SRAM cache + DRAM main memory SRAM cache + PCRAM main memory SRAM/DRAM cache + PCRAM main memory SRAM/DRAM/PCRAM cache + PCRAM main memory GB NVM cache can reduce power and AMAT AMAT (ns)

26 26 Conclusions Developed a latency-power model Deep hierarchies are less power efficient than flat hierarchies Large NVM caches can be power efficient Advanced insertion/replacement policy should be combined to handle streaming patterns.

27 27 Limitations Assumed a blocking in-order core Ignored memory level parallelism, prefetching, Write endurance in NVM caches Aggressive power control in SRAM caches

28 28 Acknowledgement & Disclaimer This material is based upon work supported by the Department of Energy under Award Number DE- SC See for additional information

29 Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Doe Hyun Yoon, Tobin Gonzalez, Parthasarathy Ranganathan, and Robert S. Schreiber Intelligent Infrastructure Lab (IIL), Hewlett-Packard Labs Computing Frontiers May

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC