Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory

Size: px

Start display at page:

Download "Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory"

Sandra Murphy
5 years ago
Views:

1 Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu

2 Universal Demand for Low Power Mobility Ba"ery life Performance Power constraints Energy cost Environment 2

3 Near-Threshold Operation Vth NT Vdd Nominal Vdd 0 ~1/ Voltage (V) ~1/ Frequency 1 3

4 Challenges in Near-Threshold Performance degradanon Amplified process varianon Leakage power dominates The ininal idea of Respin Build caches in NT-CMP with leakage-free nonvolanle memories to reduce power consumpnon Probability distribution Frac=on of Total Chip Power Nominal Vdd Relative frequency Near-Threshold Vdd 900mV, Vth σ/µ= 9% 900mV, Vth σ/µ=12% 400mV, Vth σ/µ= 9% 400mV, Vth σ/µ=12% Dynamic and Leakage Power Breakdown for a 64-Core CMP ~39% Total Leakage ~73% Total Leakage ~35% Cache Leakage dynamic-core dynamic-cache leakage-core leakage-cache 4

5 Outline Basics of NVM Respin Architecture Methodology and EvaluaNon Conclusion 5

6 Non-Volatile Memory Basics Non-VolaNlity Resistance as data representanon (e.g. PCRAM, STT-RAM, ReRAM, etc.) Near-Zero Leakage Power Good fit for future power-constrained compunng High Density Great design candidate in the big data era Good Performance Feasible for on-chip storage replacement 6

7 STT-RAM Unique features of STT-RAM: fast read speed, low read energy, unlimited write endurance, and good companbility with CMOS technology Shortcomings: long write latency and high write energy MTJ (MagneNc Tunnel JuncNon) STT-RAM Cell (1T-1MTJ) AnN-parallel state: highresistance, represennng a 1 Parallel state: low-resistance, represennng a 0 7

8 STT-RAM is Good Fit for NT-CMP NT-CMP High leakage power Low operanng frequency Large numbers of cores requiring high cache capacity Unreliable funcnonal units STT-RAM Near-zero leakage power Long latency writes High density (~4x denser than SRAM) Robust from soh-errors 8

9 Outline Basics of NVM Respin Architecture Methodology and EvaluaNon Conclusion 9

Core 3 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 Core 3 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 Core 4 Core 4 Core 5 Core 5 Core 6 Core 6 Core 0 Core 1 Core 2

10 Core 3 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 Core 3 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 Core 4 Core 4 Core 5 Core 5 Core 6 Core 6 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 Core 3 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 Core 4 Core 5 Core 6 Respin Architecture NT Vdd High Vdd NT Vdd High Vdd Core 0 Core 1 Core 2 Core 7 Cluster 0 Cluster 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L3 Cache L2 Cache L1D L1I L2 Cache Core 0 Core 1 Core 2 Core 7 Core 0 Core 1 Core 2 Core 7 Cluster 2 Cluster 3 Core 8 Core 9 Core 10 Core11 Core 12 Core 13 Core 14 Core 15 (a) Cores operate at NT-Vdd rail with low frequencies Caches are built with STT-RAM and operate at high-vdd rail making read speed extremely fast Clustered-CMP with fast STT-RAM read enables within-cluster shared L1 cache design, removing coherence costs (b) 10

11 Shared Cache Controller Cache CLK (0.4ns) Core0 (1.6ns) Core1 (2.4ns) Core2 (1.6ns) Core3 (1.6ns) Core4 (2.0ns) cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 Level-shifting overhead Level-shifting overhead service C0 service C2 half-miss C3 cannot service C3 in time service C3 cycle 5 cycle 6 service C4 service C1 Request serviced (a) Cache CLK (0.4ns) Core0 (1.6ns) Core1 (2.4ns) Core2 (1.6ns) Core3 (1.6ns) Core4 (2.0ns) cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 service C0 service C2 service C3 service C4 service C1 half-miss C half-miss (b) 11

12 Dynamic Core Consolidation High process varianon and leakage in NT-CMP lead to fast cores more energy-efficient than slow ones Dynamically consolidate threads onto more efficient cores with greedy search at runnme can further save energy VC 15 VC 14 VC Virtual Core Monitor Filter 0 1 System Firmware (ACPI) Virtual Cores VC 4 OS VC 3 Core Remapper Energy Table Phy. ID Weight Status OFF OFF 0 1 VC 2 ID Map Virt. ID Phy. ID 3 2 VC 1 VC 0 Retired Inst. Energy Power Control Mechanism implemented in system firmware, energy-perinstrucnon used as greedy selecnon metric, and instrucnon VC 14 VC 13 VC C 15 C 14 C Physical Cores VC 3 C 4 VC 0 VC 4 C 3 VC 1 VC 2 C 2 OFF C 1 OFF C 0 count used as evaluanon interval 12

13 Greedy Selection A greedy approach is used to search for opnmal system configuranons at applicanon runnme Energy-Per-InstrucNon (EPI) as evaluanon metric InstrucNon count as evaluanon interval 13

14 Outline Basics of NVM Respin Architecture Methodology and Evalua=on Conclusion 14

15 Methodology Level Size (mm²) Block Size Associa=vity Read/Write Ports L1I (Private/Shared within Cluster) L1D (Private/Shared within Cluster) L2 (Shared within Cluster) L3 (Shared within Chip) 16KB (Private)/256KB (Shared within Cluster) 8MB (Small)/16MB (Medium)/32MB (Large) 32B 64B 24MB (Small)/48MB (Medium)/ 128B 96MB (Large) Table 1. Summary of Cache Parameters. Vdd Rail Area (mm²) Read/Write Latency (ns) 2-way 4-way 8-way 16-way Read/Write Energy (pj) 1/1 Leakage Power (mw) SRAM (16KB 16) Low (0.65V) SRAM (256KB) STT-RAM (256KB) High (1.0V) / / Table 2. Comparison of SRAM vs. STT-RAM Technology Parameters. 15

16 Methodology SimulaNon Framework: SESC for architectural simulanon CACTI, McPAT, and NVSim for Benchmarks: latency, power, energy, and area simulanon SPLASH2 and PARSEC Main Evaluated ConfiguraNon: 64-core CMP with four 16-core clusters Medium size L2 and L3 caches 0.4ns shared L1 cache read latency Cores CMP Architecture 64 out-of-order Fetch/Issue/Commit Width 2/2/2 Register File Size InstrucNon Window Size Reorder Buffer Size Load/Store Queue Size NoC Interconnect Coherence Protocol Consistency Model Technology NT-Vdd 76 int, 56 fp 56 int, 24 fp 80 entries 38 entries 2D Torus MESI Release Consistency 22nm 0.4V (Core), 0.65V (Cache) Nominal-Vdd 1.0V Core Frequency Range Median Core Frequency Vth std. dev./mean (σ/μ) VariaNon Parameters 375MHz 725MHz 500MHz 12% (Chip), 10% (Cluster) Table 3. Summary of Experimental Parameters. 16

17 Normalized Total Chip Power Small Cache Power and Performance Power Reduc=on of Proposed Design with Three L2/L3 Cache Sizes Medium Cache ~3% ~14% Large Cache ~23% leakage Respin achieved 14% power reducnon and 11% performance improvement with medium sized cache dynamic Rela=ve Execu=on Time for Medium Cache Size Normalized Execu=on Time GEOMEAN PR-SRAM-NT HP-SRAM-CMP SH-SRAM-Nom SH-STT 17

18 Energy Consumption For medium sized cache, Respin achieved 22% energy savings with the basic shared STT-RAM cache design plus addinonal 10% with core consolidanon enabled 1.2 Energy Consump=on for Three L2/L3 Cache Sizes 1.4 Energy Consump=on for Different Core Consolida=on 1.39 Configura=ons Normalized Energy Consump=on Small Cache Medium Cache Large Cache PR-SRAM-NT SH-SRAM-Nom SH-STT Normalized Energy Consump=on GEOMEAN PR-SRAM-NT HP-SRAM-CMP SH-SRAM-Nom SH-STT SH-STT-CC SH-STT-CC-Oracle PR-STT-CC SH-STT-CC-OS 18

19 Core Consolidation Analysis In most cases our greedy algorithm matches well with the oracle while in very few cases subopnmal selecnon becomes the barrier to slow down the pace of our greedy mechanism SH-STT-CC SH-STT-CC-Oracle radix (SPLASH2): greedy achieved 48% energy savings while oracle achieved 50% Number of OFF Cores Number of Million Instructions 16 SH-STT-CC SH-STT-CC-Oracle 14 lu (SPLASH2): greedy achieved 29% energy savings while oracle achieved 38% Number of OFF Cores Sub-op=mal Number of Million Instructions 19

20 Shared Cache Access Load More than 95% of the incoming requests can be serviced in one processor core cycle Percentage of Total Cache Cycles Shared DL1 Cache U=liza=on Rate in One NT-CMP Cluster >= 4 Num of Requests Arriving at DL1 Cache blackscholes cholesky radix raytrace streamcluster AVERAGE Percentage of DL1 Read Hit Requests Frac=on of Read Hit Requests Serviced by the Shared DL1 in 1, 2, or more cycles blackscholes cholesky radix raytrace streamcluster AVERAGE 1 2 >= 3 Num of Core Cycles Serviced 20

21 Sensitivity Study on Cluster Size 16-core cluster provides the best trade-off between data sharing among cores and shared cache access load, giving ~11% performance gain Cluster Size (#cores) Shared L1 (I/D) Size (KB) Performance Gain (%)

22 Number of Active Cores in Dynamic Core Consolidation Strong phased behavior can be observed across all applicanons, making our dynamic core consolidanon mechanism very useful Average Number of Active Cores barnes cholesky fft lu SH-STT-CC SH-STT-CC-Oracle ocean radiosity radix raytracewater-nsquared AVG swaptions streamcluster bodytrack blackscholes 22

23 Outline Basics of NVM Respin Architecture Methodology and EvaluaNon Conclusion 23

24 Conclusion The first work to explore the use of non-volanle caches in near-threshold chip muln-processors A novel architecture designed to enhance NT-CMP performance and reduce energy consumpnon by sharing L1 caches and implemennng dynamic core consolidanon mechanism Achieved energy reducnon by 33% and improved performance by 11% 24

25 Questions? Thank you! 25

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014 Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda, Keiko Abe Toshiba Corporation, R&D Center Advanced