Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

Size: px

Start display at page:

Download "Centip3De: A 64-Core, 3D Stacked, Near-Threshold System"

Horace Golden
6 years ago
Views:

Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman

1 1 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold System Ronald G. Dreslinski David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski, Gregory Chen, Trevor Mudge, Dennis Sylvester, David Blaauw

2 The Problem of Power Power does not decrease at the same rate that transistor count increases, resulting in increased energy density Circuit supply voltages are no longer scaling Dynamic dominates U CV 2 dd A I V leak dd Af A = gate area scaling 1/s 2 C = capacitance scaling < 1/s The emerging dilemma: More and more gates can fit on a die, but cooling constraints are restricting their use 2 2

3 Today: Super-V th, High Performance, Power Constrained Energy / Operation Log (Delay) Super-Vth Normalized CPU Metrics Large gate overdrive favors performance with unsustainable power density Must design within fixed TDP 0 Vth Vnom Supply Voltage Goal: maintain performance, improved Energy/Operation 3 3

4 Subthreshold Design Energy / Operation Sub-Vth Super-Vth 12-16X Log (Delay) X Operating in sub-threshold yields large power gains at the expense of performance. 0 Vth Vnom Supply Voltage Applications: sensors, medical 4 4

Subthreshold Design Energy / Operation Sub-Vth Super-Vth 12-16X Log (Delay) 500 1000X Phoenix 2 Processor, ISSCC 10 Operating in

5 Subthreshold Design Energy / Operation Sub-Vth Super-Vth 12-16X Log (Delay) X Phoenix 2 Processor, ISSCC 10 Operating in sub-threshold yields large power gains at the expense of performance. 0 Vth Vnom Supply Voltage Applications: sensors, medical 5 5

Near-Threshold Computing (NTC): >60X power reduction 6-8X

6 Near-Threshold Computing (NTC) Energy / Operation Log (Delay) Sub-Vth NTC Super-V th ~6-8X ~2X ~50-100X ~10X Near-Threshold Computing (NTC): >60X power reduction 6-8X energy reduction Enables 3D integration 0 Vth Vnom Supply Voltage 6 6

7 Architectural Impact of NTC L2 L1 Vt Core L2 L1 Core Caches have higher Vopt and operating frequency Smaller activity rate when compared to core logic Leakage larger proportion of total power in caches New Architectures Possible 8 8

Proposed NTC Architecture SRAM is run at a higher V DD Caches operate faster than core Can introduce clustered architecture Multiple cores share L1

Drawbacks: Core conflicts evicting L1 data Not dominant in simulation Longer interconnect 3D addressable Next Level Memory BUS / Switched Network

8 Proposed NTC Architecture SRAM is run at a higher V DD Caches operate faster than core Can introduce clustered architecture Multiple cores share L1 Cores see private L1 L1 still provides single-cycle latency Advantages: Less coherence/snoop traffic Larger cache for processes that need it Drawbacks: Core conflicts evicting L1 data Not dominant in simulation Longer interconnect 3D addressable Next Level Memory BUS / Switched Network L1 L1 L1 L1 L1 Core Core Core Core Core Next Level Memory BUS / Switched Network Cluster Cluster Cluster Cluster L1 Shared L1 Cache Core Core Core Core 9 9

Proposed Boosting Approach Measured results for 130nm LP design 10MHz becomes ~110MHz in 32nm simulation 140 FO4 delay core Core Cluster L1 Core Core

Single Thread Performance Turn some cores off, speed up the rest Cache de-pipelined Faster response time, same throughput Core sees larger cache Faster

9 Proposed Boosting Approach Measured results for 130nm LP design 10MHz becomes ~110MHz in 32nm simulation 140 FO4 delay core Core Cluster L1 Core Core Core Baseline Cache runs 4x core frequency Pipelined cache Core 4x Cluster L1 4 10MHz (650mV) 40MHz (800mV) Core Core Core 8x Better Single Thread Performance Turn some cores off, speed up the rest Cache de-pipelined Faster response time, same throughput Core sees larger cache Faster cores needs larger caches 1 40MHz (850mV) 80MHz (1.15 V) Core Cluster L1 Core Core Core 1 80MHz (1.15V) 160MHz (1.65V) 10 10

10 Cache Timing NTC Mode (3/4 Cores) Low power Tag arrays read first 0-1 data arrays accessed Boost Mode (1/2) Low latency Data and tags read in parallel 4 data arrays accessed Tag Array = Data Array Tag Array = Data Array 11 11

11 Cache Timing IF/DE Stage EX Stage Cache Access MEM Stage Other Access Edge A Tag Read Edge B Tag Comp Edge C Data Read Edge D Idle Other Access Tag Array = Data Array NTC Mode (3/4 Cores) Low power Tag arrays read first 0-1 data arrays accessed 12 12

12 Cache Timing IF/DE Stage EX Stage Cache Access MEM Stage Other Access Tag & Data Read Tag Compare & Mem Access Other Access Edge A Edge B Boost Mode (1/2) Low latency Data and tags read in parallel 4 data arrays accessed Tag Array = Data Array 13 13

13 Centip3De System Overview 14 14

14 Centip3De System Overview 7-Layer NTC system 2-Layer system completed fabrication with measured results Measured Full 7-layer system expected End of

15 Centip3De System Overview Cluster architecture 4 Cores/cluster 1kB I$, 8kB D$ Local clock controller operates cores 90 Out-of-phase 1591 F2F connections per cluster Cluster x32 Organized into layer pairs (cache core) Minimizes routing Up to two pairs 16 clusters per pair Cores have only vertical interconnections 16 16

16 Centip3De System Overview Bus interconnect architecture Up to 500 MHz 9-11 cycle latency 1-3 core cycles 8 lanes, each 128b One per interface Each cluster connects to all eight 1024b total Vertically connected through all four layers Flipping interface enables 128-core system Bus System 17 17

17 Centip3De System Overview 3D-Stacked Tezzaron Octopus 1 control layer 130nm CMOS 1 Gb bitcell layers Up to two layers process 8x 128b DDR2 interfaces Operated at bus frequency (up to 500 MHz) System 18 18

18 Centip3De System Overview F2F 3024 B2B F2F 3624 B2B 19 19

19 Centip3De System Overview 130nm process 12.66x5mm per layer 28.4M device core layer 18.0M device cache layer 20 20

20 2-Layer Stacking Process Evaluated Aluminum wirebonding pads connected to perimeter TSVs like for 7-layer Wirebonds F2F Core Layer Cache Layer N N P P For the measured 2-layer system, aluminum wirebond pads were used instead 22 22

21 Cache 3D Connections SRAM SRAM Cache Sea of Gates SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM Bus 23 23

22 Core 3D Connections Core 3 Core 2 Sea of Gates Core 0 Core

23 Cluster 3D Connections 1591 F2F Connections Each saved ~ um in routing Prevented wiring congestion around SRAMS 25 25

24 Silicon Results Top Core Layer Cortex [064] Cortex [065] Cortex [066] Cortex [067] Cortex [092] Cortex [093] Cortex [094] Cortex [095] Bus Hub Disabled Due To Redundancy Bus Hub Cortex [096] Cortex [097] Cortex [098] Cortex [099] Cortex [124] Cortex [125] Cortex [126] Cortex [127] Top Cache Layer I$/D$ [16] I$/D$ [23] Cache Bus Hub 8x8 Crossbar Cache Bus Hub 8x8 Crossbar I$/D$ [24] I$/D$ [31] Flipping Flipping Flipping Flipping Bottom Cache Layer I$/D$ [00] I$/D$ [07] Cache Bus Hub 8x8 Crossbar Cache Bus Hub 8x8 Crossbar I$/D$ [08] I$/D$ [15] Bottom Core Layer Cortex [000] Cortex [001] Cortex [002] Cortex [003] Cortex [028] Cortex [029] Cortex [030] Cortex [031] Bus Hub 8x4 Crossbar Bus Hub 8x4 Crossbar Cortex [032] Cortex [033] Cortex [034] Cortex [035] Cortex [060] Cortex [061] Cortex [062] Cortex [063] Tezzaron Octopus Control Layer Bitcell Layer Bitcell Layer 26 26

25 Die Shot Looking through back of core-layer / Bus Hub 4-Core Cluster Aluminum wirebond pads 130nm process 12.66x5mm per layer 28.4M device core layer 18.0M device cache layer 27 27

26 System Configurations Cache Bus Hub 4 Core Mode Cache Bus Hub 2 Core Mode 160 MHz 1.15 Volts I$/D$ Div 4x 40 MHz 0.80 Volts 160 MHz 1.15 Volts I$/D$ Div 2x 80 MHz 1.15 Volts 0 Core Boosted 0 Cores Gated Div 4x 10 MHz 0.65 Volts 2 Core Boosted 2 Cores Gated Div 2x 40 MHz 0.85 Volts Cache Bus Hub 3 Core Mode Cache Bus Hub 1 Core Mode 160 MHz 1.15 Volts I$/D$ Div 2x 80 MHz 1.15 Volts 320 MHz 1.6 Volts I$/D$ Div 2x 160 MHz 1.65 Volts 3 Cores Boosted 1 Core Gated Div 4x 20 MHz 0.75 Volts 1 Core Boosted 3 Cores Gated Div 2x 80 MHz 1.15 Volts 28 28

27 Measured Results Boosting a single cluster to 1-core mode requires disabling, or down-boosting other clusters 1-core cluster: = 15x 4-core clusters = 6x 3-core clusters = 4.5x 2-core clusters Baseline configuration depends on TDP and processing needs Power (mw) Core Power Cache Power Memory System Power Core 3-Core 2-Core 1-Core System Configuration

28 Measured Results Single-Threaded Performance (DMIPS) Core 3-Core 2-Core 1-Core System Configuration Power (mw) Core Power Cache Power Memory System Power Core 3-Core 2-Core 1-Core System Configuration 30 30

29 Measured Results Efficiency (DMIPS/Watt) Measured Results: Centip3De 3,930 (130nm) Industry Comparison: ARM A9 8,000 (40nm) [1] Estimated Results: Centip3De 18,500 (45nm) 0 4-Core 3-Core 2-Core 1-Core System Configuration [1] ARM Ltd,

30 Conclusion Near threshold computing (NTC) Need low power solutions to maintain TDP Achieves 10x energy efficiency => 10x more computation to give TDP Offers optimum balance between performance and energy Allows boosting for single threaded performance (Amdahl's law) Large scale 3D CMP demonstrated 64 cores currently 128 cores + in the future 3D design shown to be feasible This work was funded and organized with the help of DARPA, Tezzaron, ARM, and the National Science Foundation 32 32

CENTIP3DE: A 64-CORE, 3DSTACKED NEAR-THRESHOLD SYSTEM

... CENTIP3DE: A 64-CORE, 3DSTACKED NEAR-THRESHOLD SYSTEM... Ronald G. Dreslinski David Fick Bharan Giridhar Gyouho Kim Sangwon Seo Matthew Fojtik Sudhir Satpathy Yoonmyung Lee Daeyeon Kim Nurrachman Liu