for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

Size: px

Start display at page:

Download "for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami"

Miranda Carson
5 years ago
Views:

1 3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

2 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 2

3 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 3

4 Stack Multiple Dies From 2D to 3D! Connect Dies with Through Silicon Vias Wire bonding (WB) 3D stacking (System in Package or SiP) TSV Package on Package (POP) 3D stacking Source: Yuan Xie, 3D IC Design/Architecture, Coolchips Special Session, 2009 RF Analog DRAM Processor Multi Level 3D IC Sensor IO 4

Chip Implementation Examples from ISSCC 09

+ SRAM connected with wireless TSVs Image

DRAM(Samsung) SRAM+Multicore(Keio Univ.) U.

, 8Gb DDR3 DRAM Using Through Silicon Via

, A Chip Stacked Memory for On Chip SRAM

Integrated Mpixel CMOS Image Sensor, ISSCC

5 Chip Implementation Examples from ISSCC 09 Image Sensors SRAM for SoCs DRAM Multi core + SRAM connected with wireless TSVs Image Sensor(MIT) SRAM for SoCs(NEC) 8Gb 3D DRAM(Samsung) SRAM+Multicore(Keio Univ.) U. Kang et al., 8Gb DDR3 DRAM Using Through Silicon Via Technology, ISSCC 09. H. Saito et al., A Chip Stacked Memory for On Chip SRAM Rich SoCs and Processors, ISSCC 09. V. Suntharalingam et al., A 4 Side Tileable Back Illuminated 3D Integrated Mpixel CMOS Image Sensor, ISSCC 09. K. Niitsu et al., An Inductive Coupling Link for 3D Integration of a 90nm CMOS Processor and a 65nm CMOS SRAM, ISSCC 09. 5

6 Why 3D? (/3) Wire Length Reduction Replace long, high capacitance wires by TSVs Low Latency, Low Energy Small llfootprint t 6

7 Why 3D? (2/3) Integration From Off Chip to On Chip Improved Communication Low Latency, High Bandwidth, and Low Energy Heterogeneous Integration E.g. Emerging Devices 7

8 Why 3D? (3/3) ce Impro ovement Per rforman (tim mes) Stacking Fine Process Power Consumption Process node (nm) N.Miyakawa, 3D Stacking Technology for Improvement of System Performance, International Trade Partners Conference, Nov

9 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 9

Importance of On Chip Caches Memory Wall

the # of cores Growing speed gap between

serious Let s increase on chip cache capacity,

10 Importance of On Chip Caches Memory Wall Problem Memory bandwidth does not scale with the # of cores Growing speed gap between processor cores and DRAMs So, Becomes more serious Let s increase on chip cache capacity, but Requires large chip area Pentium4 Core 2 Duo Core Bus Level 2 cache MB L2 Cache architect.com/ 4MB L2 Cache 0

11 Will 3D always work well? Stacking a DRAM Cache L Hit Time L Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L L L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache

12 Cache Size Sensitivity Varies among Programs! Miss Rate es [%] L Ocean LU Cholesky Sensitive! FFT Sensitive! Sensitive! Insensitive! Insensitive! Insensitive! FMM 0 Barnes WaterSpatial Raytrace 0 2MB 4MB 8MB 6MB 32MB 64MB 28MB L2 Size 2

13 Profit Better 3D 32MB DRAM Cache 2D vs. 3D 72.mgrid LU 7.swim FMM Ocean 8.mcf 256.bzip2 WaterSpatial Cholesky 88.ammp 0 0 2D 2MB SRAM Cache Barnes MRL2_REDUTION[points] 200 FFT 79.art 300.twolf 30.apsi HTL2_OVERHEAD[cc] Profit MR L2 _ REDUCTION HT L2_ OVERHEAD MMAT

14 Appropriate Cache Size Varies within Programs! The lower, the better 350 2MB(2cc) 32MB(60cc) Ocean 300 L Miss Penalty [cc] Time Interval (00K L2 Accesses / Interval) 4

15 Outline Why 3D? Will 3D always work well? Adaptive Execution! Conclusions 5

16 Will 3D always work well? Stacking a DRAM Cache L Hit Time L Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L L L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache 6

Mode (or DRAM Cache Mode) Adapt to variation of application behavior

17 SRAM/DRAM Hybrid Cache Architecture Support Two Operation Modes High Speed, Small Cache Mode (or SRAM Cache Mode) Low Speed, Large Cache Mode (or DRAM Cache Mode) Adapt to variation of application behavior 32MB DRAM Cache 32MB DRAM Cache (Power Gated) DRAM Cache Mode SRAM Cache Mode 7

18 Microarchitecture (/2) Tag Way 0 Way Tag Way 0 Way 2way set associative SRAM Cache 32MB DRAM Cache 2way set associative DRAM Cache 8

19 Microarchitecture (2/2) SARM(Size : Cs, Block : Ls, Asso. Ws) Tag field 58 - ID 64b physical address 58 - IS I IS Offset Index Assume Ld==Ls==64B ID DARM(Size : d Block : Ld, Asso. Wd) IS LS LS 58 - I S C D LS W lg CS L D W S D ID 58 - IS 58 - IS I I D S MUX L LS CS lg L S W C D lg L D W S D Data (SRAM) = = MUX Hit/Miss (SRAM) 58 - ID 58 - ID = = 58 - ID Hit/Miss (DRAM) L D L D MUX Data (DRAM) L D 9

20 How to Adapt 50 Static Approach 40 Optimizes at program level 20 Does not change it during 0 execution 0 Needs a static analysis Dynamic Approach Optimizes at interval level (or phase level) Needs a run time profiling L2 Cach he Miss Rates [%] L Miss Pena alty [cc] Barnes FFT FMM 2MB 4MB 8MB 6MB 32MB 64MB 28MB 2MB(2cc) L2 Cache Size 32MB(60cc) Ocean Interval 20

21 Experimental Set Up Processor: In Order Benchmarks: SPEC CPU 2000, Splash2 The operation mode is set at the beginning of the program execution (and is maintained until the end) Assume an appropriate operation mode is know for each benchmark 2D BASE 3GHz L I/D 2D SRAM L2 Cache Main memory 3D HYBRID LD, LI Caches: 32KB Access Lat.:2clock cycles L2 SRAM Cache 2MB, 64B Block 8way Lat. 6 clock cycles 3D DRAM Cache 32MB, 64B Block 8way Lat. 28 clock cycles Lat.:8 clock cycles 3D CONV Core@ 3GHz L I/D 3D DRAM L2 Cache Main memory 2

22 Evaluation Results 2D BASE 3D CONV 3D HYBRID Benchmark Programs Normalized Memory Energy Normalized Memory Performa nce

23 How to Adapt 50 Static Approach 40 Optimizes at program level 20 Does not change it during 0 execution 0 Needs a static analysis Dynamic Approach Optimizes at interval level (or phase level) Needs a run time profiling L2 Cach he Miss Rates [%] L Miss Pena alty [cc] Barnes FFT FMM 2MB 4MB 8MB 6MB 32MB 64MB 28MB 2MB(2cc) L2 Cache Size 32MB(60cc) Ocean Interval 23

24 if Run Time Mode Selection Divide Program Execution into epochs, e.g. 200K L2 Misses Predict an Appropriate Operation Mode for Next Epoch On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement MR L2SRAM HT L 2 DRAM HT L 2 SRAM AveOverhead MRL2DRAM MMAT then transit from SRAM mode to DRAM mode! epoch N 2 N N N+ Operation Mode 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache SRAM Cache Mode DRAM Cache Mode 24

25 Results D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID Normalize ed AMAT ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program 25

26 Results Normalize ed AMAT HYBRID 2D SRAM DRAM STACK HYBRID IDEAL IDEAL HYBRID mgrid swim twolf ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Accu uracy of Mo ode Selecti ion Benchmark Program 26

27 Results 2 Normalize ed AMAT HYBRID 2D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID bzip2 Cholesky ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program on Accurac cy of Mod de Selecti 27

28 Outline Why 3D? Will 3D always work well? Adaptive Execution! Conclusions 28

29 Conclusions The 3D solution is one of the most promising ways to achieve High performance Low energy It does not ALWAYS work well! Run time adaptive execution by considering memoryaccess behavior 29

30 Acknowledgement This research was supported in part by New Energy and Industrial Technology Development Organization 30

3D Memory Architecture. Kyushu University

3D Memory Architecture. Kyushu University 3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D