for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 2

Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 3

Stack Multiple Dies From 2D to 3D! Connect Dies with Through Silicon Vias Wire bonding (WB) 3D stacking (System in Package or SiP) TSV Package on Package (POP) 3D stacking Source: Yuan Xie, 3D IC Design/Architecture, Coolchips Special Session, 2009 RF Analog DRAM Processor Multi Level 3D IC Sensor IO 4

Chip Implementation Examples from ISSCC 09 Image Sensors SRAM for SoCs DRAM Multi core + SRAM connected with wireless TSVs Image Sensor(MIT) SRAM for SoCs(NEC) 8Gb 3D DRAM(Samsung) SRAM+Multicore(Keio Univ.) U. Kang et al., 8Gb DDR3 DRAM Using Through Silicon Via Technology, ISSCC 09. H. Saito et al., A Chip Stacked Memory for On Chip SRAM Rich SoCs and Processors, ISSCC 09. V. Suntharalingam et al., A 4 Side Tileable Back Illuminated 3D Integrated Mpixel CMOS Image Sensor, ISSCC 09. K. Niitsu et al., An Inductive Coupling Link for 3D Integration of a 90nm CMOS Processor and a 65nm CMOS SRAM, ISSCC 09. 5

Why 3D? (/3) Wire Length Reduction Replace long, high capacitance wires by TSVs Low Latency, Low Energy Small llfootprint t 6

Why 3D? (2/3) Integration From Off Chip to On Chip Improved Communication Low Latency, High Bandwidth, and Low Energy Heterogeneous Integration E.g. Emerging Devices 7

Why 3D? (3/3) 00 00 ce Impro ovement Per rforman (tim mes) 0 0 0. 0. Stacking Fine Process Power Consumption 80 30 90 65 45 32 22 5 2 Process node (nm) N.Miyakawa, 3D Stacking Technology for Improvement of System Performance, International Trade Partners Conference, Nov.2008 8

Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 9

Importance of On Chip Caches Memory Wall Problem Memory bandwidth does not scale with the # of cores Growing speed gap between processor cores and DRAMs So, Becomes more serious Let s increase on chip cache capacity, but Requires large chip area Pentium4 Core 2 Duo Core Bus Level 2 cache MB L2 Cache http://www.atmarkit.co.jp/,http://www.chip architect.com/ 4MB L2 Cache 0

Will 3D always work well? Stacking a DRAM Cache L Hit Time L Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L L L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache

Cache Size Sensitivity Varies among Programs! Miss Rate es [%] L2 60 50 40 30 20 Ocean LU Cholesky Sensitive! FFT Sensitive! Sensitive! Insensitive! Insensitive! Insensitive! FMM 0 Barnes WaterSpatial Raytrace 0 2MB 4MB 8MB 6MB 32MB 64MB 28MB L2 Size 2

Profit 5 4 3 2 Better 3D 32MB DRAM Cache 2D vs. 3D 72.mgrid LU 7.swim FMM Ocean 8.mcf 256.bzip2 WaterSpatial Cholesky 88.ammp 0 0 2D 2MB SRAM Cache Barnes 50 00 00 80 60 40 50 20 0 MRL2_REDUTION[points] 200 FFT 79.art 300.twolf 30.apsi HTL2_OVERHEAD[cc] Profit MR L2 _ REDUCTION HT L2_ OVERHEAD MMAT

Appropriate Cache Size Varies within Programs! The lower, the better 350 2MB(2cc) 32MB(60cc) Ocean 300 L Miss Penalty [cc] 250 200 50 00 50 0 4 7 0 3 6 9 22 25 28 3 34 37 40 43 46 49 52 55 58 6 64 67 70 73 76 79 82 85 88 9 94 97 00 Time Interval (00K L2 Accesses / Interval) 4

Outline Why 3D? Will 3D always work well? Adaptive Execution! Conclusions 5

SRAM/DRAM Hybrid Cache Architecture Support Two Operation Modes High Speed, Small Cache Mode (or SRAM Cache Mode) Low Speed, Large Cache Mode (or DRAM Cache Mode) Adapt to variation of application behavior 32MB DRAM Cache 32MB DRAM Cache (Power Gated) DRAM Cache Mode SRAM Cache Mode 7

Microarchitecture (/2) Tag Way 0 Way Tag Way 0 Way 2way set associative SRAM Cache 32MB DRAM Cache 2way set associative DRAM Cache 8

Microarchitecture (2/2) SARM(Size : Cs, Block : Ls, Asso. Ws) Tag field 58 - ID 64b physical address 58 - IS I IS Offset Index Assume Ld==Ls==64B ID DARM(Size : d Block : Ld, Asso. Wd) IS LS LS 58 - I S C D LS W lg CS L D W S D ID 58 - IS 58 - IS I I D S MUX L LS CS lg L S W C D lg L D W S D Data (SRAM) = = MUX Hit/Miss (SRAM) 58 - ID 58 - ID = = 58 - ID Hit/Miss (DRAM) L D L D MUX Data (DRAM) L D 9

How to Adapt 50 Static Approach 40 Optimizes at program level 20 Does not change it during 0 execution 0 Needs a static analysis Dynamic Approach Optimizes at interval level (or phase level) Needs a run time profiling L2 Cach he Miss Rates [%] L Miss Pena alty [cc] 30 350 300 250 200 50 00 50 0 Barnes FFT FMM 2MB 4MB 8MB 6MB 32MB 64MB 28MB 2MB(2cc) L2 Cache Size 32MB(60cc) Ocean 7 3 9 25 3 37 43 49 55 6 67 73 79 85 9 97 Interval 20

Experimental Set Up Processor: In Order Benchmarks: SPEC CPU 2000, Splash2 The operation mode is set at the beginning of the program execution (and is maintained until the end) Assume an appropriate operation mode is know for each benchmark 2D BASE Core@ 3GHz L I/D 2D SRAM L2 Cache Main memory 3D HYBRID LD, LI Caches: 32KB Access Lat.:2clock cycles L2 SRAM Cache 2MB, 64B Block 8way Lat. 6 clock cycles 3D DRAM Cache 32MB, 64B Block 8way Lat. 28 clock cycles Lat.:8 clock cycles 3D CONV Core@ 3GHz L I/D 3D DRAM L2 Cache Main memory 2

25 2.5 3.5 2 05 0.5 0 25 2.5.5 2 0.5 0 Evaluation Results 2D BASE 3D CONV 3D HYBRID Benchmark Programs Normalized Memory Energy Normalized Memory Performa nce

if Run Time Mode Selection Divide Program Execution into epochs, e.g. 200K L2 Misses Predict an Appropriate Operation Mode for Next Epoch On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement MR L2SRAM HT L 2 DRAM HT L 2 SRAM AveOverhead MRL2DRAM MMAT then transit from SRAM mode to DRAM mode! epoch N 2 N N N+ Operation Mode 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache SRAM Cache Mode DRAM Cache Mode 24

Results 2.8.6 2D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID Normalize ed AMAT.4.2 0.8 06 0.6 0.4 0.2 0 ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program 25

Results Normalize ed AMAT 2.8.6.4.2 0.8 06 0.6 0.4 0.2 0 HYBRID 2D SRAM DRAM STACK HYBRID IDEAL IDEAL HYBRID 0.9 08 0.8 0.7 0.6 05 0.5 0.4 0.3 0.2 0. 0 mgrid swim twolf ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Accu uracy of Mo ode Selecti ion Benchmark Program 26

Results 2 Normalize ed AMAT.8.6.4.2 0.8 06 0.6 0.4 0.2 0 HYBRID 2D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID 0.9 0.8 0.7 0.6 05 0.5 0.4 0.3 0.2 0. 0 bzip2 Cholesky ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program on Accurac cy of Mod de Selecti 27

Outline Why 3D? Will 3D always work well? Adaptive Execution! Conclusions 28

Conclusions The 3D solution is one of the most promising ways to achieve High performance Low energy It does not ALWAYS work well! Run time adaptive execution by considering memoryaccess behavior 29

Acknowledgement This research was supported in part by New Energy and Industrial Technology Development Organization 30