3D Memory Architecture. Kyushu University

Size: px

Start display at page:

Download "3D Memory Architecture. Kyushu University"

Kevin Shelton
6 years ago
Views:

1 3D Memory Architecture Koji Inoue Kyushu University 1

2 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2

3 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 3

4 From 2D to 3D! (not only TV) Stack Multiple Dies Connect Dies with Through Silicon Vias RF Analog DRAM Processor Multi Level 3D IC Sensor IO 4

5 Chip Implementation Examples from ISSCC 09 Image Sensors SRAM for SoCs DRAM Multi core + SRAM connected with wireless TSVs U. Kang et al., 8Gb DDR3 DRAM Using Through Silicon Via Technology, ISSCC 09. H. Saito et al., A Chip Stacked Memory for On Chip SRAM Rich SoCs and Processors, ISSCC 09. V. Suntharalingam et al., A 4 Side Tileable Back Illuminated 3D Integrated Mpixel CMOS Image Sensor, ISSCC 09. K. Niitsu et al., An Inductive Coupling Link for 3D Integration of a 90nm CMOS Processor and a 65nm CMOS SRAM, ISSCC 09. 5

6 Why 3D? (1/2) Wire Length Reduction Replace long, high capacitance wires by TSVs Low Latency, Low Energy Small llfootprint t 6

7 Why 3D? (2/2) Integration From Off Chip to On Chip Improved Communication Low Latency, High Bandwidth, and Low Energy Heterogeneous Integration E.g. Emerging Devices 7

8 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 8

9 Importance of On Chip Caches Memory Wall Problem Memory bandwidth does not scale with the # of cores Growing speed gap between processor cores and DRAMs So, Becomes more serious Let s increase on chip cache capacity, but Requires large chip area 9

10 Will 3D always work well? Stacking a DRAM Cache L1 Hit Time L1 Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L1 L1 L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache 10

11 Cache Size Sensitivity Varies among Programs! L2 Miss Rat tes [%] Ocean LU Sensitive! FFT Sensitive! Sensitive! Insensitive! 20 Cholesky FMM 10 Barnes WaterSpatial Raytrace 0 2MB 4MB 8MB 16MB 32MB 64MB 128MB L2 Size Insensitive! Insensitive! 11

12 Profit Better 3D 32MB DRAM Cache 2D vs. 3D 172.mgrid LU 171.swim FMM Ocean 181.mcf 256.bzip2 WaterSpatial Cholesky ammp 0 0 2D 2MB SRAM Cache Barnes MRL2_REDUTION[points] 200 FFT 179.art 300.twolf 301.apsi HTL2_OVERHEAD[cc] Profit MR L2 _ REDUCTION HT L2_ OVERHEAD MMAT

13 Appropriate Cache Size Varies within Programs! The lower, the better 350 2MB(12cc) 32MB(60cc) Ocean 300 Miss Penalt ty [cc] L Time Interval (100K L2 Accesses / Interval) 13

14 Outline Why 3D? Will 3D always work well? Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 14

15 Will 3D always work well? Stacking a DRAM Cache L1 Hit Time L1 Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L1 L1 L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache 15

SRAM/DRAM Hybrid Cache Architecture Support Two

application behavior 32MB DRAM Cache 32MB DRAM

16 SRAM/DRAM Hybrid Cache Architecture Support Two Operation Modes High Speed, Small Cache Mode (or SRAM Cache Mode) Low Speed, Large Cache Mode (or DRAM Cache Mode) Adapt to variation of application behavior 32MB DRAM Cache 32MB DRAM Cache (Power Gated) DRAM Cache Mode SRAM Cache Mode 16

17 Microarchitecture (1/2) Tag Way 0 Way 1 Tag Way 0 Way 1 2way set associative SRAM Cache 32MB DRAM Cache 2way set associative DRAM Cache 17

18 Microarchitecture (2/2) SARM(Size : Cs, Block : Ls, Asso. Ws) Tag field 58 - ID 64b physical address 58 - IS I IS Offset Index Assume Ld==Ls==64B ID DARM(Size : d Block : Ld, Asso. Wd) IS LS LS 58 - I S C D LS W lg CS L D W S D ID 58 - IS 58 - IS I I D S MUX L LS CS lg L S W C D lg L D W S D Data (SRAM) = = MUX Hit/Miss (SRAM) 58 - ID 58 - ID = = ID Hit/Miss (DRAM) L D L D MUX Data (DRAM) L D 18

19 How to Adapt 50 Static Approach 40 Optimizes at program level Does not change it during 10 execution 0 Needs a static analysis Dynamic Approach Optimizes at interval level (or phase level) Needs a run time profiling [%] L2 Cach he Miss Rates L1 Miss Pena alty [cc] FFT FMM Barnes 2MB 4MB 8MB 16MB 32MB 64MB 128MB L2 Cache Size 2MB(12cc) 32MB(60cc) Ocean Interval 19

20 if Run Time Mode Selection Divide Program Execution into epochs, e.g. 200K L2 Misses Predict an Appropriate Operation Mode for Next Epoch On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement MR L2SRAM HT L 2 DRAM HT L 2 SRAM AveOverhead MRL2DRAM MMAT then transit from SRAM mode to DRAM mode! epoch N 1 N N+1 N+2 Operation Mode 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache SRAM Cache Mode DRAM Cache Mode 20

21 Experimental Set Up Processor: In Order Benchmarks: SPEC CPU 2000, Splash2 SRAM Cache Mode DRAM Cache Mode Core L1 Cache Size:32KB Access Time : 2clock cycles Core L1 Cache L2 Cache Size : 2MB Access Time : 6clock cycles Size : 32MB Access Time : 28clock cycles L2 Cache Main Memory Access Time:181clock cycles Main Memory 21

22 Results D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID ed AMAT Normaliz ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program 22

23 Results ed AMAT Normaliz HYBRID 2D SRAM DRAM STACK HYBRID IDEAL 1 IDEAL HYBRID mgrid swim twolf ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Accu uracy of Mo ode Selecti ion Benchmark Program 23

24 Results 2 ed AMAT Normaliz HYBRID 2D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID bzip2 Cholesky ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program on Accurac cy of Mod de Selecti 24

25 Conclusions The 3D solution is one of the most promising ways to achieve High performance Low energy It does not ALWAYS work well! Run time adaptive execution by considering memoryaccess behavior 25

26 Acknowledgement This research was supported in part by New Energy and Industrial Technology Development Organization 26

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University