for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

Similar documents
3D Memory Architecture. Kyushu University

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Effect of Data Prefetching on Chip MultiProcessor

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan

BREAKING THE MEMORY WALL

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Main Memory Supporting Caches

Emerging NVM Memory Technologies

Lecture-14 (Memory Hierarchy) CS422-Spring

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

ECE 30 Introduction to Computer Engineering

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

The University of Adelaide, School of Computer Science 13 September 2018

Memory Hierarchy. Slides contents from:

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Chapter 0 Introduction

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EE 457 Unit 7b. Main Memory Organization

Low-Complexity Reorder Buffer Architecture*

Flexible Cache Error Protection using an ECC FIFO

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

A 1-GHz Configurable Processor Core MeP-h1

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Impact of Cache Coherence Protocols on the Processing of Network Traffic

3D WiNoC Architectures

Advanced Memory Organizations

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Cache/Memory Optimization. - Krishna Parthaje

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Page 1. Memory Hierarchies (Part 2)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EN1640: Design of Computing Systems Topic 06: Memory System

Phastlane: A Rapid Transit Optical Routing Network

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Adapted from David Patterson s slides on graduate computer architecture

OVERCOMING THE MEMORY WALL FINAL REPORT. By Jennifer Inouye Paul Molloy Matt Wisler

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

Part IV: 3D WiNoC Architectures

ECE232: Hardware Organization and Design

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE414 Embedded Systems Ch 5. Memory Part 2/2

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Recall

Memory. Lecture 22 CS301

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Memory Hierarchy & Cache

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

MIPS) ( MUX

Microprocessor Trends and Implications for the Future

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

ECE 486/586. Computer Architecture. Lecture # 2

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Introduction to cache memories

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Exploring Wakeup-Free Instruction Scheduling

Precise Instruction Scheduling

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Moore s s Law, 40 years and Counting

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Low-power Architecture. By: Jonathan Herbst Scott Duntley

KeyStone II. CorePac Overview

Recap: Machine Organization

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Copyright 2012, Elsevier Inc. All rights reserved.

Transcription:

3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 2

Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 3

Stack Multiple Dies From 2D to 3D! Connect Dies with Through Silicon Vias Wire bonding (WB) 3D stacking (System in Package or SiP) TSV Package on Package (POP) 3D stacking Source: Yuan Xie, 3D IC Design/Architecture, Coolchips Special Session, 2009 RF Analog DRAM Processor Multi Level 3D IC Sensor IO 4

Chip Implementation Examples from ISSCC 09 Image Sensors SRAM for SoCs DRAM Multi core + SRAM connected with wireless TSVs Image Sensor(MIT) SRAM for SoCs(NEC) 8Gb 3D DRAM(Samsung) SRAM+Multicore(Keio Univ.) U. Kang et al., 8Gb DDR3 DRAM Using Through Silicon Via Technology, ISSCC 09. H. Saito et al., A Chip Stacked Memory for On Chip SRAM Rich SoCs and Processors, ISSCC 09. V. Suntharalingam et al., A 4 Side Tileable Back Illuminated 3D Integrated Mpixel CMOS Image Sensor, ISSCC 09. K. Niitsu et al., An Inductive Coupling Link for 3D Integration of a 90nm CMOS Processor and a 65nm CMOS SRAM, ISSCC 09. 5

Why 3D? (/3) Wire Length Reduction Replace long, high capacitance wires by TSVs Low Latency, Low Energy Small llfootprint t 6

Why 3D? (2/3) Integration From Off Chip to On Chip Improved Communication Low Latency, High Bandwidth, and Low Energy Heterogeneous Integration E.g. Emerging Devices 7

Why 3D? (3/3) 00 00 ce Impro ovement Per rforman (tim mes) 0 0 0. 0. Stacking Fine Process Power Consumption 80 30 90 65 45 32 22 5 2 Process node (nm) N.Miyakawa, 3D Stacking Technology for Improvement of System Performance, International Trade Partners Conference, Nov.2008 8

Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions 9

Importance of On Chip Caches Memory Wall Problem Memory bandwidth does not scale with the # of cores Growing speed gap between processor cores and DRAMs So, Becomes more serious Let s increase on chip cache capacity, but Requires large chip area Pentium4 Core 2 Duo Core Bus Level 2 cache MB L2 Cache http://www.atmarkit.co.jp/,http://www.chip architect.com/ 4MB L2 Cache 0

Will 3D always work well? Stacking a DRAM Cache L Hit Time L Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L L L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache

Cache Size Sensitivity Varies among Programs! Miss Rate es [%] L2 60 50 40 30 20 Ocean LU Cholesky Sensitive! FFT Sensitive! Sensitive! Insensitive! Insensitive! Insensitive! FMM 0 Barnes WaterSpatial Raytrace 0 2MB 4MB 8MB 6MB 32MB 64MB 28MB L2 Size 2

Profit 5 4 3 2 Better 3D 32MB DRAM Cache 2D vs. 3D 72.mgrid LU 7.swim FMM Ocean 8.mcf 256.bzip2 WaterSpatial Cholesky 88.ammp 0 0 2D 2MB SRAM Cache Barnes 50 00 00 80 60 40 50 20 0 MRL2_REDUTION[points] 200 FFT 79.art 300.twolf 30.apsi HTL2_OVERHEAD[cc] Profit MR L2 _ REDUCTION HT L2_ OVERHEAD MMAT

Appropriate Cache Size Varies within Programs! The lower, the better 350 2MB(2cc) 32MB(60cc) Ocean 300 L Miss Penalty [cc] 250 200 50 00 50 0 4 7 0 3 6 9 22 25 28 3 34 37 40 43 46 49 52 55 58 6 64 67 70 73 76 79 82 85 88 9 94 97 00 Time Interval (00K L2 Accesses / Interval) 4

Outline Why 3D? Will 3D always work well? Adaptive Execution! Conclusions 5

Will 3D always work well? Stacking a DRAM Cache L Hit Time L Miss Rate L2 Hit Time L2 Miss Rate Main Memory Access Time AMAT Ave. MemoryAcc Acc. Time Impact of DRAM Stacking L L L 2 L 2 HT MR ( HT MR? MMAT ) 32MB DRAM Cache 6

SRAM/DRAM Hybrid Cache Architecture Support Two Operation Modes High Speed, Small Cache Mode (or SRAM Cache Mode) Low Speed, Large Cache Mode (or DRAM Cache Mode) Adapt to variation of application behavior 32MB DRAM Cache 32MB DRAM Cache (Power Gated) DRAM Cache Mode SRAM Cache Mode 7

Microarchitecture (/2) Tag Way 0 Way Tag Way 0 Way 2way set associative SRAM Cache 32MB DRAM Cache 2way set associative DRAM Cache 8

Microarchitecture (2/2) SARM(Size : Cs, Block : Ls, Asso. Ws) Tag field 58 - ID 64b physical address 58 - IS I IS Offset Index Assume Ld==Ls==64B ID DARM(Size : d Block : Ld, Asso. Wd) IS LS LS 58 - I S C D LS W lg CS L D W S D ID 58 - IS 58 - IS I I D S MUX L LS CS lg L S W C D lg L D W S D Data (SRAM) = = MUX Hit/Miss (SRAM) 58 - ID 58 - ID = = 58 - ID Hit/Miss (DRAM) L D L D MUX Data (DRAM) L D 9

How to Adapt 50 Static Approach 40 Optimizes at program level 20 Does not change it during 0 execution 0 Needs a static analysis Dynamic Approach Optimizes at interval level (or phase level) Needs a run time profiling L2 Cach he Miss Rates [%] L Miss Pena alty [cc] 30 350 300 250 200 50 00 50 0 Barnes FFT FMM 2MB 4MB 8MB 6MB 32MB 64MB 28MB 2MB(2cc) L2 Cache Size 32MB(60cc) Ocean 7 3 9 25 3 37 43 49 55 6 67 73 79 85 9 97 Interval 20

Experimental Set Up Processor: In Order Benchmarks: SPEC CPU 2000, Splash2 The operation mode is set at the beginning of the program execution (and is maintained until the end) Assume an appropriate operation mode is know for each benchmark 2D BASE Core@ 3GHz L I/D 2D SRAM L2 Cache Main memory 3D HYBRID LD, LI Caches: 32KB Access Lat.:2clock cycles L2 SRAM Cache 2MB, 64B Block 8way Lat. 6 clock cycles 3D DRAM Cache 32MB, 64B Block 8way Lat. 28 clock cycles Lat.:8 clock cycles 3D CONV Core@ 3GHz L I/D 3D DRAM L2 Cache Main memory 2

25 2.5 3.5 2 05 0.5 0 25 2.5.5 2 0.5 0 Evaluation Results 2D BASE 3D CONV 3D HYBRID Benchmark Programs Normalized Memory Energy Normalized Memory Performa nce

How to Adapt 50 Static Approach 40 Optimizes at program level 20 Does not change it during 0 execution 0 Needs a static analysis Dynamic Approach Optimizes at interval level (or phase level) Needs a run time profiling L2 Cach he Miss Rates [%] L Miss Pena alty [cc] 30 350 300 250 200 50 00 50 0 Barnes FFT FMM 2MB 4MB 8MB 6MB 32MB 64MB 28MB 2MB(2cc) L2 Cache Size 32MB(60cc) Ocean 7 3 9 25 3 37 43 49 55 6 67 73 79 85 9 97 Interval 23

if Run Time Mode Selection Divide Program Execution into epochs, e.g. 200K L2 Misses Predict an Appropriate Operation Mode for Next Epoch On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement MR L2SRAM HT L 2 DRAM HT L 2 SRAM AveOverhead MRL2DRAM MMAT then transit from SRAM mode to DRAM mode! epoch N 2 N N N+ Operation Mode 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache (Power Gated) 32MB DRAM Cache SRAM Cache Mode DRAM Cache Mode 24

Results 2.8.6 2D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID Normalize ed AMAT.4.2 0.8 06 0.6 0.4 0.2 0 ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program 25

Results Normalize ed AMAT 2.8.6.4.2 0.8 06 0.6 0.4 0.2 0 HYBRID 2D SRAM DRAM STACK HYBRID IDEAL IDEAL HYBRID 0.9 08 0.8 0.7 0.6 05 0.5 0.4 0.3 0.2 0. 0 mgrid swim twolf ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Accu uracy of Mo ode Selecti ion Benchmark Program 26

Results 2 Normalize ed AMAT.8.6.4.2 0.8 06 0.6 0.4 0.2 0 HYBRID 2D SRAM DRAM STACK HYBRID IDEALIDEAL HYBRID 0.9 0.8 0.7 0.6 05 0.5 0.4 0.3 0.2 0. 0 bzip2 Cholesky ammp art bzip2 mcf mgrid swim twolf Cholesky FFT FMM LU Ocean Benchmark Program on Accurac cy of Mod de Selecti 27

Outline Why 3D? Will 3D always work well? Adaptive Execution! Conclusions 28

Conclusions The 3D solution is one of the most promising ways to achieve High performance Low energy It does not ALWAYS work well! Run time adaptive execution by considering memoryaccess behavior 29

Acknowledgement This research was supported in part by New Energy and Industrial Technology Development Organization 30