Memory System Design Part II. Bharadwaj Amrutur ECE Dept. IISc Bangalore.

Similar documents
Topics. Digital Systems Architecture EECE EECE Need More Cache?

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

CPE 631 Lecture 04: CPU Caches

Question?! Processor comparison!

ECE331: Hardware Organization and Design

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Caching Basics. Memory Hierarchies

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Page 1. Multilevel Memories (Improving performance using a little cash )

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

LSN 7 Cache Memory. ECT466 Computer Architecture. Department of Engineering Technology

Memory Hierarchy: The motivation

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

ECE331: Hardware Organization and Design

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Recap: Machine Organization

Memory Hierarchy: Motivation

CS3350B Computer Architecture

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Lecture 11 Cache. Peng Liu.

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

EC 513 Computer Architecture

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Why memory hierarchy

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Caches. Hiding Memory Access Times

EN1640: Design of Computing Systems Topic 06: Memory System

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Page 1. Memory Hierarchies (Part 2)

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

The Memory Hierarchy & Cache

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

ECE468 Computer Organization and Architecture. Memory Hierarchy

EN1640: Design of Computing Systems Topic 06: Memory System

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

ECE 30 Introduction to Computer Engineering

Cache Memory and Performance

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

EE 4683/5683: COMPUTER ARCHITECTURE

Advanced Memory Organizations

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Memory hierarchies and caches

Memory. Lecture 22 CS301

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

COSC3330 Computer Architecture Lecture 19. Cache

Lec 11 How to improve cache performance

Advanced Computer Architecture

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

1/19/2009. Data Locality. Exploiting Locality: Caches

Handout 4 Memory Hierarchy

ECE232: Hardware Organization and Design

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Y. K. Malaiya

CS3350B Computer Architecture

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory Hierarchy. Slides contents from:

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

14:332:331. Week 13 Basics of Cache

Introduction to cache memories

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Course Administration

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Computer Systems Architecture

ECE7995 Caching and Prefetching Techniques in Computer Systems. Lecture 8: Buffer Cache in Main Memory (I)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Modern Computer Architecture

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

ECE331: Hardware Organization and Design

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Solutions for Chapter 7 Exercises

Cache Performance (H&P 5.3; 5.5; 5.6)

Transcription:

Memory System Design Part II Bharadwaj Amrutur ECE Dept. IISc Bangalore.

References: Outline Computer Architecture a Quantitative Approach, Hennessy & Patterson Topics Memory hierarchy Cache Multi-core considerations Main Memory Disk Virtual memory Power considerations 2

View from the processor Clk Processor MemOp Address Memory ReadData WriteData Memory Operations (MemOp) (DLX) Load Store (Other RISC Processors) Prefetch Load/Store coprocessor Cache Flush Synchronization Address is 32bits or 64bits (modern processors) Data bus width is 64 (accesses can be in bytes, 32bits, 64bits) 3

The Gap 1000 Performance 100 10 1 Moore s Law Less Law? DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr. 4 From Kubiatowicz/UCB

Memory Hierarchy Characteristics Integration Box PCB Chip/ Package Chip <1mm Proc 16-128 64-bit Regs ½ cycle latency, ~ 1000 Gb/s Few mm 4KB-32KB L1 $ 1 cycle latency, ~ 400 Gb/s Few cms 1MB - 8MB L2 $ 5-10 cycles latency, ~200 Gb/s Few Inches 128MB - 4GB Main Memory (DRAM) 40-100 cycles latency, ~50Gb/s Many Inches 80GB-few TB Disk 1000s of cycles, ~1Gb/s 5

Exercise Memory Hierarchy Find Power/Mbps/bit for each layer of the memory hierarchy Plot Power/Mbps versus Bit as well as Bit 0.5 Which is better? 6

Cache concept Small, fast storage to exploit Spatial and Temporal Locality Found in other places: File caches, Name Caches etc. Consider the memory as a sequence of blocks Also known as lines The block can contain multiple bytes. Cache allows storage of a subset of the blocks from main memory Cache is first searched to satisfy the memory access request. A hit will return fast. A miss will incur a penalty. Main Memory 0 1 2 3 4 5 6 7 8 9 10 1112 1314 15 Cache 0 1 2 3 Main memory blocks are temporarily stored in the Cache 7

Average Memory Access Time Program Execution Time is given as: CPU time =IC ALU ops Instr CPI MemAccess Aluops AMAT Cycletime Inst Average Memory Access Time (AMAT) is given as: AMAT=HitTime MissRate MissPenalty HitTime and MissPenalty are in number of clock cycles IC is Instruction Count in the program To reduce AMAT, reduce HitTime, MissRate and MissPenalty HitTime is usually the lowest possible of 1 cycle MissPenalty is a function of the upper levels of the memory hierarchy MissRate is a function of Cache Size & Associativity which also impacts Cycletime : Hence an optimization problem 8

Exercise Write the corresponding equation for the energy consumed by a program 9

Energy Program Execution Energy is given as: CPU ener =IC ALU ops Instr EPI Aluops MemAccess AMAE CPU Inst time LeakagePower Average Memory Access Energy (AMAE) is given as: AMAE=HitEnergy MissRate MissEnergy HitEnergy and MissEnergy are in joules and are average numbers to account for actitivity factor of the data/address bits IC is Instruction Count in the program To reduce AMAE, reduce HitEnergy, MissRate and MissEnergy MissPenalty is a function of the upper levels of the memory hierarchy MissRate is a function of Cache Size & Associativity which also impacts CPUtime : Hence an optimization problem 10

Cache issues Where should a block be placed in the cache? How is a block searched for in the cache? Which block should be replaced on a cache miss? What to do on a write? 11

Direct Mapped: Placement Direct Mapped Main Memory 0 1 2 3 4 5 6 7 8 9 10 111213 14 15 Cache 0 1 2 3 The main memory blocks which map to specific cache blocks are: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 The formula is: BlockAddress mod CacheSize 12 (CacheSize is in Blocks)

Direct Mapped: Search Valid Tag Decoder Data = Hit/Miss 31 0 Tag CacheIndex ByteSel 13

Block Placement: 2-way Associative Direct Mapped Main Memory 0 1 2 3 4 5 6 7 8 9 10 111213 14 15 Set 0 Set 1 Cache The main memory blocks which map to specific cache blocks are: 0 4 8 12 0 1 2 3 2 6 10 14 1 5 9 13 3 7 11 15 Within each set, the blocks can be in either of the locations The formula is: 14

Block Placement: 2-way Associative Direct Mapped Main Memory 0 1 2 3 4 5 6 7 8 9 10 111213 14 15 Set 0 Set 1 Cache The main memory blocks which map to specific cache blocks are: 0 4 8 12 0 1 2 3 2 6 10 14 1 5 9 13 3 7 11 15 Within each set, the blocks can be in either of the locations The formula is: SetNumber = BlockAddress mod NumSets 15

2-Way Associative: Search 31 0 Tag CacheIndex ByteSel Valid Tag Decoder Data Valid Tag Decoder Data = Tristate Driver = Tristate Driver Hit/Miss_Set0 Hit/Miss_Set1 Exercises: a) Complete the wiring b) How do you generate the final Hit/Miss signal c) Extend the design to a Fully Associative Cache d) What happens to MissRate with associativity e) What happens to MissRate with size f) What happens to cycle time with Associativity and Size? 16

Replacement Tag Decoder Data Valid Tag Decoder Data = Tristate Driver = Tristate Driver Hit/Miss_Set0 Hit/Miss_Set1 Random: Randomly select the cache block to replace LRU: Least Recently Used: Select the cache block which was accessed the last MRU: Most Recently Used: Avoid replacing cache blocks which are accesed recently LFU: Least Frequently Used: Choose the cache block which was used least. 17

LRU Replacement 1 4 3 1 2 3 Miss Least Recently Used 1 4 3 1 4 1 1 3 4 2 1 3 4 3 2 1 4 3 2 1 4 Replace 4. Stack data structure. Push the most recently accessed cache block number at top. Replace from the bottom of the stack How to implement in hardware? 18

LRU Hardware implementation Implement stack Complicated for large associativities to maintain stack Especially removal from middle and insert at top Counter based Have a time stamp counter of small number of bits Update time stamp of each accessed block with current timestamp Replace block with smallest time stamp Periodically clear time stamps: Background process Example 1-bit time stamp : MRU 19

LRU/MRU Implementation MRUValid Tag Decoder Data = Hit/Miss 31 0 Tag CacheIndex ByteSel 20

Write Through Write Policy On a hit, update the cache block as wells as the block in the memory Every write incurs traffic to the main memory Processor will have to wait for the main memory to be updated before continuing Write Stall Unless Write buffer is used Stores to main memory are held in the write buffer and processor continues operation Need to accommodate a burst of stores What if the store buffer gets full? 21

Write Back Writes only done to cache blocks Multiple writes to same block doesn't incur main memory traffic On a cache block eviction, check if it needs to be written back to main memory Need an extra dirty bit per cache line. 22

Write back structure Dirty MRUValid Tag Decoder Data = Hit/Miss 31 0 Tag CacheIndex ByteSel 23

Dealing with Write Miss Write Allocate Load the block into cache on a write miss Similar to read miss Typically used with write back policy No-write Allocate Block is modified at lower level and not brought into this level Typically used with write through policy 24

Exercise Design cache management unit for Write Back Cache with Associativity of 4 LRU approximation 25

Multi-core processors P0 P3 L1 Cache L1 Cache Shared L2 Cache 26

Cache Coherency Same memory block can be in multiple Level1 cache blocks If one processor updates the local copy, how to make all the copies the same? Need cache coherence protocols 27