Memory. Lecture 22 CS301

Similar documents
Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Advanced Memory Organizations

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

LECTURE 11. Memory Hierarchy

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Course Administration

The Memory Hierarchy & Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Memory Hierarchy and Caches

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Caches. Hiding Memory Access Times

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Computer Architecture Spring 2016

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE331: Hardware Organization and Design

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

ECE232: Hardware Organization and Design

The University of Adelaide, School of Computer Science 13 September 2018

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CSE 2021: Computer Organization

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CpE 442. Memory System

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Mainstream Computer System Components

EN1640: Design of Computing Systems Topic 06: Memory System

CSE 2021: Computer Organization

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchies &

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS152 Computer Architecture and Engineering Lecture 16: Memory System

ECE468 Computer Organization and Architecture. Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory. Objectives. Introduction. 6.2 Types of Memory

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Modern Computer Architecture

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS3350B Computer Architecture

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

CPE 631 Lecture 04: CPU Caches

Cycle Time for Non-pipelined & Pipelined processors

Caching Basics. Memory Hierarchies

14:332:331. Week 13 Basics of Cache

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

14:332:331. Week 13 Basics of Cache

Lecture 11 Cache. Peng Liu.

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 18: DRAM Technologies

Review : Pipelining. Memory Hierarchy

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Chapter 6 Objectives

Adapted from David Patterson s slides on graduate computer architecture

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Caches Concepts Review

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS61C : Machine Structures

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CENG4480 Lecture 09: Memory 1

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Transcription:

Memory Lecture 22 CS301

Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday

Pipelined Machine Fetch Decode Execute Memory PC 4 Read Addr Out Data Instruction Memory << 2 op/fun rs rt rd imm src1 src1data src2 src2data Register File destreg destdata << 2 Addr Out Data Data Memory In Data 16 Sign Ext 32 Pipeline Register (Writeback)

The Challenge Be able to randomly access gigabytes (or more) of data at processor speeds

How Do We Access Data?

Program Characteristics Temporal Locality Spatial Locality

Program Characteristics Temporal Locality w If you use one item, you are likely to use it again soon Spatial Locality

Program Characteristics Temporal Locality w If you use one item, you are likely to use it again Spatial Locality w If you use one item, you are likely to use its neighbors soon

Examples of Each Type of Locality? Temporal locality w Good? w Bad? Spatial locality w Good? w Bad?

Locality Programs tend to exhibit spatial & temporal locality. Just a fact of life. How can we use this knowledge of program behavior to solve our problem?

Predicting Data Accesses Can we predict what data we will use?

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address?

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address? w Given address, guess next likely address

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address? w Given address, guess next likely address w Too many choices table too large or fits

Memory Hierarchy Tech Speed Size Cost/bit SRAM (logic) Fastest CPU L1 Smallest Highest SRAM (logic) L2 Cache DRAM (capacitors) Slowest DRAM Largest Lowest

Using Caches To Improve Performance Caches make the large gap between processor speed and memory speed appear much smaller Caches give the appearance of having lots and lots of quickly accessible memory w Achieved by exploiting spatial and temporal locality

SRAM Static Random Access Memory w Volatile memory array 4-6 transistors per bit w Fast accesses 0.5-2.5ns Dimensions: w Height: # addressable locations w Width: # of b per addressable unit Usually 1 or 4 2M x 16 SRAM Height: 2M Width: 16

SRAM: Selection Logic Need to choose which addressable unit goes to output lines 2M multiplexor infeasible Single shared output line: bit line w Tri-state buffer used to allow multiple sources to drive bit line Tri-State Buffer A Ctrl Z 0 0 X 0 1 0 1 0 X 1 1 1 Select 0 Enable Data 0 In Out Select 1 Enable Data 1 In Out Select 2 Enable Output Data 2 In Out Select 3 Data 3 In Enable Out

SRAM: Using Bit Lines input lines address or word lines bit (output) lines

SRAM: For Large Arrays Large arrays (4Mx8 SRAM) require HUGE decoders and word lines Instead, 2-stage decoding w Selects addresses for eight 4Kx1024 arrays w Multiplexors select 1 bit from each 1024-b wide array

DRAM Dynamic RAM Value stored as charge in capacitor (1T) w Must be refreshed w Refresh by reading and writing back cells w Only uses 1-2% of active DRAM cycles 2 level decoder w Row access Row access strobe (RAS) w Column access Column access strobe (CAS) Access times w 50-70 ns (typical) 4M x 1 DRAM 2048 x 2048 array

Memory Hierarchy Tech Speed Size Cost/bit SRAM (logic) Fastest CPU L1 Smallest Highest SRAM (logic) L2 Cache DRAM (capacitors) Slowest DRAM Largest Lowest

What Do We Need to Think About? 1. Design cache that takes advantage of spatial & temporal locality

What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase spatial & temporal locality

What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase locality w Java - difficult to do w C - more control over data placement

What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase locality w Java - difficult to do w C - more control over data placement Note: Caches exploit locality. Programs have varying degrees of locality. Caches do not have locality!

Cache Design Temporal Locality Spatial Locality

Cache Design Temporal Locality w When we obtain the data, store it in the cache. Spatial Locality

Cache Design Temporal Locality w When we obtain the data, store it in the cache. Spatial Locality w Transfer large block of contiguous data to get item s neighbors. w Block (Line): Amount of data transferred for a single miss (data plus neighbors)

Where do we put data? Searching whole cache takes time & power Direct-mapped w Limit each piece of data to one possible position Search is quick and simple

Memory Direct-Mapped 000000 000100 Index 00 01 10 11 Cache 010000 010100 100000 100100 110000 110100

Memory Direct-Mapped One block (line) 000000 000100 Index 00 01 10 11 Cache 010000 010100 100000 100100 110000 110100

Direct-Mapped cache Block (Line) size = 2 words (8B) Index Data 292 00 01 10 11 Byte Address 0b100100100 Where do we look in the cache? How do we know if it is there?

Direct-Mapped cache Block (Line) size = 2 words (8B) Index 00 01 10 11 Data Byte Address 0b100100100 Block Address Where is it within the block? Where do we look in the cache? BlockAddress mod #slots BlockAddress & (#slots-1) How do we know if it is there?

Direct-Mapped cache Block (Line) size = 2 words (8B) Valid 00 01 10 11 Tag Data 1 1001 M[292-295] M[288-291] Byte Address 0b100100100 Tag Index Where is it within the block? Where do we look in the cache? BlockAddress mod #slots BlockAddress & (#slots-1) How do we know if it is there? We need a tag & valid bit

Splitting the Address Direct-Mapped Cache Valid Tag Data 00 0 0b1010001 01 0 10 0 11 0 Tag Index Block Offset Byte Offset

Definitions Byte Offset: Which within? Block Offset: Which within? Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which within? Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of blocks checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of blocks checked each access Index: Which set within cache? Tag: Is this the right one?

Definitions Block (Line) Hit Miss Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit Miss Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time w Time to access this cache Miss Penalty

Definitions Block - unit of data transfer bytes/words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time w Time to access this cache Miss Penalty w Time to receive block from lower level w Not always constant

Example 1 Direct-Mapped Block size=2 words Direct-Mapped Cache Valid Tag Data 00 0 0x1010001 01 0 10 0 11 0 Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 0 10 0 11 0 Direct-Mapped Cache Tag Data Reference Stream: Hit/Miss 0b1001000 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 0 10 0 11 0 Direct-Mapped Cache Tag Data 72 Reference Stream: Hit/Miss 0b1001000 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 0 11 0 Direct-Mapped Cache Tag 10 Data M[76-79] M[72-75] 72 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 0 11 0 Direct-Mapped Cache Tag 10 Data M[76-79] M[72-75] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 0 Direct-Mapped Cache Tag 10 00 Data M[76-79] M[72-75] M[20-23] M[16-19] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 0 Direct-Mapped Cache Tag 10 00 Data M[76-79] M[72-75] M[20-23] M[16-19] 56 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 1 10 1 1 Direct-Mapped Cache Tag Data 10 00 01 M[76-79] M[72-75] M[20-23] M[16-19] 11 M[60-63] M[56-59] 56 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 16 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 16 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[20-23] M[16-19] M[76-79] M[72-75] M[60-63] M[56-59] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[20-23] M[16-19] M[76-79] M[72-75] M[60-63] M[56-59] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 01 01 Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 36 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 1 01 10 1 1 11 1 Direct-Mapped Cache Tag 01 10 01 01 Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 36 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 1 01 10 1 1 11 1 Direct-Mapped Cache Tag 01 10 01 01 Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 1 01 10 1 1 11 1 Direct-Mapped Cache Tag 01 10 01 01 Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: 4 / 6 = 67% Hit Rate: 2 / 6 = 33% Tag Index Block Offset Byte Offset

Implementation Byte Address 0x100100100 Tag Valid Index Tag Data Byte Offset Block offset 00 01 10 11 = MUX Hit? Data

Example 2 You are implementing a 64-Kbyte cache, 32-bit address The block size (line size) is 16 bytes. Each word is 4 bytes How many bits is the block offset? How many bits is the index? How many bits is the tag?

Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? How many bits is the tag?

Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes, address 32 bits How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? w 64*1024 / 16 = 4096 -> 12 bits How many bits is the tag?

Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes, address 32 bits How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? w 64*1024 / 16 = 4096 -> 12 bits How many bits is the tag? w 32 - (2 + 12 + 2) = 16 bits

Direct-mapped $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? Example

Example Direct-mapped $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? 0 1 2 3 4 5 6 7 0 16 1 17 32 16 36 45

Reducing Cache Conflicts Problem: w Lines that map to same cache index conflict w Lines conflict even if other cache lines unused Solution: w Have multiple cache lines for each mapping

Cache Set Associativity Set: Group of cache lines address can map to Direct-mapped: 1 location for block n-way set associative: n locations for block Fully-associative: Maps to any location Direct-mapped 2-way set associative Fully-associative Set 0 1 2 3 4 5 6 7 Set 0 1 2 3 Set 0

Cache Set Associativity Decreases conflicts => increases hit rate On cache request, must check every cache line in set w Increases hit time Number of sets smaller than direct mapped, so fewer index bits w lg (number of sets) where sets < # cache lines Tag bits increase

2-way set associative $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? Example

Example 2-way set associative $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? 0 0 1 1 2 2 3 3 0 32 16 36 45

Implementation Byte Address 0x100100100 0 Valid Tag Tag Data Index Valid Byte Offset Block offset Tag Data 1 Hit? = MUX = MUX MUX Data

Example You are implementing a 1Mbyte 4-way set associative cache, 32-bit address The block size (line size) is 256 bytes. How many bits is the block offset? How many bits is the index? How many bits is the tag?

What Happens on Cache Miss? Detect desired block is not there w Valid bit 0 OR w Tag not one we re looking for If valid bit is set but tag not one we re looking for, evict current block Request line from lower level Upon receipt of data from lower level, set tag and valid bits and store data. Pass data up to requestor

How caches work Classic abstraction Each level of hierarchy has no knowledge of the configuration of lower level L1 cache s perspective Me L1 L2 cache s perspective Me L2 Cache Memory L2 Cache Memory DRAM DRAM

Memory Operation at any level Address 1. Me Cache 1. Cache receives request Memory

Memory operation at any level Address 1. Me 2. Cache 1. Cache receives request 2. Look for item in cache Memory

Memory operation at any level Address 1. Me 2. Cache 3. Data 1. Cache receives request 2. Look for item in cache Hit - return data Memory

Memory operation at any level Address 1. Me 2. 3. Memory Cache 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory

Memory operation at any level Address 1. Me 2. 3. Memory Cache 4. 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache

Memory operation at any level Address 1. Me 2. 3. Memory Cache 5. Data 4. 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache return data

Performance Hit: latency = Miss: latency = Goal: minimize misses!!!

Performance Hit: latency = access time Miss: latency = Goal: minimize misses!!!

Performance Hit: latency = access time Miss: latency = access time + miss penalty Goal: minimize misses!!!

Performance How does the memory system affect CPI? Penalty on cache hit: w hit time w frequently only 1 cycle needed to access on cache hit Penalty on cache miss: w miss time time to get from lower level of memory CPI = 1 + memory stalls/instruction = 1 + (% miss) (cache miss penalty)

L1 cache s perspective Me Memory L1 L1 s miss penalty contains the access of L2, and possibly the access of DRAM!!! L2 Cache DRAM

Multi-level Caches Base CPI 1.0, 500MHz clock Main memory-100 cycles, L2-10 cycles L1 miss rate per instruction - 5% W/L2-2% of instructions go to DRAM What is the speedup with the L2 cache?

Multi-level Caches CPI = 1 + memory stalls / instructions

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/ miss = 1 + 5 = 6 cycles / instr

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new =1 + L2%*L2penalty +Mem %*MemPenalty instr =1 + 5% * 10 + 2% * 100=3.5 cycles/

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new =1 + L2%*L2penalty +Mem %*MemPenalty instr =1 + 5% * 10 + 2% * 100=3.5 cycles/ Speedup = 6 / 3.5 = 1.7

Average Memory Access Time AMAT = L1 access time + L1 miss penalty L1 miss penalty = L2 access time + L2 miss penalty L2 miss penalty = Memory access time + Memory miss penalty

Calculate AMAT Organization: w L1 cache Access time is 1 cycle Hit rate of 90% w L2 cache Access time is 10 cycles Hit rate of 95% w Memory Access time is 100 cycles Hit rate of 100%

Ways To Improve Cache Performance Make the cache bigger w Pro: More stuff can fit in the cache so stuff doesn t have to get thrown out as often w Con: Time to access larger memory longer Reduce the number of conflicts in the cache by increasing associativity w Pro: Multiple memory lines that map to same cache set can reside in cache simultaneously w Con: More time needed to determine if there is a hit because have to check multiple cache blocks

Ways To Improve Cache Performance Use multiple levels of cache w Access time of non-primary cache not as important. More important for it to have lower miss rate. w Pro: Reduces (average) miss penalty if there is a hit in lower level of cache w Con: Takes up space and increases (worst-case) latency if access misses in this level of cache. Make the block size larger to exploit spatial locality w Pro: Fewer misses for sequential accesses w Pro: Decreases bits dedicated to tags w Con: Fewer blocks in cache for given cache size w Con: Miss penalty may be larger because larger blocks need to be retrieved from lower level of hierarchy

2-way set associative $ w Block size = 4 words w Total size = 32 words Word addresses w 2 w 35 w 63 w 110 w 210 w 77 w 3 w 97 w 170 What is the hit rate? Example

Cache Writes There are multiple copies of the data lying around w L1 cache, L2 cache, DRAM Do we write to all of them? Do we wait for the write to complete before the processor can proceed?

Do we write to all of them? Write-through - write to all levels of hierarchy Write-back - write to lower level only when cache line gets evicted from cache w creates inconsistent data - different values for same item in cache and DRAM. w Inconsistent data in highest level in cache is referred to as dirty

Write-Through CPU sw $3, 0($5) L1 L2 Cache DRAM

Write-Back CPU sw $3, 0($5) L1 L2 Cache DRAM

Write-through vs Write-back Which performs the write faster? w Write-back - it only writes the L1 cache Which has faster evictions from a cache? w Write-through - no write involved, just overwrite tag Which causes more bus traffic? w Write-through. DRAM is written every store. Write-back only writes on eviction.

Beyond The Cache: Memory

Memory System Design Challenges DRAM is designed for density, not speed DRAM is slower than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

Narrow Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

Narrow Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles Cache Bus DRAM

Wide Configuration CPU Given: w 1 clock cycle request w 15 cycles / 2 words DRAM latency w 1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

Wide Configuration CPU Given: w 1 clock cycle request w 15 cycles / 2 words DRAM latency w 1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles Cache Bus DRAM

Interleaved Configuration CPU Byte 0 in DRAM 0, byte 1 in DRAM 1, Byte 2 in DRAM 0,... Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM

Interleaved Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles Cache Bus DRAM DRAM

DRAM Optimizations Fast page mode w Allow repeated accesses to row buffer without another row access time Synchronous DRAM w Add clock signal to DRAM interface to make synchronous w Programmable register holds number of bytes to transfer over many cycles Double Data Rate (DDR) w Transfer data on rising and falling clock edges instead of just one.

DRAM Optimizations Make DRAM chip act like a memory system Each chip has interleaved memory and a high speed interface RDRAM w Switch RAS/CAS lines to bus that allows multiple access to be inflight simultaneously You don t have to wait for one DRAM request to finish before sending another request Direct RDRAM w Don t multiplex over one bus. Have 3: Data Row Column