Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Similar documents
Memory Hierarchy and Caches

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Assignment 1 due Mon (Feb 4pm

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Memory Hierarchy: The motivation

EE 4683/5683: COMPUTER ARCHITECTURE

Page 1. Memory Hierarchies (Part 2)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Page 1. Multilevel Memories (Improving performance using a little cash )

The Memory Hierarchy & Cache

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS 136: Advanced Architecture. Review of Caches

CS 282 KAUST Spring 2010

Advanced Computer Architecture

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

LECTURE 11. Memory Hierarchy

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Memory Hierarchy: Motivation

Advanced Memory Organizations

Memory Hierarchy. Slides contents from:

Caching Basics. Memory Hierarchies

Introduction to cache memories

A Cache Hierarchy in a Computer System

Memory Hierarchy. Slides contents from:

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Caches. Hiding Memory Access Times

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy: Caches, Virtual Memory

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Lecture 11 Cache. Peng Liu.

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Lecture 7 - Memory Hierarchy-II

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Memory hierarchy review. ECE 154B Dmitri Strukov

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CS3350B Computer Architecture

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Cache Memory and Performance

Question?! Processor comparison!

12 Cache-Organization 1

Improving Cache Performance

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

EC 513 Computer Architecture

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter 2: Memory Hierarchy Design Part 2

Lecture-14 (Memory Hierarchy) CS422-Spring

V. Primary & Secondary Memory!

Main Memory Supporting Caches

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Improving Cache Performance

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture notes for CS Chapter 2, part 1 10/23/18

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang


Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Course Administration

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Memory Hierarchy Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory.

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Transcription:

Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals

Presentation Outline The Need for Cache Memory The Basics of Caches Cache Performance Multilevel Caches Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 2

Processor-Memory Performance Gap CPU Performance: 55% per year, slew down after 2004 Performance Gap DRAM: 7% per year 1980 No caches in microprocessor 1995 Two-level caches in microprocessor Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 3

The Need for Cache Memory Widening speed gap between CPU and main memory Processor operation takes less than 0.5 ns Main memory requires more than 50 ns to access Each instruction involves at least one memory access One memory access to fetch the instruction A second memory access for load and store instructions Memory bandwidth limits the instruction execution rate Cache memory can help bridge the CPU-memory gap Cache memory is small in size but fast Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 4

Typical Memory Hierarchy Register file is at the top of the memory hierarchy Typical size < 1 KB Access time < 0.3 ns Level 1 Cache (~ 64 KB) Access time: ~ 1 ns L2 Cache (Megabytes) Access time: ~ 5 ns Main Memory (Gigabytes) Access time: ~ 50 ns Disk Storage (Terabytes) Access time: ~ 5 ms (disk) Faster Microprocessor Reg File L1 Cache L2 Cache Memory Controller Memory Bus Main Memory I/O Bus Magnetic Disk or Flash Bigger Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 5

Principle of Locality of Reference Programs access small portion of their address space At any time, only a small set of instructions & data is needed Temporal Locality (locality in time) If an item is accessed, probably it will be accessed again soon Same loop instructions are fetched each iteration Same procedure may be called and executed many times Spatial Locality (locality in space) Tendency to access contiguous instructions/data in memory Sequential execution of Instructions Traversing arrays element by element Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 6

Typical Memory Reference Patterns Address n loop iterations Instruction fetches Stack accesses subroutine call argument access subroutine return Data accesses scalar accesses Time Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 7

What is a Cache Memory? Small and fast (SRAM) memory technology Stores the subset of instructions & data currently being accessed Used to reduce average access time to memory Caches exploit temporal locality by Keeping recently accessed data closer to the processor Caches exploit spatial locality by Moving blocks consisting of multiple contiguous words Goal is to achieve Fast speed of cache memory access Balance the cost of the memory system Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 8

Almost Everything is a Cache! In computer architecture, almost everything is a cache! Registers: a cache on variables software managed First-level cache: a cache on memory Second-level cache: a cache on memory Memory: a cache on hard disk Stores recent programs and their data Hard disk can be viewed as an extension to main memory Branch target and prediction buffer Cache on branch target and prediction information Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 9

Next... The Need for Cache Memory The Basics of Caches Cache Performance Multilevel Caches Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 10

Inside a Cache Memory Processor Address Data Cache Memory Address Data Main Memory Tags identify blocks in the cache Address Tag 0 Cache Block 0 Address Tag 1... Tag N 1 Cache Block N 1 Cache Block (or Cache Line) Cache Block 1 Unit of data transfer between main memory and a cache Large block size Less tag overhead + Burst transfer from DRAM Typically, cache block size = 64 bytes in recent caches... N Cache Blocks Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 11

Four Basic Questions on Caches Q1: Where can a block be placed in a cache? Block placement Direct Mapped, Set Associative, Fully Associative Q2: How is a block found in a cache? Block identification Block address, tag, index Q3: Which block should be replaced on a miss? Block replacement Random, FIFO, LRU, NRU Q4: What happens on a write? Write strategy Write Back or Write Through (Write Buffer) Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 12

Block Placement Block Address 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Main Memory Index 000 001 010 011 100 101 110 111 Set Index 00 01 10 11 No Indexing Direct Mapped Cache A memory block is mapped to one cache index Cache index = lower 3 bits of block address 2-way Set Associative A memory block placed anywhere inside a set Set index = lower 2 bits of block address Fully Associative A memory block can be placed anywhere inside the cache Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 13

Block Identification A memory address is divided into Block address: identifies block in memory Block offset: to access bytes within a block A block address is further divided into Index: used for direct cache access Tag: upper bits of block address Tag must be stored inside cache For block identification A valid bit is also required Indicates whether cache block is valid or not Block Address Tag Index offset V Tag Block Data =...... Data Hit Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 14

Direct Mapped Cache byte offset Data-in word select Aligner write Block Address Tag Index Offset b k t Decoder V V Tag 0 Data Block 0 Tag 1 Data Block 1 Tag 2 k -1 Block 2 k -1 = enable Aligner byte offset Data-out Enable write access to data block if cache hit Hit Aligner shifts data within word according to byte offset Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 15

Direct Mapped Cache Index is decoded to access one cache entry Compare address tag against cache entry tag If the tags are equal and the entry is valid then Cache Hit Otherwise: Cache Miss Cache index field is k-bit long Number of cache blocks is 2 k Block Offset field is b-bit long = word offset ǁ byte offset Number of bytes in a block is 2 b Tag field is t-bit long Memory address = t + k + b bits Cache data size = 2 k+b bytes Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 16

Fully Associative Cache byte offset Data-in Tag Offset b t V word select Tag 0 = Aligner Data Block 0 enable write V Tag 1 Data Block 1 = enable... V Tag m-1 = Data Block (m 1) enable byte offset Aligner m-way associative Hit Data-out Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 17

Fully Associative Cache m blocks exist inside cache A block can be placed anywhere inside cache (No indexing) m comparators are used Match address tag against all cache tags in parallel If a block is found with a matching tag then cache hit Select data block with a matching tag, or Enable writing of data block with matching tag if cache write Block Offset field = b bits = word offset ǁ byte offset Word offset selects word (column) within data block Byte offset used when reading/writing less than a word Cache data size = m 2 b bytes Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 18

2-Way Set Associative Cache Data-in write byte offset Aligner word select Tag Index Offset b k t Decoder V V Tag 0a V Tag 0b Tag 1a V Tag 1b = = Data Block 0a Data Block 1a Block 2 k -1a way 0 Data Block 0b Data Block 1b Block 2 k -1b way 1 In general, m-way Set byte offset Aligner Associative Hit Data-out Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 19

Set-Associative Cache A set is a group of blocks that can be indexed One set = m blocks m-way set associative Set index field is k-bit long Number of Sets = 2 k Set index is decoded and only one set is examined m tags are checked in parallel using m comparators If address tag matches a stored tag within set then Cache Hit Otherwise: Cache Miss Cache data size = m 2 k+b bytes (2 b bytes per block) A direct-mapped cache has one block per set (m = 1) A fully-associative cache has only one set (k = 0) Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 20

Handling a Cache Read Processor issues address Compare address tag against all tags stored in the indexed set Yes Cache Hit? No Return data Miss Penalty Clock Cycles to process a cache miss Send address to lower-level cache or main memory Select Victim Block Send miss signal to processor Transfer Block from lower-level memory and replace victim block with new block Return data Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 21

Replacement Policy Which block to replace on a cache miss? No choice for direct-mapped cache m choices for m-way set associative cache Random replacement Candidate block is selected randomly One counter for all sets (0 to m 1): incremented on every cycle On a cache miss, replace block specified by counter First In First Out (FIFO) replacement Replace oldest block in set (Round-Robin) One counter per set (0 to m 1): specifies oldest block to replace Counter is incremented on a cache miss Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 22

Replacement Policy cont d Least Recently Used (LRU) Replace block that has been unused for the longest time LRU state must be updated on every cache hit If m = 2, there are only 2 permutations (a single bit is needed) LRU approximation is used in practice for m = 4 and 8 ways Not Recently Used (NRU) One reference bit (R-bit) per block m reference bits per set R-bit is set to 1 when block is referenced Block replacement favors blocks with R-bit = 0 in set Pick any victim block with R-bit = 0 (Random or Leftmost) If all blocks have R-bit = 1 then replace any block in set Block replacement clears the R bits of all blocks in set Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 23

Comparing Random, FIFO, and LRU Data cache misses per 1000 instructions 10 SPEC2000 benchmarks on Alpha processor Block size of 64 bytes LRU and FIFO outperforming Random for a small cache Little difference between LRU and Random for a large cache LRU is expensive for large associativity (when m is large) Random is the simplest to implement in hardware 2-way 4-way 8-way Size LRU Rand FIFO LRU Rand FIFO LRU Rand FIFO 16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5 Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 24

Write Through Write Policy Writes update cache and lower-level memory Cache control bit: only a Valid bit is needed Memory always has latest data, which simplifies data coherency Can always discard cached data when a block is replaced Write Back Writes update cache only Cache control bits: Valid and Modified bits are required Modified cached data is written back to memory when replaced Multiple writes to a cache block require only one write to memory Uses less memory bandwidth than write-through and less power However, more complex to implement than write through Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 25

Write Miss Policy What happens on a write miss? Write Allocate Allocate new block in cache Write miss acts like a read miss, block is fetched then updated No Write Allocate Send data to lower-level memory Cache is not modified All write back caches use write allocate on a write miss Subsequent writes will be captured in the cache Write-through caches might choose no-write-allocate Reasoning: writes must still go to lower level memory Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 26

Write Buffer Write buffer is a queue that holds: address + wdata Write-through: all writes are sent to lower-level memory Buffer decouples the write from the memory bus writing Write occurs without stalling processor, until buffer is full Problem: write buffer may hold data on a read miss If address is found, return data value in write buffer Transfer block from lower level memory and update D-cache Processor rdata wdata address Write Through D-Cache Write Buffer block address wdata Lower Level Memory Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 27

Victim Buffer Holds modified evicted blocks and their addresses When a modified block is replaced (evicted) in the D-cache Prepares a modified block to be written back to lower memory Victim buffer decouples the write back to lower memory Giving priority to read miss over write back to reduce miss penalty Problem: Victim buffer may hold block on a cache miss Solution: transfer modified block in victim buffer into data cache rdata transfer block Processor wdata address Write Back D-Cache Victim Buffer block address modified block Lower Level Memory Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 28

Write Through Write-Through No Write Allocate Processor issues address + wdata Compare address tag against all tags stored in the indexed set Queue address + wdata into write buffer Yes Write Hit? No Stall processor if write buffer is full Write D-cache Do Nothing Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 29

Write Back Processor issues: address + wdata Compare address tag against all tags stored in the indexed set Write-Back Write Allocate Yes Write Hit? No Write D-cache Set M-bit Send block address to lower-level memory & lookup victim buffer. Transfer block into D-cache. Select block to replace. If modified, move into victim buffer Write D-cache Set M-bit Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 30

Next... The Need for Cache Memory The Basics of Caches Cache Performance Multilevel Caches Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 31

Hit Rate and Miss Rate Hit Rate = Hits / (Hits + Misses) Miss Rate = Misses / (Hits + Misses) I-Cache Miss Rate = Miss rate in the Instruction Cache D-Cache Miss Rate = Miss rate in the Data Cache Example: Out of 1000 instructions fetched, 60 missed in the I-Cache 25% are load-store instructions, 50 missed in the D-Cache What are the I-cache and D-cache miss rates? I-Cache Miss Rate = 60 / 1000 = 6% D-Cache Miss Rate = 50 / (25% 1000) = 50 / 250 = 20% Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 32

Memory Stall Cycles The processor stalls on a Cache miss When fetching instruction from I-Cache and Block not present When loading/storing data from D-cache and Block not present Miss Penalty: clock cycles to process a cache miss Miss Penalty is assumed equal for I-cache & D-cache Miss Penalty is assumed equal for Load and Store Memory Stall Cycles = I-Cache Misses Miss Penalty + D-Cache Read Misses Miss Penalty + D-Cache Write Misses Miss Penalty Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 33

Combined Misses Combined Misses = I-Cache Misses + D-Cache Read Misses + Write Misses I-Cache Misses = I-Count I-Cache Miss Rate Read Misses = Load Count D-Cache Read Miss Rate Write Misses = Store Count D-Cache Write Miss Rate Combined misses are often reported per 1000 instructions Memory Stall Cycles = Combined Misses Miss Penalty Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 34

Memory Stall Cycles Per Instruction Memory Stall Cycles Per Instruction = Combined Misses Per Instruction Miss Penalty Miss Penalty is assumed equal for I-cache & D-cache Miss Penalty is assumed equal for Load and Store Combined Misses Per Instruction = I-Cache Miss Rate + Load Frequency D-Cache Read Miss Rate + Store Frequency D-Cache Write Miss Rate Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 35

Example on Memory Stall Cycles Consider a program with the given characteristics 20% of instructions are load and 10% are store I-cache miss rate is 2% D-cache miss rate is 5% for load, and 1% for store Miss penalty is 20 clock cycles for all cases (I-Cache & D-Cache) Compute combined misses and stall cycles per instruction Combined misses per instruction in I-Cache and D-Cache 2% + 20% 5% + 10% 1% = 0.031 combined misses per instruction Equal to an average of 31 misses per 1000 instructions Memory stall cycles per instruction 0.031 20 (miss penalty) = 0.62 memory stall cycles per instruction Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 36

CPU Time with Memory Stall Cycles CPU Time = (CPU clock cycles + Memory Stall Cycles) Clock Cycle CPU Time = I-Count CPI Clock Cycle CPI = CPI execution + Memory Stall Cycles per Instruction CPI = CPI in the presence of cache misses CPI execution = CPI for ideal cache (no cache misses) Memory stall cycles per instruction increase the CPI Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 37

Example on CPI with Memory Stalls A processor has CPI execution of 1.5 (perfect cache) I-Cache miss rate is 2%, D-cache miss rate is 5% for load & store 20% of instructions are loads and stores Cache miss penalty is 100 clock cycles for I-cache and D-cache What is the impact on the CPI? Answer: Memory Stall Cycles per Instruction = 0.02 100 (I-Cache) + 0.2 0.05 100 (D-Cache) = 3 CPI = 1.5 + 3 = 4.5 cycles per instruction CPI / CPI execution = 4.5 / 1.5 = 3 Processor is 3 times slower due to memory stall cycles Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 38

Average Memory Access Time Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate Miss penalty where Hit Time = time to hit in the cache Two parts: Instruction Access + Data Access Overall AMAT = %I-Access AMAT I-Cache + %D-Access AMAT D-Cache AMAT I-Cache = Average memory access time in I-Cache AMAT D-Cache = Average memory access time in D-Cache %I-Access = Instruction Access / Total memory access %D-Access = Data Access / Total memory access Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 39

AMAT Example Compute the overall average memory access time Hit time = 1 clock cycle in both I-Cache and D-Cache Miss penalty = 50 clock cycles for I-Cache and D-Cache I-Cache misses = 3.8 misses per 1000 instructions D-Cache misses = 41 misses per 1000 instructions Load+Store frequency = 25% Solution: I-Cache miss rate = 3.8/1000 = 0.0038 D-Cache miss rate = 41/(1000 0.25) = 0.164 per access %I-Access = 100 / (100 + 25) = 0.8 (80% of total memory access) %D-Access = 25 / (100 + 25) = 0.2 (20% of total memory access) Overall AMAT = 0.8 (1 + 0.0038 50) + 0.2 (1 + 0.164 50) = 2.79 Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 40

Next... The Need for Cache Memory The Basics of Caches Cache Performance Multilevel Caches Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 41

Multilevel Caches Top level cache is kept small to Reduce hit time Reduce energy per access Add another cache level to Reduce the memory gap Reduce memory bus loading Multilevel caches can help Reduce miss penalty Reduce average memory access time (AMAT) Large L2 cache can capture many misses in L1 caches Reduce the global miss rate Processor Core Addr Inst Addr Data I-Cache D-Cache Addr Block Addr Block Unified L2 Cache Addr Block Main Memory Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 42

Multilevel Inclusion L1 cache blocks are always present in L2 cache Wastes L2 cache space! but L2 has space for additional blocks A miss in L1, but a hit in L2 copies block from L2 to L1 A miss in L1 and L2 brings a block into L1 and L2 A write in L1 causes data to be written in L1 and L2 Write-through policy is used from L1 to L2 Write-back policy is used from L2 to lower-level memory To reduce traffic on the memory bus A replacement (or invalidation) in L2 must be seen in L1 A block which is evicted from L2 must also be evicted from L1 Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 43

Multilevel Exclusion L1 cache blocks are never found in L2 cache Prevents wasting space Cache miss in L1, but hit in L2 results in a swap of blocks Cache miss in L1 and L2 brings the block into L1 only Block replaced in L1 is moved into L2 L2 cache stores L1 evicted blocks, in case needed later in L1 Write-Back policy from L1 to L2 Write-Back policy from L2 to lower-level memory Same or different block size in L1 and L2 Pentium 4 had 64-byte blocks in L1 but 128-byte blocks in L2 Core i7 uses 64-byte blocks at all cache levels (simpler) Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 44

Local Miss Rate Local and Global Miss Rates Number of cache misses / Memory accesses to this cache Miss Rate L1 for L1 cache Miss Rate L2 for L2 cache Global Miss Rate Number of cache misses / Memory accesses generated by processor MissRate L1 for L1 cache (same as local) MissRate L1 MissRate L2 for L2 cache (different from local) Global miss rate is a better measure for L2 cache Fraction of the total memory accesses that miss in L1 and L2 Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 45

Example of Local & Global Miss Rates Problem: suppose that out of 1000 instructions 5 instructions missed in the I-Cache and 39 missed in the D-Cache 15 instructions missed in the L2 cache Load + Store instruction frequency = 30% Compute: L1 miss rate, L2 local and global miss rates Solution: L1 misses per instruction = (5 + 39) / 1000 = 0.044 Memory accesses per instruction = 1 + 0.3 = 1.3 L1 Miss Rate = L2 Miss Rate = 0.044 / 1.3 = 0.0338 per access 15 / 44 = 0.341 per L2 access L2 Global Miss Rate = 0.0338 0.341 = 0.0115 (equal to 15/1000/1.3) Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 46

Average Memory Access Time AMAT = Hit time L1 + Miss Rate L1 Miss Penalty L1 Hit time L1 = Hit time I-Cache = Hit time D-Cache Miss Penalty L1 = Miss Penalty I-Cache = Miss Penalty D-Cache Miss Penalty in the presence of L2 Cache Miss Penalty L1 = Hit time L2 + Miss Rate L2 Miss Penalty L2 AMAT in the presence of L2 Cache AMAT = Hit time L1 + Miss Rate L1 ( Hit time L2 + Miss Rate L2 Miss Penalty L2 ) Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 47

For split I-Cache and D-Cache: L1 Miss Rate Miss Rate L1 = %Iacc Miss Rate I-Cache + %Dacc Miss Rate D-Cache %Iacc= Percent of Instruction accesses = 1 /(1 + %LS) %Dacc= Percent of Data accesses = %LS/(1 + %LS) %LS = Frequency of Load and Store instructions L1 Misses per Instruction: Misses per Instruction L1 = Miss Rate L1 (1 + %LS) Misses per Instruction L1 = Miss Rate I-Cache + %LS Miss Rate D-Cache Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 48

AMAT Example with L2 Cache Problem: Compute Average Memory Access Time I-Cache Miss Rate = 1%, D-Cache Miss Rate = 10% L2 Cache Miss Rate = 40% L1 Hit time = 1 clock cycle (identical for I-Cache and D-Cache) L2 Hit time = 8 cycles, L2 Miss Penalty = 100 cycles Load + Store instruction frequency = 25% Solution: Memory Access per Instruction = 1 + 25% = 1.25 Misses per Instruction L1 = 1% + 25% 10% = 0.035 Miss Rate L1 = Miss Penalty L1 = 0.035 / 1.25 = 0.028 (2.8%) 8 + 0.4 100 = 48 cycles AMAT = 1 + 0.028 48 = 2.344 cycles Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 49

Memory Stall Cycles Per Instruction Memory Stall Cycles per Instruction = Memory Access per Instruction Miss Rate L1 Miss Penalty L1 = (1 + %LS) Miss Rate L1 Miss Penalty L1 = (1 + %LS) Miss Rate L1 (Hit Time L2 + Miss Rate L2 Miss Penalty L2 ) Memory Stall Cycles per Instruction = Misses per Instruction L1 Hit Time L2 + Misses per Instruction L2 Miss Penalty L2 Misses per Instruction L1 = (1 + %LS) Miss Rate L1 Misses per Instruction L2 = (1 + %LS) Miss Rate L1 Miss Rate L2 Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 50

Two-Level Cache Performance Problem: out of 1000 addresses generated by program I-Cache misses = 5, D-Cache misses = 35, L2 Cache misses = 8 L1 Hit = 1 cycle, L2 Hit = 8 cycles, L2 Miss penalty = 80 cycles Load + Store frequency = 25%, CPI execution = 1.1 Compute memory stall cycles per instruction and effective CPI If the L2 Cache is removed, what will be the effective CPI? Solution: L1 Miss Rate = (5 + 35) / 1000 = 0.04 (or 4% per access) L1 misses per Instruction = L2 misses per Instruction = 0.04 (1 + 0.25) = 0.05 (8 / 1000) 1.25 = 0.01 Memory stall cycles per Instruction = 0.05 8 + 0.01 80 = 1.2 CPI L1+L2 = CPI L1only = 1.1 + 1.2 = 2.3, CPI/CPI execution = 2.3/1.1 = 2.1x slower 1.1 + 0.05 80 = 5.1 (worse) Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 51

Relative Execution Time by L2 Size Reference is 8 MB L2 cache with L2 hit = 1 Copyright 2012, Elsevier Inc. All rights reserved. Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 52

Power 7 On-Chip Caches (IBM 2010) 32KB I-Cache/core 4-way set-associative 32KB D-Cache/core 8-way set-associative 3-cycle latency 256KB Unified L2 Cache/core 8-way set-associative 8-cycle latency 32MB Unified Shared L3 Cache Embedded DRAM 8 regions 4 MB 25-cycle latency to local region Cache Memory COE 403 Computer Architecture - KFUPM Muhamed Mudawar slide 53