CPE 631 Lecture 04: CPU Caches

Similar documents
Aleksandar Milenkovich 1

Modern Computer Architecture

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Computer Architecture Spring 2016

CS152 Computer Architecture and Engineering Lecture 17: Cache System

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Advanced Computer Architecture

ארכיטקטורת יחידת עיבוד מרכזי ת

Handout 4 Memory Hierarchy

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EE 4683/5683: COMPUTER ARCHITECTURE

Page 1. Memory Hierarchies (Part 2)

ECE ECE4680

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Course Administration

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CS3350B Computer Architecture

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Memory Hierarchy Review

CS152: Computer Architecture and Engineering Caches and Virtual Memory. October 31, 1997 Dave Patterson (http.cs.berkeley.

Time. Recap: Who Cares About the Memory Hierarchy? Performance. Processor-DRAM Memory Gap (latency)

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

CSE 502 Graduate Computer Architecture. Lec 6-7 Memory Hierarchy Review

14:332:331. Week 13 Basics of Cache

Lecture 11. Virtual Memory Review: Memory Hierarchy

14:332:331. Week 13 Basics of Cache

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Time. Who Cares About the Memory Hierarchy? Performance. Where Have We Been?

Memory hierarchy review. ECE 154B Dmitri Strukov

LECTURE 11. Memory Hierarchy

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

CS152 Computer Architecture and Engineering Lecture 18: Virtual Memory

Administrivia. CMSC 411 Computer Systems Architecture Lecture 8 Basic Pipelining, cont., & Memory Hierarchy. SPEC92 benchmarks

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CPE 631 Lecture 06: Cache Design

Aleksandar Milenkovich 1

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Memory Hierarchy: The motivation

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

ECE468 Computer Organization and Architecture. Virtual Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE4680 Computer Organization and Architecture. Virtual Memory

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Caching Basics. Memory Hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Paging! 2/22! Anthony D. Joseph and Ion Stoica CS162 UCB Fall 2012! " (0xE0)" " " " (0x70)" " (0x50)"

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

CSE 502 Graduate Computer Architecture. Lec 5-6 Memory Hierarchy Review

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Introduction to cache memories

Memory. Lecture 22 CS301

CSC Memory System. A. A Hierarchy and Driving Forces

Memory Hierarchies 2009 DAT105

ECE331: Hardware Organization and Design

Memory Hierarchy: Motivation

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Question?! Processor comparison!

Memory Technologies. Technology Trends

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

Why memory hierarchy

ECE468 Computer Organization and Architecture. Memory Hierarchy

CSE 502 Graduate Computer Architecture. Lec 7-10 App B: Memory Hierarchy Review

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

The Memory Hierarchy & Cache

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

LECTURE 10: Improving Memory Access: Direct and Spatial caches

CMPT 300 Introduction to Operating Systems

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Memory System Design Part II. Bharadwaj Amrutur ECE Dept. IISc Bangalore.

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

ECE331: Hardware Organization and Design

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Introduction. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Transcription:

Lecture 04 CPU Caches Electrical and Computer Engineering University of Alabama in Huntsville Outline Memory Hierarchy Four Questions for Memory Hierarchy Cache Performance 26/01/2004 UAH- 2 1

Processor-DR Latency Gap 1000 Processor 2x/1.5 year CPU Performance 100 10 Processor-Memory Performance Gap grows 50% / year Memory 2x/10 years DR 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 26/01/2004 UAH- 3 User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology Solution The Memory Hierarchy (MH) Levels in Memory Hierarchy Upper Lower Processor Control Datapath Speed Capacity Cost/bit Fastest Smallest Highest Slowest Biggest Lowest 26/01/2004 UAH- 4 2

Generations of Microprocessors Time of a full cache miss in instructions executed 1st Alpha 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha 180 ns/1.7 ns =108 clks x 6 or 648 1/2X latency x 3X clock rate x 3X Instr/clock 5X 26/01/2004 UAH- 5 Why hierarchy works? Principal of locality Probability of reference Address space Rule of thumb Programs spend 90% of their execution time in only 10% of code Temporal locality recently accessed items are likely to be accessed in the near future Keep them close to the processor Spatial locality items whose addresses are near one another tend to be referenced close together in time Move blocks consisted of contiguous words to the upper level 26/01/2004 UAH- 6 3

Cache Measures To Processor From Processor Upper Level Memory Bl. X Lower Level Memory Bl. Y Hit time << Miss Penalty Hit data appears in some block in the upper level (Bl. X) Hit Rate the fraction of memory access found in the upper level Hit Time time to access the upper level (R access time + Time to determine hit/miss) Miss data needs to be retrieved from the lower level (Bl. Y) Miss rate 1 - (Hit Rate) Miss penalty time to replace a block in the upper level + time to retrieve the block from the lower level Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) 26/01/2004 UAH- 7 Levels of the Memory Hierarchy Capacity Access Time Cost CPU Registers 100s Bytes <1s ns Cache 10s-100s K Bytes 1-10 ns $10/ MByte Main Memory M Bytes 100ns- 300ns $1/ MByte Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte Registers Cache Memory Disk Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level faster Tape infinite Larger sec-min Tape $0.0014/ 26/01/2004 MByte Lower Level UAH- 8 4

Four Questions for Memory Heir. Q#1 Where can a block be placed in the upper level? Block placement direct-mapped, fully associative, set-associative Q#2 How is a block found if it is in the upper level? Block identification Q#3 Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used) Q#4 What happens on a write? Write strategy Write-through vs. write-back Write allocate vs. No-write allocate 26/01/2004 UAH- 9 Direct-Mapped Cache In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory 26/01/2004 UAH- 10 5

Q1 Where can a block be placed in the upper level? Block 12 placed in 8 block cache Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Full Mapped Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 01234567 01234567 00112233 Cache 1111111111222222222233 01234567890123456789012345678901 Memory 26/01/2004 UAH- 11 Direct-Mapped Cache (cont d) Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Cache Index 0 1 2 3 Cache (4 byte) 26/01/2004 UAH- 12 6

Direct-Mapped Cache (cont d) Since multiple memory addresses map to same cache index, how do we tell which one is in there? What if we have a block size > 1 byte? Result divide memory address into three fields Block Address tttttttttttttttttt iiiiiiiiii oooo TAG to check if have the correct block INDEX to select block OFFSET to select byte within the block 26/01/2004 UAH- 13 Direct-Mapped Cache Terminology INDEX specifies the cache index (which row of the cache we should look in) OFFSET once we have found correct block, specifies which byte within the block we want TAG the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location BLOCK ADDRESS TAG + INDEX 26/01/2004 UAH- 14 7

Direct-Mapped Cache Example Conditions 32-bit architecture (word=32bits), address unit is byte 8KB direct-mapped cache with 4 words blocks Determine the size of the Tag, Index, and Offset fields OFFSET (specifies correct byte within block) cache block contains 4 words = 16 (2 4 ) bytes 4 bits INDEX (specifies correct row in the cache) cache size is 8KB=2 13 bytes, cache block is 2 4 bytes #Rows in cache (1 block = 1 row) 2 13 /2 4 = 2 9 9 bits TAG Memory address length - offset - index = 32-4 - 9 = 19 tag is leftmost 19 bits 26/01/2004 UAH- 15 1 KB Direct Mapped Cache, 32B blocks For a 2 ** N byte cache The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 Cache Tag Example 0x50 Stored as part of the cache state 9 Cache Index Ex 0x01 4 Byte Select Ex 0x00 0 Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63 Byte 1 Byte 33 Byte 0 Byte 32 0 1 2 3 Byte 1023 Byte 992 31 26/01/2004 UAH- 16 8

Two-way Set Associative Cache N-way set associative N entries for each Cache Index N direct mapped caches operates in parallel (N typically 2 to 4) Example Two-way set associative cache Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Valid Cache Tag Cache Data Cache Block 0 Cache Index Cache Data Cache Block 0 Cache Tag Valid Adr Tag Compare 1 Sel1 Mux 0 Sel0 Compare OR 26/01/2004 Hit UAH- Cache Block 17 Disadvantage of Set Associative Cache N-way Set Associative Cache v. Direct Mapped Cache N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss Possible to assume a hit and continue. Recover later if miss. Valid Cache Tag Cache Data Cache Block 0 Cache Index Cache Data Cache Block 0 Cache Tag Valid Adr Tag Compare 1 Sel1 Mux 0 Sel0 Compare OR 26/01/2004 Hit UAH- Cache Block 18 9

Q2 How is a block found if it is in the upper level? Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address Tag Index Block Offset 26/01/2004 UAH- 19 Q3 Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative Random LRU (Least Recently Used) Assoc 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 26/01/2004 UAH- 20 10

Q4 What happens on a write? Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT read misses cannot result in writes WB no repeated writes to same location WT always combined with write buffers so that don t wait for lower level memory 26/01/2004 UAH- 21 Write stall in write through caches When the CPU must wait for writes to complete during write through, the CPU is said to write stall Common optimization => Write buffer which allows the processor to continue as soon as the data is written to the buffer, thereby overlapping processor execution with memory updating However, write stalls can occur even with write buffer (when buffer is full) 26/01/2004 UAH- 22 11

Write Buffer for Write Through Processor Cache DR Write Buffer A Write Buffer is needed between the Cache and Memory Processor writes data into the cache and the write buffer Memory controller write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries 4 Works fine if Store frequency (w.r.t. time) << 1 / DR write cycle Memory system designer s nightmare Store frequency (w.r.t. time) -> 1 / DR write cycle Write buffer saturation 26/01/2004 UAH- 23 What to do on a write-miss? Write allocate (or fetch on write) The block is loaded on a write-miss, followed by the write-hit actions No-write allocate (or write around) The block is modified in the memory and not loaded into the cache Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate 26/01/2004 UAH- 24 12

An Example The Alpha 21264 Data Cache (64KB, 64-byte blocks, 2w) Offset <29> <9> <6> Tag Index Valid<1> Tag<29> <44> - physical address Data<512> CPU Address Data in Data out...... =? 81 Mux =? 81 Mux 21 M U X Write buffer...... Lower level memory 26/01/2004 UAH- 25 Cache Performance Hit Time = time to find and retrieve data from current level cache Miss Penalty = average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate = % of requests that are found in current level cache Miss Rate = 1 - Hit Rate 26/01/2004 UAH- 26 13

Cache Performance (cont d) Average memory access time (AT) AT = Hit time + Miss Rate Miss Penalty = % instructions ( Hit time + % data ( Hit time Data Inst + Miss + Miss Rate Data Rate inst Miss Miss Penalty Penalty Data ) Inst ) 26/01/2004 UAH- 27 An Example vs. Separate I&D Proc Cache-L1 Cache-L2 I-Cache-L1 Proc Cache-L2 D-Cache-L1 Compare 2 design alternatives (ignore L2 caches)? 16KB I&D Inst misses=3.82 /1K, Data miss rate=40.9 /1K 32KB unified misses = 43.3 misses/1k Assumptions ld/st frequency is 36% 74% accesses from instructions (1.0/1.36) hit time = 1clock cycle, miss penalty = 100 clock cycles Data hit has 1 stall for unified cache (only one port) 26/01/2004 UAH- 28 14

vs. Separate I&D (cont d) Proc Cache-L1 Cache-L2 I-Cache-L1 Proc Cache-L2 D-Cache-L1 Miss rate (L1I) = (# L1I misses) / (IC) #L1I misses = (L1I misses per 1k) * (IC /1000) Miss rate (L1I) = 3.82/1000 = 0.0038 Miss rate (L1D) = (# L1D misses) / (# Mem. Refs) #L1D misses = (L1D misses per 1k) * (IC /1000) Miss rate (L1D) = 40.9 * (IC/1000) / (0.36*IC) = 0.1136 Miss rate (L1U) = (# L1U misses) / (IC + Mem. Refs) #L1U misses = (L1U misses per 1k) * (IC /1000) Miss rate (L1U) = 43.3*(IC / 1000) / (1.36 * IC) = 0.0318 26/01/2004 UAH- 29 vs. Separate I&D (cont d) Proc Cache-L1 Cache-L2 I-Cache-L1 Proc Cache-L2 D-Cache-L1 AT (split) = (% instr.) * (hit time + L1I miss rate * Miss Pen.) + (% data) * (hit time + L1D miss rate * Miss Pen.) =.74(1 +.0038*100) +.26(1+.1136*100) = 4.2348 clock cycles AT (unif.) = (% instr.) * (hit time + L1Umiss rate * Miss Pen.) + (% data) * (hit time + L1U miss rate * Miss Pen.) =.74(1 +.0318*100) +.26(1 + 1 +.0318*100) = 4.44 clock cycles 26/01/2004 UAH- 30 15

AT and Processor Performance Miss-oriented Approach to Memory Access CPI Exec includes ALU and Memory instructions IC CPI CPU time = Ł Exec MemAccess + MissRate MissPenalty Inst ł Clock rate CPU time = IC CPI Ł Exec MemMisses + MissPenalty Inst ł Clock rate 26/01/2004 UAH- 31 AT and Processor Performance (cont d) Separating out Memory component entirely AT = Average Memory Access Time CPI ALUOps does not include memory instructions ALUops MemAccess IC CPI ALUops + AT Inst Inst CPU time = Ł ł Clock rate AT = Hit time + Miss Rate Miss Penalty = % instructions ( Hit time + % data ( Hit time Data Inst + Miss Rate + Miss Rate Data Data 26/01/2004 UAH- 32 inst Miss Penalty Miss Penalty ) Inst ) 16

Summary Caches The Principle of Locality Program access a relatively small portion of the address space at any instant of time. Temporal Locality Locality in Time Spatial Locality Locality in Space Three Major Categories of Cache Misses Compulsory Misses sad facts of life. Example cold start misses. Capacity Misses increase cache size Conflict Misses increase cache size and/or associativity Write Policy Write Through needs a write buffer. Write Back control can be complex Today CPU time is a function of (ops, cache misses) vs. just f(ops) What does this mean to Compilers, Data structures, Algorithms? 26/01/2004 UAH- 33 Summary The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Bad Good Factor A Less Associativity Block Size Factor B More 26/01/2004 UAH- 34 17