COSC3330 Computer Architecture Lecture 19. Cache

Similar documents
ECE331: Hardware Organization and Design

Improving Cache Performance

Improving Cache Performance

Main Memory Supporting Caches

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Page 1. Memory Hierarchies (Part 2)

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE331: Hardware Organization and Design

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

ECE331: Hardware Organization and Design

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS3350B Computer Architecture

EE 4683/5683: COMPUTER ARCHITECTURE

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE232: Hardware Organization and Design

ECEC 355: Cache Design

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Course Administration

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

V. Primary & Secondary Memory!

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Memory Hierarchy Design (Appendix B and Chapter 2)

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Cache Memory and Performance

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Why memory hierarchy

ECE 2300 Digital Logic & Computer Organization. More Caches

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Question?! Processor comparison!

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

LECTURE 11. Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

ECE232: Hardware Organization and Design

COMPUTER ORGANIZATION AND DESIGN

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

CS 136: Advanced Architecture. Review of Caches

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

ECE331: Hardware Organization and Design

Caching Basics. Memory Hierarchies

CS 61C: Great Ideas in Computer Architecture

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

Computer Architecture CS372 Exam 3

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

Advanced Memory Organizations

The University of Michigan - Department of EECS EECS 370 Introduction to Computer Architecture Midterm Exam 2 solutions April 5, 2011

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CACHE ARCHITECTURE. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

ECE 30 Introduction to Computer Engineering

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

CPE 631 Lecture 04: CPU Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Memory System Design Part II. Bharadwaj Amrutur ECE Dept. IISc Bangalore.

Computer Architecture Spring 2016

Transcription:

COSC3330 Computer Architecture Lecture 19 Cache Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Cache Topics

3

Cache Hardware Cost How many total bits are required for a direct mapped data cache with 16KB of data and 4-word blocks assuming a 32-bit address? Address from CPU tag (18 bits) Index (10 bits) 31 30 15 14 13 4 3 2 1 0 Block offset (2 bits) Byte offset (2 bits) 0 1 2 1021 1022 1023 D V #bits = #blocks x (block size + tag size + valid size + dirty size) = 1024 x (16B + 18 bits + 1 bit + 1 bit) = 148 Kbits = 185 KB 156% larger than the storage for data 4

Cache Hardware Cost The number of bits in a cache includes both the storage for data and for the tags For a direct mapped cache with 2 n blocks, n bits are used for the index For a block size of 2 m words (2 m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word The total number of bits in a direct-mapped cache is D V I$: = 2 n x (block size + tag size + valid size) D$: = 2 n x (block size + tag size + valid size + dirty size) 5

Measuring Cache Performance Execution time is composed of CPU execution cycles and memory stall cycles Assuming that cache hit does not require any extra CPU cycle for execution (that is, the MA stage takes 1 cycle upon a hit), CPU time (cycles)= #insts CPI = CPU clock cycles = (CPU execution clock cycles + Memory-stall clock cycles) 6

Measuring Cache Performance Memory stall cycles come from cache misses Memory stall clock cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = (reads/program) x (read miss rate) x (read miss penalty) Write-stall cycles = (writes/program) x (write miss rate) x (write miss penalty) For simplicity, assume that a read miss and a write miss incur the same miss penalty Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty)

Impacts of Cache Performance A processor with I-cache miss rate = 2% D-cache miss rate = 4% Loads & stores are 36% of instructions Miss penalty = 100 cycles Base CPI with an ideal cache = 2 What is the execution cycle on the processor? CPU time (cycles) = CPU clock cycles = (CPU execution clock cycles + Memory-stall clock cycles) Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty) Assume that # of instructions executed is I CPU execution clock cycles = 2I Memory stall clock cycles I$: I x 002 100 = 2I D$: (I x 036) 004 100 = 144I Total memory stall cycles = 2I + 144I = 344I Total execution cycles = 2I + 344I = 544I 8

Impacts of Cache Performance (Cont) How much faster a processor with a perfect cache? Execution cycle time of the CPU with a perfect cache = 2I Thus, the CPU with the perfect cache is 272x (=544I/2I) faster! If you design a new computer with CPI = 1, how much is the system with the perfect cache faster? Total execution cycles = 1I + 344I = 444I Therefore, the CPU with the perfect cache is 444x faster! Then, the amount of execution time spent on memory stalls have risen from 63% (=344/544) to 77% (= 344/444) 9

Impacts of Cache Performance (Cont) Thoughts 2% of instruction cache misses with the miss penalty of 100 cycle increases CPI by 2! cache misses also increase CPI dramatically As CPU is made faster, the amount of time spent on memory stalls will take up an increasing fraction of the execution time

Average Memory Access Time (AMAT) Hit time is also important for performance Hit time is not always less than 1 CPU cycle For example, the hit latency of Core 2 Duo s L1 cache is 3 cycles! To capture the fact that the time to access data for both hits and misses affect performance, designers sometimes use average memory access time (AMAT), which is the average time to access memory considering both hits and misses AMAT = Hit time + Miss rate x Miss penalty 11

Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate x Miss penalty Example CPU with 1ns clock (1GHz) Hit time = 1 cycle Miss penalty = 20 cycles Cache miss rate = 5% AMAT = 1 + 005 20 = 2ns (2 cycles for each instruction)

Improving Cache Performance How to increase the cache performance? Increase the cache size? A larger cache will have a longer access time An increase in hit time will likely add another stage to the pipeline At some point, the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance We ll explore two different techniques AMAT = Hit time + Miss rate x Miss penalty 13

Improving Cache Performance We ll explore two different techniques AMAT = Hit time + Miss rate x Miss penalty The first technique reduces the miss rate by reducing the probability that different memory blocks will contend for the same cache location The second technique reduces the miss penalty by adding additional cache levels to the memory hierarchy

Reducing Cache Miss Rates (#1) Alternatives for more flexible block placement Direct mapped cache maps a memory block to exactly one cache block Fully associative cache allows a memory block to be mapped to any cache block n-way set associative cache divides the cache into sets, each of which consists of n ways A memory block is mapped to a unique set (specified by the index field) and can be placed in any way of that set (so, there are n choices) 15

Associative Cache Example 2-way 4 sets Only 1 set 16

Spectrum of Associativity For a cache with 8 entries 17

4-Way Set Associative Cache 16KB cache 4-way set associative 4B (32-bit) per block How many sets are there? 31 30 13 12 11 2 1 0 Byte offset 20 Index 10 Index 0 1 2 1022 1023 V V V 0 0 0 1 1 1 Way 0 Way 1 Way 2 Way 3 2 2 2 V 1022 1023 1022 1023 1022 1023 32 32 32 32 Hit 18 4x1 mux

Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block address (number) determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) 19

Range of Set Associative Caches For a fixed size cache, the increase by a factor of 2 in associativity doubles the number of blocks per set (the number or ways) and halves the number of sets Example: An increase of associativity from 4 ways to 8 ways decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Block address Used for tag comparison Selects the set Selects the word in the block address Index Block offset Byte offset Decreasing associativity Increasing associativity Direct mapped (only one way) Smaller tags, only a single comparator 20 Fully associative (only one set) is all the bits except Block offset and byte offset

Associativity Example Assume that there are 3 small different caches with 4 oneword blocks, and CPU generates 8-bit address to cache Direct mapped 2-way set associative Fully associative Find the number of misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, 8 We assume that the sequence is for all memory reads 21

Associativity Example Direct mapped cache 7 6 5 4 3 2 1 0 tag index tag 0, 8, 0, 6, 8 Address Block address Cache Index Hit or Miss? b 0000 0000 (0x00) b 00 0000 (0x00) 0 b 0010 0000 (0x20) b 00 1000 (0x08) 0 b 0000 0000 (0x00) b 00 0000 (0x00) 0 b 0001 1000 (0x18) b 00 0110 (0x06) 2 b 0010 0000 (0x20) b 00 1000 (0x08) 0 miss miss miss miss miss index 0 1 2 3 V 0808 01 6 0 01 0 0000 0010 0001 Mem[0] Mem[8] Mem[6] 22

Associativity Example (Cont) 2-way set associative 7 6 5 4 3 2 1 0 tag index tag 0, 8, 0, 6, 8 Address Block address Cache Index Hit or Miss? b 0000 0000 (0x00) b 00 0000 (0x00) 0 b 0010 0000 (0x20) b 00 1000 (0x08) 0 b 0000 0000 (0x00) b 00 0000 (0x00) 0 b 0001 1000 (0x18) b 00 0110 (0x06) 0 b 0010 0000 (0x20) b 00 1000 (0x08) 0 miss miss hit miss miss index V V 0 01 8086 0 00000 00100 Mem[0] Mem[8] 01 00011 00100 Mem[8] Mem[6] 1 0 0 23

Associativity Example (Cont) Fully associative 7 6 5 4 3 2 1 0 tag tag 0, 8, 0, 6, 8 Address Block address Cache Index Hit or Miss? b 0000 0000 (0x00) b 00 0000 (0x00) 0 b 0010 0000 (0x20) b 00 1000 (0x08) 0 b 0000 0000 (0x00) b 00 0000 (0x00) 0 b 0001 1000 (0x18) b 00 0110 (0x06) 0 b 0010 0000 (0x20) b 00 1000 (0x08) 0 miss miss hit miss hit index V V V V 8086 0 01 000000 Mem[0] 01 001000 Mem[8] 01 000110 Mem[6] 0 24

Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Miss Rate 12 10 8 6 4 2 0 1-way 2-way 4-way 8-way Associativity 25 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB from Hennessy & Patterson, Computer Architecture, 2003 Largest gains are achieved when going from direct mapped to 2-way (more than 20% reduction in miss rate)

Replacement Policy Upon a miss, which way s block should be picked for replacement? Direct mapped cache has no choice Set associative cache prefers non-valid entry if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Hardware should keep track of when each way s block was used relative to the other blocks in the set Simple for 2-way (takes one bit per set), manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU with high-associativity cache 26

Example Cache Structure with LRU 2-way set associative data cache LRU V D V D Set 0 Set 1 Set 2 Set n-3 Set n-2 Set n-1 27

Example Cache Structure with LRU 2 ways 1 bit per set to mark latest way accessed in set Evict way not pointed by bit K-way set associative LRU Requires full ordering of way accesses Hold a log2k bit counter per line When way j is accessed X = counter[j] Counter[j] = k-1; for (i=0; i<k; i++) if ((i!=j) and (counter[i]>x)) Counter[i]--; When replacement is needed Evict way with counter ==0 28