Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Similar documents
Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

ENCM 501 Winter 2015 Tutorial for Week 5

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Contents Slide Set 9. Final Notes on Textbook Chapter 7. Outline of Slide Set 9. More about skipped sections in Chapter 7. Outline of Slide Set 9

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

Slide Set 8. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CS3350B Computer Architecture

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Integer Multiplication and Division

Memory Hierarchy. Slides contents from:

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture 11 Cache. Peng Liu.

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

1/19/2009. Data Locality. Exploiting Locality: Caches

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Caching Basics. Memory Hierarchies

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

LECTURE 11. Memory Hierarchy

Course Administration

Cache Memory and Performance

COMP 3221: Microprocessors and Embedded Systems

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Question?! Processor comparison!

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

CS61C : Machine Structures

ECE331: Hardware Organization and Design

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Slides for Lecture 15

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

ECE 485/585 Microprocessor System Design

EEC 483 Computer Organization

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

ENCM 369 Winter 2016 Lab 11 for the Week of April 4

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS 136: Advanced Architecture. Review of Caches

CS3350B Computer Architecture

EN1640: Design of Computing Systems Topic 06: Memory System

Virtual Memory. Stefanos Kaxiras. Credits: Some material and/or diagrams adapted from Hennessy & Patterson, Hill, online sources.

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Cache Optimisation. sometime he thought that there must be a better way

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Computer Architecture Spring 2016

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchy: Caches, Virtual Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESIGN

Page 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"


CMPT 300 Introduction to Operating Systems

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Computer Architecture Spring 2016

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Slides for Lecture 6

13-1 Memory and Caches

Memory Hierarchy. Mehran Rezaei

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Caches Concepts Review

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

CSE 153 Design of Operating Systems

Review : Pipelining. Memory Hierarchy

CMPSC 311- Introduction to Systems Programming Module: Caching

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Transcription:

Slide Set 5 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017

ENCM 501 W17 Lectures: Slide Set 5 slide 2/37 Contents Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 3/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 4/37 Review: Example computer with only one level of cache and no virtual memory CORE L1 I- CACHE L1 D- CACHE DRAM CONTROLLER DRAM MODULES We re looking at this simple system because it helps us to think about cache design and performance issues while avoiding the complexity of real systems like the Intel i7 shown in textbook Figure 2.21.

ENCM 501 W17 Lectures: Slide Set 5 slide 5/37 Review: Example data cache organization index 8 Key 8-to-256 decoder. set 0 set 1 set 2. set 253 set 254 set 255 way 0. way 1 block status: 1 valid bit and 1 dirty bit per block tag: 1 18-bit stored tag per block data: 64-byte (512-bit) data block set status: 1 LRU bit per set.. This could fit within our simple example hierarchy, but is also not much different from some current L1 D-cache designs.

ENCM 501 W17 Lectures: Slide Set 5 slide 6/37 Valid bits in caches going 1 0 I hope it s obvious why V bits for blocks go 0 1. But why might V bits go 1 0? In other words, why does it sometimes make sense to invalidate one or more cache blocks? Here are two big reasons. (There are likely some other good reasons.) DMA: direct memory access. Instruction writes by O/S kernels and by programs that write their own instructions. Let s make some notes about each of these reasons.

ENCM 501 W17 Lectures: Slide Set 5 slide 7/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 8/37 Storage cells in the example cache Data blocks are implemented with SRAM cells. The design tradeoffs for the cell design relate to speed, chip area, and energy use per read or write. Tags and status bits might be SRAM cells or might be CAM ( content addressable memory ) cells. A CMOS CAM cell uses the same 6-transistor structure as an SRAM cell for reads, writes, and holding a stored bit for long periods of time during which there are neither reads nor writes. A CAM cell also has 3 or 4 extra transistors that help in determining whether the bit pattern in a group of CAM cells (e.g., a stored tag) matches some other bit pattern (e.g., a search tag).

ENCM 501 W17 Lectures: Slide Set 5 slide 9/37 CAM cells organized to make a J-bit stored tag BL J 1 BL J 1 BL J 2 BL J 2 BL 0 BL 0 WL i... MATCH i CAM CELL CAM CELL... CAM CELL For reads or writes the wordline and bitlines plays the same roles they do in a row of an SRAM array. To check for a match, the wordline is held LOW, and the search tag is applied to the bitlines. If every search tag bit matches the corresponding stored tag bit, the matchline stays HIGH; if there is even a single mismatch, the matchline goes LOW.

ENCM 501 W17 Lectures: Slide Set 5 slide 10/37 CAM cells versus SRAM cells for stored tags With CAM cells, tag comparisons can be done in place. With SRAM cells, stored tags would have to be read via bitlines to comparator circuits outside the tag array, which is a slower process. (Schematics of caches in the textbook tend to show tag comparison done outside of tag arrays, but that is likely done to show that a comparison is needed, not to indicate physical design.) CAM cells are larger than SRAM cells. But the total area needed for CAM-cell tag arrays will still be much smaller than the total area needed for SRAM data blocks. Would it make sense to use CAM cells for V (valid) bits?

ENCM 501 W17 Lectures: Slide Set 5 slide 11/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 12/37 Cache line is a synonym for cache block Hennessy and Patterson are fairly consistent in their use of the term cache block. A lot of other literature uses the term cache block. However, the term cache line, which means the same thing, is also in wide use. So in other literature, you may read things like... In a 4-way set-associative cache, an index finds a set containing 4 cache lines. In a direct-mapped cache, there is one cache line per index. A cache miss, even if it is for access to a single byte, will result in the transfer of an entire cache line.

ENCM 501 W17 Lectures: Slide Set 5 slide 13/37 Direct-mapped caches A direct-mapped cache can be thought of as a special case of a set-associative cache, in which there is only one way. For a given cache capacity, a direct-mapped cache is easier to build than an N-way set-associative cache with N 2: no logic is required to find the correct way for data transfer after a hit is detected; no logic is needed to decide which block in a set to replace in handling a miss. Direct-mapped caches may also be faster and more energy-efficient. However, direct-mapped caches are vulnerable to index conflicts (sometimes called index collisions).

ENCM 501 W17 Lectures: Slide Set 5 slide 14/37 Let s think about changing the design of our example 32 KB 2-way set-associative data cache to a direct-mapped organization. If the capacity stays at 32 KB and the block size remains at 64 bytes, how does the change in organization affect the number of sets, the widths of indexes and tags, and the number of memory cells needed?

ENCM 501 W17 Lectures: Slide Set 5 slide 15/37 Data cache index conflict example Consider this sketch of a C function: int g1, g2, g3, g4; void func(int *x, int n) { int loc[10], k; while ( condition ) { } } make accesses to g1, g2, g3, g4 and loc What will happen in the following scenario? the addresses of g1 to g4 are 0x0804 fff0 to 0x0804 fffc the address of loc[0] is 0xbfbd fff0

ENCM 501 W17 Lectures: Slide Set 5 slide 16/37 Instruction cache index conflict example Suppose a program spends much of its time in a loop within function f... void f(double *x, double *y, int n) { int k; for (k = 0; k < n; k++) y[k] = g(x[k]) + h(x[k]); } Suppose that g and h are small, simple functions that don t call other functions. What kind of bad luck could cause huge numbers of misses in a direct-mapped instruction cache?

ENCM 501 W17 Lectures: Slide Set 5 slide 17/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 18/37 Motivation for set-associative caches (1) Qualitatively: In our example of data cache index conflicts, the conflicts go away if the cache is changed from direct-mapped to 2-way set-associative. In our example of instruction cache index conflicts, in the worst case, the conflicts go away if the cache is changed from direct-mapped to 4-way set-associative. Quantitatively, see Figure B.8 on page B-24 of the textbook. Conflict misses are a big problem in direct-mapped caches. Moving from direct-mapped to 2-way to 4-way to 8-way reduces the conflict miss rate at each step.

ENCM 501 W17 Lectures: Slide Set 5 slide 19/37 Motivation for set-associative caches (2) Detailed studies show that 4-way set-associativity is good enough to eliminate almost all conflict misses. But many practical cache designs are 8- or even 16-way set-associative. There must be reasons for this other than the desire to avoid conflict misses. We ll come back to this question later.

ENCM 501 W17 Lectures: Slide Set 5 slide 20/37 Replacement strategies in set-associative caches Let N be the number of ways. With N = 2, LRU replacement is easy to implement a single bit in each set can track which block should be replaced on a miss in that set. Exact LRU replacement is harder to implement with N > 2. LRU status bits would have to somehow encode a list of least-to-most-recent accesses within a set. However, choice of replacement strategy between various reasonable options seems to have very little effect on miss rate see Figure B.4 on page B-10 of the textbook. So we re not going to study cache block replacement strategy in detail in ENCM 501.

ENCM 501 W17 Lectures: Slide Set 5 slide 21/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 22/37 Fully-associative caches A fully-associative cache can be thought of as an N-way set-associative cache in which N is equal to the number of blocks. In this way of thinking, how many sets are there in a fully-associative cache? What is the width of an index? For energy use, how would hit detection in a fully-associative cache compare with hit detection in a direct-mapped or (small N) set-associative cache with the same capacity?

ENCM 501 W17 Lectures: Slide Set 5 slide 23/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 24/37 Different options for handling writes See Q4: What Happens on a Write? on pages B-10 to B-12 of the textbook for details. There is not much I can put into lecture slides to improve on the clarity of that material. Note that the cache access examples presented at the end of Slide Set 4 assume a write-back policy, using write allocate in the case of a write miss.

ENCM 501 W17 Lectures: Slide Set 5 slide 25/37 Write buffers: Location and purpose A question that could arise from recent lecture material: In a system with one level of cache, is a write buffer part of the L1 D-cache or part of the DRAM controller? Partial answer: Neither the textbook nor Web search results are perfectly clear about this. But it s probably simpler to think of the write buffer as part of the L1 D-cache. Regardless, it s more important to understand the purpose of a write buffer: to securely hold pending main memory updates while minimizing the need of a processor core to stall.

ENCM 501 W17 Lectures: Slide Set 5 slide 26/37 Write buffers are hardware queues In general, a write buffer can be thought of as a collection of write buffer entries, in which each entry is a triple: (address, data size, data) The key idea is that a write buffer holds multiple pending writes, in case there is a flurry of cache misses, each causing replacement of a dirty block in a write-back cache; a simple flurry of writes to a write-through cache; some other situation that generates writes to main memory faster than main memory can receive them. A processor should only stall on a write if a write buffer is full has as many pending writes as it can hold which should be an unusual event.

ENCM 501 W17 Lectures: Slide Set 5 slide 27/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 28/37 Multi-level caches A common configuration: A core has an L1 I-cache and an L1 D-cache, each with 32 KB capacity, and a unified L2 cache with 256 KB capacity. So there is a total of 320 KB of cache capacity for this core. Why is it not better to use the chip area for a 160 KB L1 I-cache and a 160 KB L1 D-cache? Give two reasons, one related to SRAM array design issues, and another related to memory use patterns of software programs. The same ideas apply to systems with 3 levels of cache. Other complicated design concerns come up when multiple cores share access to a single L2 or L3 cache.

ENCM 501 W17 Lectures: Slide Set 5 slide 29/37 AMAT for a 2-level cache system A formula from the textbook: AMAT = L1 hit time + L1 miss rate (L2 hit time + L2 miss rate L2 miss penalty) What is AMAT if there are no cache misses at all? What is AMAT if there are some L1 misses but no L2 misses? What are design criteria for the L1 and L2 caches, given that the goal is to minimize AMAT?

ENCM 501 W17 Lectures: Slide Set 5 slide 30/37 This definition (not from the textbook!) is incorrect: For a system with two levels of caches, the L2 hit rate of a program is the number of L2 hits divided by the total number of memory accesses. What is a correct definition for L2 hit rate, compatible with the formula for AMAT?

ENCM 501 W17 Lectures: Slide Set 5 slide 31/37 L2 cache design tradeoffs An L1 cache must keep up with a processor core. That is a challenge to a circuit design team but keeps the problem simple: If a design is too slow, it fails. For an L2 cache, the tradeoffs are more complex: Increasing capacity improves L2 miss rate but makes L2 hit time, chip area and (probably) energy use worse. Decreasing capacity improves L2 hit time, chip area, and (probably) energy use, but makes L2 miss rate worse. Suppose L1 hit time = 1 cycle, L1 miss rate = 0.020, L2 miss penalty = 100 cycles. Which is better, considering AMAT only, not chip area or energy? (a) L2 hit time = 10 cycles, L2 miss rate = 0.50 (b) L2 hit time = 12 cycles, L2 miss rate = 0.40

ENCM 501 W17 Lectures: Slide Set 5 slide 32/37 3 C s of cache misses: compulsory, capacity, conflict It s useful to think about the causes of cache misses. Compulsory misses (sometimes called cold misses ) happen on access to instructions or data that have never been in a cache. Examples include: instruction fetches in a program that has just been copied from disk to main memory; data reads of information that has just been copied from a disk controller or network interface to main memory. Compulsory misses would happen even if a cache had the same capacity as the main memory the cache was supposed to mirror.

ENCM 501 W17 Lectures: Slide Set 5 slide 33/37 Capacity misses This kind of miss arises because a cache is not big enough to contain all the instructions and/or data a program accesses while it runs. Capacity misses for a program can be counted by simulating a program run with a fully-associative cache of some fixed capacity. Since instruction and data blocks can be placed anywhere within a fully-associative cache, it s reasonable to assume that any miss on access to a previously accessed instruction or data item in a fully-associative cache occurs because the cache is not big enough. Why is this a good but not perfect approximation?

ENCM 501 W17 Lectures: Slide Set 5 slide 34/37 Conflict misses Conflict misses (also called collision misses ) occur in direct-mapped and N-way set-associative caches because too many accesses to memory generate a common index. In the absence of a 4th kind of miss coherence misses, which can happen when multiple processors share access to a memory system we can write: conflict misses = total misses compulsory misses capacity misses The main idea behind increasing set-associativity is to reduce conflict misses without the time and energy problems of a fully-associative cache.

ENCM 501 W17 Lectures: Slide Set 5 slide 35/37 3 C s: Data from experiments Textbook Figure B.8 has a lot of data; it s unreasonable to try to jam all of that data into a few lecture slides. So here s a subset of the data, for 8 KB capacity. N is the degree of associativity, and miss rates are in misses per thousand accesses. miss rates N compulsory capacity conflict 1 0.1 44 24 2 0.1 44 5 4 0.1 44 < 0.5 8 0.1 44 < 0.5 This is real data from practical applications. It is worthwhile to study the table to see what general patterns emerge.

ENCM 501 W17 Lectures: Slide Set 5 slide 36/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 37/37 Caches and Virtual Memory Both are essential systems to support applications running on modern operating systems. As mentioned in Slide Set 4, it really helps to keep in mind what problems are solved by caches and what very different problems are solved by virtual memory. Caches are an impressive engineering workaround for difficult facts about relative latencies of memory arrays. Virtual memory (VM) is a concept, a great design idea, that solves a wide range of problems for computer systems in which multiple applications are sharing resources. Slide Set 6 will be about virtual memory and interactions between cache systems and virtual memory systems.