CS429: Computer Organization and Architecture

Similar documents
CS429: Computer Organization and Architecture

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Caches Concepts Review

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Today Cache memory organization and operation Performance impact of caches

211: Computer Architecture Summer 2016

Last class. Caches. Direct mapped

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memory and Performance

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Giving credit where credit is due

CISC 360. Cache Memories Nov 25, 2008

Cache Memories October 8, 2007

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

EE 4683/5683: COMPUTER ARCHITECTURE

ECE232: Hardware Organization and Design

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Caches III CSE 351 Spring

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

ECE331: Hardware Organization and Design

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Page 1. Memory Hierarchies (Part 2)

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 240 Stage 3 Abstractions for Practical Systems

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

CS3350B Computer Architecture

Memory. Lecture 22 CS301

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

14:332:331. Week 13 Basics of Cache

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy Design (Appendix B and Chapter 2)

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

Computer Architecture CS372 Exam 3

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CMPT 300 Introduction to Operating Systems

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Page 1. Multilevel Memories (Improving performance using a little cash )

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Course Administration

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Pollard s Attempt to Explain Cache Memory

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Memory. Objectives. Introduction. 6.2 Types of Memory

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CISC 360. Cache Memories Exercises Dec 3, 2009

Solutions for Chapter 7 Exercises

CS 136: Advanced Architecture. Review of Caches

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Slides contents from:

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

V. Primary & Secondary Memory!

14:332:331. Week 13 Basics of Cache

CMPSC 311- Introduction to Systems Programming Module: Caching

Computer Architecture Spring 2016

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Cache Memory and Performance

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Advanced Memory Organizations

CS3350B Computer Architecture

COSC3330 Computer Architecture Lecture 19. Cache

Memory Hierarchy. Slides contents from:

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

1/19/2009. Data Locality. Exploiting Locality: Caches

Transcription:

CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: April 5, 2018 at 13:55 CS429 Slideset 19: 1

Cache Vocabulary Much of the cache organization described in these slidesets applies to the L1 and L2 caches; not to other caches such as the TLB, browser caches, etc. Capacity Cache block/line Associativity Cache set Set index Tag Block offset Hit rate Miss rate Placement policy Replacement policy CS429 Slideset 19: 2

Organization of Cache Memory Cache is an array of S = 2 s sets. Each set contains E 1 lines. Each line holds a block of data containing B = 2 b bytes Cache size: C = B E S data bytes. Set 0 Set 1 Set S 1 1 valid bit per line t tag bits per line B = 2b bytes per cache block E lines per set CS429 Slideset 19: 3

Addressing Caches Set 0 Address A: t bits s bits b bits tag set index block offset m 1 0 Set 1 Set S 1 The word at address A is in the cache if the tag bits in one of the valid lines in set set index match tag for that line. The word contents begin at offset block offset from the beginning of the block. CS429 Slideset 19: 4

Direct-Mapped Cache This is the simplest kind of cache, characterized by exactly one line per set (i.e, E = 1). Set 0 Set 1 Set S 1 CS429 Slideset 19: 5

Accessing Direct-Mapped Caches Use the set index bits to determine the set of interest. Set 0 00001 tag set index block offset Set 1 Set S 1 CS429 Slideset 19: 6

Direct-Mapped Caches: Matching and Selection Line matching: Find a valid line in the selected set with a matching tag. Word selection: Extract the word using the block offset. 0 1 2 3 4 5 6 7 Set i: 1 0110 w0 w1 w2 w3 0110 i 100 tag set index block offset 1 The valid bit must be set. 2 The tag bits in the cache line must match the tag bits in the address. 3 If (1) and (2), then cache hit, and block offset selects starting bits. CS429 Slideset 19: 7

Bytes in the Line 0 1 2 3 4 5 6 7 Set i: 1 0110 w0 w1 w2 w3 0110 i 100 tag set index block offset Suppose that the set index is 4 bits. What are the characteristics of this cache? 1 How big are addresses on this machine? 2 How many sets are there in the cache? 3 How many lines per set? 4 How many bytes per line? 5 How big is the block offset? 6 Suppose i = 0101. What range of bytes are in this line? CS429 Slideset 19: 8

Direct-Mapped Cache Simulation Suppose: M = 16 byte addresses; B = 2 bytes/block; Address x xx x S = 4 sets; E = 1 line/set. 0 [0000] (miss) v tag data 1 0 M[0 1] 13 [1101] (miss) v tag data 1 0 M[0 1] Address trace (reads): (1) (3) 1 1 M[12 13] 1 0 = [0000 2 ] 2 1 = [0001 2 ] 3 13 = [1101 2 ] 4 8 = [1000 2 ] (4) v 8 [1000] (miss) tag data 1 1 M[8 9] 1 1 M[12 13] (5) v 0 [0000] (miss) tag data 1 0 M[0 1] 1 1 M[12 13] 5 0 = [0000 2 ] CS429 Slideset 19: 9

Set Associative Caches These are characterized by more than one line per set. Set 0 1 valid bit per line t tag bits per line B = 2b bytes per cache block E lines per set Set 1 Set S 1 CS429 Slideset 19: 10

Accessing Set Associative Caches Set selection is identical to that for direct-mapped cache. Set 0 v tag cache block v tag cache block Selected set Set 1 v tag cache block v tag cache block t bits s bits b bits 00001 tag set index block offset Set S 1 v tag cache block v tag cache block CS429 Slideset 19: 11

Accessing Set Associative Caches For Line Matching and Word Selection, we must compare the tag in each valid line in the selected set. 0 1 2 3 4 5 6 7 Set i: 1 1001 1 0110 w0 w1 w2 w3 0110 i 100 tag set index block offset 1 The valid bit must be set. 2 The tag bits in one of the cache lines must match the tag bits in the address. 3 If (1) and (2), than cache hit, and block offset selects starting byte. CS429 Slideset 19: 12

Why Use Middle Bits as Index? 4 line Cache 00 01 10 11 High-Order Bit Indexing Adjacent memory blocks map to same cache entry. Poor use of spatial locality. Low-Order Bit Indexing Consecutive memory blocks map to different cache lines. Can hold a larger region of address space in cache at one time. 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 High Order Bit Indexing 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Middle Order Bit Indexing CS429 Slideset 19: 13

Fully Associative Caches A fully associative cache is one that has only one set. There are no set index bits in the address. Otherwise, accessing is the same as for a set associative cache. Tends to require more hardware to perform the associative search on a larger number of lines. CS429 Slideset 19: 14

Cache Performance Metrics Miss Rate Fraction of memory references not found in the cache (misses / references) Typical numbers: 3-10% for L1; can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor (including time to determine whether the line is in the cache). Typical numbers: 1-3 clock cycles for L1; 5-12 clock cycles for L2. Miss Penalty Additional time required because of a miss. Typically 100-300 cycles for main memory. CS429 Slideset 19: 15

Memory System Performance Average Memory Access Time (AMAT) Taccess = (1 p miss ) t hit +p miss t miss t miss = t hit +t penalty Assume 1-level cache, 90% hit rate, 1 cycle hit time, 200 cycle miss time (hit time plus miss penalty). taccess = (1 0.1) 1+0.1 200 = 0.9+20 21 AMAT = 21 cycles, even though 90% only take one cycle. This shows the importance of a high hit rate. CS429 Slideset 19: 16

Memory System Performance II How does AMAT affect overall performance? Recall the CPI equation (pipeline efficiency). CPI = 1.0+lp+mp+rp load/use penalty (lp) assumed memory access of 1 cycle. Further, we assumed all load instructions were 1 cycle. Cause Name Instr. Cond. Stalls Product Freq. Freq. Load lp 0.30 0.7 21 4.41 Load/Use lp 0.30 0.3 21+1 1.98 Mispredict mp 0.20 0.4 2 0.16 Return rp 0.02 1.0 3 0.06 Total 6.61 More realistic AMAT (20+ cycles) really hurts CPI and overall performance. CS429 Slideset 19: 17

Memory System Performance and the Pipeline Suppose your pipelined Y86 machine had a 1-level cache, 90% hit rate, 1 cycle hit time, 200 cycle miss time. taccess = (1 0.1) 1+0.1 200 = 0.9+20 21 Recall that the clock speed of the pipeline is constrained by the slowest stage (M). What does this mean for the pipelined Y86 if the slowest stage has such huge variability? What could you do? CS429 Slideset 19: 18

Memory System Performance and the Pipeline II What does this mean for the pipelined Y86 if the slowest stage has such huge variability? What could you do? 1 Run the clock 200 times more slowly to accommodate the longest memory access? Obviously not. 2 Stall the pipeline when you have a cache miss? There s really no other alternative. Why not? Reads from memory really can t come from anywhere else. Forwarding won t help; the value is not in any pipeline register. CS429 Slideset 19: 19

Improving Memory System Performance Taccess = (1 p miss ) t hit +p miss t miss How can we reduce AMAT? Reduce the miss rate. Reduce the miss penalty. Reduce the hit time. t miss = t hit +t penalty There have been numerous inventions targeting each of these. Which would matter most? CS429 Slideset 19: 20

Issues with Writes If you write to an item in cache, the cached value becomes inconsistent with the values stored at lower levels of the memory hierarchy. There are two main approaches to dealing with this: Write-through: immediately write the cache block to the next lowest level. Write-back: only write to lower levels when the block is evicted from the cache. Write-through requires updating multiple levels of the memory hierarchy (causes bus traffic) on every write. Write-back reduces bus traffic, but requires that each cache line have a dirty bit, indicating that the line has been modified. CS429 Slideset 19: 21

Write Strategies How to deal with write misses? Write-allocate loads the line from the next level and updates the cache block. No-write-allocate bypasses the cache and updates directly in the lower level of the memory hierarchy. Write-through caches are typically no-write-allocate. Write-back caches are typically write-allocate. Why does this make sense? CS429 Slideset 19: 22

Writing Cache Friendly Code Can write code to improve miss rate. Repeated references to variables are good (temporal locality). Stride-1 reference patterns are good (spatial locality). Examples: Assume cold cache, 4-byte words, 4-word cache blocks. int sumarrayrows ( int a[m][n]) { int i, j, sum = 0; for ( i = 0; i < M; i++) for ( j = 0; j < N; j++) sum += a[ i ][ j ]; return sum; } Miss rate = 1/4 = 25% int sumarraycols ( int a[m][n]) { int i, j, sum = 0; for ( j = 0; j < N; j++) for ( i = 0; i < M; i++) sum += a[ i ][ j ]; return sum; } Miss rate = 100% CS429 Slideset 19: 23

Some Questions to Consider What happens where there is a miss and the cache has no free lines? What should we evict? What happens on a write miss? What if we have a multicore chip where cores share the L2 cache but have private L1 caches? What bad things could happen? CS429 Slideset 19: 24

Concluding Observations A programmer can optimize for cache performance. How data structures are organized. How data are accessed. Nested loop structure. All systems favor cache friendly code. Getting absolute optimum performance is very platform specific (cache sizes, line sizes, associativities, etc.) But you can get most of the advantage with generic code. Keep the working set reasonably small (temporal locality). Use small strides (spatial locality). CS429 Slideset 19: 25