registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Similar documents
registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EN1640: Design of Computing Systems Topic 06: Memory System

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSE 2021: Computer Organization

14:332:331. Week 13 Basics of Cache

CSE 2021: Computer Organization

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

The Memory Hierarchy & Cache

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

V. Primary & Secondary Memory!

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Introduction to cache memories

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Memory. Lecture 22 CS301

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

ECE 30 Introduction to Computer Engineering

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

CS3350B Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Caches. Hiding Memory Access Times

ECE331: Hardware Organization and Design

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Main Memory Supporting Caches

14:332:331. Week 13 Basics of Cache

Advanced Memory Organizations

ECE331: Hardware Organization and Design

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Page 1. Memory Hierarchies (Part 2)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Page 1. Multilevel Memories (Improving performance using a little cash )

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

ECE232: Hardware Organization and Design

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Computer Architecture Memory hierarchies and caches

EE 4683/5683: COMPUTER ARCHITECTURE

ECE260: Fundamentals of Computer Engineering

Memory latency: Affects cache miss penalty. Measured by:

Computer Systems Laboratory Sungkyunkwan University

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Lecture 11 Cache. Peng Liu.

Caches Concepts Review

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

COMP 3221: Microprocessors and Embedded Systems

Memory latency: Affects cache miss penalty. Measured by:

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

A Cache Hierarchy in a Computer System

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Memory Hierarchy Design (Appendix B and Chapter 2)

Course Administration

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

CS 3510 Comp&Net Arch

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

COMPUTER ORGANIZATION AND DESIGN

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

EE 457 Unit 7b. Main Memory Organization

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Recap: Machine Organization

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Mainstream Computer System Components

Adapted from David Patterson s slides on graduate computer architecture

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

LECTURE 11. Memory Hierarchy

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Transcription:

13 1 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas 110 Winter 2009 CMPE Cache Direct-mapped cache Reads and writes Cache associativity Cache and performance Textbook Edition: 7.1 to 7.3 Third Fourth Edition: 5.1, 5.2, 5.3

13 3 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches hierarchy Memory Rel. speed: 1 registers MEMORY ADDRESS data not in registers data 1-2 on-chip cache is it here? N 2-5 off-chip cache is it here? N 10-20 main memory: real address space part of virtual addr. sp. is it here? N 1000-100,000 disk: rest of virtual addr. sp. files, etc. is it here? N? long-term storage devices get it

13 4 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches Memory location CPU DISK MAIN MEMORY L2 L1 INST L1 DATA REGISTERS

block: amount of information transferred (in bytes or words) hit: the block is present miss: the block is not present miss penalty: time (in clock cycles) to fetch a block from the lower level 13 5 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches Basic concepts data locality: temporal locality spatial locality hit rate: fraction of times a requested block is found hit time: time to fetch a block that is present miss rate: fraction of times a requested block is not present rate = 100% - hit rate) (miss

Cache mappings fully-associative cache 13 6 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches Size of cache < size of main memory CPU cache main memory direct-mapped cache set-associative cache

13 7 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Direct-mapped cache cache BLOCK address (cache INDEX) 000 0 0 1 100 1 110 111 Each memory block is mapped to exactly one block in the cache. memory BLOCK address 000 0 0 1 100 1 110 111

The cache index 13 8 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Many different memory blocks map to a single cache block which block? Use the memory adress' lower bits to index the cache. cache index = (memory block address) % (cache size in blocks) 1: 32-block main memory, 8-block cache (we consider block Example addresses). The memory block address is... bits. To index the cache we need... bits the lower... bits of the memory block address. The memory block 0 maps to the cache location... The memory block 110 maps to the cache location...

13 9 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches 2: 128-byte main memory, 8-block cache, 4-byte (= 1 word) cache Example size (we consider byte addresses). block... -bit memory byte address... -bit cache (block) index... bits to address the byte within the block The memory addresses 000, 0, 010, and 011 all map memory byte address 0 0 000 011 010 0 000 001 000 cache index 000 0 0 1 100 1 110 111 cache block (4 bytes) to the same cache block...

13 10 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches The Tag field Many different memory blocks map to a single cache block how do we know which memory block is in the cache block? To each cache line we add a tag that contains the remaining part (upper bits) of the address 3: 32-block main memory, 8-block cache. Memory blocks 000, Example 100, and 110 all map to the same cache block... 0, tag memory block address 00000 000 000 001 000 0 010 011 000 0 0 000 0 0 1 100 1 110 111 cache index cache block (4 bytes)

The Valid bit 13 11 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches The CPU performs many different tasks, and the memory contents change how do we know if a cache block is good"? To each cache line we add a valid bit to indicate whether the content of the block corresponds to what the CPU is actually looking for. For instance, after a reset, all valid bits are reset - no block contains useful information.

13 12 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches memory byte address tag 31 30 29... 13 12 11 10... 3 2 1 0 index CPU 1-word block, direct-mapped cache index 0 1 2......... 1020 1021 1022 1023 V TAG DATA DATA HIT mem. address [b] cache line size [b] bits for index cache data size [B] bits for tag total cache size [b]

13 13 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Cache trace with block address 32-block memory, 8-block cache, the memory address is a block address Address dec bin Hit/Miss INDEX V TAG DATA 22 26 22 18 26 18 26 22 110 110 110 100 110 100 110 110 000 0 0 1 100 1 110 111

13 14 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Cache trace with byte address 256-byte memory, 32-byte cache, 4-byte cache block, memory byte addressing Address Hit/Miss dec bin INDEX V TAG 89 232 90 8 91 92 232 7 10 111000 10 000000 11 1100 111000 000011 000 0 0 1 100 1 110 111 DATA

read misses write misses 13 15 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Cache reads and writes In our CPU, Instruction Memory and Data Memory are actually cache memories. On a memory access, hits are straightforward to handle. Misses are more complex:

stall the CPU restart the instruction 13 16 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Read misses For instructions: stall the CPU send the original PC to memory (current PC-4) and wait write the cache entry (including tag and valid bit) restart the instruction For data: send the address to memory and wait write the cache entry (including tag and valid bit)

13 17 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Write misses What is a write miss? In a 1-word-block write-through cache, writes always hit. We do not need to know what was in the memory location, since the CPU is overwriting it anyway. Problem: inconsistency Solutions: write-through write-back

13 18 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Write-through Every time, write both the cache and the memory: write buffer CPU CACHE MEMORY simple slow (write buffer)

13 19 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Write-back Write only the cache. Write the entire block back into the memory only when R/ W R dec 22 Address bin 110 H/ M H the block needs to be replaced (dirty bit). Cache index 110 V T D DATA CPU CPU W Hit CACHE CACHE write buffer W W 22 22 110 110 W 14 110 W 14 110 CPU CPU FLUSH BLOCK CACHE CACHE MEMORY R 22 110 CPU CACHE CPU CACHE CPU CPU CPU FLUSH BLOCK CACHE CACHE CACHE

13 20 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Multi-word caches Using cache blocks larger than one word takes advantage of spatial locality. memory byte address 4-GB memory, 64-KB direct-mapped cache with 4-word data blocks (16-bytes) CPU index word offset data word hit tag V tag cache data block

13 21 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Exercise What is the total size in bytes of the cache in the previous slide?

13 22 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Hits/misses in a multi-word cache Just like the read misses on a single-word cache, except that the entire Read: is fetched. block We can not just write the word, tag, and valid bit without verifying Write: the block is the actual block we want to write to, since more than one whether memory block maps to the same cache block. We need to compare the tag for writes too. the tags match: we can write the word the tags do not match: we need to read the block from memory and then write the word

13 23 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Cache block size and miss rate up to a certain point, cache miss rate decreases with increasing block size after a certain point, cache miss rate increases with increasing block size spatial locality decreases with block size the miss penalty increases with block size (COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED)

c) time for transferring each word 13 24 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Miss penalty (= additional clock cycles) Has three components: a) sending the address to memory b) latency to initiate the memory transfer Example on the textbook: a) = 1 clock cycle, b) = 15 clock cycles, c) = 1 clock cycle With a 4-word block cache and a 1-word memory bus, the the miss penalty on a standard DRAM is: On an SDRAM or with an interleaved memory organization is:

Static and Dynamic RAMs 13 25 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches

13 26 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches DRAM diagram RAS CAS Add WE ADDRESS BUFFER ROW DECODER COLUMN DECODER 000000000000000000000000000 111111111111111111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 000000000000000000000000000 111111111111111111111111111 SENSE AMPS and I/O

13 27 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Synchronous DRAM timing CK RAS CAS trcd WE = READ Add DQS 000000 111111 000000 111111 RA 00000000 11111111 00000000 11111111 CA 00000000000000000000000000000000000000 11111111111111111111111111111111111111 0000000000000000000000000000000000000 1111111111111111111111111111111111111 tcl DQ DO[CA] DO[CA+1] DO[CA+2] if BURST mode ONE ROW ACCESS

13 28 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Double Data Rate (DDR) DRAMs timing CK RAS CAS trcd WE = READ Add DQS 000000 111111 000000 111111 RA 00000000 11111111 00000000 11111111 CA 00000000000000000000000000000000000000 11111111111111111111111111111111111111 0000000000000000000000000000000000000 1111111111111111111111111111111111111 tcl DQ D0 D1 D2 D3 if BURST mode

tcl: CAS Latency, time between the read command and data output valid trcd: RAS-to-CAS Delay, minimum time between RAS and CAS trp: RAS Precharge time, time the row decoder needs to precharge the row tras: Activate-to-Precharge time, minimum time before trying to change row CMDrate: Command Rate, minimum time between chip select and RAS (activate) 13 29 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Commercial SDRAM parameters 2 2 2 5 T1 tcl trcd trp tras CMDrate All the above numbers are expressed in clock cycles.

13 30 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Commercial SDRAM parameters diagram CK COMMAND 0000000 1111111 0000000 1111111 ACTIVATE 0000000 1111111 0000000 1111111 000000 111111 000000 111111 READ 0000000 1111111 0000000 1111111 0000000 1111111 PRECHARGE 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 ACTIVATE (set RAS) (set CAS) (set RAS) 0000000 1111111 0000000 1111111 trcd tras trp tcl DQ

Memory bandwidth 13 31 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches If a single transfer to/from memory can transfer multiple words at a time, the miss penalty decreases CPU CPU CPU bus bus bus cache MUX/DEMUX bus bus bus MUX/DEMUX bus bus bus bus cache bus cache bus MEM MEM MEM Miss penalty for a 2-word block cache with a 2-word memory bus: Miss penalty for a 4-word block cache with a 4-word memory bus:

Cache associativity fully associative 13 32 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity What if the CPU keeps accessing two (or more) variables that map to the same location in a direct-mapped cache? More sophisticated strategy: n-way set-associative caches. direct-mapped ( 1-way set associative") n-way set associative

Cache associativity Two-way set associative cache 000 0 0 1 100 1 110 111 00 10 11 cache SET index 13 33 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas

cache SET index 0 1 Cache associativity Four-way set associative cache 000 001 0 011 0 1 1 111 100 101 1 111 110 111 111 11111 13 34 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas

13 35 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Eight-way set associative cache fully-associative cache any block can go anywhere

Cache associativity Direct-mapped cache 1-way set associative cache 000 0 100 110 000 0 0 1 100 1 110 111 cache line index 13 36 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas

reduces the miss rate 13 37 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Pros and cons of increasing cache associativity Advantages: Disadvantages: requires more hardware requires a replacement policy Block replacement policy: Least Recently Used (LRU) or random implemented in hardware

13 38 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Exercise 1 For an 8-line, write-through, 2-way set-associative cache with LRU replacement and 1-word data block, trace the following sequence of addresses: block address dec binary H/M 23 0011 18 0000 196 110000 63 011111 79 0111 18 0000 199 110011 165 10

a) how many lines are in the cache? c) what is the total cache size in bits? d) diagram a cache lookup 13 39 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Exercise 2 A computer system has 32-bit addresses and a 64-KB direct-mapped, write-back cache with 8-byte data block lines. how many bits total (including cache management bits) are in each line, b) minimum?

a) # of lines: b) # of bits per line c) total cache size d) cache lookup: 13 40 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Solution: 31 24 16 8 0

a) how many sets are in the cache? b) how many lines are in the cache? d) what is the total cache size in bits? e) diagram a cache lookup 13 41 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Exercise 3 Suppose the 64-KB cache in Exercise 2 was instead 2-way set associative with 8-byte lines. how many bits total (including cache management bits) are in each line, c) minimum?

a) # of sets b) # of lines c) # of bits per line d) total cache size e) cache lookup: 13 42 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Solution: 31 24 16 8 0

13 43 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches and performance Caches and performance 4: a computer has a CPI of 1.0 when there are no cache misses, Exercise a 100 MHz clock. Each instruction has on average 0.4 data memory and references. For each cache miss the instruction takes an additional 9 clock cycles to complete. what are the CPI 100% and the MIPS 100% rating with a cache and an 100% hit rate? unrealistic what are the CPI NOCACHE and the MIPS NOCACHE rating without a cache? what are the CPI 90 85 and themips 90 85 rating with a cache and a 90% rate on instructions and an 85% hit rate on data? hit

CPI NOCACHE = MIPS NOCACHE = CPI 90 85 = MIPS 90 85 = 13 44 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches and performance Solution: CPI 100% = MIPS 100% =

13 45 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Homework Recommended exercises Third Edition: Ex 7.2-7.4, 7.6-7.8, 7.9, 7.12, 7.16, 7.17-7.19, 7.25-7.27 Ex 7.32, 7.33, 7.35