Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Similar documents
Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

LECTURE 11. Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

EE 4683/5683: COMPUTER ARCHITECTURE

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Course Administration

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Chapter 5. Memory Technology

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Memory Organizations

V. Primary & Secondary Memory!

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE 2021: Computer Organization

CS161 Design and Architecture of Computer Systems. Cache $$$$$

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSE 2021: Computer Organization

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Computer Systems Laboratory Sungkyunkwan University

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

CS3350B Computer Architecture

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Caches. Hiding Memory Access Times

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Memory Hierarchy. Slides contents from:

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

The Memory Hierarchy & Cache

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

COMPUTER ORGANIZATION AND DESIGN

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Advanced Computer Architecture

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

The University of Adelaide, School of Computer Science 13 September 2018

Welcome to Part 3: Memory Systems and I/O

EN1640: Design of Computing Systems Topic 06: Memory System

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Question?! Processor comparison!

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Memory. Lecture 22 CS301

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

A Cache Hierarchy in a Computer System

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy: The motivation

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy. Slides contents from:

Memory Management! Goals of this Lecture!

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Memory Hierarchy: Motivation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS3350B Computer Architecture

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Computer Architecture Memory hierarchies and caches

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Transcription:

TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science

2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative, fully associative caches Addressing Handling writes Performance

3 Review What is control speculation? What is data speculation? What are the advantages of a superscalar vs a VLIW? What are the disadvantages of a superscalar vs a VLIW? When is a VLIW appropriate? When is a superscalar appropriate?

4 Datapath and control from Chapter 4

4 Datapath and control from Chapter 4

5 Memory technologies Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity and cost/gb of disk Prices in 2008

6 Memory hierarchies - motivation Programmers want unlimited amount of fast memory Fast memory is expensive Large memories are slow Compromise - memory hierarchy is used to create the illusion of memory with the size of the largest and the speed of the fastest

7 Memory hierarchies Memory hierarchy: A structure that uses multiple levels of memories As the distance from the CPU increases, the size of the memories and the access time both increase The illusion of a large, fast memory is achieved by using the principles of locality

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables)

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables) If you execute an instruction once, you are likely to execute it again (loops)

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables) If you execute an instruction once, you are likely to execute it again (loops) Temporal locality Addresses recently referenced will tend to be referenced again soon

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables) If you execute an instruction once, you are likely to execute it again (loops) Temporal locality Addresses recently referenced will tend to be referenced again soon Caches exploit temporal locality!

9 Principle of spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays)

9 Principle of spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays) If you execute an instruction once, you are likely to access neighbouring instructions

9 Principle of spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays) If you execute an instruction once, you are likely to access neighbouring instructions Spacial locality If you access address X, you are likely to access an address close to X

9 Principle of spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays) If you execute an instruction once, you are likely to access neighbouring instructions Spacial locality If you access address X, you are likely to access an address close to X Caches exploit spatial locality!

10 Levels of hierarchy Exploit the principle of locality by using the memory hierarchy Memory closer to the CPU is a subset of memory further away All data is stored at the lowest level Data copied between only two levels at a time Upper levels: caching Lower levels: virtual memory

11 Exploiting locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory (main memory) Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory (CPU cache)

12 Organization of data Data transferred only between two levels at a time The data can be either present or not present in the upper level when needed The minimum unit of data that can be present or not present is called a block or a line The block is also the units transferred from the lower level when needed

13 Hierarchy and the computer Concepts used to build memory systems affect many other aspects of a computer and its performance: How the operating system manages memory and I/O How compilers generate code How applications use the computer Since all programs spend much of their time accessing memory, the memory system is the major factor in determining performance Programmers should understand the memory hierarchy to achieve proper performance

14 Hits and misses (1/2) A hit occurs when data referenced by the processor is available in a block in the upper level Otherwise, it is a miss On a miss the block containing the data must be transferred from the next level in the hierarchy The hit rate is the fraction of memory accesses found in the upper level The fraction not found is called the miss rate

15 Hits and misses (2/2) The hit time is the time needed to access data from the upper level Includes the time to determine if it is a hit or a miss The miss penalty is the time needed to access data that is not available in the upper level Includes the time to transfer the block from the lower level and to deliver the requested data The hit time is much smaller than the miss penalty

16 Programs and locality Programs tend to reuse recently accessed data items (temporal locality) and reference data items that are close to recently accessed data (spatial locality) Memory hierarchies exploit temporal locality by keeping more recently accessed data items closer to the processor Memory hierarchies exploit spatial locality by moving blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy Most systems use a true hierarchy data present at level i is also present at level i + 1

17 Cache Cache levels in the hierarchy between the main memory and the processor Simple (level 1) cache where a block is one single word The cache before and after a reference to X n

18 Direct-mapped cache How do we know if the requested word is in the cache? How do we find it? Easy to find a word in the cache if a memory location is mapped to exactly one cache location If the address of the memory location determines the exact placement in the cache it is called a direct-mapped cache Typical mapping: (blockaddress) mod (#cacheblocks)

19 Direct-mapped 1-word block sized cache Mapping: (wordaddress) mod (#cachewords) 8 word cache for 32 word memory: The 3 least significant bits of the address determine cache position

20 Direct-mapped 1-word block sized cache How do we know which address word is in a given cache word? We need to store the remaining upper address bits with the data The upper part of the address stored with the data block is called a tag For our 32 word memory, 8 word cache the tag is 2 bit What if there is invalid data in the cache word? We need a valid bit for each block For each block we have: Valid bit, tag, data block

21 Example: reads on the simple cache See pages 460-461

22 Larger blocks In order to take advantage of spatial locality, caches use blocks several words in size We only need one valid bit and one tag per block (less storage overhead) The block address is byteaddress #bytesinblock With 16 bytes per block byte, address 1200 has block address 75

23 Larger blocks and hit rate To large blocks decrease hit rate Many words not used before block is kicked out Larger blocks increase miss penalty

24 Anatomy of an address Tag Index Byte offset Byte offset: What is the first byte in the cache line are we reading? Index: Which cache line are we reading? Tag: How we differentiate between other addresses with the same Index and Byte offset. Consider a 64 KB, direct mapped cache with 64B cache lines. Assuming a 32-bit address, how many bits are used for Tag? Index? Byte offset?

24 Anatomy of an address Tag Index Byte offset Byte offset: What is the first byte in the cache line are we reading? Index: Which cache line are we reading? Tag: How we differentiate between other addresses with the same Index and Byte offset. Consider a 64 KB, direct mapped cache with 64B cache lines. Assuming a 32-bit address, how many bits are used for Tag? Index? Byte offset? Index: 10 bits, byte offset: 2 bits, tag: 16 bits

25 Cache implementation

26 Caches and associativity fully associative Instead of direct mapping a cache design can be fully associative In a fully associative cache a block can be put in any position in the cache regardless of address Requires a full search of the tags to determine cache hit or miss Increases hardware costs What is the size of the Index field for a fully associative cache?

27 Caches and associativity set associative Direct mapping and fully associative are two ends of the spectrum. Set associative caches are somewhere in between In a set associative cache one address map to a fixed number of locations in the cache A set associative cache with n locations for each block is called n-way set associative Index of the set in the cache is given by: (blockaddress) mod (#setsinthecache) All tags in the set must be searched to determine hit or miss

28 Caches and associativity Direct mapped is the same as one-way set associative Fully associative is m-way set associative where m is the number of blocks in the cache

29 Associativity, hit time and hit rate Increased associativity can increase the hit time Tag search takes more time Increased associativity can decrease the miss rate Blocks are kept longer in the cache Associativity Data miss rate 1 10.3 % 2 8.6 % 4 8.3 % 8 8.1 % FastMATH processor running SPEC2000

30 Block replacement With associativity one has to decide which block to remove when a set is full on a cache miss Two strategies: Least recently used (LRU) Random LRU needs hardware to track access

31 Example: associative caches 1 word blocks, four blocks 3 different cache implementations direct-mapped two-way set associative fully associative Block addresses addressed in sequence 0 8 0 6 8

32 Example: associative caches direct mapped

33 Example: associative caches 2-way set

34 Example: associative caches fully associative

35 Associativity and tag-bits Increasing associativity increases number of comparators needed It also increases the size of the tag fields Assume a 4 K blocks, 16 byte block size, 32-bit address cache Direct mapped: 16 bit tag 64 Kbits total for tags 2-way : 2K sets 17 bit tag 68 Kbits total for tags 4-way : 1K sets 18 bit tag 72 Kbits total for tags fully : 1 set 28 bit tag 112 Kbits total for tags

36 Implementation of direct mapped

37 Implementation of 4-way set associative

38 Handling of cache misses On a cache hit the processor proceeds as normal on the next clock cycle On a cache miss the processor must be stalled until data is available in the cache Freezes the contents of the pipeline registers and the register file On an instruction read the instruction register is invalid, and must be re-fetch

39 Miss penalty elaborated A large block size increases the transfer time of the block We can hide some of the transfer time Early restart Resume execution when the requested word is available in the cache, possibly before the transfer is complete Requested word first or critical word first Transfer the requested word in the block first and the consecutively the rest of the block wrapping the address at the top of the block

40 Memory writes Write-through can help with memory consistency On a write the block is read from the lower level (if no present in cache) The new word is written to both the word in the cache and the address in main memory. Poor performance Write buffer A buffer holds a queue of write accesses to main memory Write-back Writes are only to the cache block Block is written when replaced

40 Memory writes Write-through can help with memory consistency On a write the block is read from the lower level (if no present in cache) The new word is written to both the word in the cache and the address in main memory. Poor performance Write buffer A buffer holds a queue of write accesses to main memory Write-back Writes are only to the cache block Block is written when replaced (requires a dirty bit!)

41 Memory writes Write-through + no-write-allocate On miss, write through to next level Write-back + write-allocate On miss, read line from next level, place in cache, write to cache When block is evicted, write the line back to memory

42 Types of cache misses Cache misses can be divided into three categories depending on the reason for the miss: Compulsory misses access to a block that has never been in the cache Capacity misses access to blocks that have been kicked out because of cache size Conflict misses access to blocks that have been kicked out of a set associative or direct mapped cache, but would have been available in a fully associative cache

43 Types of cache misses

44 Impact of miss rate on performance Example on page 477

45 16 KB cache in FastMATH (MIPS) 4 K words, 16-word blocks seperate instruction and data cache OS decides between write-through and write-back Effective miss-rate 3.2 % (SPEC2000 integer benchmarks) 11.4% data 0.4% instruction Bits 5-2 is used to index the block and select the word from the block

46 Split vs. combined cache A combined cache with a total size equal to the sum of the two split caches gives a better hit rate FastMATH Split cache miss rate 3.24 % combined cache miss rate 3.18 % Split cache double the cache bandwidth, the processor can access both the instruction cache and the data cache in the same clock cycle The increased bandwidth easily overcomes the disadvantages of the increased miss rate

47 Designing the memory system to support caches (1/3) Miss penalty can be reduced if memory to cache bandwidth increased (allows larger blocks while maintaining low miss penalty) Bus clock rate usually much slower than processor (e.g., factor of 10), affecting the miss penalty Assume 1 memory bus clock cycle to send the address, 15 memory bus clock cycles for each DRAM access initiated and 1 memory bus clock cycle to send a word of data If 4 blocks and a one-word wide bank, the miss penalty would be 1 + 4 x 15 + 4 x 1 = 65 memory bus clock cycles. Transferred bytes per bus clock cycle: (4x4)/65 = 0.25

48 Increasing bandwidth by widening bus (2/3) Widen memory and buses between the processor and the memory Memory bandwidth increases proportionally Miss penalty improvement from previous example with main-memory width of two words: 1 + 2 x 15 + 2 x 1= 33 memory bus clock cycles, down from 65. Main-mem width 4 words: 17 cycles. Costs: wider bus and the potential increase in access time due to the multiplexor and control logic between the processor and cache

49 Increasing bandwidth by interleaving (3/3) Memory chips are organized in banks to read or write multiple words in one access time rather than reading or writing a single word each time. Sending an address to several banks permits them all to read simultaneously. This gives the advantage of incurring the full memory latency only once 1 + 1 x 15 + 4 x 1 = 20 memory bus clock cycles

50 Bytes / clock cycle for a single miss

51 Cache performance CPU time = (execution cycles + stall cycles) * cycle time stall cycles = read stalls + write stalls read stalls = reads/prog * read miss rate * read miss penalty write stalls = writes/prog * write miss rate * write miss penalty + buffer stalls (buffer stalls << write misses) read miss penalty write miss penalty memory stalls = mem access/prog * miss rate * miss penalty = inst/prog * miss/inst * miss penalty

52 Multi-level cache Miss penalty reduced Level 1 cache focuses on reducing hit time smaller cache size smaller block size Level 2 cache focuses on reducing miss penalty larger cache size larger block size

53 Review Cache lines exploit which locality? What is the benefit of associativity? What is the cost of associativity? Why aren t L1 caches big? What is temporal locality? What is an example of code that has no temporal locality?