CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Similar documents
CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

CS3350B Computer Architecture

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory Hierarchy Design (Appendix B and Chapter 2)

Course Administration

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

14:332:331. Week 13 Basics of Cache

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

EE 4683/5683: COMPUTER ARCHITECTURE

CS 61C: Great Ideas in Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

14:332:331. Week 13 Basics of Cache

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Page 1. Memory Hierarchies (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

EN1640: Design of Computing Systems Topic 06: Memory System

Memory. Lecture 22 CS301

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Direct-Mapped and Set Associative Caches. Instructor: Steven Ho

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Pollard s Attempt to Explain Cache Memory

Lecture 11 Cache. Peng Liu.

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

LECTURE 11. Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

ECE331: Hardware Organization and Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS61C : Machine Structures

CSE 2021: Computer Organization

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

CSE 2021: Computer Organization

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

Caches. Hiding Memory Access Times

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

ECE232: Hardware Organization and Design

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory hierarchy review. ECE 154B Dmitri Strukov

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

COMPUTER ORGANIZATION AND DESIGN

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

EN1640: Design of Computing Systems Topic 06: Memory System

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Transcription:

CS 6C: Great Ideas in Computer Architecture (Machine Structures) Instructors: Randy H Katz David A PaHerson hhp://insteecsberkeleyedu/~cs6c/fa Direct Mapped (contnued) - Interface CharacterisTcs of the Hierarchy Direct Mapped - Interface Increasing distance from the processor in access Tme Processor 4-8 bytes (word) L$ 8-3 bytes (block) L$ to 4 blocks Main Inclusive what is in L$ is a subset of what is in L$ is a subset of what is in MM that is a subset of is in SM,4+ bytes (disk sector = page) Secondary 3 (RelaTve) size of the memory at each level 9/9/ Fall - - Lecture #4 4 Mapping the Address 5 4 3 Mem Block Within $ Block Block Within $ Note: $ = In example, block size is 4 bytes/ word (it could be mult- word) and cache blocks are the same size, unit of transfer between memory and cache # blocks >> # blocks 6 blocks/6 words/64 bytes/6 bits to address all bytes 4 blocks, 4 bytes ( word) per block 4 blocks map to each cache block Byte within block: low order two bits, ignore! (nothing smaller than a block) block to cache block, aka index: middle two bits Which memory block is in a given cache block, aka tag: top two bits 9/9/ Fall - - Lecture #4 5 Byte Within Block (eg, Word) Direct Mapped Example One word blocks, cache size = K words (or 4KB) Hit Valid 3 3 3 3 Byte offset What kind of locality are we taking advantage of? 9/9/ Fall - - Lecture #4 6 3

MulTword Block Direct Mapped Four words/block, cache size = K words Byte 3 3 3 4 3 Hit offset Valid 53 54 55 8 Block offset Taking Advantage of SpaTal Locality Let cache block hold more than one word 3 4 3 4 5 Start with an empty cache - all blocks initally marked as not valid 3 4 3 4 5 What kind of locality are we taking advantage of? 9/9/ Fall - - Lecture #4 7 3 9/9/ Fall - - Lecture #4 8 Miss rate (%) Miss Rate vs Block Size vs Size 5 6 3 64 8 56 Block size (bytes) 8 KB 6 KB 64 KB 56 KB Miss rate goes up if the block size becomes a significant fracton of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) 9/9/ Fall - - Lecture #4 Field Sizes Number of bits in a cache includes both the storage for data and for the tags 3- bit byte address For a direct mapped cache with n blocks, n bits are used for the index For a block size of m words (m+ bytes), m bits are used to address the word within the block and bits are used to address the byte within the word What is the size of the tag field? The total number of bits in a direct- mapped cache is then n x (block size + tag field size + valid field size) How many total bits are required for a direct mapped cache with 6KB of data and 4- word blocks assuming a 3- bit address? Handling Hits Read hits (I$ and D$) Hits are good in helping us go fast; misses are bad/slow us down Write hits (D$ only) Require the cache and memory to be consistent Always write the data into both the cache block and the next level in the memory hierarchy (write- through) Writes run at the speed of the next level in the memory hierarchy so slow! or can use a write buffer and stall only if the write buffer is full Allow cache and memory to be inconsistent Write the data only into the cache block (write- back the cache block to the next level in the memory hierarchy when that cache block is evicted ) Need a dirty bit for each data cache block to tell if it needs to be wrihen back to memory when it is evicted can use a write buffer to help buffer write- backs of dirty blocks Sources of Misses Compulsory (cold start or process migraton, first reference): First access to a block, cold fact of life, not a whole lot you can do about it If you are going to run millions of instructon, compulsory misses are insignificant SoluTon: increase block size (increases miss penalty; very large blocks could increase miss rate) Capacity: cannot contain all blocks accessed by the program SoluTon: increase cache size (may increase access Tme) Conflict (collision): MulTple memory locatons mapped to the same cache locaton SoluTon : increase cache size SoluTon : increase associatvity (stay tuned) (may increase access Tme) 3

Handling Misses (Single Word Blocks) Read misses (I$ and D$) Stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume Write misses (D$ only) Stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write- back cache), write the word from the processor to the cache, then let the pipeline resume or Write allocate just write the word into the cache updatng both the tag and data, no need to check for cache hit, no need to stall or No- write allocate skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn t full 4 MulTword Block ConsideraTons Read misses (I$ and D$) Processed the same as for single word blocks a miss returns the entre block from memory Miss penalty grows as block size grows Early restart processor resumes executon as soon as the requested word of the block is returned Requested word first requested word is transferred from the memory to the cache (and processor) first Nonblocking cache allows the processor to contnue to access the cache while the cache is handling an earlier miss Write misses (D$) If using write allocate must first fetch the block from memory and then write the word to the block (or could end up with a garbled block in the cache (eg, for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block) 5 Direct Mapped s - Interface Midterm! HW #3: Posted, Due SUNDAY@3:59:59 Exam Review on Monday, GPB 6-8 PM NO lecture next Wednesday, 6 October Exam, 6-9 PM, Pimentel Closed book, notes No calculator One 85 x crib sheet 6 9/9/ Fall - - Lecture #4 7 Direct Mapped s s- Interface Hits and Misses s- Interface 8 9 3

3- bit data & 3- bit addr per cycle Systems that Support s The off- chip interconnect and memory architecture affects overall system performance in dramatc ways One word wide organizaton (one word wide and one word wide memory) Assume memory clock cycle to send address 5 memory clock cycles to get the st word in the block from (row cycle Tme), 5 memory clock cycles for nd, 3 rd, 4 th words (subsequent column access Tme) note effect of latency! memory clock cycle to return a word of data - Bus to bandwidth Number of bytes accessed from memory and transferred to cache/ per memory clock cycle (DDR) S OperaTon Awer a row is read into the SRAM register RAS CAS Column Address Input CAS as the startng burst address along with a burst length Transfers a burst of data (ideally a cache block) from a series of sequental addresses within that row - clock controls transfer of successive words in the burst Cycle Time Row Address Col Address N rows + N cols N x M SRAM M- bit Output st M- bit Access nd M- bit 3 rd M- bit 4 th M- bit Row Address M bit planes Row Add One Word Wide Bus, One Word Blocks If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory memory clock cycle to send address memory clock cycles to read memory clock cycle to return data bytes per memory clock cycle One Word Wide Bus, Four Word Blocks What if the block size is four words and each word is in a different row? cycle to send st address cycles to read 4 One Word Wide Bus, Four Word Blocks What if block size is four words and all words are in the same row? cycle to send st address cycles to read 6 bank bank bank bank 3 Interleaved, One Word Wide Bus For a block size of four words cycle to send st address cycles to read banks Number of bytes transferred per clock cycle (bandwidth) for a single miss is 8 4

System ObservaTons Its important to match the cache characteristcs s access one block at a Tme (usually more than one word) With the characteristcs Use s that support fast multple word accesses, preferably ones that match the block size of the cache With the memory- characteristcs Make sure the memory- can support the access rates and paherns With the goal of increasing the - Bus to bandwidth Summary Hits in caches hide the long access latencies to main memory Misses suffer from the high latency of going to main memory process may have to wait for memory in these cases unless other tricks are used (future parallelism discussions!) MiTgate the effect of the latency by exploitng memory bandwidth and parallelism to move a block from memory to cache in the hope that locality will increase future hits 3 3 5