Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Similar documents
CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CMPT 300 Introduction to Operating Systems

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Course Administration

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

10/11/17. New-School Machine Structures. Review: Single Cycle Instruction Timing. Review: Single-Cycle RISC-V RV32I Datapath. Components of a Computer

UCB CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECE468 Computer Organization and Architecture. Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

UC Berkeley CS61C : Machine Structures

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

14:332:331. Week 13 Basics of Cache

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Review : Pipelining. Memory Hierarchy

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part I

CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Components of a Computer

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

UCB CS61C : Machine Structures

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

! CS61C : Machine Structures. Lecture 22 Caches I. !!Instructor Paul Pearce! ITʼS NOW LEGAL TO JAILBREAK YOUR PHONE!

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CSC Memory System. A. A Hierarchy and Driving Forces

CS61C : Machine Structures

CS61C : Machine Structures

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

ECE232: Hardware Organization and Design

LECTURE 11. Memory Hierarchy

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

CS3350B Computer Architecture

Speicherarchitektur. Who Cares About the Memory Hierarchy? Technologie-Trends. Speicher-Hierarchie. Referenz-Lokalität. Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Memory. Lecture 22 CS301

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

14:332:331. Week 13 Basics of Cache

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Recap: Machine Organization

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Review: New-School Machine Structures. Review: Direct-Mapped Cache. TIO Dan s great cache mnemonic. Memory Access without Cache

Modern Computer Architecture

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

UCB CS61C : Machine Structures

Cache Memory - II. Some of the slides are adopted from David Patterson (UCB)

CS3350B Computer Architecture

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CS61C : Machine Structures

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

CS61C : Machine Structures

The Memory Hierarchy & Cache

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

CS161 Design and Architecture of Computer Systems. Cache $$$$$


Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

Cycle Time for Non-pipelined & Pipelined processors

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches. Smart Phone. Today s Lecture. Core. FuncTonal Unit(s) Logic Gates

Memory Hierarchy: Motivation

CS61C : Machine Structures

Caches II. CSE 351 Autumn Instructor: Justin Hsia

CPE 631 Lecture 04: CPU Caches

Transistor: Digital Building Blocks

CS 61C: Great Ideas in Computer Architecture Caches Part 2

EN1640: Design of Computing Systems Topic 06: Memory System

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Memory Hierarchy Y. K. Malaiya

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Memory Hierarchy: The motivation

Lctures 33: Cache Memory - I. Some of the slides are adopted from David Patterson (UCB)

CS61C : Machine Structures

Caches II. CSE 351 Spring Instructor: Ruth Anderson

Transcription:

Performance 980 98 982 983 984 985 986 987 988 989 990 99 992 993 994 995 996 997 998 999 2000 7/4/20 CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Instructor: Michael Greenbaum Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds = Program Instruction Clock Cycle Time measurement via clock cycles, machine specific Power of increasing concern, and being added to benchmarks Profiling tools (eg, gprof) as way to see where spending time in your program. 7/4/20 Spring 20 -- Lecture # 2 New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to computer e.g., Search Katz Parallel Threads Assigned to core e.g., Lookup, Ads Software Parallel Instructions > instruction @ one time e.g., 5 pipelined instructions Parallel Data > data item @ one time e.g., Add up 4 pairs of words Hardware descriptions All gates @ one time Harness Parallelism & Achieve High Performance Hardware Warehouse Scale Computer Core Memory Input/Output Instruction Unit(s) Main Memory Smart Phone Computer Today and Core Tomorrow (Cache) Core Functional Unit(s) A 0+B 0 A +B A 2+B 2 A 3+B 3 Logic Gates 3 4 Storage in a Computer Processor holds data in register file (~0 Bytes) Registers accessed on sub-nanosecond timescale Memory (we ll call main memory ) More capacity than registers (~Gbytes) Access time ~50-0 ns Hundreds of clock cycles per memory access?! Historical Perspective 989 first Intel CPU with cache on chip 998 Pentium III has two cache levels on chip 00 0 CPU µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.

Great Idea #3: Principle of Locality/ Memory Hierarchy Library Analogy Writing a report on a specific topic. E.g., works of J.D. Salinger While at library, check out books and keep them on desk. If need more, check them out and bring to desk. But don t return earlier books since might need them Limited space on desk; Which books to keep? You hope this collection of ~ books on desk enough to write report, despite being only 0.0000% of books in UC Berkeley libraries 7/4/20 Spring 20 -- Lecture # 7 8 Locality Temporal Locality (locality in time) Go back to same book on desktop multiple times If a memory location is referenced then it will tend to be referenced again soon Spatial Locality (locality in space) When go to book shelf, pick up multiple books on J.D. Salinger since library stores related books together If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon Principle of Locality Principle of Locality: Programs access small portion of address space at any instant of time What program structures lead to temporal and spatial locality in code? In data? 9 How does hardware exploit principle of locality? Offer a hierarchy of memories where closest to processor is fastest (and most expensive per bit so smallest) furthest from processor is largest (and least expensive per bit so slowest) Goal is to create illusion of memory almost as fast as fastest memory and almost as large as biggest memory of the hierarchy Higher Levels in memory hierarchy Lower Memory Hierarchy Processor Level Level 2 Level 3... Level n Increasing distance from processor, decreasing speed Size of memory at each level As we move to deeper levels the latency goes up and price per bit goes down. Why? 2 2

Caches Processor and memory speed mismatch leads us to add a new level: a memory cache Implemented with same integrated circuit processing technology as processor, integrated on-chip: faster but more expensive than DRAM memory Cache is a copy of a subset of main memory Modern processors have separate caches for instructions and data, as well as several levels of caches implemented in different sizes As a pun, often use $ ( cash ) to abbreviate cache, e.g. D$ = Data Cache, I$ = Instruction Cache 3 Memory Hierarchy Technologies Caches use SRAM (Static RAM) for speed and technology compatibility Fast (typical access times of 0.5 to 2.5 ns) Low density (6 transistor cells), higher power, expensive ($2000 to $4000 per GB in 20) Static: content will last as long as power is on Main memory uses DRAM (Dynamic RAM) for size (density) Slower (typical access times of 50 to 70 ns) High density ( transistor cells), lower power, cheaper ($20 to $40 per GB in 20) Dynamic: needs to be refreshed regularly (~ every 8 ms) Consumes % to 2% of the active cycles of the DRAM 4 Characteristics of the Memory Hierarchy Increasing distance from the processor in access time Block Unit of transfer between memory and cache Processor L$ 4-8 bytes (word) 8-32 bytes (block) L2$ 6-28 bytes (block) Main Memory 4,096+ bytes (page) Secondary Memory (Relative) size of the memory at each level Inclusive what is in L$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 5 How is the Hierarchy Managed? registers memory By compiler (or assembly level programmer) cache main memory By the cache controller hardware main memory disks (secondary storage) By the operating system (virtual memory) (Talk about later in the semester) Virtual to physical address mapping assisted by the hardware (TLB) By the programmer (files) 6 Datapath Typical Memory Hierarchy On-Chip Components Control RegFile Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk Or Flash) Speed (cycles): ½ s s s 0 s,000,000 s Size (bytes): 0 s K s M s G s T s Cost/bit: highest lowest Principle of locality + memory hierarchy presents programmer with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Review so far Wanted: size of the largest memory available, speed of the fastest memory available Approach: Memory Hierarchy Successively lower levels contain most used data from next higher level Exploits temporal & spatial locality 7 8 3

Administrivia Midterm Friday 7/5, 9am-2pm, 2050 VLSB How to study: Studying in groups can help. Take old exams for practice (link at top of main webpage) Look at lectures, section notes, projects, hw, labs, etc. Go to Review Session. Will cover up to tomorrow s material. Midterm Review Session TODAY, 4pm 6pm, Wozniak Lounge 9 20 Administrivia HW grades are up, check using glookup Send questions about grading to your reader. Mid-Session Survey Short survey to complete as part of Lab 7. Let us know how we re doing, and what we can do to improve! 2 22 Cache Management Cache managed automatically by hardware. Operations available in hardware are limited, scheme needs to be relatively simple. Where in the cache do we put a block of data from memory? How do we find it when we need it? What is the overall organization of blocks we impose on our cache? Direct Mapped Caches Each memory block is mapped to exactly one block in the cache Only need to check this single location to see if block is in cache. Cache is smaller than memory Multiple blocks in memory map to a single block in the cache! Need some way of determining the identity of the block. 23 24 4

Direct Mapped Caches Address mapping: (block address) modulo (# of blocks in the cache) Lower bits of memory address determine which block in the cache the block is stored. Upper bits of memory address (Tag) determine which block in memory the block came from. Tag Index Memory Address Fields (For now ) 25 Block Mapping From Memory Index Valid Tag 00 0 Cache 4-bit memory addresses 0000 000 00 Data 00 00 0 0 0 00 0 0 (block address) modulo (# of blocks in the cache) Spring 20 -- Lecture # 7/4/20 26 Full Address Breakdown Lowest bits of address (Offset) determine which byte within a block it refers to. Full address format: Tag Index Memory Address Offset n-bit Offset means a block is how many bytes? n-bit Index means cache has how many blocks? 27 TIO Breakdown- Summary All fields are read as unsigned integers. Index specifies the cache index (which row /block of the cache we should look in) I bits <=> 2 I blocks in cache Offset once we ve found correct block, specifies which byte within the block we want (which column in the cache) O bits <=> 2 O bytes per block Tag the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to a given location Cache Index Valid Tag 00 0 Caching: A First Example Data Q: Is the mem block in cache? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache 0000xx 000xx 00xx 00xx 00xx 0xx 0xx 0xx 00xx 0xx xx xx 0xx xx xx xx Main Memory - 6 bit addresses One word blocks Two low order bits define the byte in the block Q: Where in the cache is the mem block? Use next 2 low order memory address bits the index to determine which cache block (i.e., modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache) 29 Multiword Block Direct Mapped Cache Four words/block, cache size = K words Byte 3 30... 3 2... 4 3 2 0 Hit offset Index Valid Tag 0 2... 253 254 255 and Tag 20 20 Index 8 Data (words) Block offset 30 32 Data 5

Miss rate (%) 7/4/20 Caching Terminology When reading memory, 3 things can happen: cache hit: cache block is valid and contains proper address, so read desired word cache miss: nothing in cache in appropriate block, so fetch from memory cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy) 32 33 Direct Mapped Cache Consider the sequence of memory address accesses Start with an empty cache - all blocks 0 2 3 4 3 4 5 initially marked as not valid 0000 000 00 00 00 00 00 0 miss miss 2 miss 3 miss 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem() 00 Mem() 00 Mem() 00 Mem(2) 00 Mem(2) 00 Mem(3) Time 4 miss 3 hit 4 hit 5 miss 0 4 00 Mem(0) 0 Mem(4) 0 Mem(4) 0 Mem(4) 00 Mem() 00 Mem() 00 Mem() 00 Mem() 00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(3) 00 Mem(3) 00 Mem(3) 00 Mem(3) 5 Time 8 requests, 6 misses 34 Taking Advantage of Spatial Locality Let cache block hold more than one byte 0 2 3 4 3 4 5 Start with an empty cache - all blocks initially marked as not valid 0 miss 00 Mem() Mem(0) 00 Mem() Mem(0) 0000 000 00 00 00 00 00 hit 2 miss 00 Mem() Mem(0) 00 Mem() Mem(0) 3 hit 4 miss 3 hit 0 00 Mem() 5 Mem(0) 4 0 Mem(5) Mem(4) 4 hit 5 miss 0 Mem(5) Mem(4) 0 Mem(5) Mem(4) 5 4 8 requests, 4 misses 35 Miss Rate vs Block Size vs Cache Size 5 0 6 32 64 28 256 Block size (bytes) 8 KB 6 KB 64 KB 256 KB Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) 36 6

37 Average Memory Access Time (AMAT) Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses AMAT = Time for a hit + Miss rate x Miss penalty What is the AMAT for a processor with a 200 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache access time of clock cycle? + 0.02 x 50 = 2 clock cycles Or 2 x 200 = 400 psecs Potential impact of much larger cache on AMAT? ) Lower Miss rate 2) Longer Access time (Hit time): smaller is faster Increase in hit time will likely add another stage to the pipeline At some point, increase in hit time for a larger cache may overcome the improvement in hit rate, yielding a decrease in performance 7/4/20 Spring 20 -- Lecture #2 38 Measuring Cache Performance Effect on CPI Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = IC CPI CC = IC (CPI ideal + Average Memory-stall cycles) CC CPI stall A simple model for Memory-stall cycles Memory-stall cycles = accesses/instruction miss rate miss penalty Will talk about writes and write misses next lecture, where its a little more complicated 7/4/20 Spring 20 -- Lecture #2 39 Impacts of Cache Performance Relative $ penalty increases as processor performance improves (faster clock rate and/or lower CPI) Memory speed unlikely to improve as fast as processor cycle time. When calculating CPI stall, cache miss penalty is measured in processor clock cycles needed to handle a miss Lower the CPI ideal, more pronounced impact of stalls Processor with a CPI ideal of 2, a 0 cycle miss penalty, 36% load/store instr s, and 2% I$ and 4% D$ miss rates Memory-stall cycles = 2% 0 + 36% 4% 0 = 3.44 So CPI stalls = 2 + 3.44 = 5.44 More than twice the CPIideal! What if the CPI ideal is reduced to? What if the D$ miss rate went up by %? 7/4/20 Spring 20 -- Lecture #2 40 And In Conclusion.. Principle of Locality Hierarchy of Memories (speed/size/cost per bit) to Exploit Locality Direct Mapped Cache Each block in memory maps to one block in the cache. Index to determine which block. Offset to determine which byte within block Tag to determine if it s the right block. AMAT to measure cache performance 4 7