EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Similar documents
CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Memory Hierarchies (Part 2)

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

EE 4683/5683: COMPUTER ARCHITECTURE

CS3350B Computer Architecture

Course Administration

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Improving Cache Performance

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Improving Cache Performance

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

CPE 631 Lecture 04: CPU Caches

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Modern Computer Architecture

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

CMPT 300 Introduction to Operating Systems

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

CS3350B Computer Architecture

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS61C Review of Cache/VM/TLB. Lecture 26. April 30, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Time. Recap: Who Cares About the Memory Hierarchy? Performance. Processor-DRAM Memory Gap (latency)

CS152: Computer Architecture and Engineering Caches and Virtual Memory. October 31, 1997 Dave Patterson (http.cs.berkeley.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

COSC3330 Computer Architecture Lecture 19. Cache

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Computer Architecture Spring 2016

ECE331: Hardware Organization and Design

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Mul$- Level Caches

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

CS 136: Advanced Architecture. Review of Caches

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

ECE331: Hardware Organization and Design

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture Caches Part 2

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

Advanced Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory. Lecture 22 CS301

Page 1. Multilevel Memories (Improving performance using a little cash )

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

ECE232: Hardware Organization and Design

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

ECE ECE4680

Handout 4 Memory Hierarchy

Memory Hierarchy: The motivation

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Introduction to cache memories

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchy: Motivation

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Question?! Processor comparison!

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

ארכיטקטורת יחידת עיבוד מרכזי ת

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Caching Basics. Memory Hierarchies

Review. You Are Here! Agenda. Recap: Typical Memory Hierarchy. Recap: Components of a Computer 11/6/12

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture. Multilevel Caches, Cache Questions

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Transcription:

Administrative EEC 7 Computer Architecture Fall 5 Improving Cache Performance Problem #6 is posted Last set of homework You should be able to answer each of them in -5 min Quiz on Wednesday (/7) Chapter 7: memory hierarchy, caches, virtual memory Lowest quiz grade will be dropped Last lecture (/7) Final review - I will email you a sample final by then Problem solving Come prepared with questions to discuss Courtesy of Prof Mary Jane Irwin (Penn State University) Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time Processor 4-8 bytes (word) L$ 8-3 bytes (block) L$ to 4 blocks Main Secondary,4+ bytes (disk sector = page) (Relative) size of the memory at each level Inclusive what is in L$ is a subset of what is in L$ is a subset of what is in MM that is a subset of is in SM Review: Principle of Locality Temporal Locality Keep most recently accessed data items closer to the processor Spatial Locality Move blocks consisting of contiguous words to the upper levels Hit Time << Miss Penalty To Processor From Processor Upper Level Blk X Hit: data appears in some block in the upper level (Blk X) - Hit Rate: the fraction of accesses found in the upper level - Hit Time: RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a lower level block (Blk Y) Lower Level Blk Y - Miss Rate = - (Hit Rate) - Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block s word to the processor - Miss Types: Compulsory, Conflict, Capacity The Art of System Design Workload or Benchmark programs Processor $ MEM reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, op: i-fetch, read, write Optimize the memory system organization to minimize the average memory access time for typical workloads Impact of Hierarchy on Algorithms Today CPU time is a function of (ops, cache misses) What does this mean to Compilers, structures, Algorithms? Quicksort: fastest comparison based sorting algorithm when keys fit in memory Complexity O(nlog(n)) Radix sort: also called linear time sort Complexity O(n) For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys The Influence of Caches on the Performance of Sorting by A LaMarca and RE Ladner Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 997, 37-379 For Alphastation 5, 3 byte blocks, direct mapped L MB cache, 8 byte keys, from 4 to 4

Quicksort vs Radix as vary number keys: Instructions Quicksort vs Radix as vary number keys: Instrs & Time 8 7 6 Radix sort Quick (Instr/key) Radix (Instr/key) 8 7 6 Radix sort Quick (Instr/key) Radix (Instr/key) Quick (clocks/key) Radix (clocks/key) 5 5 Time 4 4 3 Quick sort Instructions/key 3 Quick sort Instructions/key E+7 Job size in keys E+7 Job size in keys Quicksort vs Radix as vary number keys: Cache misses 5 4 3 Radix sort Cache misses Quick(miss/key) Radix(miss/key) Consider a Simplified System Single cache (on-chip) (like your bookshelf) Much smaller than main memory How do we build it? (big/small?) How do we look things up in it? (search/index/etc?) How do we manage it? (when we run out of space?) Quick sort Processor Control Job size in keys What is proper approach to fast algorithms? path Registers On-Chip Cache Main (DRAM) Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = IC CPI CC = IC (CPI ideal + -stall cycles) CC CPI stall -stall cycles come primarily from cache misses (a sum of read-stalls and write-stalls) Read-stall cycles = reads/program read miss rate read miss penalty Write-stall cycles = (writes/program write miss rate write miss penalty) + write buffer stalls For write-through caches, we can simplify this to -stall cycles = miss rate miss penalty Impacts of Cache Performance Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI) The memory system is unlikely to improve as fast as processor cycle time When calculating CPI stall, the cache miss penalty is measured in processor clock cycles needed to handle a miss The lower the CPI ideal, the more pronounced the impact of stalls A processor with a CPI ideal of, a cycle miss penalty, 36% load/store instr s, and % I$ and 4% D$ miss rates -stall cycles = % + 36% 4% = 344 So CPI stalls = + 344 = 544 What if the CPI ideal is reduced to? What if the processor clock rate is doubled (doubling the miss penalty)?

Reducing Cache Miss Rates # Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block fully associative cache A compromise is to divide the cache into sets each of which consists of n ways (n-way set associative) A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) (block address) modulo (# sets in the cache) Set Associative Cache Example Cache Way Set V Q: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Main Two low order bits define the byte in the word (3-b words) One word blocks Q: How do we find it? Use next low order memory address bit to determine which cache set (ie, modulo the number of sets in the cache) Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 4 4 4 4 4 4 Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 4 4 4 4 miss 4 miss hit 4 hit Mem() Mem() Mem() Mem() Mem(4) Mem(4) Mem(4) Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! Four-Way Set Associative Cache 8 = 56 sets each with four ways (each with one block) V 53 54 55 V 53 54 55 3 3 3 8 V 53 54 55 Byte offset V 53 54 55 Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (ie, the number or ways) and halves the number of sets decreases the size of the index by bit and increases the size of the tag by bit offset Byte offset 3 4x select Hit

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (ie, the number or ways) and halves the number of sets decreases the size of the index by bit and increases the size of the tag by bit Used for tag compare Decreasing associativity Direct mapped (only one way) Smaller tags Selects the set Increasing associativity Selects the word in the block offset Byte offset Fully associative (only one set) is all bits except block and byte offset Costs of Set Associative Caches When a miss occurs, which way s block do we pick for replacement? Least Recently Used (): the block replaced is the one that has been unused for the longest time - Must have hardware to keep track of when each way s block was used relative to the other blocks in the set - For -way set associative, takes one bit per set set the bit when a block is referenced (and reset the other way s bit) N-way set associative cache costs N comparators (delay and area) MUX delay (set selection) before data is available available after set selection (and Hit/Miss decision) In a direct mapped cache, the cache block is available before the Hit/Miss decision - So its not possible to just assume a hit and continue and recover later if it was a miss Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Reducing Cache Miss Rates # Use multiple levels of caches Miss Rate 8 6 4 -way -way 4-way 8-way Associativity 4KB 8KB 6KB 3KB 64KB 8KB 56KB 5KB from Hennessy & Patterson, Computer Architecture, 3 Largest gains are in going from direct mapped to -way (%+ reduction in miss rate) With advancing technology have more than enough room on the die for bigger L caches or for a second level of caches normally a unified L cache and in some cases even a unified L3 cache For our example, CPI ideal of, cycle miss penalty (to main memory), 36% load/stores, a % (4%) L I$ (D$) miss rate, add a UL$ that has a 5 cycle miss penalty and a 5% miss rate CPI stalls = + 5 + 36 4 5 + 5 + 36 5 = 354 (as compared to 544 with no L$) Multilevel Cache Design Considerations Design considerations for L and L caches are very different Primary cache should focus on minimizing hit time in support of a shorter clock cycle - Smaller with smaller block sizes Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times - Larger with larger block sizes The miss penalty of the L cache is significantly reduced by the presence of an L cache so it can be smaller (ie, faster) but have a higher miss rate For the L cache, hit time is less important than miss rate The L$ hit time determines L$ s miss penalty L$ local miss rate >> than the global miss rate Key Cache Design Parameters L typical L typical Total size (blocks) 5 to 4 to 5, Total size (KB) 6 to 64 5 to 8 size (B) 3 to 64 3 to 8 Miss penalty (clocks) to 5 to Miss rates (global for L) % to 5% % to %

Two Machines Cache Parameters 4 Questions for the Hierarchy L organization L cache size L block size L associativity 64 bytes Intel P4 Split I$ and D$ 8KB for D$, 96KB for trace cache (~I$) 4-way set assoc AMD Opteron Split I$ and D$ 64KB for each of I$ and D$ 64 bytes -way set assoc Q: Where can a block be placed in the upper level? ( placement) Q: How is a block found if it is in the upper level? ( identification) L replacement ~ L write policy L organization L cache size write-through Unified 5KB write-back Unified 4KB (MB) Q3: Which block should be replaced on a miss? ( replacement) L block size 8 bytes 64 bytes L associativity L replacement L write policy 8-way set assoc ~ write-back 6-way set assoc ~ write-back Q4: What happens on a write? (Write strategy) Q&Q: Where can a block be placed/found? Q3: Which block should be replaced on a miss? Direct mapped Set associative Fully associative Direct mapped Set associative Fully associative # of sets # of blocks in cache (# of blocks in cache)/ associativity Location method the set; compare set s tags Compare all blocks tags s per set Associativity (typically to 6) # of blocks in cache # of comparisons Degree of associativity # of blocks Easy for direct mapped only one choice Set associative or fully associative Random (Least Recently Used) For a -way set associative cache, random replacement has a miss rate about times higher than is too costly to implement for high levels of associativity (> 4-way) since tracking the usage information is costly Q4: What happens on a write? Write-through The information is written to both the block in the cache and to the block in the next lower level of the memory hierarchy Write-through is always combined with a write buffer so write waits to lower level memory can be eliminated (as long as the write buffer doesn t fill) Write-back The information is written only to the block in the cache The modified cache block is written to main memory only when it is replaced Need a dirty bit to keep track of whether the block is clean or dirty Pros and cons of each? Write-through: read misses don t result in writes (so are simpler and cheaper) Write-back: repeated writes require only one write to lower level Improving Cache Performance Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (6 to 64 bytes typical) victim cache small buffer holding most recently discarded blocks Reduce the miss penalty smaller blocks use a write buffer to hold dirty blocks being replaced so don t have to wait for the write to complete before reading check write buffer on read miss may get lucky for large blocks fetch critical word first use multiple cache levels L cache not tied to CPU clock rate faster backing store/improved memory bandwidth - wider buses - memory interleaving, page mode DRAMs

Improving Cache Performance 3 Reduce the time to hit in the cache smaller cache direct mapped cache smaller blocks for writes - no write allocate no hit on cache, just write to write buffer - write allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache Summary: The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics Cache Size Associativity Size - workload Bad - use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Good Factor A Less Factor B More Recap Q: Where can a block be placed in the upper level? Recap Q: How is a block found if it is in the upper level? placed in 8 block cache: Fully associative, direct mapped, -way set associative SA Mapping = Number Modulo Number Sets Address offset no Fully associative: block can go anywhere 3 4 5 6 7 no Direct mapped: block can go only into block 4 ( mod 8) 3 4 5 6 7 Set associative: block can go anywhere in set ( mod 4) 3 4 5 6 7 no Set Select Select -frame address Set Set Set Set 3 Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag no 3 3 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9 Recap Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random (Least Recently Used) Associativity way way 4 way 4 way 8 way 8 way Know This! Calculate runtime given cache statistics (miss rate, miss penalty, etc) Calculate Average Access Time (AMAT) Understand direct mapped, set-associative, fully associative Comparison between them Sources of cache misses Architecture Addressing into them (tag, index, byte) Cache behavior with them Size Random Random Random 6 KB 5% 57% 47% 53% 44% 5% 64 KB 9% % 5% 7% 4% 5% 56 KB 5% 7% 3% 3% % %

Sample Problem: Impact on Performance Suppose a processor executes at Clock Rate = GHz (5 ns per cycle) Base CPI = (assuming -cycle cache hits) 5% arith/logic, 3% ld/st, % control Suppose that % of data memory operations (lw, sw) get 5 cycle miss penalty Suppose that % of instructions get same miss penalty What is CPI? What is AMAT? Sample Problem: Impact on Performance CPI = Base CPI + average stalls per instruction (cycles/ins) + [ 3 (Mops/ins) x (miss/mop) x 5(cycle/miss)] + [ (InstMop/ins)x (miss/instmop) x 5 (cycle/miss)] = ( + 5 + 5) cycle/ins = 3 645% of the time the proc is stalled waiting for memory! AMAT=(/3)x[+x5]+(3/3)x[+x5]=54 Sample Problem: Cache Performance Processor: CPI = Icache miss rate = %, miss penalty = cycles Dcache miss rate = 4%, miss penalty = cycles Using SPECint load/store percentage of 36%: Speedup from this processor to one that never missed? What if CPI =? What if we doubled clock rate of computer without changing memory speed (CPI = )? Calculating Cache Performance, CPI= I miss cycles = I * % * = * I D miss cycles = I * 4% * 36% * = 44 * I So memory stalls/instr = + 44 = 344 CPU time with stalls = I * CPI stall * clkcycle CPU time, no stalls I * CPI perf * clkcycle CPI stall = + 344; CPI perf = CPI stall / CPI perf = 544 / = 7 Calculating Cache Performance, CPI= I miss cycles = I * % * = * I D miss cycles = I * 4% * 36% * = 44 * I So memory stalls/instr = + 44 = 344 Calculating Cache Performance, x I miss cycles = I * % * = 4 * I D miss cycles = I * 4% * 36% * = 88 * I So memory stalls/instr = 4 + 88 = 688 CPU time with stalls = CPU time, no stalls I * CPI stall * clkcycle I * CPI perf * clkcycle CPU time (slow clk) = CPU time (fast clk) I * CPI slow * clkcycle I * CPI fast * clkcycle/ CPI stall = + 344; CPI perf = CPI stall / CPI perf = 444 / 4 = 444 CPI fast = + 688; CPI slow = 544 Speedup = 544 / (888 * 5) = 3 Ideal machine is x faster