Time 11/03/99 UCB Fall 1999

Similar documents
Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

Modern Computer Architecture

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Time. Who Cares About the Memory Hierarchy? Performance. Where Have We Been?

Time. Recap: Who Cares About the Memory Hierarchy? Performance. Processor-DRAM Memory Gap (latency)

CS152: Computer Architecture and Engineering Caches and Virtual Memory. October 31, 1997 Dave Patterson (http.cs.berkeley.

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Recap: Set Associative Cache. N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

ECE468 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Lecture 11. Virtual Memory Review: Memory Hierarchy

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Types of Cache Misses: The Three C s

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

CPE 631 Lecture 04: CPU Caches

CS61C Review of Cache/VM/TLB. Lecture 26. April 30, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

EE 4683/5683: COMPUTER ARCHITECTURE

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS152 Computer Architecture and Engineering Lecture 18: Virtual Memory

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Course Administration

Chapter-5 Memory Hierarchy Design

ECE ECE4680

Cache performance Outline

Handout 4 Memory Hierarchy

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

CS3350B Computer Architecture

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 11 Reducing Cache Misses. Computer Architectures S

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy Review

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

14:332:331. Week 13 Basics of Cache

Paging! 2/22! Anthony D. Joseph and Ion Stoica CS162 UCB Fall 2012! " (0xE0)" " " " (0x70)" " (0x50)"

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Memory hierarchy review. ECE 154B Dmitri Strukov

Computer Architecture Spring 2016

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Page 1. Memory Hierarchies (Part 2)

CPE 631 Lecture 06: Cache Design

Advanced Computer Architecture

Aleksandar Milenkovich 1

Memory Technologies. Technology Trends

CS162 Operating Systems and Systems Programming Lecture 13. Caches and TLBs. Page 1

Memory Hierarchies 2009 DAT105

14:332:331. Week 13 Basics of Cache

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"

CMSC 611: Advanced Computer Architecture. Cache and Memory

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Aleksandar Milenkovich 1

Page 1. CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

10/7/13! Anthony D. Joseph and John Canny CS162 UCB Fall 2013! " (0xE0)" " " " (0x70)" " (0x50)"

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Administrivia. CMSC 411 Computer Systems Architecture Lecture 8 Basic Pipelining, cont., & Memory Hierarchy. SPEC92 benchmarks

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

EITF20: Computer Architecture Part4.1.1: Cache - 2

CSE 502 Graduate Computer Architecture. Lec 6-7 Memory Hierarchy Review

Transcription:

Recap Who Cares About the Hierarchy? CS52 Computer Architecture and Engineering Lecture 9 s and TLBs November 3, 999 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides http//www-inst.eecs.berkeley.edu/~cs52/ Performance Processor-DRAM Gap (latency) µproc CPU 6%/yr. Moore s Law (2X/.5yr) Processor- Performance Gap (grows 5% / year) DRAM 98 98 982 983 984 985 986 987 988 989 99 99 992 993 994 995 996 997 998 999 2 DRAM 9%/yr. (2X/ yrs) Lec9. Time Lec9.2 Recap Levels of the Hierarchy Processor Instr. Operands Blocks Pages Disk Files Upper Level faster Recap exploit locality to achieve fast memory Two Different Types of Locality Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense Good choice for providing the user FAST access time. Tape Larger Lower Level Lec9.3 Lec9.4

Recap performance equations Time = IC x CT x (ideal CPI + memory stalls/inst) The Big Picture Where are We Now? The Five Classic Components of a Computer memory stalls/instruction = Average accesses/inst x Miss Rate x Miss Penalty = (Average IFETCH/inst x MissRate Inst x Miss Penalty Inst ) + (Average Data/inst x MissRate Data x Miss Penalty Data ) Processor Control Input Assumes that ideal CPI includes Hit Times. Datapath Output Average Access time = Hit Time + (Miss Rate x Miss Penalty) Today s Topics Recap last lecture Simple caching techniques Many ways to improve cache performance Virtual memory? Lec9.5 Lec9.6 The Art of System Design Workload or Benchmark programs Example KB Direct Mapped with 32 B Blocks For a 2 ** N byte cache The uppermost (32 - N) bits are always the Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Processor 3 Block address 9 4 reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>,... op i-fetch, read, write Tag Example x5 Stored as part of the cache state Index Ex x Byte Select Ex x Valid Bit Tag Data $ MEM Optimize the memory system organization to minimize the average memory access time for typical workloads x5 Byte 3 Byte 63 Byte Byte 33 Byte Byte 32 2 3 Byte 23 Byte 992 3 Lec9.7 Lec9.8

Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT Larger block size means larger miss penalty - Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up - Too few cache blocks In general, Average Access Time Miss Penalty = Hit Time x ( - Miss Rate) + Miss Penalty x Miss Rate Miss Rate Exploits Spatial Locality Fewer blocks compromises temporal locality Block Size Block Size Average Access Time Increased Miss Penalty & Miss Rate Block Size Lec9.9 Extreme Example single line Valid Bit Tag Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache Data Byte 3 Byte 2 Byte If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again - Continually loading data into the cache but discard (force out) them before they are used again - Worst nightmare of a cache designer Ping Pong Effect Conflict Misses are misses caused by Byte Different memory locations mapped to the same cache index - Solution make the cache size bigger - Solution 2 Multiple entries for the same Index Lec9. Another Extreme Example Fully Associative Fully Associative Forget about the Index Compare the Tags of all cache entries in parallel Example Block Size = 32 B blocks, we need N 27-bit comparators By definition Conflict Miss = for a fully associative cache 3 X X X X X Tag Tag (27 bits long) Valid Bit 4 Data Byte 3 Byte 63 Byte Select Ex x Byte Byte Byte 33 Byte 32 Lec9. Valid Set Associative N-way set associative N entries for each Index N direct mapped caches operates in parallel Example Two-way set associative cache Index selects a set from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Tag Data Block Adr Tag Compare Sel Index Data Block Mux Sel OR Block Hit Tag Compare Valid Lec9.2

Valid Disadvantage of Set Associative N-way Set Associative versus Direct Mapped N comparators vs. Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Block is available BEFORE Hit/Miss Possible to assume a hit and continue. Recover later if miss. Tag Data Block Adr Tag Compare Sel OR Index Data Block Mux Sel Block Hit Tag Compare Valid Lec9.3 A Summary on Sources of Misses Compulsory (cold start or process migration, first reference) first access to a block Cold fact of life not a whole lot you can do about it Note If you are going to run billions of instruction, Compulsory Misses are insignificant Conflict (collision) Multiple memory locations mapped to the same cache location Solution increase cache size Solution 2 increase associativity Capacity cannot contain all blocks access by the program Solution increase cache size Coherence (Invalidation) other process (e.g., I/O) updates memory Lec9.4 Source of Misses Quiz Sources of Misses Answer Assume constant cost. Size Small, Medium, Big? Compulsory Miss Direct Mapped N-way Set Associative Fully Associative Size Compulsory Miss Direct Mapped N-way Set Associative Fully Associative Big Medium Small Same Same Same Conflict Miss Conflict Miss High Medium Zero Capacity Miss Coherence Miss Capacity Miss Coherence Miss Low Medium High Same Same Same Choices Zero, Low, Medium, High, Same Note If you are going to run billions of instruction, Compulsory Misses are insignificant. Lec9.5 Lec9.6

Administrative Issues Lab 6 breakdowns due by 5pm tonight! Should be reading Chapter 7 of your book Second midterm 2 in 2 weeks (Wed, November 7th) Pipelining - Hazards, branches, forwarding, CPI calculations - (may include something on dynamic scheduling) Hierarchy Possibly something on I/O (see where we get in lectures) Possibly something on power (Broderson Lecture) Computers in the news (last term) IBM breakthrough! Tunneling Magnetic Junction RAM (TMJ-RAM) Speed of SRAM, density of DRAM, non-volatile (no refresh) New field called Spintronics combination of quantum spin and electronics Same technology used in high-density disk-drives Lec9.7 Structure of Tunneling Magnetic Junction Lec9.8 Recap Four Questions for s and Hierarchy Q Where can a block be placed in the upper level? (Block placement) Q2 How is a block found if it is in the upper level? (Block identification) Q3 Which block should be replaced on a miss? (Block replacement) Q4 What happens on a write? (Write strategy) Q Where can a block be placed in the upper level? Block no. Block 2 placed in 8 block cache Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Fully associative block 2 can go anywhere 2 3 4 5 6 7 Block no. Direct mapped block 2 can go only into block 4 (2 mod 8) 2 3 4 5 6 7 Set associative block 2 can go anywhere in set (2 mod 4) Block 2 3 4 5 6 7 no. Block-frame address Set Set Set Set 2 3 Lec9.9 Block no. 2 2 2 2 2 2 2 2 2 2 3 3 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Lec9.2

Q2 How is a block found if it is in the upper level? Q3 Which block should be replaced on a miss? Tag Block Address Index Set Select Block offset Data Select Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag Easy for Direct Mapped Set Associative or Fully Associative Random LRU (Least Recently Used) Associativity 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 6 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.% 64 KB.9% 2.%.5%.7%.4%.5% 256 KB.5%.7%.3%.3%.2%.2% Lec9.2 Lec9.22 Q4 What happens on a write? Write Buffer for Write Through Write through The information is written to both the block in the cache and to the block in the lowerlevel memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT read misses cannot result in writes WB no writes of repeated writes WT always combined with write buffers so that don t wait for lower level memory Processor Write Buffer A Write Buffer is needed between the and Processor writes data into the cache and the write buffer controller write contents of the buffer to memory Write buffer is just a FIFO DRAM Typical number of entries 4 Works fine if Store frequency (w.r.t. time) << / DRAM write cycle system designer s nightmare Store frequency (w.r.t. time) > / DRAM write cycle Write buffer saturation Lec9.23 Lec9.24

Write Buffer Saturation Processor Write Buffer Store frequency (w.r.t. time) > / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row) - Store buffer will overflow no matter how big you make it - The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation Use a write back cache Install a second level (L2) cache (does this always work?) Processor Write Buffer L2 DRAM DRAM Lec9.25 Write-miss Policy Write Allocate versus Not Allocate Assume a 6-bit write to memory location x and causes a miss Do we read in the block? - Yes Write Allocate - No Write Not Allocate 3 Valid Bit Tag Tag x5 Example x 9 Index Ex x Data Byte 3 Byte 63 Byte 23 Byte Byte 33 Byte Byte 32 2 3 4 Byte Select Ex x Byte 992 3 Lec9.26 Impact of Hierarchy on Algorithms Today CPU time is a function of (ops, cache misses) vs. just f(ops) What does this mean to Compilers, Data structures, Algorithms? The Influence of s on the Performance of Sorting by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 997, 37-379. Quicksort fastest comparison based sorting algorithm when all keys fit in memory Radix sort also called linear time sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys For Alphastation 25, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4 to 4 Lec9.27 Quicksort vs. Radix as vary number keys Instructions 8 7 6 5 4 3 2 Quick sort Radix sort E+7 Job size in keys Instructions/key Quick (Ins tr/key) Radix (Instr/key) Lec9.28

Quicksort vs. Radix as vary number keys Instrs & Time Quicksort vs. Radix as vary number keys misses 8 7 6 5 Radix sort Time Quick (Ins tr/key) Radix (Instr/key) Quick (Clocks /key) Radix (clocks/key) 5 4 3 2 Radix sort misses Quick(mis s /key) Radix(miss/key) 4 3 2 Quick sort Instructions E+7 Job size in keys Lec9.29 Quick sort Job size in keys What is proper approach to fast algorithms? Lec9.3 How Do you Design a? Impact on Cycle Time Set of Operations that must be supported read data <= Mem[Physical Address] write Mem[Physical Address] <= Data Physical Address Read/Write Data Black Box Inside it has Tag-Data Storage, Muxes, Comparators,... Hit Time directly tied to clock rate increases with cache size increases with associativity I - miss invalid PC IR IRex A B Determine the internal register transfers Design the Datapath Design the Controller Address Data In Data Out DataPath Control Points Signals Controller R/W Active wait Average Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls) IRm IRwb D Miss R T Lec9.3 Lec9.32

What happens on a miss? For in-order pipeline, 2 options Freeze pipeline in Mem stage (popular early on Sparc, R4) IF ID EX Mem stall stall stall stall Mem Wr IF ID EX stall stall stall stall stall Ex Wr Use Full/Empty bits in registers + MSHR queue - MSHR = Miss Status/Handler Registers (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. Per cache-line keep info about memory address. For each word register (if any) that is waiting for result. Used to merge multiple requests to one memory line - New load creates MSHR entry and sets destination register to Empty. Load is released from pipeline. - Attempt to use register before result returns causes instruction to block in decode stage. - Limited out-of-order execution with respect to loads. Popular with in-order superscalar architectures. Out-of-order pipelines already have this functionality built in (load queues, etc). Lec9.33 Improving Performance 3 general options Time = IC x CT x (ideal CPI + memory stalls) Average Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time). Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Lec9.34 3Cs Absolute Miss Rate (SPEC92) 2 Rule.4.2..8.6.4.2 -way 2 Compulsory vanishingly small 2-way 4 4-way 8 8-way Size (KB) Conflict Capacity 6 32 64 28 Compulsory Lec9.35 miss rate -way associative cache size X = miss rate 2-way associative cache size X/2.4.2..8.6.4.2 -way 2 2-way 4 4-way 8 8-way Size (KB) Conflict Capacity 6 32 64 28 Compulsory Lec9.36

3Cs Relative Miss Rate. Reduce Misses via Larger Block Size % 8% 6% 4% 2% -way 2-way 4-way 8-way Capacity Conflict Miss Rate 25% 2% 5% % K 4K 6K 64K % 2 4 8 6 Flaws for fixed block size Good insight => invention Size (KB) 32 64 28 Compulsory 5% % 6 32 64 Block Size (bytes) 28 256 256K Lec9.37 Lec9.38 2. Reduce Misses via Higher Associativity Example Avg. Access Time vs. Miss Rate 2 Rule Miss Rate DM cache size N - Miss Rate 2-way cache size N/2 Beware Execution time is only final measure! Will Clock Cycle time increase? Hill [988] suggested hit time for 2-way vs. -way external cache +%, internal + 2% Example assume CCT =. for 2-way,.2 for 4-way,.4 for 8-way vs. CCT direct mapped Size Associativity (KB) -way 2-way 4-way 8-way 2.33 2.5 2.7 2. 2.98.86.76.68 4.72.67.6.53 8.46.48.47.43 6.29.32.32.32 32.2.24.25.27 64.4.2.2.23 28..7.8.2 (Red means A.M.A.T. not improved by more associativity) Lec9.39 Lec9.4

3. Reducing Misses via a Victim 4. Reducing Misses by Hardware Prefetching How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [99] 4-entry victim cache removed 2% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS Tag and Comparator Tag and Comparator Tag and Comparator Tag and Comparator DATA One line of Data One line of Data One line of Data One line of Data To Next Lower Level In Hierarchy E.g., Instruction Prefetching Alpha 264 fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too Jouppi [99] data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [994] for scientific programs for 8 streams got 5% to 7% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty Lec9.4 Lec9.42 5. Reducing Misses by Software Prefetching Data Data Prefetch Load data into register (HP PA-RISC loads) Prefetch load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 6. Reducing Misses by Compiler Optimizations McFarling [989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Data Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Merging Arrays improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange change nesting of loops to access data in order stored in memory Loop Fusion Combine 2 independent loops that have same looping and some variables overlap Blocking Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows Lec9.43 Lec9.44

Review Improving Performance. Reducing Penalty Faster DRAM / Interface. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. New DRAM Technologies RAMBUS - same initial latency, but much higher bandwidth Synchronous DRAM TMJ-RAM from IBM?? Merged DRAM/Logic - IRAM project here at Berkeley Better BUS interfaces CRAY Technique only use SRAM Lec9.45 Lec9.46. Reducing Penalty Read Priority over Write on Miss Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS by 5% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read 2. Reduce Penalty Early Restart and Critical Word First Don t wait for full block to be loaded before restarting CPU Early restart As soon as the requested word of the block ar rives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block Lec9.47 Lec9.48

3. Reduce Penalty Non-blocking s Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires muliple memory banks (otherwise cannot support) Penium Pro allows 4 outstanding memory misses Value of Hit Under Miss for SPEC 2.8.6.4.2.8.6.4.2 Hit Under i Misses eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6 ora Integer Floating Point FP programs on average AMAT=.68 ->.52 ->.34 ->.26 Int programs on average AMAT=.24 ->.2 ->.9 ->.9 8 KB Data, Direct Mapped, 32B block, 6 cycle miss -> ->2 2->64 Base -> ->2 2->64 Base Hit under n Misses Lec9.49 Lec9.5 4. Reduce Penalty Second-Level Reducing Misses which apply to L2? L2 Equations AMAT = Hit Time L + Miss Rate L x Miss Penalty L Miss Penalty L = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L + Miss Rate L x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) Reducing Miss Rate. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim 4. Reducing Misses by HW Prefetching Instr, Data 5. Reducing Misses by SW Prefetching Data 6. Reducing Capacity/Conf. Misses by Compiler Optimizations Definitions Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) Global miss rate misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L x Miss Rate L2 ) Global Miss Rate is what matters Lec9.5 Lec9.52

L2 cache block size & A.M.A.T. Reducing Miss Penalty Summary Relative CPU Time 2.9.8.7.6.5.4.3.2..36.28.27 6 32 64 28 256 52 Block Size.34 32KB L, 8 byte path to memory.54.95 Five techniques Faster Main Read priority over write on miss Early Restart and Critical Word First on miss Non-blocking s (Hit under Miss, Miss under Miss) Second Level Can be applied recursively to Multilevel s Danger is that time to DRAM will grow with multiple levels in between First attempts at L2 caches can make things worse, since increased worst case is worse Lec9.53 Lec9.54 Recall Levels of the Hierarchy Basic Issues in Virtual System Design Capacity Access Time Cost CPU Registers s Bytes <s ns K Bytes - ns $.-./bit Main M Bytes ns-us $.-. Registers Instr. Operands Blocks Pages Disk G Bytes Disk ms -3-4 - cents Files Tape infinite sec-min -6 Tape Staging Xfer Unit prog./compiler -8 bytes cache cntl 8-28 bytes OS 52-4K bytes user/operator Mbytes Upper Level faster Larger Lower Level Lec9.55 size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy mem disk reg cache Paging Organization frame virtual and physical address space partitioned into blocks of equal size page frames /3/99 pages UCB Fall 999 pages Lec9.56

Address Map V = {,,..., n - } virtual address space M = {,,..., m - } physical address space MAP V --> M U {} address mapping function Processor n > m MAP(a) = a if data at virtual address a is present in physical address a and a in M a a = if data at virtual address a is not present in M Name Space V Addr Trans Mechanism a physical address missing item fault fault handler Main Secondary OS performs this transfer Lec9.57 P.A. 24 768 Paging Organization frame Address Mapping VA page no. disp Page Table Base Reg index into page table 7 Physical K K K Page Table V Access Rights PA + table located in physical memory Addr Trans MAP V.A. 24 physical memory address page K K 3744 3 K Virtual actually, concatenation is more likely unit of mapping also unit of transfer from virtual to physical memory Lec9.58 Virtual Address and a CPU hit data It takes an extra memory access to translate VA to PA Main This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE Why access cache with PA at all? VA caches have a problem! synonym / alias problem two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary same lsb of VA &PA CS52 > cache / Kubiatowicz size Lec9.59 Virtually Addressed CPU VA hit VA PA miss Translation Translation data Only require address translation on cache miss! synonym problem two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! nightmare for update must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits. (usually disallowed by fiat) PA Main Lec9.6

Reducing Translation Time Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache Making address translation practical TLB Virtual memory => memory acts like a cache for the disk Page table maps virtual page numbers to physical frames Translation Look-aside Buffer (TLB) is a cache of recent translations Virtual Address Space Physical Space Page Table 2 virtual address page off 3 Translation Lookaside Buffer frame page 2 2 5 Lec9.6 Lec9.62 TLBs R3 TLB & CP (MMU) A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access 2 6 6 VPN ASID 2 8 PFN N D V G global (ignore ASID) valid, dirty, non-cacheable TLB access time comparable to cache access time (much less than main memory access time) 63 8 7 Entry Hi Safe entries Entry Lo loaded when VA presented for translation Index index of probe and fail flag random random index for replacement Lec9.63 Lec9.64

Optimal Page Size Mimimize wasted storage small page minimizes internal fragmentation small page increase size of page table Minmize transfer time large pages (multiple disk sectors) amortize access cost sometimes transfer unnecessary info sometimes prefetch useful data sometimes discards useless data early General trend toward larger pages because big cheap RAM increasing mem / disk performance gap larger address spaces Lec9.65 32 Overlapped TLB & Access So far TLB access is serial with cache access Hit/ Miss can we do it in parallel? only if we are careful in the cache organization! TLB FN assoc lookup 2 page # What if cache size is increased to 8KB? = index 2 disp 4 bytes K FN Data Hit/ Miss Lec9.66 Problems With Overlapped TLB Access Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K 2 cache index 2 2 This bit is changed by VA translation, but is needed for cache virt page # disp lookup Solutions go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[3]=PA[3] 4 4 K 2 way set assoc cache Lec9.67 Page Fault What happens when you miss? Not talking about TLB miss TLB is HWs attempt to make page table lookup fast (on average) Page fault means that page is not resident in memory Hardware must detect situation Hardware cannot remedy the situation Therefore, hardware must trap to the operating system so that it can remedy the situation pick a page to discard (possibly writing it to disk) load the page in from disk update the page table resume to program so HW will retry and succeed! What is in the page fault handler? see CS62 What can HW do to help it do a good job? Lec9.68

Page Replacement Not Recently Used (-bit LRU, Clock) Large Address Spaces Associated with each page is a reference flag such that ref flag = if the page has been referenced in recent past = otherwise -- if replacement is necessary, choose any page frame such that its reference bit is. This is a page that has not been referenced in the recent past dirty used page fault handler page table entry page table entry Or search for the a page that is both not recently referenced AND not dirty. last replaced pointer (lrp) if replacement is to take place, advance lrp to next entry (mod table size) until one with a bit is found; this is the target for replacement; As a side effect, all examined PTE s have their reference bits set to zero. Two-level Page Tables 32-bit address 2 P index P2 index page offest 2 GB virtual address space 4 MB of PTE2 paged, holes 4 KB of PTE K PTEs 4 bytes 4KB Architecture part support dirty and used bits in the page table => may need to update PTE on any instruction fetch, load, store How does TLB affect this design problem? Software TLB miss? What about a 48-64 bit address space? 4 bytes Lec9.69 Lec9.7 Inverted Page Tables IBM System 38 (AS4) implements 64-bit addresses. 48 bits translated start of object contains a 2-bit tag Virtual Page hash = V.Page P. Frame => TLBs or virtually addressed caches are critical Lec9.7 Survey R4 32 bit virtual, 36 bit physical variable page size (4KB to 6 MB) 48 entries mapping page pairs (28 bit) MPC6 (32 bit implementation of 64 bit PowerPC arch) 52 bit virtual, 32 bit physical, 6 segment registers 4KB page, 256MB segment 4 entry instruction TLB 256 entry, 2-way TLB (and variable sized block xlate) overlapped lookup into 8-way 32KB L cache hardware table search through hashed page tables Alpha 264 arch is 64 bit virtual, implementation subset 43, 47,5,55 bit 8,6,32, or 64KB pages (3 level page table) 2 entry ITLB, 32 entry DTLB 43 bit virtual, 28 bit physical octword address 4 28 24 Lec9.72

Why virtual memory? Generality ability to run programs larger than size of physical memory Storage management allocation/deallocation of variable sized blocks is costly and leads to (external) fragmentation Protection regions of the address space can be R/O, Ex,... Flexibility portions of a program can be placed anywhere, without relocation Storage efficiency retain only most important portions of the program in memory Concurrent I/O execute other processes while loading/dumping page Expandability can leave room in virtual address space for objects to grow. Performance Observe impact of multiprogramming, impact of higher level languages Lec9.73 Summary #/ 4 The Principle of Locality Program likely to access a relatively small portion of the address space at any instant of time. - Temporal Locality Locality in Time - Spatial Locality Locality in Space Three (+) Major Categories of Misses Compulsory Misses sad facts of life. Example cold start misses. Conflict Misses increase cache size and/or associativity. Nightmare Scenario ping pong effect! Capacity Misses increase cache size Coherence Misses Caused by external processors or I/O devices Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Lec9.74 Summary #2 / 4 The Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics - workload - use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Size Associativity Block Size Bad Good Factor A Factor B Less More miss rate Summary #3 / 4 Miss Optimization Lots of techniques people use to improve the miss rate of caches Technique MR MP HT Complexity Larger Block Size + Higher Associativity + Victim s + 2 Pseudo-Associative s + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + Lec9.75 Lec9.76

Summary #4 / 4 TLB, Virtual s, TLBs, Virtual all understood by examining how they deal with 4 questions ) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance (funny times, as most systems can t access all of 2nd level cache without TLB misses!) Lec9.77