Memory Hierarchy Design. Chapter 5

Similar documents
DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Types of Cache Misses: The Three C s

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Cache performance Outline

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

CMSC 611: Advanced Computer Architecture. Cache and Memory

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 11 Reducing Cache Misses. Computer Architectures S

Computer Architecture Spring 2016

CPE 631 Lecture 06: Cache Design

Lecture 11. Virtual Memory Review: Memory Hierarchy

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Aleksandar Milenkovich 1

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

Chapter-5 Memory Hierarchy Design

EITF20: Computer Architecture Part4.1.1: Cache - 2

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Aleksandar Milenkovich 1

Lec 12 How to improve cache performance (cont.)

CSE 502 Graduate Computer Architecture

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Performance (H&P 5.3; 5.5; 5.6)

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

Modern Computer Architecture

COSC 6385 Computer Architecture - Memory Hierarchies (I)

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

EITF20: Computer Architecture Part4.1.1: Cache - 2

CPE 631 Lecture 04: CPU Caches

Advanced optimizations of cache performance ( 2.2)

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

EE 4683/5683: COMPUTER ARCHITECTURE

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Memory Hierarchies 2009 DAT105

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Computer Architecture Spring 2016

Course Administration

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Adapted from David Patterson s slides on graduate computer architecture

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

5008: Computer Architecture

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lec 11 How to improve cache performance

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS3350B Computer Architecture

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

A Cache Hierarchy in a Computer System

Page 1. Memory Hierarchies (Part 2)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

COEN6741. Computer Architecture and Design

CS/ECE 250 Computer Architecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Copyright 2012, Elsevier Inc. All rights reserved.

Handout 4 Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

ארכיטקטורת יחידת עיבוד מרכזי ת

Pollard s Attempt to Explain Cache Memory

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Technologies. Technology Trends

Transcription:

Memory Hierarchy Design Chapter 5 1

Outline Review of the ABCs of Caches (5.2) Cache Performance Reducing Cache Miss Penalty 2

Problem CPU vs Memory performance imbalance Solution Driven by temporal and spatial locality Memory hierarchies Fast L1, L2, L3 caches Larger but slower memories Even larger but even slower secondary storage Keep most of the action in the higher levels Overview 3

Review: What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers: a cache on variables First-level cache: a cache on second-level cache Second-level cache: a cache on memory Memory: a cache on disk (virtual memory) TLB: a cache on page table Branch-prediction: a cache on prediction information? Proc/Regs Bigger L1-Cache L2-Cache Memory Faster Disk, Tape, etc. 4

Review: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (500 instructions on 21264!) To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y 5

Why it works Exploit the statistical properties of programs Locality of reference Temporal Spatial P(access,t) Average Memory Access Time AMAT HitTime MissRate MissPenalty HitTime HitTime Inst Data MissRate MissRate Inst Data MissPenalty Inst MissPenalty Data address Simple hardware structure that observes program behavior and reacts to improve future performance Is the cache visible in the ISA? 6

Locality of Reference Temporal and Spatial Sequential access to memory Unit-stride loop (cache lines = 256 bits) for (i = 1; i < 100000; i++) sum = sum + a[i]; Non-unit stride loop (cache lines = 256 bits) for (i = 0; i <= 100000; i = i+8) sum = sum + a[i]; 7

Temporal locality Locality We are likely to need this word again in the near future. Spatial locality There is a high probability that the other data in the block will be needed soon. 8

Cache Systems CPU 400MHz Main Memory 10MHz CPU Cache Main Memory 10MHz Bus 66MHz Bus 66MHz CPU Data object transfer Cache Block transfer Main Memory 9

Example: Two-level Hierarchy Access Time T 1 +T 2 T 1 0 Hit ratio 1 10

Basic Cache Read Operation CPU requests contents of memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot 11

Elements of Cache Design Cache size Line (block) size Number of caches Block placement Block identification Replacement algorithm Write strategy 12

Cache Size Cache size << main memory size Small enough Minimize cost Speed up access (less gates to address the cache) Keep cache on chip Large enough Minimize average access time Optimum size depends on the workload Practical size? 13

Line Size Optimum size depends on workload Small blocks do not use locality of reference principle Larger blocks reduce the number of blocks Replacement overhead Practical sizes? Tag Cache Main Memory 14

Number of Caches Increased logic density => on-chip cache Internal cache: level 1 (L1) External cache: level 2 (L2) Unified cache Balances the load between instruction and data fetches Only one cache needs to be designed / implemented Split caches (data and instruction) Pipelined, parallel architectures 15

Block Placement Q1: Where can a block be placed in the upper level? Fully Associative, Set Associative, Direct Mapped 16

1 KB Direct Mapped Cache, 32B blocks For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 Cache Tag Example: 0x50 Stored as part of the cache state 9 Cache Index Ex: 0x01 4 Byte Select Ex: 0x00 0 Valid Bit : Cache Tag 0x50 : Cache Data Byte 31 Byte 63 : : Byte 1 Byte 33 : Byte 0 Byte 32 0 1 2 3 Byte 1023 : Byte 992 31 17

Valid Review: Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel How big is the tag? Example: Two-way set associative cache Cache Index selects a set from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Cache Tag Cache Data Cache Block 0 : : : Cache Index Cache Data Cache Block 0 : Cache Tag Valid : : Adr Tag Compare Sel1 1 Mux 0 Sel0 Compare Hit OR Cache Block 18

n-way Set Associative There are n blocks in a set. Direct mapped is simply one-way set associative. A fully associative cache with m blocks could be called m-way set associative. Direct mapped can be thought of as having m sets, and fully associative as having one set. 19

Q2: How is a block found if it is in the upper level? Index identifies set of possibilities Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address Tag Index Block Offset Cache size = Associativity * 2 index_size * 2 offest_size 20

Direct Mapping 21

Associative Mapping 22

K-Way Set Associative Mapping 23

Replacement Algorithm Simple for direct-mapped: no choice Random Simple to build in hardware LRU Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 24

Q4: What happens on a write? Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no repeated writes to same location WT always combined with write buffers so that don t wait for lower level memory What about on a miss? Write_no_allocate vs write_allocate 25

Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle 26

Write Policy Write is more complex than read Write and tag comparison can not proceed simultaneously Only a portion of the line has to be updated Write policies Write through write to the cache and memory Write back write only to the cache (dirty bit) Write miss: Write allocate load block on a write miss No-write allocate update directly in memory 27

Alpha AXP 21064 Cache 21 8 5 Address Tag Index offset Data In Valid Tag Data (256) CPU data out Write buffer =? Lower level memory 28

Outline Review of the ABCs of Caches Cache Performance (5.3) Reducing Cache Miss Penalty 29

DECstation 5000 Miss Rates 30 25 % 20 15 10 5 Instr. Cache Data Cache Unified 0 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Cache size Direct-mapped cache with 32-byte blocks Percentage of instruction references is 75% 30

Cache Performance Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, Average memory-access time = Hit time + Miss rate x Miss penalty (ns) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time to lower level = f(latency to lower level) transfer time: time to transfer block =f(bandwidth) 31

Cache Performance Improvements Average memory-access time = Hit time + Miss rate x Miss penalty Cache optimizations Reducing the miss rate Reducing the miss penalty Reducing the hit time 32

Example: Harvard Architecture Unified vs Separate I&D (Harvard) Proc Unified Cache-1 Unified Cache-2 I-Cache-1 Proc Unified Cache-2 D-Cache-1 Statistics (given in H&P): 16KB I&D: Inst miss rate= 0.64%, Data miss rate= 6.47% 32KB unified: Aggregate miss rate= 1.99% Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port) AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 33

Cache Performance Equations CPU time = (CPU execution cycles + Mem stall cycles) * Cycle time Mem stall cycles = Mem accesses * Miss rate * Miss penalty CPU time = IC * (CPI execution + Mem accesses per instr * Miss rate * Miss penalty) * Cycle time Misses per instr = Mem accesses per instr * Miss rate CPU time = IC * (CPI execution + Misses per instr * Miss penalty) * Cycle time 34

The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back Cache Size Associativity Block Size The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Bad Good Factor A Less Factor B More 35

Outline Review of the ABCs of Caches Cache Performance Reducing Cache Miss Penalty (5.4) Reducing Miss Rate (5.5) Reducing Miss Penalty and Miss Rate via Parallelism (5.6) 36

Reducing Miss Penalty Multi-level Caches Critical Word First and Early Restart Priority to Read Misses over Writes Merging Write Buffers Victim Caches Sub-block placement 37

Second-Level Caches L2 Equations AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 +Miss Rate L2 +Miss Penalty L2 ) Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) Global miss rate misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2 ) Global Miss Rate is what matters 38

Performance of Multi-Level Caches 32 KByte L1 cache; Global miss rate close to single level cache rate provided L2 >> L1 local miss rate Do not use to measure impact Use in equation! L2 not tied to clock cycle! Target miss reduction 39

Early Restart and CWF Don t wait for full block to be loaded Early restart As soon as the requested word arrives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first and send it to the CPU as soon as it arrives; then fill in the rest of the words in the block. Generally useful only in large blocks Extremely good spatial locality can reduce impact Back to back reads on two halves of cache block does not save you much (see example in book) Need to schedule instructions! 40

Giving Priority to Read Misses Write buffers complicate memory access RAW hazard in main memory on cache misses SW 512(R0), R3 (cache index 0) LW R1, 1024(R0) (cache index 0) LW R2, 512(R0) (cache index 0) Wait for write buffer to empty? Might increase read miss penalty Check write buffer contents before read If no conflicts, let the memory access continue Write Back: Read miss replacing dirty block Normal: Write dirty block to memory, then do the read Optimized: copy dirty block to write buffer, then do the read More optimization: write merging 41

Victim Caches 42

Write Merging Write address V V V V 100 1 0 0 0 104 1 0 0 0 108 1 0 0 0 112 1 0 0 0 Write address V V V V 100 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 43

Outline Review of the ABCs of Caches Cache Performance Reducing Cache Miss Penalty (5.4) Reducing Miss Rate (5.5) Reducing Miss Penalty and Miss Rate via Parallelism (5.6) 44

Reducing Miss Rates: Types of Cache Misses Compulsory First reference or cold start misses Capacity Working set is too big for the cache Fully associative caches Conflict (collision) Many blocks map to the same block frame (line) Affects Set associative caches Direct mapped caches 45

Miss Rates: Absolute and Distribution 46

Reducing the Miss Rates 1. Larger block size 2. Larger Caches 3. Higher associativity 4. Pseudo-associative caches 5. Compiler optimizations 47

1. Larger Block Size Effects of larger block sizes Reduction of compulsory misses Spatial locality Increase of miss penalty (transfer time) Reduction of number of blocks Potential increase of conflict misses Latency and bandwidth of lower-level memory High latency and bandwidth => large block size Small increase in miss penalty 48

Example 49

2. Larger Caches More blocks Higher probability of getting the data Longer hit time and higher cost Primarily used in 2 nd level caches 50

3. Higher Associativity Eight-way set associative is good enough 2:1 Cache Rule: Miss Rate of direct mapped cache size N = Miss Rate 2-way cache size N/2 Higher Associativity can increase Clock cycle time Hit time for 2-way vs. 1-way external cache +10%, internal + 2% 51

4. Pseudo-Associative Caches Fast hit time of direct mapped and lower conflict misses of 2-way set-associative cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit time Pseudo hit time Miss penalty Drawback: CPU pipeline design is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC 52

Pseudo Associative Cache CPU Address 1 Tag 1 Data Data Data in out =? 2 2 3 =? Write buffer Lower level memory 53

5. Compiler Optimizations Avoid hardware changes Instructions Profiling to look at conflicts between groups of instructions Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows 54

Merging Arrays /* Before: 2 sequential arrays */ int key[size]; int val[size]; /* After: 1 array of stuctures */ struct merge { int key; int val; }; struct merge merged_array[size]; Reducing conflicts between val & key; improved spatial locality 55

Loop Interchange /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality Same number of executed instructions 56

Loop Fusion /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access toa&cvs. one miss per access; improve temporal locality 57

Blocking (1/2) /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1){ r = 0; for (k = 0; k < N; k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 3 NxNx4 => no capacity misses Idea: compute on BxB submatrix that fits 58

Blocking (2/2) /* After */ for (jj = 0; jj < N; jj = jj+b) for (kk = 0; kk < N; kk = kk+b) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+b-1,n); j=j+1){ r = 0; for(k=kk; k<min(kk+b-1,n);k =k+1) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }; B called Blocking Factor 59

Compiler Optimization Performance vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress 1 1.5 2 2.5 3 Performance Improvement merged arrays loop interchange loop fusion blocking 60

Reducing Cache Miss Penalty or Miss Rate via Parallelism 1. Nonblocking Caches 2. Hardware Prefetching 3. Compiler controlled Prefetching 61

1. Nonblocking Cache Out-of-order execution Proceeds with next fetches while waiting for data to come Non-blocking caches continue to supply cache hits during a miss requires out-of-order execution CPU hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses 62

Hit Under Miss Avg. Mem. Access Time 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Hit Under i Misses FP: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int:: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle penalty 0->1 1->2 2->64 Base 0.2 0 eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6 ora 63

2. Hardware Prefetching Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too: 1 data stream buffer gets 25% misses from 4KB DM cache; 4 streams get 43% For scientific programs: 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty 64

3. Compiler-Controlled Prefetching Compiler inserts data prefetch instructions Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC) Special prefetching instructions cannot cause faults; a form of speculative execution Nonblocking cache: overlap execution with prefetch Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 65

Reducing Hit Time 1. Small and Simple Caches 2. Avoiding address Translation during Indexing of the Cache 66

1. Small and Simple Caches Small hardware is faster Fits on the same chip as the processor Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? Small data cache and fast clock rate Direct Mapped, on chip Overlap tag check with data transmission For L2 keep tag check on chip, data off chip fast tag check, large capacity associated with separate memory chip 67

Small and Simple Caches 68

2. Avoiding Address Translation Virtually Addressed Cache (vs. Physical Cache) Send virtual address to cache. Every time process is switched must flush the cache; Cost: time to flush + compulsory misses from empty cache Dealing with aliases (two different virtual addresses map to same physical address) I/O must interact with cache, so need virtual address Solution to aliases HW guarantees that every cache block has unique PA SW guarantee (page coloring): lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; Solution to cache flush PID tag that identifies process and address within process 69

Virtual Addressed Caches CPU CPU CPU TB $ VA PA VA Tags $ TB VA VA PA Tags VA $ TB PA L2 $ PA PA MEM MEM Conventional Organization MEM Virtually Addressed Cache Translate only on miss Synonym Problem Overlap $ access with VA translation: requires $ index to remain invariant across translation 70

TLB and Cache Operation Virtual address TLB Operation Page# Offset TLB Hit Miss Real address Cache Operation + Tag Remainder Cache Hit Value Miss Page Table Main Memory Value 71

Process ID Impact 25 Miss rate (%) 20 15 10 5 Purge PIDs Uniprocess 0 2K 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB Cache size 72

Index with Physical Portion of Address If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag 31 12 11 0 Page address Addres tag Index Page offset Block offset Limits cache to page size: what if want bigger caches and uses same trick? Larger page sizes Higher associativity Index = log(cache Size/[block size*associativity]) Page coloring 73

3. Pipelined Writes =? W1 CPU Address Data Data in out Tag =? R1/W1 R1 Delayed write buffer W2 M u x Data Write buffer W1 Lower level memory 74

Cache Performance Summary Important Summary Table (Fig. 5.26) Understand the underlying tradeoffs E.g. victim caches benefit both miss penalty and miss rates. E.g. small caches improve hit rate but increase miss rate 75

Main Memory Background Performance of Main Memory: Latency: Cache Miss Penalty Access Time: time between request and word arrives Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically Addresses divided into 2 halves (Memory as a 2D matrix): RAS or Row Access Strobe CAS or Column Access Strobe Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor /bit, area is 10X) Address not divided: Full addreess Size: DRAM/SRAM 4-8 Cost & Cycle time: SRAM/DRAM 8-16 76

Main Memory Organizations Simple Wide Interleaved CPU CPU CPU Cache Multiplexor Cache bus Cache bus Memory bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 256/512 bits 32/64 bits 77 sp

Performance Timing model (word size is 32 bits) 1 to send address, 6 access time, 1 to send data Cache Block is 4 words Simple M.P. = 4 x (1+6+1) = 32 Wide M.P. = 1 + 6 + 1 = 8 Interleaved M.P. = 1 + 6 + 4x1 = 11 Addr Block 0 Addr Block 1 Addr Block 2 Addr Block 3 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Four-way interleaved memory 78

Independent Memory Banks Memory banks for independent accesses Multiprocessor I/O CPU with Hit under n Misses, Non-blocking Cache Superbank: all memory active on one block transfer (or Bank) Bank: portion within a superbank that is word interleaved (or Subbank)... Superbank number Superbank offset Bank number Bank offset 79

Number of banks How many banks? number banks >= number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready (like in vector case) Increasing DRAM => fewer chips => harder to have banks 64MB main memory 512 memory chips of 1-Mx1 (16 banks of 32 chips) 8 64-Mx1-bit chips (maximum: one bank) Wider paths (16 Mx4bits or 8Mx8bits) 80

Avoiding Bank Conflicts Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks (512 mod 128=0), conflict on word accesses SW: loop interchange or array not power of 2 ( array padding ) HW: Prime number of banks bank number = address mod number of banks address within bank = address / number of banks modulo & divide per memory access with prime no. banks? Let number of banks be = prime number = 2 K -1 address within bank = address mod number words in bank easy if 2 N words per bank from chinese remainder theorem 81

Fast Bank Number Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules bi x mod ai, 0 bi ai, 0 x a0 a1 a2 and ai and aj are co-prime if i j, then the integer x has only one solution (unambiguous mapping): bank number = b 0, number of banks = a 0 (3 in example) address within bank = b 1, # of words in bank = a 1 (8 in ex) N word address 0 to N-1, prime no. banks, words power of 2 Seq. Interleaved Modulo Interleaved Bank Number: 0 1 2 0 1 2 Address within Bank: 0 0 1 2 0 16 8 1 3 4 5 9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13 14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7 21 22 23 15 7 23 82

Virtual Memory Overcoming main memory size limitation Sharing of main memory among processes Virtual memory model Decoupling of Addresses used by the program (virtual) Memory addresses (physical) Physical memory allocation Pages Segments Process relocation Demand paging 83

Virtual/Physical Memory Mapping Virtual addresses Virtual memory Physical memory Physical addresses 0 1023 1024 2047 2048 3071 3072-4095 0 1023 1024 2047 2048 3071 3072 4095 4096-5119... Page 0 Page 1 Page 2 Page 3 Process n Page 0 Page 1 Page 2 Page 3 Page 4 MMU Page frame 0 Page frame 1 Page frame 2 Page frame 3 Page frame 4 Page frame 5 Page frame 6 0 1023 1024 2047 2048 3071 3072 4095 4096 5119 5120 6143 6144-7167 84

Caches vs. Virtual Memory Quantitative differences Block (page) size Hit time Miss (page fault) penalty Miss (page fault) rate Size Replacement control Cache: hardware Virtual memory: OS Size of virtual address space = f(address size) Disks are also used for the file system 85

Design Elements Minimize page faults Block size Block placement Fully associative Block identification Page table Replacement Algorithm LRU Write Policy Write back 86

Page Tables Each process has one or more page tables Size of Page table (31-bit address, 4KB pages => 2MB) Two-level approach: 2 virtual-to-physical translations Inverted page tables 00100 110011001110 Virtual address 0 0 001110011001 000 1 1 110110011011 xxx 2 0 001110000101 001 3 1 110000111100 101 4 1 001001000100 xxx 101 110011001110 Page# Disk address Present bit Page frame Physical address 87

Segmentation Visible to the programmer Multiple address spaces of variable size Segment table: start address and size Segment registers (x86) Advantages Simplifies handling of growing data structures Independent code segments VA: Segment Offset Fault Segment table Size Compare PA: + 88

Paging vs. Segmentation Address Programmer visible? Block replacement Fragmentation Disk traffic Page One word No Trivial Internal Efficient Segment Two words Maybe Hard external Not efficient Hybrids: Paged segments Multiple page sizes 89

Translation Buffer Fast address translation Principle of locality Cache for the page table Tag: portion of the virtual address Data: page frame number, protection field, valid, use, and dirty bit Virtual cache index and physical tags Address translation on the critical path Small TLB Pipelined TLB TLB misses 90

TLB and Cache Operation Virtual address TBL Operation Page# Offset TLB Hit Miss Real address Cache Operation + Tag Remainder Cache Hit Value Miss Page Table Main Memory Value 91

Page Size Large size Smaller page tables Faster cache hit times Efficient page transfer Less TLB misses Small size Less internal fragmentation Process start-up time 92

Memory Protection Multiprogramming Protection and sharing Virtual memory Context switching Base and bound registers (Base + Address) <= Bound Hardware support Two execution modes: user and kernel Protect CPU state: base/bound registers, user/kernel mode bits, and the exception enable/disable bits System call mechanism 93

Protection and Virtual Memory During the virtual to physical mapping Check for errors or protection Add permission flags to each page/segment Read/write protection User/kernel protection Protection models Two-level model: user/kernel Protection rings Capabilities 94

Memory Hierarchy Design Issues Superscalar CPU and number of ports to the cache Multiple issue processors Non-blocking caches Speculative execution and conditional instructions Can generate invalid addresses (exceptions) and cache misses Memory system must identify speculative instructions and suppress the exceptions and cache stalls on a miss Compilers: ILP versus reducing cache misses for (i = 0; i < 512; i = i + 1) for (j = 0; j < 512; j = j + 1) x[i][j] = 2 * x[i][j-1]; I/O and cache coherency 95 sp

Coherency CPU CPU CPU Cache A B 100 200 A 500 A B 200 B 100 200 Memory A B 100 200 A B 100 200 A B 200 200 I/O Cache and Memory coherent I/O Output A I/O Input A 96