Memory Hierarchy Design

Similar documents
Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

CS3350B Computer Architecture

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 11. Virtual Memory Review: Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lec 11 How to improve cache performance

Caching Basics. Memory Hierarchies

A Cache Hierarchy in a Computer System

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Copyright 2012, Elsevier Inc. All rights reserved.

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Introduction to cache memories

Course Administration

Adapted from David Patterson s slides on graduate computer architecture

EITF20: Computer Architecture Part4.1.1: Cache - 2

COSC 6385 Computer Architecture - Memory Hierarchies (II)

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Memory Hierarchies (Part 2)

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

EITF20: Computer Architecture Part4.1.1: Cache - 2

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 11. Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Memory. Lecture 22 CS301

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Memory Hierarchy Y. K. Malaiya

Memory Hierarchy and Caches

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Computer Architecture Spring 2016

COMPUTER ORGANIZATION AND DESIGN

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Memory latency: Affects cache miss penalty. Measured by:

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Memory Hierarchies 2009 DAT105

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Memory latency: Affects cache miss penalty. Measured by:

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Memory Technologies. Technology Trends

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Modern Computer Architecture

Cache Optimisation. sometime he thought that there must be a better way

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Lecture notes for CS Chapter 2, part 1 10/23/18

CMSC 611: Advanced Computer Architecture. Cache and Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Spring 2016

EN1640: Design of Computing Systems Topic 06: Memory System

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

The Memory Hierarchy & Cache

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS 136: Advanced Architecture. Review of Caches

Cache Performance (H&P 5.3; 5.5; 5.6)

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Transcription:

Memory Hierarchy Design

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 2

Many Levels in Memory Hierarchy Invisible only to high-level language programmers Usually made invisible to the programmer (even assembly programmers) Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same MCM as CPU) Physical memory There can also be a 3rd (or more) cache levels here (usu. mounted on same board as CPU) Our focus in chapter 5 Virtual memory (on hard disk, often in same enclosure as CPU) Disk files (on hard disk often in same enclosure as CPU) Network-accessible disk files (often in the same building as the CPU) Tape backup/archive system (often in the same building as the CPU) Data warehouse: Robotically-accessed room full of shelves of tapes (usually on the same planet as the CPU) 3

Simple Hierarchy Example Note many orders of magnitude change in characteristics between levels: 128 8192 200 4 100 50,000 4

CPU vs. Memory Performance Trends Relative performance (vs. 1980 perf.) as a function of year +55%/year +35%/year +7%/year 5

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 6

Cache Basics A cache is a (hardware managed) storage, intermediate in size, speed, and cost-per-bit between the programmer-visible registers and main physical memory. The cache itself may be SRAM or fast DRAM. There may be >1 levels of caches Basis for cache to work: Principle of Locality When a location is accessed, it and nearby locations are likely to be accessed again soon. Temporal locality - Same location likely again soon. Spatial locality - Nearby locations likely soon. 7

Four Basic Questions Consider levels in a memory hierarchy. Use block for the unit of data transfer; satisfy Principle of Locality. Locality Transfer between cache levels, and the memory The level design is described by four behaviors Block Placement: Where Block Identification: How Block is a block found if it is in the level? Replacement: Which Write could a new block be placed in the level? existing block should be replaced if necessary? Strategy: How are writes to the block handled? 8

Block Placement Schemes 9

Direct-Mapped Placement A block can only go into one frame in the cache Determined by block s address (in memory space) Frame number usually given by some low-order bits of block address. This can also be expressed as: (Frame number) = (Block address) mod (Number of frames (sets) in cache) Note that in a direct-mapped cache, block placement & replacement are both completely determined by the address of the new block that is to be accessed. 10

Direct-Mapped Identification Tags Block frames Address Tag Frm# Off. Decode & Row Select One Selected &Compared Compare Tags? Hit Mux select Data Word 11

Fully-Associative Placement One alternative to direct-mapped is: Allow block to fill any empty frame in the cache. How do we then locate the block later? Can associate each stored block with a tag Identifies the block s location in cache. When the block is needed, treat the cache as an associative memory, using the tag to match all frames in parallel, to pull out the appropriate block. Another alternative to direct-mapped is placement under full program control. A register file can be viewed as a small programmercontrolled cache (w. 1-word blocks). 12

Fully-Associative Identification Block addrs Block frames Address Block addr Off. Parallel Compare & Select Note that, compared to Direct: More address bits have to be stored with each block frame. A comparator is needed for each frame, to do the parallel associative lookup. Hit Mux select Data Word 13

Set-Associative Placement The block address determines not a single frame, but a frame set (several frames, grouped together). Frame set # = Block address mod # of frame sets The block can be placed associatively anywhere within that frame set. If there are n frames in each frame set, the scheme is called n-way set-associative. Direct mapped = 1-way set-associative. Fully associative: There is only 1 frame set. 14

Set-Associative Identification Tags Block frames Address Tag Set# Off. Note: 4 separate sets Set Select Parallel Compare within the Set Intermediate between directmapped and fully-associative in number of tag bits needed to be associated with cache frames. Still need a comparator for each frame (but only those in one set need be activated). Hit Mux select Data Word 15

Cache Size Equation Simple equation for the size of a cache: (Cache size) = (Block size) (Number of sets) (Set Associativity) Can relate to the size of various address fields: (Block size) = 2(# of offset bits) (Number (# of sets) = 2(# of index bits) of tag bits) = (# of memory address bits) address (#Memory of index bits) (# of offset bits) 16

Replacement Strategies Which block do we replace when a new block comes in (on cache miss)? Direct-mapped: There s only one choice! Associative (fully- or set-): If any frame in the set is empty, pick one of those. Otherwise, Random: there are many possible strategies: Simple, fast, and fairly effective Least-Recently Used (LRU), and approximations thereof Require bits to record replacement info., e.g. 4-way requires 4! = 24 permutations, need 5 bits to define the MRU to LRU positions FIFO: Replace the oldest block. 17

Write Strategies Most accesses are reads, not writes Especially if instruction reads are included Optimize for reads performance matters Direct mapped can return value before valid check Writes are more difficult Can t write to cache till we know the right block Object written may have various sizes (1-8 bytes) When to synchronize cache with memory? Write through - Write to cache and to memory Prone Write to stalls due to high bandwidth requirements back - Write to memory upon replacement Memory may be out of date 18

Another Write Strategy Maintain a FIFO queue (write buffer) of cache frames (e.g. can use a doubly-linked list) Meanwhile, take items from top of queue and write them to memory as fast as bus can handle Reads might take priority, or have a separate bus Advantages: Write stalls are minimized, while keeping memory as up-to-date as possible 19

Write Miss Strategies What do we do on a write to a block that s not in the cache? Two main strategies: Both do not stop processor Write-allocate (fetch on write) - Cache the block. No write-allocate (write around) - Just write to memory. Write-back caches tend to use write-allocate. White-through tends to use no-write-allocate. Use dirty bit to indicate write-back is needed in write-back strategy 20

Example Alpha 21264 64KB, 2-way, 64-byte block, 512 sets 44 physical address bits 21

Instruction vs. Data Caches Instructions and data have different patterns of temporal and spatial locality Also instructions are generally read-only Can have separate instruction & data caches Advantages Doubles bandwidth between CPU & memory hierarchy Each cache can be optimized for its pattern of locality Disadvantages Slightly more complex design Can t dynamically adjust cache space taken up by instructions vs. data 22

I/D Split and Unified Caches Size I-Cache D-Cache Unified Cache 8KB 8.16 44.0 63.0 16KB 3.82 40.9 51.0 32KB 1.36 38.4 43.3 64KB 0.61 36.9 39.4 128KB 0.30 35.3 36.2 256KB 0.02 32.6 32.9 Miss per 1000 accesses Much lower instruction miss rate than data miss rate 23

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 24

Cache Performance Equations Memory stalls per program (blocking cache): MemoryStal lcycle IC ( MemoryAccesses ) MissRate MissPenalt y Instructio n Misses MemoryStallCycle IC ( ) MissPenalty Instruction CPU time formula: CPU Time IC (CPI Execu Memory Stall Cycle ) Cycle Time Instruction More cache performance will be given later! 25

Cache Performance Example Ideal CPI=2.0, memory references / inst=1.5, cache size=64kb, miss penalty=75ns, hit time=1 clock cycle Compare performance of two caches: Direct-mapped (1-way): cycle time=1ns, miss rate=1.4% 2-way: cycle time=1.25ns, miss rate=1.0% Miss Penalty1 way Miss Penalty 2 way 75ns 75 Cycles 1ns 75ns 60 Cycles 1.25ns CPU Time1 way IC (2.0 1.4% 1.5 75) 1ns 3.575 IC CPU Time 2 way IC (2.0 1% 1.5 60) 1.25ns 3.625 IC 26

Out-Of-Order Processor Define new miss penalty considering overlap Mem. Stall Cycles Misses (Total miss latency Overlap miss latency) Instruction Instruction Compute memory latency and overlapped latency Example (from previous slide) Assume 30% of 75ns penalty can be overlapped, but with longer (1.25ns) cycle on 1-way design due 75ns 70% tomiss OOO Penalty 42 Cycles 1 way 1.25ns CPU Time1 way,ooo IC (2.0 1.4% 1.5 42) 1.25ns 3.60 IC 27

Cache Performance Improvement Consider the cache performance equation: (Average memory access time) = (Hit time) + (Miss rate) (Miss penalty) Amortized miss penalty It obviously follows that there are three basic ways to improve cache performance: Reducing miss penalty (5.4) Reducing miss rate (5.5) Reducing miss penalty/rate via parallelism (5.6) Reducing hit time (5.7) Note that by Amdahl s Law, there will be diminishing returns from reducing only hit time or amortized miss penalty by itself, instead of both together. 28

Cache Performance Improvement Reduce miss penalty: Multilevel cache; Critical word first and early restart; priority to read miss; Merging write buffer; Victim cache Reduce miss rate: Larger block size; Increase cache size; Higher associativity; Way prediction and Pseudo-associative caches; Compiler optimizations; Reduce miss penalty/rate via parallelism: Non-blocking cache; Hardware prefetching; Compilercontrolled prefetching; Reduce hit time: Small simple cache; Avoid address translation in indexing cache; Pipelined cache access; Trace caches 29

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 30

Multi-Level Caches What is important: faster caches or larger caches? Average memory access time = Hit time (L1) + Miss rate (L1) x Miss Penalty (L1) Miss penalty (L1) Hit time (L2) + Miss rate (L2) x Miss Penalty (L2) Can plug 2nd equation into the first: Average Hit memory access time = time(l1) + Miss rate(l1) x (Hit time(l2) + Miss rate(l2)x Miss penalty(l2)) 31

Multi-level Cache Terminology Local miss rate The miss rate of one hierarchy level by itself. # of misses at that level / # accesses to that level e.g. Miss rate(l1), Miss rate(l2) Global miss rate The miss rate of a whole group of hierarchy levels # of accesses coming out of that group (to lower levels) / # of accesses to that group Generally this is the product of the miss rates at each level in the group. Global L2 Miss rate = Miss rate(l1) Local Miss rate(l2) 32

Effect of 2-level Caching L2 size usually much bigger than L1 Provide reasonable hit rate Decreases May miss penalty of 1st-level cache increase L2 miss penalty Multiple-level cache inclusion property Inclusive cache: L1 is subset of L2; simplify cache coherence mechanism, effective cache size = L2 Exclusive cache: L1, L2 are exclusive; increase effect cache sizes = L1 + L2 Enforce inclusion property: Backward invalidation on L2 replacement 33

L2 Cache Performance 1. 2. Global cache miss rate is similar to the single cache miss rate Local miss rate is not a good measure of secondary caches 34

Early Restart, Critical Word First Early restart Don t wait for entire block to fill Resume CPU as soon as requested word is fetched Critical word first wrapped fetch, requested word first Fetch the requested word from memory first Resume CPU Then transfer the rest of the cache block Most beneficial if block size is large Commonly used in all the processors 35

Read Misses Take Priority Processor must wait on a read, not on a write Miss penalty is higher for reads to begin with and more benefit from reducing read miss penalty Write buffer can queue values to be written Until memory bus is not busy with reads Careful about the memory consistency issue What if we want to read a block in write buffer? Wait for write, then read block from memory Better: Read block out of write buffer. Dirty block replacement when reading Write old block, read new block - Delays the read. Old block to buffer, read new, write old. - Better! 36

Sub-block Placement Larger blocks have smaller tags (match faster) Smaller blocks have lower miss penalty Compromise solution: Use a large block size for tagging purposes Use a small block size for transfer purposes How? Valid bits associated with sub-blocks. Blocks Tags 37

Merging Write Buffer A mechanism to help reduce write stalls On a write to memory, block address and data to be written are placed in a write buffer. CPU can continue immediately Unless the write buffer is full. Write merging: If the same block is written again before it has been flushed to memory, old contents are replaced with new contents. Care must be taken to not violate memory consistency and proper write ordering 38

Write Merging Example 39

Victim Cache Small extra cache Holds blocks overflowing from the occasional overfull frame set. Very effective for reducing conflict misses. Can be checked in parallel with main cache Insignificant increase to hit time. 40

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 41

Three Types of Misses Compulsory During a program, the very first access to a block will not be in the cache (unless pre-fetched) Capacity The working set of blocks accessed by the program is too large to fit in the cache Conflict Unless cache is fully associative, sometimes blocks may be evicted too early because too many frequentlyaccessed blocks map to the same limited set of frames. 42

Misses by Type Conflict Conflict misses are significant in a direct-mapped cache. From direct-mapped to 2-way helps as much as doubling cache size. Going from direct-mapped to 4-way is better than doubling cache size. 43

As fraction of total misses 44

Larger Block Size Keep cache size & associativity constant Reduces compulsory misses Due to spatial locality More accesses are to a pre-fetched block Increases capacity misses More unused locations pulled into cache May increase conflict misses (slightly) Fewer sets may mean more blocks utilized per set Depends on pattern of addresses accessed Increases miss penalty - longer block transfers 45

Block Size Effect Miss rate is actually goes up if the block is too large relative to the cache size 46

Larger Caches Keep block size, set size, etc. constant No effect on compulsory misses. Block still won t be there on its 1st access! Reduces capacity misses More capacity! Reduces conflict misses (in general) Working blocks spread out over more frame sets Fewer blocks map to a set on average Less chance that the number of active blocks that map to a given set exceeds the set size. But, increases hit time! (And cost.) 47

Higher Associativity Keep cache size & block size constant Decreasing the number of sets No effect on compulsory misses No effect on capacity misses By definition, these are misses that would happen anyway in fully-associative Decreases conflict misses Blocks for in active set may not be evicted early set size smaller than capacity Can increase hit time (slightly) Direct-mapped is fastest n-way associative lookup a bit slower for larger n 48

Performance Comparison Assume: Hit Time1 way Cycle Time1 way Miss Penalty 25 Cycle Time1 way Hit Time 2 way 1.36 Cycle Time1 way Hit Time 4 way 1.44 Cycle Time1 way Hit Time 8 way 1.52 Cycle Time1 way 4KB, 1-way miss-rate=9.8%; 4-way miss-rate=7.1% Average Mem Access Time1 way 1.00 Miss Rate 25 1.00 0.098 25 3.45 Average Mem AccessTime 4 way 1.44 Miss Rate 25 1.44 0.071 25 3.215 49

Higher Set-Associativity Cache Size 1-way 2-way 4-way 8-way 4KB 3.44 3.25 3.22 3.28 8KB 2.69 2.58 2.55 2.62 16KB 2.23 2.40 2.46 2.53 32KB 2.06 2.30 2.37 2.45 64KB 1.92 2.14 2.18 2.25 128KB 1.52 1.84 1.92 2.00 256KB 1.32 1.66 1.74 1.82 512KB 1.20 1.55 1.59 1.66 Higher associativity increase the cycle time The table shows the average memory access time 1-way is better most of cases 50

Way Prediction Keep in each set a way-prediction information to predict the block in each set will be accessed next Only one tag may be matched at the first cycle; if miss, other blocks need to be examined Beneficial in two aspects Fast data access: Access the data without knowing the tag comparison results Low power: Only match a single tag; if majority of the prediction is correct Different systems use variations of the concept 51

Pseudo-Associative Caches Essentially this is 2-way set-associative, but with sequential (rather than parallel) lookups. Fast hit time if first frame checked is right. An occasional slow hit if an earlier conflict had moved the block to its backup location. 52

Pseudo-Associative Caches Placement: Place block b in frame (b mod n). Identification: Look for block b first in frame (b mod n), then in its secondary location ((b+n/2) mod n). (flip the most-significant bit) If found there, primary and secondary blocks are swapped. May maintain a MRU bit to reduce the search and for better replacement. Replacement: Block in frame (b mod n) is moved to secondary location ((b+n/2) mod n). (Block there is flushed.) Write strategy: Any desired write strategy can be used. 53

Compiler Optimizations Reorganize code to improve locality properties. The hardware designer s favorite solution. Requires no new hardware! Various techniques Cache awareness: Merging Arrays Loop Interchange Loop Fusion Blocking Other (in multidimensional arrays) source-source transformation technique 54

Loop Blocking Matrix Multiply Before: After: 55

Effect of Compiler Optimizations 56

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty via Parallelism Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 57

Non-blocking Caches Known as lockup-free cache, hit under miss While a miss is being processed, Allow other cache lookups to continue anyway Useful in dynamically scheduled CPUs Other instructions may be in the load queue Reduces effective miss penalty Useful CPU work fills the miss penalty delay slot hit under multiple miss, miss under miss: Extend technique to allow multiple misses to be queued up, while still processing new hits 58

Non-blocking Caches 59

Hardware Prefetching When memory is idle, speculatively get some blocks before the CPU first asks for them! Simple heuristic: Fetch 1 or more blocks that are consecutive to last one(s) fetched Often, the extra blocks are placed in a special stream buffer so as not to conflict with actual active blocks in the cache, otherwise the prefetch may pollute the cache Prefetching can reduce misses considerably Speculative fetches should be low-priority Use only otherwise-unused memory bandwidth Energy-inefficient (like all speculation) 60

Compiler-Controlled Prefetching Insert special instructions to load addresses from memory well before they are needed. Register vs. cache, faulting vs. nonfaulting Semantic invisibility, nonblocking-ness Can considerably reduce misses Can also cause extra conflict misses Replacing a block before it is completely used Can also delay valid accesses (tying up bus) Low-priority, can be pre-empted by real access 61

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 62

Small and Simple Caches Make cache smaller to improve hit time Or (probably better), add a new & smaller L0 cache between existing L1 cache and CPU. Keep L1 cache on same chip as CPU Physically close to functional units that access it Keep L1 design simple, e.g. direct-mapped Avoids Tag multiple tag comparisons can be compared after data cache fetch Reduces effective hit time 63

Access Time in a CMOS Cache 64

Avoid Address Translation In systems with virtual address spaces, virtual addr. must be mapped to physical addresses. If cache blocks are indexed/tagged w. physical addresses, we must do this translation before we can do the cache lookup. Long hit time! Solution: Access cache using the virtual address. Call this a Virtual Cache Drawback: Can Cache flush on context switch fix by tagging blocks with Process Ids (PIDs) Another problem: Aliasing, i.e. two virtual addresses mapped to same real address Fix with anti-aliasing or page coloring 65

Miss rate Benefit of PID Tags in Virtual Cache W/o PIDs, purge W. PIDs W/o context switching 66

Pipelined Cache Access Pipeline cache access so that Effective latency of first level cache hit can be multiple clock cycles Fast cycle time and slow hits Hit times: 1 for Pentium, 2 for P3, and 4 for P4 Increases number of pipeline stages Higher penalty on mispredicted branches More cycles from issue of load to use of data In reality Increases the bandwidth of instructions than decreasing the actual latency of a cache hit 67

Trace Caches Supply enough instructions per cycle without dependencies Finding ILP beyond 4 instructions per cycle Don t limit instructions in a static cache block to spatial locality Find dynamic sequence of instructions including taken branches NetBurst (P4) uses trace caches Addresses are no longer aligned. Same instruction is stored more than once If part of multiple traces 68

Summary of Cache Optimizations 69

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations for Improving Performance Memory Technology Virtual Memory Conclusion 70

Wider Main Memory 71

Simple Interleaved Memory Adjacent words found in different mem. banks Banks can be accessed in parallel Overlap latencies for accessing each word Can use narrow bus To return accessed words sequentially Fits well with sequential access e.g., of words in cache blocks 72

Independent Memory Banks Original motivation for memory banks Higher bandwidth by interleaving seq. accesses Allows multiple independent accesses Each bank requires separate address/data lines Non-blocking caches allow CPU to proceed beyond a cache miss Allows multiple simultaneous cache misses Possible only with memory banks 73

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 74

Main Memory Bandwidth: Bytes read or written per unit time Latency: Described by Access For Time: Delay between initiation/completion reads: Present address till result ready. Cycle time: Minimum interval between separate requests to memory. Address lines: Separate bus CPU Mem to carry addresses. RAS (Row Access Strobe) First half of address, sent first. CAS (Column Access Strobe) Second half of address, sent second. 75

RAS vs. CAS DRAM bit-cell array 1. RAS selects a row 2. Parallel readout of all row data 3. CAS selects a column to read 4. Selected bit written to memory bus 76

Typical DRAM Organization Low 14 bits (256 Mbit) High 14 bits 77

Types of Memory DRAM (Dynamic Random Access Memory) Cell design needs only 1 transistor per bit stored. Cell charges leak away and may dynamically (over time) drift from their initial levels. Requires periodic refreshing to correct drift e.g. Time every 8 ms spent refreshing kept to <5% of bandwidth SRAM (Static Random Access Memory) Cell voltages are statically (unchangingly) tied to power supply references. No drift, no refresh. But needs 4-6 transistors per bit. DRAM: 4-8x larger, 8-16x slower, 8-16x cheaper/bit 78

Amdahl/Case Rule Memory size (and I/O bandwidth) should grow linearly with CPU speed Typical: 1 MB main memory, 1 Mbps I/O bandwidth per 1 MIPS CPU performance. Takes a fairly constant ~8 seconds to scan entire memory (if memory bandwidth = I/O bandwidth, 4 bytes/load, 1 load/4 instructions, no latency problem) Moore s Law: DRAM size doubles every 18 months (up 60%/yr) Tracks processor speed improvements Unfortunately, DRAM latency has only decreased 7%/year. Latency is a big deal. 79

Some DRAM Trend Data Since 1998, the rate of increase in chip capacity has slowed to 2x per 2 years: * 128 Mb in 1998 * 256 Mb in 2000 * 512 Mb in 2002 80

ROM and Flash ROM (Read-Only Memory) Nonvolatile + protection Flash Nonvolatile NVRAMs Reading Writing RAMs require no power to maintain state flash is near DRAM speeds is 10-100x slower than DRAM Frequently used for upgradeable embedded SW Used in Embedded Processors 81

DRAM Variations SDRAM Synchronous DRAM DRAM internal operation synchronized by a clock signal provided on the memory bus Double Data Rate (DDR) uses both clock edges RDRAM RAMBUS (Inc.) DRAM Proprietary DRAM interface technology on-chip interleaving / multi-bank technology a high-speed packet-switched (split-transaction) bus interface byte-wide interface, synchronous, dual-rate Licensed to many chip & CPU makers Higher bandwidth, costly than generic SDRAM DRDRAM Direct RDRAM (2nd ed. spec.) Separate row and column address/command buses Higher bandwidth (18-bit data, more banks, faster clock) 82

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 83

Virtual Memory The addition of the virtual memory mechanism complicated the cache access 84

Paging vs. Segmentation Paged Segment: Segment Each segment has integral number of pages for easy replacement and can still treat each segmentation as a unit 85

Four Important Questions Where to place a block in main memory? Operating systems takes care of it Replacement takes very long fully associative How to find a block in main memory? Page table is used Offset is concatenated when paging is used Offset is added when segmentation is used. Which block to replace when needed? Obviously LRU is used to minimize page faults What happens on a write? Magnetic disks takes millions of cycles to access. Always write back (use of dirty bit). 86

Addressing Virtual Memories 87

Fast Address Calculation Page tables are very large Kept in main memory Two memory accesses for one read or write Remember the last translation Reuse if the address is on the same page Exploit the principle of locality If access have locality, the address translations should also have locality Keep the address translations in a cache Translation lookaside buffer (TLB) The tag part stores the virtual address and the data part stores the page number. 88

TLB Example: Alpha 21264 Same as PID 89

An Memory Hierarchy Example 28 90

Protection of Virtual Memory Maintain two registers Base Bound For each address check base <= address <= bound Provide two modes User OS (kernel, supervisor, executive) 91

Alpha-21264 Virtual Addr. Mapping Supports both segmentation and paging 92

Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual Memory Conclusion 93

Design of Memory Hierarchies Superscalar CPU, number of ports to cache Speculative execution and memory system Combine inst. Cache with fetch and decode Caches in embedded systems! Real-time vs. power I/O and consistency of cached data The cache coherence problem 94

The Cache Coherency Problem 95

Alpha 21264 Memory Hierarchy 96