Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Similar documents
Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Types of Cache Misses: The Three C s

Computer Architecture Spring 2016

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

CMSC 611: Advanced Computer Architecture. Cache and Memory

CPE 631 Lecture 06: Cache Design

Lecture 11. Virtual Memory Review: Memory Hierarchy

Aleksandar Milenkovich 1

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Cache performance Outline

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Hierarchy Design. Chapter 5

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Lecture 11 Reducing Cache Misses. Computer Architectures S

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Cache Performance (H&P 5.3; 5.5; 5.6)

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Aleksandar Milenkovich 1

Memory Hierarchies 2009 DAT105

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

Lecture notes for CS Chapter 2, part 1 10/23/18

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lec 12 How to improve cache performance (cont.)

CSE 502 Graduate Computer Architecture

Advanced optimizations of cache performance ( 2.2)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter 2: Memory Hierarchy Design Part 2

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

5008: Computer Architecture

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Page 1. Multilevel Memories (Improving performance using a little cash )

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CS/ECE 250 Computer Architecture

A Cache Hierarchy in a Computer System

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

MEMORY HIERARCHY DESIGN

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Page 1. Memory Hierarchies (Part 2)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to cache memories

EITF20: Computer Architecture Part4.1.1: Cache - 2

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

CS422 Computer Architecture

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

LECTURE 5: MEMORY HIERARCHY DESIGN

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Advanced Caching Techniques

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

Lec 11 How to improve cache performance

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Caching Techniques

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Memory Hierarchy Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter-5 Memory Hierarchy Design

EE 4683/5683: COMPUTER ARCHITECTURE

Course Administration

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Transcription:

Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller! Cache " organization " fetch algorithm " replacement " write policy! Virtual memory " paged memory paged segment " segment memory segmented page " TLB " protection! Main memory " interleaved memory " access mode! Multiprocessor cache " coherence protocols " consistency models # virtual addr space # multi-level cache # unified / split cache # multiprocessor caches Chap5-1

Four Questions about Caches Q1: Where can a block be placed in the upper level? (Block placement) " Fully associative, direct mapped, 2-way set associative " index = (block num) % (# of set) Q2: How is a block found if it is in the upper level? (Block identification) " address = tag + index + block offset Q3: Which block should be replaced on a miss? (Block replacement) " Random (large associativities) " LRU (smaller associativities) Q4: What happens on a write? (Write strategy) " Write through: written to both cache and mem " Write back: written only to the cache. The modified cache block is written to main memory only when it is replaced. Chap5-2 Terms in Memory Hierarchy! Block (page, line) Units transferred between, fixed length! Hit/Miss! Miss Ratio miss ratio( m) = # of hits # of misses! Hit time time to access upper level! Miss penalty access time of lower level + replace time + transfer time! average (effective) access time tavg = thit + miss ratio tmiss CPI / = CPIideal + mem ref miss ratio C instruns w cache miss Chap5-3

Decisions on Caches! Organization (Placement) " Direct-mapped " Fully-associative " Set-associative! Fetch Algorithm " fetchonmiss " non-blocking! Replacement Algorithm " LRU, random, FIFO! Write policy " Write-back " Write-through # sector caches # multi-level caches # fetchbypassing # prefetch C = (# of set s) a b! Cache Parameters " Cache size (C) " Block (line) size (b) # set size (s) # Associativity (a) Chap5-4 Strategies for Line Replacement! Fetching a line " access and fill - starts on line boundary " fetch bypass (wrap-around load) access faulted word first and load the remainder line on following " non-blocking cache processor continues execution when exection not depend on " prefetching! Line Replacements " LRU: Least-recently used optimize based on temporal locality a counter associated with each line modify counter on each read/write " FIFO: longest line is to be replaced " Random: select a vitctim at random! Victim Caches a fully-assoicative buffer holding replaced lines Chap5-5

Write Policies! write back (copy back) cached block is updated to memory only when replaced - dirty bits used to avoid update clean blocks - Less traffic for larger caches =! write through block written both to cache and memory adv: retain a consistent copy between cache and mem disadv: high memory bandwidth required! on a write miss write allocate No write allocate Traffic / inst = fract_ writes! Write Buffers allow reads to proceed without stalling write buffer P cache M Chap5-6 Other Types of Caches!Split Data / Instruction caches Separate Unified 1.bandwidth increased 1.less costly 2.take spactial locality 2. overall miss ratio increase form instr. cache 3.self-modifying code and 3. no interlock on execute are possible simultaneous requests!cache Performance CPU time = IC x (CPI execution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPU time = IC x (CPI execution + Misses per instruction x Miss penalty) x Clock cycle time Chap5-7

Improving Cache Performance! Cache Size cache size = # of set set associativity! Effects of parameters " Cache size larger cache exploit better temporal locality large access time " Block size larger block size has better spatial locality cause unused data transferred & replacement " Associativity larger associaivity get lower miss ratio high cost! Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)! Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. block size Chap5-8 Sources of cache miss! 3C/4C " Compulsory : first access to a block cold-start miss, first reference miss " Capacity : due to blocks being discarded, limited by size " Conflict : due to conflicts in set-associative caches " Coherence : Misses because blocks are invalidated due to references by other processors! Miss rate reduction " I-1: larger block size $ take advantage of spatial locality $ will increase miss penalty " I-2: higher associativity => increased hit time " I-3: victim cache $ A small fully associative cache $ contains blocks that are replaced on a miss $ useful for small and direct-mapped caches Chap5-9

Miss Rate Reduction (Cont)! I-4: Pseudo-associative caches " proceed like direct-mapped " on a miss, check another entry in set " fast hit time and slow hit time " can reduce miss rate! I-5: Prefetching " Prefetch instructions " hardware based stream buffer " compiler-controlled prefetch $ Register prefetch $ Cache prefetch " Issues $ faulting vs. non faulting (nonbinding) $ blocking vs nonblocking (lockup-free)! Prefetching relies on extra memory bandwidth that can be used without penalty Chap5-10 More miss rate reduction (prefetching)! Hardware prefetching of instructions " Alpha 21064 fetches 2 blocks on a miss " Extra block placed in stream buffer " On miss check stream buffer " Works with data blocks too! Data prefetching " Preload data into register (HP PA-RISC loads) " Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) " Special prefetching instructions cannot cause faults; a form of speculative execution " Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses?! Prefetching policy " OBL ( one block lookahead ) " always prefetch " prefetch on miss " tagged prefetch " stride prefetching Chap5-11

More miss rate reduction (compiler optimizations)! Instructions " Reorder procedures in memory so as to reduce misses " Profiling to look at conflicts " McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with4byteblocks! I-7: Compiler optimization on Data " Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays " Loop Interchange: change nesting of loops to access data in order stored in memory " Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap " Blocking: Improve temporal locality by accessing locks?of data repeatedly vs. going down whole columns or rows Chap5-12 Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality Chap5-13

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality Chap5-14 Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; };! Two Inner Loops: " Read all NxN elements of z[] " Read N elements of 1 row of y[] repeatedly " Write N elements of 1 row of x[]! Capacity Misses a function of N & Cache Size: " 2N 3 +N 2 => (assuming no conflict; otherwise )! Idea: compute on BxB submatrix that fits Chap5-15

Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+b) for (kk = 0; kk < N; kk = kk+b) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+b-1,n); j = j+1) {r = 0; for (k = kk; k < min(kk+b-1,n); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; };!B called Blocking Factor!Capacity Misses from 2N 3 +N 2 to N 3 /B+2N 2!Conflict Misses Too? Chap5-16 Reducing Cache Miss Penalty! II-1: Giving priority to read misses over writes " cost of writes reduced by a write buffer! II-2: Sub-block Placement (Do not have to load full block on a miss) access unit- subblock tag unit - a full line +reduce tag overhead +wrapped fetch! II-3: Do not wait for full block to be loaded before restarting CPU " Early restart " critical word first - wrapped fetch/ requested word first " benefit from large block size! II-4: Nonblocking caches (lockup-free caches) " allow cache to continue to supply data during a miss " may cause out-of-order accesses Chap5-17

Multi-level Caches! Why multi-level caches? Need larger caches to reduce frequency of most costly misses => reduce the cost of misses with different level of caches! II-5: Two-level Caches " motivation $ reduce average access time by $ large cache size block size $ cache associativity $ hierarchical sharing $ shield processors from coherence traffic " Multi level Inclusion Property L2 always contains superset of data in L1 16 x C1 : 16 K, direct map (A = 1) B1 = 16 C2 : 256K, if B2=B1 => A2=16 if B2=4B1 => A2=64 A i + 1 Bi + 1 Si Ai max(, ) Bi Si + 1 Chap5-18 Reducing hit time! III-1: Fast Hit times via Small and Simple Caches " Direct Mapped, on chip! III-2: Virtual address Cache " avoid address translation during indexing the cache! III-3: Pipelined Writes " Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update " Only Write in the pipeline; empty during a miss Chap5-19

Virtual-Addressed Cache! Motivation : reduce cache hit time! Approach : " address translate partially in parallelism => limited by page page " virtual caches! Problems address translation require if miss synonym (alias) sol :» Inverted Page Table (IBM 801)» reverse translation table 2 - level cache :» large addr space (64bit) context switching» add PID to tag»flush I/O device cache coherence in MP P V-C P - Cache M Chap5-20 Fast hits by Avoiding Address Translation CPU CPU CPU VA VA VA TB VA Tags $ PA Tags $ TB $ PA TB VA L2 $ PA PA PA MEM MEM Conventional Organization MEM Virtually Addressed Cache Translate only on miss Synonym Problem Overlap $ access with VA translation: requires $ index to remain invariant across translation Chap5-21

Cache Optimization Summary Technique MR MP HW Cmplty Larger Block Size + - 0 Higher Associativity + - 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + 0 Priority to Read Misses + 1 Subblock Placement - + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 Small & Simple Caches? - + 0 Avoiding Address Translation + 2 Pipelining Writes + 1 Chap5-22 Using Caches VS Large Register File on chip Pro : Large Register File " All local scalars variables " Compiler assigns global variables " Register addressing " May parallel access " Smallonchipandfast Against : " Significant overhead in context switching " Limit set on number of local variables " Only small fraction is actively used " Rely on compiler to max usage Caches Pro : Recently used local scalar Blocks of memory Recently used global Memory addressing (associative access ) Against : Non-uniform access latency Slow access time More hardware complexity Replacement overhead Chap5-23

Main Memory! Performance of Main Memory: " Latency: Cache Miss Penalty $ Access Time: time between request and word arrives $ Cycle Time: time between requests " Bandwidth: I/O & Large Block Miss Penalty! Main Memory is DRAM (Dynamic Random Access Memory) " Dynamic since needs to be refreshed periodically (8 ms) " Addresses divided into 2 halves (as a 2D matrix) $ RAS or Row Access Strobe $ CAS or Column Access Strobe! Cache uses SRAM: (Static Random Access Memory) " No refresh (6 transistors/bit vs. 1 transistor/bit) " Address not divided Chap5-24 Improving main memory bandwidth (a) one-word-wide " CPU, Cache, Bus, Memory same width (b) wide memory " Mux/ Cache, Bus, Memory N words (c) interleaved memory " CPU, Cache, Bus 1 word: Memory N Modules (4 Modules) Chap5-25

Increasing memory bandwidth(more)! Independent Memory Banks " allow multiple independent accesses " multiple memory controllers allow bank to operate independently! Avoiding Bank Conflicts " problem int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; " Even with 128 banks, since 512 is multiple of 128 conflict " solution: SW: loop interchange or declaring array not power of 2 HW: Prime number of banks Chap5-26 DRAM-specific Interleaving! Multiple RAS accesses: " Nibble mode - supply three extra bits " page mode - change column address until next access or refresh time " static column - do not toggle the column access strobe on each column address change! New DRAMs to address gap " Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock " RAMBUS: Each Chip a module vs. slice of memory?short bus between CPU and chips Does own refresh Variable amount of data returned 1byte/2ns(500MB/sperchip) Chap5-27