Memory Hierarchy Chapter 2. Abdullah Muzahid
|
|
- Rosa Fleming
- 6 years ago
- Views:
Transcription
1 Memory Hierarchy Chapter 2 Abdullah Muzahid
2 17. 2-Way Set Associative Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: How big is cache? Addr Blk Fnd Upd R/W Binary addr Tag Set O Way Way 3R R R R R R R R R R
3 17. 2-Way Set Associative Example Addr Blk Fnd Upd R/W Binary addr Tag Set O Way Way 3R Miss 34R Miss 1216R Miss 444R Miss 1 448R R Miss 1 296R R R R Miss 1
4 18. Write-Through, No Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem R/W Binary addr Tag Set O Way Way Refs 3W R R W W R R How many main memory reads? How many main memory writes?
5 18. Write-Through, No Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem R/W Binary addr Tag Set O Way Way Refs 3W Miss None 1W 34R Miss 1R 444R Miss 1 1R 448W W 8496W Miss None 1W 85R Miss 1R 34R Miss 1 1R Main memory reads 4 Main memory writes 3
6 19. Write-Back, Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem Dirty R/W Binary addr Tag Set O Way Way Refs 3W R R W W R R Put * in Upd Way if that way is (still) dirty How many reads and writes, and why?
7 19. Write-Back, Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem Dirty R/W Binary addr Tag Set O Way Way Refs 3W Miss 1 * 1R 34R Miss 1R 444R Miss 1 1R 448W *1 8496W Miss 1 * 1R 85R * 34R Miss 1 1R+1W ) Last ref evicts dirty block (448), causes read of 34 and write of 448! Main memory reads 5 Main memory writes 1
8 2. L1 cache of AMD Opteron Comp Arch, Henn & Patt, Fig B.5, pg B-13 Block address <25> <9> Tag (512 blocks) (512 blocks) Index 2 2 Block offset <6> Valid <1> Tag <25> =? =? Data <64> 2:1 mux 4 CPU address Data in Data out Victim buffer Lower-level memory 2-way assoc L1 Valid & dirty bit 8 blk LRU bit for each set 1. into 3 parts: 64 byte blksz! 6 bits o 512 entry! 9 bits index 25-bit tag (4 2. Index determ proper set 3. Check if tag match & valid bit set 4. Mux selects which way to pass out On hit, output! CPU On miss, output! vic bu
9 21. Improving Cache Performance Assume main and virtual memory implementations are fixed, how can we improve our cache performance? Reduce the cache miss penalty Reduce cache miss rate Use parallelism to overlap operations, improving one or both of above Doing hardware prefetch in parallel with normal mem tra c can reduce miss rate Reduce cache hit time ) Will discuss each of these in turn
10 22. Ideas for Reducing Cache Miss Penalty Use early restart: Allow cpu to continue as soon as required bytes are in cache, rather than waiting for entire block to load Critical word first: Load accessed bytes in block first Load remaining words in block in wrap-around manner Status bits needed to indicate how much of block has arrived Particularly good for caches wt large block sizes Give memory reads priority over writes Merging write bu er Victim caches Use multilevel caches
11 23. Giving Memory Reads Priority Over Writes (Reducing Cache Miss Penalty) Assume common case of having a write-bu er so that the CPU does not stall on writes (as it must for reads): The CPU can check in write bu er on read miss: presently in write bu, load from there not in write bu, load from mem before prior writes Advantages: Since read stalls CPU & write does not, we min stalls May avoid mem load if in write bu Write bu ers can make write-back more e cient as well: 1. For dirty bit eviction, copy the dirty blk frm cache to write bu 2. Load the evicting block from mem to cache (CPU unstalled) 3. Write dirty block from bu to memory
12 24. Merging Write Bu er (Reducing Cache Miss Penalty) Due to lat, multiword writes more e c than writing words sep Mult words in write bu may be associated with same Valid bits used to indicate which words to write Reduces the # of mem accesses Reduces the # of write bu stalls for a given bu size Write address V V Comp Arch, Henn & Patt, Fig 2.7, pg Mem[1] Mem[18] Mem[116] Mem[124] Write address V V V V 1 1 Mem[1] 1 Mem[18] 1 Mem[116] 1 V V Mem[124] Top bu wo write merge Don t need valid tags in write-back Assume 32-byte blk for further cache For seq acc, 4-fold red in # of writes & bu e In practice, must handle 1,2,4 as well as 8-byte words ) Larger blksz, more help
13 25. Victim Caches (Reducing Cache Miss Penalty) Victim cache: small (eg, 1-8 blocks), fully-associative cache that contains recently evicted blocks from a primary cache Checked in parallel with primary cache Available on following cycle if the item in the victim cache Victim block swapped with block in cache Of great benefit to direct-mapped L1 cache Comp Arch 3rd ed, Henn & Patt, Fig 5.13, pg 422 Less popular today!
14 26. Multi-Level Caches (Reducing Cache Miss Penalty) Becomes more popular as miss penalty for primary cache grows Further caches may be o -chip, but still made of SRAM Almost all general purpose machine have at least 2 lvls of cache, most have 2 on-chip caches Further caches typically have larger blocks and cache size Equations: local miss rate = misses in this cache / accesses to this cache global miss rate = misses in this cache / accesses to L1 cache avg acc time = L1 hit time + L1 miss rate * L1 miss penalty L1 miss penalty = L2 hit time + L2 miss rate * L2 miss penalty L2 miss penalty = mainmem access time L1 miss penalty is average access time for L2 Local miss rate: % of this cache s refs that go to the next lvl Global miss rate: % of all cache s refs that go to the next lvl
15 26. Cache miss equation examples Assume nref=1, nl1miss=4, nl2miss = 2, L2 miss penalty = 1 cycles, L2 hit time = 1, L1 hit time = 1 1. What is the local and global miss rate for each cache?! L1 has same loc & glob miss rate, since all mem refs go to L1 L1 miss rate = 1 4 = 1 4 =.4 = 4% L2 local miss rate = 2 4 =.5 = 5% L2 global miss rate = 1 2 = 1 2 =.2 = 2% 2. What is the average access time? avg acc time = L1 hit time + L1 miss rate L1 miss penalty L1 miss penalty = avg L2 acc time avg L2 acc time = L2 hit time + L2 miss rate L2 miss pen avg L2 acc time = = = 6 avg acc time = L1 hit time + L1 miss rate L1 miss penalty avg acc time = = = =3.4 cycles = =
16 27. Reducing Cache Miss Rate Just discussed techniques for reducing the cost of a cache miss, now want to investigate ways to increase our chances of hitting in the cache. Cache Miss Categories: Compulsory: First access to block must always miss Calc total # of blocks accessed in program Capacity: Blocks that are replaced and reloaded because the cache cannot contain all the used blocks needed during execution. Sim fully-assoc cache, sub compulsory miss from total miss Conflict: Occurs when too many blocks map to same cache set Sim desired cache, sub comp & capacity miss from total miss
17 Miss rate per type Miss rate per type.1 Comp Arch, Henn & Patt, Fig 2.2, pg % 8% 6% 4% 2% Cache size (KB) 28. Miss Rate vs. Cache Size 1-way 2-way 4-way 8-way Capacity Compulsory % Cache size (KB) 1-way 2-way 4-way 8-way Capacity Compulsory Top figure shows total miss rate Compulsory (tiny 1st line) misses stay constant Only way to dec comp is to increase blksz, which may increase miss penalty Capacity (lrg blk area) go down with size Conflict misses dec wt size: Since # of conflicts go down wt size, assoc pays o less for large caches Bottom figure shows distribution of misses % of compuls misses increase wt size, since other types of misses decrease wt size
18 29. Reducing Miss Rate wt Larger Blocks Advantages: Exploits spatial locality Reduces compulsory misses Disadvantages: Increases miss penalty Can increase conflicts May waste bandwidth SPEC92 blksz analysis: If linesz lrg comp to cachesz, conflicts rise, increasing miss rate 64-byte linesz reasonable 8 studied cache sizes Comp Arch, Henn & Patt, Fig B.1, pg B-27 Cache Size blksz 4K 16K 64K 256K % 3.94% 2.4% 1.9% % 2.87% 1.35%.7% 64 7.% 2.64% 1.6%.51% % 2.77% 1.2%.49% % 3.29% 1.15%.49%
19 3. Reducing Miss Rate wt Larger Caches & Higher Associativity Larger Caches Advantages: Reduces capacity & conflict misses Disdvantages: Uses more space May increase hit time Higher cost ($, power, die) Higher Associativity Advantages: Reduces conflict misses Disadvantages: May increase hit time Tag check done before data can be sent Req more space & power More logic for comparitors More bits for tag Other status bits (LRU)
20 31. Reducing Miss Rate wt Way Prediction & Pseudo Associativity Hit time as fast as direct-mapped, and req only 1 comparitor Reduces misses like a set-associative cache Will have fast hits and slow hits Way Prediction Each set has bits indicating which block to check on next access A miss requires checking the other blocks in the set on subsequent cycles Pseudo Associativity Accesses cache as in directmapped cache wt 1 less indx bit On miss, chks sister blk in cache (eg., by invert most sig indx bit) May swap two blks on an init cache miss wt a pseudo way hit
21 32. Reducing Cache Miss Penalty and/or Miss Rate via Parallelism Nonblocking caches Allow cache hit accesses while a cache miss is being serviced Some allow hits under multiple misses (req. a queue of outstanding misses) Could use a status bit for each block to indicate blk currently being filled Hardware prefetching & Software prefetching Idea is that predicted mem blocks are fetched while doing computations on present blocks Requires nonblocking caches Most prefetches do not raise exceptions If guess is right, data in-cache for use If wrong, wasted some bandwidth we weren t using anyway Helps with latency, by exploiting unused bandwidth If bus saturated, prefetch won t help, and most archs ignore Can help with throughput, if usage is sporadic Could expand conflict/capacity misses if prefetch is wrong
22 Pipeline Cache Access Pipeline cache access Increases hit latency But gives fast clock cycle and high bandwidth Most modern processors do this
23 34. Reducing Cache Hit Time Small & simple caches Small caches! less propogation delay Direct mapped! overlap tag chk & data sending Some designs have tags on-chip, data o Avoiding address translation: Virtual caches Avoids virtual-physical trans step, but problematic in practice Virtually indexed, physically tagged Indx cache by page o set, but tag with Can get data frm cache earlier Pipelined cache access: Allows fast clock speed, but results in greater br mispred penalty & load latency
24 35. Increasing Cache Bandwidth wt Multibanked Caches Increase bandwidth by sending address to b banks simultaneously b banks lookup address & write to bus at same time Increases bandwidth by b in best case Usually use sequential interleaving ) Figure 5.6 shows b=4 Block address Bank Block address 1 Comp Arch, Henn Block & Patt, Fig 5.6, pg 299 Bank 1 Bank 2 Block address address 2 3 Bank
25 Compiler Opt: Loop Interchange improve spaial/temporal locality of data for(j=;j<1;j=j+1) { } } for(i=;i<5;i=i+1){ x[i][j] = 2 * x [i][j]; for(i=;i<5;i=i+1){ } for (j=;j<1;j=j+1) { x[i][j] = 2 * x [i][j]; } Copyright Josep Torrellas 1999, 21, 22 25
26 Hardware Prefetching of I,D Prefetch : access items before they are needed and deposit them into caches or external buffers I prefetching: e.g. fetch next block on a miss or on access. The prefetched block goes to a stream buffer (or cache) D prefetching : same idea could have several stream buffers to capture several localiies Careful about bandwidth use Copyright Josep Torrellas 1999, 21, 22 26
27 Compiler Controlled Prefetching Compiler inserts prefetch instrucions Register prefetch : into a reg. (+ cache) Cache prefetch : into the cache Can be fauling : causes an excepion if protecion violaion non fauling : turns to No op if it would cause an excepion Needs a non blocking or lockup free cache: cache can be accessed while there is a prefetch / miss pending. Copyright Josep Torrellas 1999, 21, 22 27
28 Example 8 KB dir mapped cache with 16 B blocks Each element of a and b is 8 byte long 3r,1c 11r,3c for(i=;i<3;i=i+1) for(j=;j<1;j=j+1) a[i][j]= b[j][] * b[j+1][] a: Even j value miss; odd j value hit (spaial loc) - > 15 misses b: No spaial locality ; Only temp locality ; suppose no conflicts, miss 11 Imes TOTAL= 251 misses Copyright Josep Torrellas 1999, 21, 22 28
29 Usually works in loops Can be combined with loop unrolling & sokware pipelining Problem: Overhead Prefetching Copyright Josep Torrellas 1999, 21, 22 29
30 SimplificaIons: 1) not worry about first few misses, 2) not a fauling pref Split so that first loop prefetches a & b second loop prefetches only a assume long latency of miss prefetch 7 iteraions ahead for(j=;j<1;j=j+1) { prefetch(b[j+8][]); prefetch(a[][j+7]); a[][j] = b[j][]*b[j+1][]; } for(i=1;i<3;i=i+1){ for (j=;j<1;j=j+1) { prefetch(a[i][j+7]); a[i][j]=b[j][]*b[j+1][]; } } Copyright Josep Torrellas 1999, 21, 22 3
31 We are prefetching a[][7] - a[][99] a[1][7] - a[1][99] a[2][7] - a[2][99] b[8][] - b[1][] only lek with: 8 misses for b b[][].b[7][] 12 misses for a: a[][] a[][2] a[][4] a[][6] a[1][] a[1][2] a[1][4] a[1][6] a[2][] a[2][2] a[2][4] a[2][6] So execute 4 instrucions to avoid 231 misses Copyright Josep Torrellas 1999, 21, 22 31
32 36. Summary of Cache Optimizations Miss Miss Hit HW Technique pen rate tim BW cmplx Comment Comment Lrgr cachesz = + = 1widely used widely for used L2,L3 for L2,L3 Larger blksz + = = P4 L2 uses P4 L2 128 uses bytes 128 bytes Higher assoc = + = 1widely used widely used Multilevel + = = = 2Costly hrdwr, Costly hrdwr, esp if esp if caches L1 blkszl1 6= blksz L2; widely 6= L2; widely used used Cache indx w/o = = + = 1triv if small triv ifcache small translation USIII/21264 USIII/21264 Read priority + = = = 1easy foreasy uniproc, for uniproc, over writes widely used widely used Crit wrd frst + = = = 2widely used widely used &earlyrestrt Mrgng write bu + = = = 1widely used widely used Victim caches + + = = 2Athlon had Athlon 8-entry had 8-entry Way pred & = = + = 1I-cache I-cache of USIII/D-c of USIII/D-c of R43 of R43 Pseudoassoc = = + = 1L2 of ofl2 R1K of of R1K Comp. opt. = + = = hard, varies hard, by varies comp. by comp. Hardware pref + + = = 2I,3D widely used widely used Software pref + + = = 3widely used widely used Sm & simple cache = + widely used widely L1used L1 Nonblk caches + = = + 3all out-of-order all out-of-order CPUs CPUs Pipelined cache = = - + 1widely used widely used banked caches = = = + 1L2 of Opteron L2 of Opteron & Niagara & Niagara
Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 2 (cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Improving Cache Performance Average mem access time = hit time + miss rate * miss penalty speed up
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationMemory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple
Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss
More informationLec 11 How to improve cache performance
Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationMemory Hierarchy Basics
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases
More informationAdvanced cache optimizations. ECE 154B Dmitri Strukov
Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationReducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationCS222: Cache Performance Improvement
CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationAdvanced Caching Techniques (2) Department of Electrical Engineering Stanford University
Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationEEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?
EEC 17 Computer Architecture Fall 25 Introduction Review Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationAdvanced Caching Techniques
Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory
More informationMemory Hierarchies 2009 DAT105
Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement
More informationPollard s Attempt to Explain Cache Memory
Pollard s Attempt to Explain Cache Start with (Very) Basic Block Diagram CPU (Actual work done here) (Starting and ending data stored here, along with program) Organization of : Designer s choice 1 Problem
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationCACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás
CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationMEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming
MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationLecture-18 (Cache Optimizations) CS422-Spring
Lecture-18 (Cache Optimizations) CS422-Spring 2018 Biswa@CSE-IITK Compiler Optimizations Loop interchange Merging Loop fusion Blocking Refer H&P: You need it for PA3 and PA4 too. CS422: Spring 2018 Biswabandan
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationImproving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationLecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time
Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B
More informationCaching Basics. Memory Hierarchies
Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationModern Computer Architecture
Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap
More informationPerformance metrics for caches
Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:
More informationCSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationLecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin
Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin
More informationTypes of Cache Misses: The Three C s
Types of Cache Misses: The Three C s 1 Compulsory: On the first access to a block; the block must be brought into the cache; also called cold start misses, or first reference misses. 2 Capacity: Occur
More informationCPE 631 Lecture 06: Cache Design
Lecture 06: Cache Design Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Cache Performance How to Improve Cache Performance 0/0/004
More informationTextbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:
Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, 2003 Textbook web site: www.vrtechnology.org 1 Textbook web site: www.vrtechnology.org Laboratory Hardware 2 Topics 14:332:331
More informationCSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common
More informationAleksandar Milenkovich 1
Review: Caches Lecture 06: Cache Design Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville The Principle of Locality: Program access a relatively
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationCS422 Computer Architecture
CS422 Computer Architecture Spring 2004 Lecture 19, 04 Mar 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Topics for Today Cache Performance Cache Misses:
More informationMemory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB
Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationClassification Steady-State Cache Misses: Techniques To Improve Cache Performance:
#1 Lec # 9 Winter 2003 1-21-2004 Classification Steady-State Cache Misses: The Three C s of cache Misses: Compulsory Misses Capacity Misses Conflict Misses Techniques To Improve Cache Performance: Reduce
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures)
CS 6C: Great Ideas in Computer Architecture (Machine Structures) Instructors: Randy H Katz David A PaHerson hhp://insteecsberkeleyedu/~cs6c/fa Direct Mapped (contnued) - Interface CharacterisTcs of the
More informationCSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)
CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give
More informationMemory Hierarchy. Advanced Optimizations. Slides contents from:
Memory Hierarchy Advanced Optimizations Slides contents from: Hennessy & Patterson, 5ed. Appendix B and Chapter 2. David Wentzlaff, ELE 475 Computer Architecture. MJT, High Performance Computing, NPTEL.
More informationPage 1. Memory Hierarchies (Part 2)
Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy
More informationCache performance Outline
Cache performance 1 Outline Metrics Performance characterization Cache optimization techniques 2 Page 1 Cache Performance metrics (1) Miss rate: Neglects cycle time implications Average memory access time
More informationLecture 11. Virtual Memory Review: Memory Hierarchy
Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationAdvanced Caching Techniques
Advanced Caching Approaches to improving memory system performance eliminate memory accesses/operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationCOSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University
COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating
More informationCPU issues address (and data for write) Memory returns data (or acknowledgment for write)
The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives
More informationL2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary
HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationHandout 4 Memory Hierarchy
Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction
More informationCaches Concepts Review
Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationPortland State University ECE 587/687. Caches and Memory-Level Parallelism
Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each
More informationMemory Hierarchy 3 Cs and 6 Ways to Reduce Misses
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Four Questions for Memory Hierarchy Designers
More informationregisters data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.
Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationארכיטקטורת יחידת עיבוד מרכזי ת
ארכיטקטורת יחידת עיבוד מרכזי ת (36113741) תשס"ג סמסטר א' July 2, 2008 Hugo Guterman (hugo@ee.bgu.ac.il) Arch. CPU L8 Cache Intr. 1/77 Memory Hierarchy Arch. CPU L8 Cache Intr. 2/77 Why hierarchy works
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationMemory Hierarchy Design
Memory Hierarchy Design Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss Rate Reducing Hit Time Main Memory and Organizations Memory Technology Virtual
More informationCS 136: Advanced Architecture. Review of Caches
1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you
More informationImproving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion
Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More information