[ 5.4] What cache line size is performs best? Which protocol is best to use?
|
|
- Drusilla Pearson
- 5 years ago
- Views:
Transcription
1 Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part science. Parameters need to be chosen for the simulator. The authors selected a single-level 4-way set-associative 1 MB cache with 64-byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Effect of coherence protocol [ 5.4.3] Three coherence protocols were compared: The Illinois MESI protocol ( Ill, left bar). The three-state invalidation protocol (3St) with bus upgrade for S M transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) Lecture 16 Architecture of Parallel Computers 1
2 Address bus Data bus Traffic (MB/s) Barnes/III Barnes/3St Barnes/3St-RdEx x LU/III d LU/3St l LU/3St-RdEx t Ocean/III x Ocean/3S Ill Ocean/3St-RdEx t Radiosity/III Ex Radiosity/3St Radiosity/3St-RdEx Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx In our parallel programs, which protocol seems to be best? Somewhat surprisingly, the result turns out to be the same for the multiprocessor workload. The reason for this? The advantage of the four-state protocol is that no bus traffic is generated on E M transitions. But E M transitions are very rare (less than 1 per 1K references) Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
3 Effect of cache line size [ 5.4.4] Recall from Lecture 11 that cache misses can be classified into four categories: Cold misses (called compulsory misses in the previous discussion) occur the first time that a block is referenced. Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processors cache, after which the other processor reads one of the modified words. How could we attack each of the four kinds of misses? To reduce capacity misses, we could To reduce conflict misses, we could To reduce cold misses, we could To reduce coherence misses, we could If we increase the line size, the number of coherence misses might go up or down. Why? Lecture 16 Architecture of Parallel Computers 3
4 Increasing the line size has other disadvantages. It increases conflict misses. Why? It increases bus traffic. Why? So it is not clear which line size will work best. 0.6 Upgrade False sharing True sharing Capacity Cold Barnes/8 Barnes/16 Barnes/32 8 Barnes/64 Miss rate (%) Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Results for the first three applications seem to show that which line size is best? For the second set of applications, Radix shows a greatly increasing number of false-sharing misses with increasing block size Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
5 12 Upgrade 10 8 False sharing True sharing Capacity Cold Miss rate (%) Ocean/8 Ocean/ Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 However, this is not the whole story. Larger line sizes also create more bus traffic. Traffic (bytes/instructions) Address bus Data bus 0 Barnes/8 Barnes/ Barnes/32 Barnes/64 Barnes/ Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 Lecture 16 Architecture of Parallel Computers 5
6 With this in mind, which line size would you say is best? Invalidate vs. update [ 5.4.5] Which is better, an update or an invalidation protocol? At first glance, it might seem that update schemes would always be superior to write-invalidate schemes. Why might this be true? Why might this not be true? When there are not many external rereads, When there is a high degree of sharing, For example, in a producer-consumer pattern, Update and invalidation schemes can be combined (see 5.4.5). Let s look at real programs Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
7 0.60 False sharing True sharing Capacity 2.00 Miss rate (%) Cold Miss rate (%) LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd Where there are many coherence misses, If there were many capacity misses, So let s look at bus traffic Lecture 16 Architecture of Parallel Computers 7
8 Note that in two of the applications, updates in an update protocol are much more prevalent than upgrades in an invalidation protocol. LU/inv LU/upd 0.00 Upgrade/update rate (%) Each of these operations produces bus traffic; therefore, the update protocol causes more traffic. Ocean/inv Ocean/mix Ocean/upd The main problem is that one processor tends to write a block multiple times before another processor reads it. Raytrace/inv Raytrace/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Radix/inv Upgrade/update rate (%) In addition, updates cause problems in nonbus-based multiprocessors. Radix/mix Radix/upd Extending Cache Coherence [ 6.6] There are several ways in which cache coherence can be extended. These ways interact with other aspects of the architecture. Considering these issues yields a good understanding of where caches fit in the overall hardware/software architecture Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
9 Shared-cache designs [ 6.6.1] Why not let one cache serve multiple processors? This could be a 1stlevel cache, a 2nd-level cache, etc. P 1... Switch P n What would be the advantages? It eliminates the need for cache coherence. $ (interleaved) Main memory (interleaved) It may decrease the miss ratio. Why? It may reduce the need for a large cache. Why? It allows more effective use of long cache lines. Examples of such designs include the Alliant FX-8 (ca. 1985), which shared a cache among 8 processors, and the Encore Multimax (ca. 1987), which shared a cache among two processors. Today, shared caches are being considered for single-chip multiprocessors, with 4 8 processors sharing an on-chip first-level cache. What are some disadvantages of shared caches? Latency. Lecture 16 Architecture of Parallel Computers 9
10 Bandwidth. Speed of cache. Destructive interference the converse of overlapping working sets. Using shared second-level caches moderates both the advantages and disadvantages of shared first-level caches. Virtually indexed caches and cache coherence [ 6.6.2] Why not just store information in the cache by virtual address rather than physical address? What would be the advantage of this? [A TLB is like a cache that translates a virtual address of the form (page #, displacement) to a physical address of the form (frame #, displacement). ] Page Number Frame Number With large caches, it becomes increasingly difficult to do address translation in parallel with cache access. Why? 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
11 Recall this diagram from Lecture 10. What happens to the set number as the cache increases in size? Page number k Set number d j Byte within block Rather than slow down every instruction by serializing address translation and cache access, it makes sense just to use a virtually addressed cache. But virtually addressed caches have their own problems. One problem occurs when memory is shared. Two processes may have different virtual names for the same block. Process 1 s page table Process 2 s page table Page What s wrong with this? Why are duplicate copies of information bad? Think about what happens on a process switch. Assume The processor has been running Process 1. It then switches to Process 2. Later it switches back to Process 1. This is called the synonym problem. Lecture 16 Architecture of Parallel Computers 11
12 Solution: For each cache block, keep both a virtual tag and a physical tag. This allows a virtual address to be found, given a physical address, or vice versa. When a cache miss occurs, look up the physical address to find its virtual address (if any) within the cache. If the block is already in the cache, rename it to a new virtual address rather than reading in the block from memory. Let s see how this works in practice. 1. Look up referenced virtual address in cache. 2. Is virtual address found in cache? No 4. Use translated physical address to look up physical tags. 5. Is physical address found in cache? No 6. Cache miss; fetch data from lower level in memory hierarchy, writing back block(s) if necessary. Yes Yes 3. Cache hit. Proceed with reference. 7. Possibility of synonym in cache 8. Move cache block B from synonym virtual index to replace block C at current virtual index, writing back block C if necessary. Notes: Step 4: We are looking for another copy of the same block being used by another process, with a different virtual address. Step 6: Block at virtual address needs to be written back if it has been modified. Its physical tag may be the same as another block in cache; if so, that block needs to be written back too. Step 7: In one cache organization, there s not just a 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
13 possibility of a synonym; it s a certainty. What organization is this? Step 8: If another process q was using this block and tries to access it again, it will get a miss on virtual address, but will find it by physical address. What will it therefore do with this block? Note that only physical addresses are put on the bus. Therefore, they can be snooped, and blocks can be invalidated (or updated) just like in coherence protocols for physically addressed caches. Figure 6.24 in CS&G, which I didn t have time to redraw, illustrates the organization of a virtually addressed cache. Consider first a virtual address. ASID means address-space identifier, which is often called a process identifier (PID) in other books. The diagram shows the V-index overlapping the virtual page number. What is the significance of this? Is the V-index actually stored in the cache? What is the state field in the tag memories? Now consider the physical address. Lecture 16 Architecture of Parallel Computers 13
14 What is the role of the P-index? How long is the P-index, relative to the V-index? Note that each P-tag entry points to a V-tag entry and vice versa; this makes it possible to find a physical address given a virtual address, or vice versa. In multilevel caches, the L 1 cache can be virtually tagged for rapid access, while the L 2 cache is physically tagged to avoid synonyms across processors. Example: Suppose P 1 s page 7 (addresses to 7FFF 16 ) is stored in page frame Suppose also that P 2 is using the same page, but calls it page 2. P 1 is the first to access block 0 on this page. Assuming cache lines are 16 bytes long, addresses 7000 to 700F are now cached. Let s assume the cache is direct mapped, and has 1024 (= ) lines. Which line will this block be placed in? Now suppose that P 2 accesses this block. What will happen? What number given above haven t we used yet? Where does this number get used or stored? TLB coherence [ 6.6.3] A TLB is effectively just a cache of recently used page-table entries. Each processor has its own TLB Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
15 How can two TLBs come to contain entries for the same page? Name some times when TLB entries need to be changed. Which of the above lead to coherence problems? Solution 1: Virtually addressed caches When a cache is virtually addressed, when is address translation needed? Given that accesses to page mappings are rare in virtually addressed caches, what table might we be able to dispense with? If we don t have a, we don t have a TLB coherence problem. Well, we still have the problem of keeping the swap-out and protection information coherent. Assuming no TLB, where will this information be found? This problem will be handled by the cache-coherence hardware when the OS updates the page tables. Solution 2: TLB shootdown With the TLB-shootdown approach, a processor that changes a TLB entry tells other processors that may be caching the same TLB entries to invalidate their copies. Step 1. A processor p that wants to modify a page table disables interprocessor interrupts and locks the page table. It also clears its active flag, which indicates whether it is actively using any page table. Lecture 16 Architecture of Parallel Computers 15
16 Step 2. Processor p sends an interrupt to other processors that might be using the page table, describing the TLB actions to be performed. Step 3. Each processor that receives the interrupt clears its active flag. Step 4. Processor p busy-waits till the active flags of all interrupted processors are clear, then modifies the page table. Processor p then releases the page-table lock, sets its active flag, and resumes execution. Step 5. Each interrupted processor busy-waits until none of the page tables it is using are locked. After executing the required TLB actions and setting its active flag, it resumes execution. This solution has an overhead that scales linearly with the number of processors. Why? Solution 3: Invalidate instructions Some processors, notably the PowerPC, have a TLB_invalidate_entry instruction. This instruction broadcasts the page address on the bus so that the snooping hardware on other processors can automatically invalidate the corresponding TLB entries without interrupting the processor. A processor simply issues a TLB_invalidate_entry instruction immediately after changing a page-table entry. This works well on a shared-bus system; if two processors change the same entry at the same time, only one change can be broadcast first on the bus Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring
A three-state update protocol
A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having
More informationSnooping coherence protocols (cont.)
Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have
More informationPerformance of coherence protocols
Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses
More informationECE PP used in class for assessing cache coherence protocols
ECE 5315 PP used in class for assessing cache coherence protocols Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the
More informationCache Coherence: Part 1
Cache Coherence: art 1 Todd C. Mowry CS 74 October 5, Topics The Cache Coherence roblem Snoopy rotocols The Cache Coherence roblem 1 3 u =? u =? $ 4 $ 5 $ u:5 u:5 1 I/O devices u:5 u = 7 3 Memory A Coherent
More informationL7 Shared Memory Multiprocessors. Shared Memory Multiprocessors
L7 Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor Dominate
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationBasic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.
Shared emory ultiprocessors Basic Architecture of SP Buses are good news and bad news The (memory) bus is a point all processors can see and thus be informed of what is happening A bus is serially used,
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationLecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It
More informationA cache is a small, fast memory which is transparent to the processor. The cache duplicates information that is in main memory.
Cache memories A cache is a small, fast memory which is transparent to the processor. The cache duplicates information that is in main memory. With each data block in the cache, there is associated an
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationPerformance metrics for caches
Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:
More informationShared Memory Multiprocessors
Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O
More informationThe MESI State Transition Graph
Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization
More informationCS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationComputer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James
Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationSnooping coherence protocols (cont.)
Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationLecture 24: Board Notes: Cache Coherency
Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will
More informationLecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions
Outline protocol Dragon updatebased protocol mpact of protocol optimizations LowerLevel Protocol Choices observed in state: what transition to make? Change to : assume ll read again soon good for mostly
More informationLogical Diagram of a Set-associative Cache Accessing a Cache
Introduction Memory Hierarchy Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year Levels of memory with different sizes & speeds close to
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationMEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming
MEMORY HIERARCHY BASICS B649 Parallel Architectures and Programming BASICS Why Do We Need Caches? 3 Overview 4 Terminology cache virtual memory memory stall cycles direct mapped valid bit block address
More informationLecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations
Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle
More informationA More Sophisticated Snooping-Based Multi-Processor
Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More informationLecture 20: Multi-Cache Designs. Spring 2018 Jason Tang
Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start
More information[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.
Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationModern Computer Architecture
Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationLecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections
Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System
More informationAdapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]
Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationIntroduction. Memory Hierarchy
Introduction Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year 1 Memory Hierarchy Levels of memory with different sizes & speeds close to
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationNOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.
Recap Protocol Design Space of Snooping Cache Coherent ultiprocessors CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Snooping cache coherence solve difficult problem by applying
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More informationChapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2)
Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Feng-Chia Unive ersity Outline 5.4 Virtual
More informationCache Coherence: Part II Scalable Approaches
ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor
More informationLecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationSnooping-Based Cache Coherence
Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills
More informationShow Me the $... Performance And Caches
Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1
More informationSnoop-Based Multiprocessor Design III: Case Studies
Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More information[ 6.1] A cache is a small, fast memory which is transparent to the processor. The cache duplicates information that is in main memory.
Cache memories [ 6.1] A cache is a small, fast memory which is transparent to the processor. The cache duplicates information that is in main memory. With each data block in the cache, there is associated
More informationEastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.
Eastern Mediterranean University School of Computing and Technology ITEC255 Computer Organization & Architecture CACHE MEMORY Introduction Computer memory is organized into a hierarchy. At the highest
More informationHandout 4 Memory Hierarchy
Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced
More informationCIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018
CIS 3207 - Operating Systems Memory Management Cache and Demand Paging Professor Qiang Zeng Spring 2018 Process switch Upon process switch what is updated in order to assist address translation? Contiguous
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationA Basic Snooping-Based Multi-Processor Implementation
Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationEN1640: Design of Computing Systems Topic 06: Memory System
EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationBackground. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.
Memory Hierarchies Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory Mem Element Background Size Speed Price Register small 1-5ns high?? SRAM medium 5-25ns $100-250 DRAM large
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationChapter 6: Demand Paging
ADRIAN PERRIG & TORSTEN HOEFLER ( 5-006-00 ) Networks and Operating Systems Chapter 6: Demand Paging Source: http://redmine.replicant.us/projects/replicant/wiki/samsunggalaxybackdoor If you miss a key
More informationCS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017
CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMo Money, No Problems: Caches #2...
Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to
More informationScalable Multiprocessors
Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More informationCHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN
CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationPage 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "
Review Address Segmentation " CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs" February 23, 2011! Ion Stoica! http//inst.eecs.berkeley.edu/~cs162! 1111 0000" 1110 000" Seg #"
More informationCS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"
CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs" October 1, 2012! Prashanth Mohan!! Slides from Anthony Joseph and Ion Stoica! http://inst.eecs.berkeley.edu/~cs162! Caching!
More informationCS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017
CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More information