Performance study example ( 5.3) Performance study example
|
|
- Laurence Craig
- 5 years ago
- Views:
Transcription
1 erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial benchmarks. ead b a b c d Invalid Write a. a b c d Modified 26 erformance study example How do you handle coherence if you do not have a shared bus?? 27
2 Sample Machines CU Interrupt controller Bus interface 256-KB L 2 -ro module -ro module -ro module Intel entium ro Quad Coherent 4 processors -ro bus (64-bit data, 36-bit address, 66 MHz) CI bridge CI bridge Memory controller CU/mem cards CI I/O cards CI bus CI bus MIU 1-, 2-, or 4-way interleaved DM 2 2 Mem ctrl Bus interface/switch Sun Enterprise server Coherent Up to 16 processor and/or memory-i/o cards Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards Bus interface 100bT, SCSI SBUS SBUS SBUS 2 FiberChannel 28 Directory-based Coherence ( 5.4) Idea: Implement a directory that keeps track of where each copy of a block is cached and its state in each cache (note that with snooping, the state of a block was kept only in the cache). rocessors must consult the directory before caching blocks from memory. If block is exclusive, then its owner should provide the most up-to-date copy. When a block in memory is updated (written), the directory is consulted to either update or invalidate other cached copies. Eliminates the overhead of broadcasting/snooping (bus bandwidth) Hence, scales up with the numbers of processors that would saturate a single bus. Slower in terms of latency?? 1 2 n network/bus Shared space (memory, ) 29
3 Directory-based Coherence The memory and the directory can be centralized 0 1 Network n Mem Dir Mem Dir Shared memory Mem Dir Or distributed 0 Mem Dir 1 Mem Dir n Mem Dir Shared memory Network lternatively, the memory may be distributed but the directory can be centralized. Or the memory may be centralized but the directory can be distributed (as we will discuss in the case of CM with private caches) 30 Distributed directory-based coherence The location (home) of each memory block is determined by its address. controller decides if access is Local or emote s in snooping caches, the state of every block in every cache is tracked in that cache (exclusive/dirty, shared/clean, invalid) to avoid the need for write through and unnecessary write back. In addition, with each block in memory, a directory entry keeps track of where the block is cached. ccordingly, a block can be in one of the following states: Uncached: no processor has it (not valid in any cache) Shared/clean: cached in one or more processors and memory is up-to-date Exclusive/modified/dirty: one processor (owner) has data; memory out-of-date 31
4 Enforcing coherence Coherence is enforced by exchanging messages between nodes Three types of nodes may be involved Local requestor node (L): the node that reads or write the cache block Home node (H): the node that stores the block (and its directory entry) in its memory -- may be the same as L emote nodes (): other nodes that have a cached copy of the requested block. When L encounters a ead Hit, it just reads the data When L encounters a ead Miss, it sends a message to the home node, H, of the requested block three cases may arise: The directory indicates that the block is not cached The directory indicates that the block is shared/clean The directory indicates that the block is exclusive/modified 32 What happens on a read miss? (when block is invalid in local cache) (a)ead miss (if block is shared or uncached) -- L sends request to H -- H sends the block to L L -- state of block is shared in directory -- state of block is shared in L 1 equest to Home node eturn data 2 H (b) ead miss (if block is exclusive in another cache) -- L sends request to H -- H informs L about the block owner, -- L requests the block from -- send the block to L -- L and set the state of block to shared -- informs H that it should change the state of the block to shared L 3 4 equest to owner eturn data 1 equest to Home node eturn owner 4 2 evise entry H 33
5 What happens on a write miss? (when block is invalid in local cache) (a) Write miss to an uncached block -- similar to a read miss to an uncached block except that the state of the block is set to exclusive (b) Write miss to an block that is exclusive in another cache -- similar to a read miss to an exclusive block except that the state of the block is set to exclusive in H and L and to Invalid in. (c) Write miss to a shared block -- L sends request to H -- H sets the state to exclusive -- H sends the block to L -- H sends to L the list of other sharers -- L sets the block s state to exclusive -- L sends invalidating messages to each sharers () -- Each sets block s state to invalid 3 Invalidate ack L 4 4 ack 1 equest to Home node eturn sharers and data 3 2 Invalidate 5 evise entry H 34 What happens on a write hit? (when block is shared or exclusive in local cache) (a) If the block is exclusive in L, just write the data 5 (b) If the block is shared in L -- L sends a request to H to have the block as exclusive -- H sets the state to exclusive -- H informs L of the block s other sharers -- L sets the block s state to exclusive -- L sends invalidating messages to each sharers () -- sets block s state to invalid 3 Invalidate degree of complexity that we will ignore: ack L 4 4 ack 1 equest to Home node eturn sharers and data 3 2 Invalidate We need a busy state to handle simultaneous requests to the same block. For example, if there are two writes to the same block it has to be serialized. eason: order of events depends on message orders, which is non-deterministic. evise entry H 35
6 The coherence protocol at a node s cache controller 36 The coherence protocol (Directory response to a coherence message) 37
7 MSI Directory-based coherence - example Case 1: X is in the uncached (U) state in home directory j i Home of X U dir state of cached blocks where X is cached k ossible scenario: j reads X Then j writes to X 38 MSI Directory-based coherence - example Case 2: X is exclusive (E) in home directory and owned by j (dirty, d, in j) i Home of X E{j} j dir X d State of cached blocks where X is cached k Trace the state of X if: Then k reads X 39
8 MSI Directory-based coherence - example Case 3: X is exclusive (E) in home directory and owned by j (dirty, d, in j) i Home of X E{j} j dir X d State of cached blocks where X is cached k Trace the state of X if: k writes to X 40 MSI Directory-based coherence - example Case 4: X is shared (S) in home directory and clean (c) in j and K i Home of X S{j,k} j dir X c State of cached blocks where X is cached k Trace the state of X if: j reads X Then k writes into X X c 41
9 The MESI protocol s described earlier, in MSI, a cache block can be in one of three states Invalid (uncached) : not in the cache (not valid in any cache) Shared/clean: cached in one or more processors and memory is up-to-date Modified/dirty/exclusive: one processor (owner) has data; memory out-of-date The MESI protocol divides the Exclusive state to two states Invalid (uncached): same as in MSI Shared: cached in more than one processors and memory is up-to-date Exclusive: one processor (owner) has data and it is clean Modified: one processor (owner) has data, but it is dirty If MESI is implemented using a directory, then the information kept for each block in the directory is the same as the three state protocol: Shared in MESI = shared/clean but more than one sharer Exclusive in MESI = shared/clean but only one sharer Modified in MESI = Exclusive/Modified/dirty However, at each cached copy, a distinction is made between shared, exclusive and modified (rather than only shared and modified). 42 The MESI protocol On a read miss (local block is invalid), load the block and change its state to exclusive if it was uncached in memory shared if it was already shared, modified or exclusive - if it was modified, the owner will send you a clean copy - if was modified or exclusive, the previous owner will change the state of the block to shared in its cache. On a write miss: same as read miss, except set the state to modified copies in other caches (if any) are invalidated On a write hit to a modified block, do nothing On a write hit to an exclusive block change the block to modified no need for invalidation. this is the main advantage of MESI over MSI On a write hit to a shared block change the block to modified and invalidate the other cached copies. When a modified block is evicted, write it back. In snooping bus implementations of MESI, on a read miss, we need to if the block is in some other cache(s) to set its state correctly to shared or exclusive. To take full advantage of MESI, should know when a block is to be changed from shared to exclusive 43
10 The MESI protocol If MESI is implemented as a snooping protocol, then the main advantage over the three state protocol is when a read to an uncached block is followed by a write to that block. fter the uncached block is read, it is marked exclusive Note that, when writing to a shared block, the transaction has to be posted on the bus so that other sharers invalidate their copies. But when writing to an exclusive block, there is no need to post the transaction on the bus. Hence, by distinguishing between shared and exclusive states, we can avoid bus transactions when writing on an exclusive block. However, now a cache that has an exclusive block has to monitor the bus for any read to that block. Such a read will change the state to shared. This advantage disappears in a directory protocol since after a write onto an exclusive block, the directory has to be notified to change the state to modified. 44 Latency optimization 1) Forwarding requests 3: req L 1: req 4: reply H 2: forward 3: respond 1: req L H 4: revise 2: reply 4: reply L 1: req H 2: forward 3: revise 3: reply 2) Use SM for directories (hardware optimization) 3) Overlap activities on the critical path - parallel multiple invalidation - parallel lookup of directory and memory at home node. 45
11 Storage overhead In the simplest representation of a directory entry, a full bit vector is used for each entry (one bit used to indicate presence in each node.) storage overhead doesn t scale well with number of nodes. Larger blocks (cache lines) means lower overhead For very large number of nodes, may use a list of sharers instead of a bit vector Lower overhead if only few sharers Example; for 1024 processors, overhead is reduced if fewer than 100 sharers May reduce overhead further by keeping only directory entries for the blocks that are cached (uncached blocks do not need an entry) Can keep the directory entries for the cached blocks in a hash table (associative cache structure) should invalidate cached copies when the directory entry is removed (evicted) from the hash table. 46 Cache-based Directory Schemes x x x cache cache cache x Mem Keep the information about the sharers of a cached block in the cache by linking the replicated cached entries in a linked list rather than storing a list of sharers with the block in the main memory. When a processor caches a block, it inserts itself at the front of the linked list To invalidate a cache block in the other caches, follow the link list (easier if a doubly link list) Scalable Coherent Interface (SCI) IEEE Standard 47
12 Hierarchical approaches to coherence Multi-levels - especially useful for multi-node systems, when each node is a multiprocessor (example: multi SMs) Examples of two-level systems: B1 B1 Dir. Main Mem Main Mem Dir. Network Snooping-directory Network1 Network1 Network1 Network1 adapter adapter adapter adapter Network 2 Directory-directory Bus Directory-snooping 48 Cache organization in multicore systems Shared systems rivate systems Memory controller System interconnect Memory controller Memory system Memory system Examples: Intel Core Duo entium Uses MESI (Modified, Exclusive, Shared, Invalid) cache coherence protocol to keep the data coherent Examples: MD Dual Core Opteron Uses MOESI (M + Owned + ESI) cache coherence protocol to keep the data coherent ( is inclusive to ) 49
13 Example of distributed directories in CMs 0 1 n Distributed shared cache Dir Dir Dir Network (on chip) Off chip Memory (or on-chip L3) Directories are used to keep track of the state of shared entities that are cached in multiple private caches. If the modules form a shared cache space, then the directories perform a role very similar to their roles in distributed shared memory systems. reserve coherence in the private caches One directory entry for each entry in Location of a cache line in is determine by address of cache entry 50 Example of distributed directories in CMs dir dir Network (on chip) Network (on chip) Shared memory Directory Shared memory If each module is private to the corresponding core, then on chip directories may be used as replacement for a centralized directory Each cache block is associated with a directory entry. Only cache blocks that are on chip need to have directory entries How do you organize and distribute the directory entries among tiles? location of directory entry (called its home) is determined by the address. 51
14 The Tilera TILE-Gx36 rchitecture: MiC UTx2, USBx2, JTG, I 2 C, SI CIe lane CIe lane CIe lane Flexible I/O Memory Controller (DD3) Memory Controller (DD3) mie XUI XUI XUI XUI 36 rocessor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache 40 Gbps total packet I/O 4 ports 10GbE (XUI) 16 ports 1GbE () 48 Gbps CIe I/O 2 16Gbps Stream IO ports Wire-speed packet engine 60Mpps MiC engine: 20 Gbps crypto Compress & decompress 52 TILE-Gx100 : Complete System-on-a-Chip with bit cores MiC UT x2, USB x2, JTG, I 2 C, SI CIe lane CIe lane CIe lane Flexible I/O MiC Memory Controller (DD3) Memory Controller (DD3) Memory Controller (DD3) Memory Controller (DD3) mie Interlaken Interlaken XUI XUI XUI XUI XUI XUI XUI XUI 1.2GHz 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps imesh BW Gbps packet I/O 8 ports XUI / 2 XUI 2 40Gb Interlaken 32 ports 1GbE () 80 Gbps CIe I/O 3 StreamIO ports (20Gb) Wire-speed packet eng. 120Mpps MiC engines: 40 Gbps crypto compress & decompress 53
15 The Tilera core rocessor Each core is a complete computer 3-way VLIW CU rotection and interrupts Memory cache and Cache Virtual and physical address space Instruction and data TLBs Cache integrated 2D DM engine uns SM Linux uns off-the-shelf C/C++ programs Signal processing and general apps Core egister File Three Execution ipelines Cache 16K -I I-TLB 2D 8K -D D-TLB DM 64K Terabit Switch 54 Tilera Tile64 x5 55
Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem
Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually
More informationCache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014
Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014 Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded:
More informationMemory Hierarchy in a Multiprocessor
EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationLecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization
Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLearning Curve for Parallel Applications. 500 Fastest Computers
Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationCache Coherence in Scalable Machines
ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationLecture 5: Directory Protocols. Topics: directory-based cache coherence implementations
Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes
More informationScalable Multiprocessors
Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system
More informationSGI Challenge Overview
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationCOMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation
COP 633 - Parallel Computing Lecture 10 September 27, 2018 CC-NUA (1) CC-NUA implementation Reading for next time emory consistency models tutorial (sections 1-6, pp 1-17) COP 633 - Prins CC-NUA (1) Topics
More informationReview. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing
EEC 252 Graduate Computer Architecture Lec 13 nooping Cache and Directory Based Multiprocessors David atterson Electrical Engineering and Computer ciences University of California, Berkeley http://www.eecs.berkeley.edu/~pattrsn
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationLecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)
Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationMultiprocessor Systems
Multiprocessor ystems 55:132/22C:160 pring2011 1 (vs. VAX-11/780) erformance 10000 1000 100 10 1 Uniprocessor erformance (ECint) From Hennessy and atterson, Computer Architecture: A Quantitative Approach,
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationCache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:
More informationLecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro
Lecture 7: M Wrap-Up, ache coherence Topics: handling M errors and writes, cache coherence intro 1 Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modified bits
More information4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins
4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationPage 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence
SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it
More informationCache Coherence in Scalable Machines
Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationCache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols
Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationSnoop-Based Multiprocessor Design III: Case Studies
Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationProcessor Architecture
Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)
1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More information4. Shared Memory Parallel Architectures
Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures
More informationLecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues
Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection
More informationLecture 1: Introduction
Lecture 1: Introduction ourse organization: 4 lectures on cache coherence and consistency 2 lectures on transactional memory 2 lectures on interconnection networks 4 lectures on caches 4 lectures on memory
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence
Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy
More informationECE 669 Parallel Computer Architecture
ECE 669 arallel Computer Architecture Lecture 2 Architectural erspective Overview Increasingly attractive Economics, technology, architecture, application demand Increasingly central and mainstream arallelism
More informationLecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols
Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype
More informationLecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations
Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,
More informationSpecial Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.
Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction
More informationLecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations
Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationRecall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols
ecall: Sequential onsistency Example S252 Graduate omputer rchitecture Lecture 21 pril 14 th, 2010 Distributed Shared ory rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252 rocessor 1 rocessor
More informationCache Coherence: Part II Scalable Approaches
ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationMulticore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh
Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth
More informationCMSC 611: Advanced. Distributed & Shared Memory
CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor
More informationLecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It
More informationShared Memory Architectures. Approaches to Building Parallel Machines
Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache
More informationPortland State University ECE 588/688. Cache Coherence Protocols
Portland State University ECE 588/688 Cache Coherence Protocols Copyright by Alaa Alameldeen 2018 Conditions for Cache Coherence Program Order. A read by processor P to location A that follows a write
More informationCache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.
Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L25-1 Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion
More informationLect. 6: Directory Coherence Protocol
Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationLecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro
Lecture 7: M, ache coherence Topics: handling M errors and writes, cache coherence intro 1 hase hange Memory Emerging NVM technology that can replace Flash and DRAM Much higher density; much better scalability;
More informationCache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri
Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationA Scalable SAS Machine
arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationSpecial Course on Computer Architecture
Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationParallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University
18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution
More informationAleksandar Milenkovich 1
Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection
More informationInterconnect Routing
Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through
More informationShared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB
Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an
More informationMidterm Exam 02/09/2009
Portland State University Department of Electrical and Computer Engineering ECE 588/688 Advanced Computer Architecture II Winter 2009 Midterm Exam 02/09/2009 Answer All Questions PSU ID#: Please Turn Over
More informationThread- Level Parallelism. ECE 154B Dmitri Strukov
Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors
More informationIncoherent each cache copy behaves as an individual copy, instead of as the same memory location.
Cache Coherence This lesson discusses the problems and solutions for coherence. Different coherence protocols are discussed, including: MSI, MOSI, MOESI, and Directory. Each has advantages and disadvantages
More informationEITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor
EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently
More information