Embedded Systems Architecture. Multiprocessor and Multicomputer Systems

Size: px
Start display at page:

Download "Embedded Systems Architecture. Multiprocessor and Multicomputer Systems"

Transcription

1 Embedded Systems Architecture Multiprocessor and Multicomputer Systems M. Eng. Mariusz Rudnicki 1/57

2 Multiprocessor Systems 2/57

3 Multiprocessor Systems 3/57

4 Multiprocessor Systems 4/57

5 UMA SMP Architectures Uniform Memory Access: all processors share a unique centralized primary memory, each CPU has the same memory access time. These systems are also called Symmetric Shared-memory Multiprocessors (SMP) (Hennessy-Patterson, Fig. 6.1) 5/57

6 UMA Bus-Based SMP Architectures The simplest multiprocessors use a single bus: Two or more CPUs and one or more memory modules use the same bus for communication; When the bus is busy and CPU wants to access memory, it must wait; Increasing number of CPUs increases waiting time; This can be reduced by including cache support. 6/57

7 UMA Bus-Based SMP Architectures Multicore processors are small UMA multiprocessor systems; the first shared cache (L2 or L3) is the communication channel. Shared memory can become a bottleneck for the system performance, reason all processors must synchronize on the single bus and memory access. Caches local to each CPU alleviate the problem, furthermore each processor can be equipped with a private memory to store data of computations that need not been shared by other processors. Traffic to/from shared memory can reduce considerably (Tanenbaum, Fig. 8.24) 7/57

8 SINGLE BUS TOPOLOGY 8/57

9 SINGLE BUS TOPOLOGY 9/57

10 SINGLE BUS TOPOLOGY Local caches pose a fundamental issue: each processor sees memory through its own cache, because of this two processors can see different values for the same memory location. Time Event Cache A Cache B RAM location for X CPU X reads Z CPU Y reads Z CPU X stores 0 in X /57

11 CACHE COHERENCY This issue is cache coherency, and if not solved, it prevents using caches in the processors, with heavy consequences on performance. Many cache coherency protocols have been proposed. All of them designed to prevent different version of the same cache block being present in two or more caches (false sharing). All solutions are at hardware level: each cache controller is capable of monitoring all memory requests on the bus coming from other CPUs, and, if necessary, the coherency protocol is activated. Cache controllers implement bus snooping. 11/57

12 SNOOPING CACHES WRITE THROUGH a simple cache coherency protocol. Let us see event that happen between a processor accessing data, and its cache: read miss: CPU cache controller fetches from RAM the missing block and load its into the cache. Solution subsequent reads of the same data (read hit). write miss: modified data are written directly in RAM: prior to this, the block containing data is not loaded into local cache. write hit: the cache block is updated and the update is propagated to RAM. Write operations are propagated to RAM, whose content is always updated. 12/57

13 SNOOPING CACHES Now we consider the operations on the side of the snooper in another CPU. Cache X generates read/write ops., cache Y is a snooping cache (Tanenbaum, Fig. 8.25): read miss: cache Y sees cache X fetch a block from memory but does nothing (in case of read hit sees nothing at all). write miss/hit: cache Y checks if it holds a copy of the modified data: if not, it takes no action. If it does hold a copy, the block containing it is flagged as invalid in cache Y. Action Local request Remote request Read miss Read hit Write miss Fetch data from memory Use data from local cache Update data in memory Write hit Update cache and memory Invalidate cache entry 13/57

14 SNOOPING CACHES Since all caches snoop on all memory actions from other caches, when a cache modifies a data item, the update is carried out in the cache itself and in memory; old block is removed from all other caches (flagged as invalid). According to this protocol, cache have no inconsistent data. Variations to basic protocol. Example, old blocks could be updated with a new value (replicating writes). This version requires more work, but prevents future cache misses. 14/57

15 SNOOPING CACHES Cache coherency protocol simplicity. The basic disadvantage of write through based protocols is inefficiency communication bus is the bottleneck. To alleviate the problem, in this protocol not all write operations are immediately propagated to RAM: a bit is set the cache block, to signal that the block is up-to-date, while memory is old. Modified block will be written to the RAM, possibly after more updates (not after each of them). 15/57

16 UMA Multiprocessors MESI Protocol Even with MESI protocol, a single bus to interface all processors with memory limits the dimension of UMA multiprocessor systems up to 32 CPUs. 1. Invalid The cache entry does not contain valid data. 2. Shared Multiple caches may hold the line; memory is up to date. 3. Exclusive No other cache holds the line; memory is up to date. 4. Modified The entry is valid; memory is invalid; no copies exist. 16/57

17 UMA Multiprocessors MESI Protocol time 17/57

18 UMA Multiprocessors MESI Protocol After CPU is initially booted all cache entries are marked invalid. The first memory read the line referenced is fetched into the cache of CPU 1 reading memory. It is marked as being E (exclusive) state this is the only copy in a cache. CPU 2 reads the same memory block. CPU 1 snooper sees that it is no longer alone and announces on the bus I also have a copy. Both copies are marked as S (shared) state. CPU 2 writes data to the cache. It puts out an invalidate signal on the bus other CPUs should discard their copies. The copy cached goes to M (modified) state. CPU 3 wants to read block. CPU 2, which is the owner, asserts signal on the bus telling CPU 3 to please wait while it writes its line back to memory. After this CPU 3 fetch a copy (copies are marked as shared). 18/57

19 UMA Multiprocessors MESI Protocol After that CPU 2 writes the line again which invalidates a copy in CPU 3 s cache. In the end CPU 1 writes to a word in the line. CPU 2 sees that and asserts a bus signal that telling CPU 1 wait while it writes its line back to memory. When it finishes, it marks its own copy as invalid since it knows that another CPU is about to modify it. Now this is the situation in which a CPU is writing to an uncached line. If the write-allocate policy is in use the lin will be loaded into the cache and marked as modified, M state. If the write-allocate policy is not in use the write will go directly to memory and line will not be cached anywhere. 19/57

20 UMA Multiprocessors - Crossbar Switches To overcome this limitation, a different kind of interconnection network is needed. The simplest solution for connecting n CPUs to m memory modules is the crossbar switch. Crossbar switches have long been used in telecommunications switches. At each intersection is a cross point - a switch that can be opened or closed. The crossbar is a non-blocking network. The number of switches for 20/57

21 UMA Multiprocessors - Crossbar Switches... CPUs 21/57

22 UMA Multiprocessors - Crossbar Switches 22/57

23 NUMA Multiprocessors NUMA Non Uniform Memory Access these systems use a shared logical address space, but physical memory is distributed among CPUs, data access time depends on data position, in local or in remote memory. These systems are called Distributed Shared Memory (DSM) architectures (Hannessy-Patterson, Fig. 6.2). 23/57

24 NUMA Multiprocessors NUMA systems all CPUs share the same address space, but each processor has a local memory, visible to all other processors. In NUMA systems access to local memory blocks is quicker than remote memory blocks. All NUMA systems have a single logical address space shared by all CPUs, but physical memory is distributed among processors. There are two types of memories: local and remote memory. Even remote memory is accessed by each CPU with LOAD and STORE instructions. NUMA SYSTEMS: Non-Caching NUMA (NC-NUMA); Cache-Coherent NUMA (CC-NUMA); 24/57

25 NC-NUMA Multiprocessors In a NC-NUMA system CPUs have no local cache. Each memory access is managed with a modified MMU, which controls if the request is for a local or for a remote block; in the latter case, the request is forwarded to the node containing the requested data. Program using remote data will run much slower than what they would, if the data were stored in the local memory (Tanenbaum, Fig. 8.30). 25/57

26 NC-NUMA Multiprocessors In NC-NUMA systems there is no cache coherency problem there is no caching at all: each memory item is in a single location. Remote memory access is very inefficient. Because of this, NC- NUMA systems can resort to special software that relocates memory pages from one block to another, just to maximize performances. Page scanner demon activates every few seconds, exemines statistics on memory usage, and moves pages from one block to another, to increase performance. In NC-NUMA systems, each processor can also have a private memory and a cache, but only private data (those allocated in the private local memory) can be in the cache. This solution increases the performance of each processor (is adopted in Cray T3D/E). Remote data access time remains 400 processor clock cycles in Cray T3D/E, against 2 for retrieving data from local cache. 26/57

27 CC-NUMA Multiprocessors Caching can mitigate the problem due to remote data access, but brings back the cache coherency issue. Bus snooping is a method to enforce coherency, but this techniques is too expensive beyond a certain number of processors, and it is much too difficult to implement in system that do not rely on bus-based interconnections. DIRECTORY-BASED PROTOCOL is a common approach to enforce cache coherency in CC-NUMA systems contain many processors. The main idea is to associate each node in the system with a directory for its RAM blocks: a database stating in which cache is located a block, and what is its state. When a block of memory is addressed, the directory in the node where the block is located is queried, to know if the block is in any cache and, if so, if it has been changed respect to the copy in RAM. 27/57

28 CC-NUMA Multiprocessors A directory is queried at each access by an instruction to the corresponding memory block, it must be implemented with very fast hardware as an instance with associated cache, or at least with SRAM. Let us consider a 256-node system, each node equipped with CPU and 16 MB of local RAM. Total system memory RAM is 2 32 = 4GB, each node holds 2 18 blocks of 64 bytes (2 18 *2 6 =2 24 = 16MB). The address space is shared, it means node 0 containing memory addresses 0 16MB, node MB, and so on. Physical address consists of 32 bits. The 8 MSbits specify the node number holding the RAM block containing the addressed data. 28/57

29 CC-NUMA Multiprocessors The followed 18 bits identify the block within the 16 MB memory bank. The 6 LSbits address the byte within the block (Tanenbaum, Fig. 8.31b). Node Block Offset 8 bits 18 bits 6 bits Each node has directory holding 2 18 entries track of the block of the associated local memory. Each entry in the directory registers if the block is stored in any cache, and if so, in which node. 29/57

30 CC-NUMA Multiprocessors At first we assume that each 64-byte block is stored in a single cache in some processor at most. What happens when CPU 15 executes a LOAD into specifying RAM address. CPU 15 forwards the address to the local MMU, which translates the LOAD into physical address, e.g. 0x Node 34. Block 4. Offset 8. The MMU sees addressed data belongs in node 34, and sends the request through network to that node, to know if the block 4 is in a cache, and which one. 30/57

31 CC-NUMA Multiprocessors Node 34 forwards the request to its own directory, which checks and discovers - block is in no remote node cache. Block is fetched from local RAM and send to node 15, and the directory is updated to register that block 4 is in the cache at node Node 34 directory /57

32 CC-NUMA Multiprocessors Now we consider the case of request for block 2 in node 34. Node 34 directory discovers that block 2 is cached in node 82. Node 34 directory update block 2 entry, to reflect that the is at node 15, and sends node 82 a message requesting that block 2 is sent to node 15 and the corresponding entry in node 82 be invalidated. When are block updated in RAM? Only when they are modified. The simplest solution when CPU executes a STORE: the update is propagated to the RAM holding the block addressed by the STORE. This type of architecture has a lot of messages flowing the interconnection network. through 32/57

33 CC-NUMA Multiprocessors Overhead can be easily tolerated. Each node has a 16 MB of RAM, and 2 18 x 9-bit entries to keep track of the status of blocks. The overhead is 1,76 %. With 32 byte blocks overhead increase to 4% while it decreases with 128-byte blocks. In real systems, directory based architecture is more complicated: Block can be in at most one cache, and system efficiency can be increased allowing blocks to be in more caches (nodes) at the same time. By keeping track of the status of the block (modified, untouched) communication between CPU and memory can be minimized. For instance, if a cache block has not been modified, the original block in RAM is valid and read from remote CPU for that block can be answered by the RAM itself, without fetching the block from cache that holds a copy. 33/57

34 CC-NUMA Multiprocessors Process synchronization Mono-processor systems synchronization using system calls or constructors of the programming language: semaphores, conditional critical regions, monitors. Above-mentioned methods based on specific hardware synchronization primitives: An uninterruptible machine instruction capable of fetching and modifying value; of exchanging the contents of a register and of a memory word. Multiprocessor systems require similar primitives: processes share unique address space and synchronization must use this address space, rather than resorting to message exchange mechanisms. 34/57

35 CC-NUMA Multiprocessors Process synchronization Ex. classical solution to the critical section problem in a monoprocessor systems based on atomic exchange operation. Using this method we can build high level synchronization primitives such semaphores. Atomicity of the synchronization instruction is no sufficient. On one processor atomic instruction is executed without interrupts, but what about other processors? Would it be correct to disable all memory accesses since a synchronization primitive is launched, during the associated variable has been modified? Its possible, but with a slow down all memory operations not involved in synchronization (and so far we are ignoring any cache effect ). 35/57

36 CC-NUMA Multiprocessors Process synchronization Many processors use a couple of instructions, executed in a sequel. The first instruction tries to bring to the CPU the shared variable used by all processors for synchronization. The second one tries to modify the shared variable, and returns a value that tells if the couple has been executed in atomic fashion, and in a multiprocessor this means: no other process has modified the variable used for synchronization before the couple has completed execution, and no context switch occured in the processor between the two instructions. 36/57

37 CC-NUMA Multiprocessors Process synchronization [0(R1)] be the content of memory word addressed with 0(R1), used as shared synchronization variable. 1) LL Rx, 0(R1) linked load 2) SC Ry, 0(R1) store conditional The execution of the two instructions with respect to [0(R1)] is tied to what happens in-between: If [0(R1)] is modified (by another process) before SC executes, SC fails : [0(R1)] is not modified by SC and 0 is written in Ry; If SC does not fail: Ry is copied into [0 (R1)] and 1 is written in Ry. SC fails (with the same effects) if a context switch occurs in the CPU in-between the execution of the two instructions. 37/57

38 CC-NUMA Multiprocessors Process synchronization LL and SC the special instructions use an invisible register, the link register; LL store the memory address of the memory reference in the link reg. The link reg. is cleared if: the cache block it refers is invalidated; a context switch is executed. The SC checks that memory reference and link reg. match.; if so the LL SC couple behaves as an atomic memory reference. Inserting other instruction between LL and SC must be taken with care; only register-register instructions are safe. 38/57

39 CC-NUMA Multiprocessors Process synchronization Example of atomic exchange between R4 and [0 (R1)] in a shared memory multiprocessor system: retry_: OR Ry, R4, R0 LL Rx, 0 (R1) SC Ry, 0 (R1) BEQZ Ry, retry_ MOV R4, Rx MOV is executed, R4 and [0 (R1)] have been exchanged atomically, we are guaranteed that [0 (R1)] has not been changed by other processes before the completion of the exchanged - EXCH. 39/57

40 CC-NUMA Multiprocessors Process synchronization Using EXCH it is possible to implement spin locks: accesses to a critical section by a processor cycling on a lock variable mutually exclusive access. The lock variable tells if the critical section is free or occupied by another process. Busy waiting to implement critical section is right solution only for very short critical sections. Very short critical section can in turn be used to implement high level synchronization mechanisms and mutual exclusion, such as semaphores. Busy waiting is less of a problem in multiprocessors. Why? 40/57

41 CC-NUMA Multiprocessors Process synchronization If there where no cache (no coherency ) the lock variable could be left in memory; a process tries to get the lock with an atomic exchanged, and checks if the lock is free. If the cache coherency is in place, the lock variable can be kept in the cache of the all CPUs. This makes spin locks more caches). efficient (processors work on the 41/57

42 CC-NUMA Multiprocessors Process synchronization lockit: LD R2, 0(R1) //load of lock BNZE R2, lockit //lock not available, spin again ADD R2, 0 (R1), #1 //prepera value for locking EXCH R2, 0 (R1) //atomic exchange BNZE R2, lockit //spin if lock was not 0 The following example shows a case with three CPUs working according to MESI. Once CPU 0 sets the lock to 0, the entries in the two other caches are invalidate and the new value must be fetched from the cache in CPU 0. One of the two gets the value 0 first and succeeds in the exchange, the other processor finds the lock variable set to 1, and starts spinning again. 42/57

43 CC-NUMA Multiprocessors Process synchronization 43/57

44 CC-NUMA Multiprocessors Memory consistecy models STRICT consistency: any read from memory location Xreturns always the last value written in that memory location. SEQUENTIAL consistency: all CPUs see the same ordering. 44/57

45 COMA Multiprocessors NUMA and CC-NUMA machines have the disadvantage references to remote memory are much slower than references to local memory. In CC-NUMA, this performance difference is hidden to some extent by the caching. If remote data greatly exceeds the cache capacity cache misses will occur constantly and system performance will be poor. An alternative kind of multiprocessor ties to use each CPU s main memory as cache. COMA Cache Only Memory Access in these systems data have no specific permanent location (no specific memory address), where they stay and whence they can be read (copied into local caches) and/or modified (first in cache and then update at their permanent location). In COMA systems the physical address space is split into cache lines, which migrate around the system on demand. Blocks do not have home machines. A memory that just attracts lines as needed is called an attraction memory. Using the main RAM as a big cache greatly increases the hit rate, hence the performance. 45/57

46 COMA Multiprocessors COMA systems introduce two new problems: 1. How are cache lines located? 2. When a line is purged from memory, what happens if it is the last copy? The first question relates to the fact that when the MMU has translated a virtual address to a physical address, if the line is not in the true hardware cache, there is no easy way to tell if it is in main memory at all. Some solutions have been proposed. To check if a cache line is in main memory, new hardware could be added to keep track of the tag for each cache line. The MMU could then compare the tag for the line needed to the tags for all the cache lines in memory to look for hit. 46/57

47 COMA Multiprocessors In simple COMA system was implemented different solution. This solution is to map entire pages in but not require that all the cache lines be present. In this solution is needed a bit map per page, giving one bit per cache line indicating the presence or absence of the line. If cache line is present, it must be in right position in its page. If it is not present, any attempt to use it causes a trap the software should find it and bring it in. Other solution is to give each page a home machine in terms of where its directory entry is, but not where the data are. Then the message can be sent to the home machine to at least locate the cache line. Other scheme involve organizing memory as a tree and searching upward until the line is found. 47/57

48 COMA Multiprocessors The second problem relates to not purging the last copy. What happens if the line chosen happens to be the last copy? In this case, it can not be thrown out. One solution is to go back to the directory and check if there are other copies. If so, the line can be safely thrown out. In other case it has to be migrated some where else. Another solution is to label one copy of each cache line as the master copy and never throw it out. This solution avoids having to check with the directory. KSR-1, Data Diffusion Machine, SDAARC 48/57

49 Message Passing Multi-computers In multiprocessor systems shared memory can be implemented in many ways, including snooping buses, data crossbars, multistage switching networks, and various directory based schemes. Programs written for a multiprocessor can just access any location in memory without knowing anything about the internal topology or implementation scheme. This illusion is what makes multiprocessors so attractive and why programmers like this programming model. On the other hand, multiprocessors also have their limitations, which is why multi-computers are important, too. First and foremost, multiprocessors do not scale to large sizes. 49/57

50 Message Passing Multi-computers Enormous amount of hardware Sun had to use to get the E25K to scale to 72 CPUs. In contrast, a multicomputer that has 65,536 CPUs is nothing special. It will be years before anyone builds a commercial 65,536- node multiprocessor, and by then million-node multi-computers will be in use. 50/57

51 Message Passing Multi-computers (A.S. Tanenbaum) 51/57

52 Multi-computers Interconnection Network There are many different topologies of the interconnection networks. A star A complete interconnect 52/57

53 Multi-computers Interconnection Network A tree A ring 53/57

54 Multi-computers Interconnection Network A grid A double torus 54/57

55 Multi-computers Interconnection Network A cube A 4D hypercube 55/57

56 Message Passing Multi-computers There are two general styles of the multi-computers: MPPs and clusters. The first category consists of the MPPs (Massively Parallel Processors), which are huge supercomputers. These are used in science, in engineering, and in industry for very large calculations, for handling very large numbers of transactions per second, or for data warehousing (storing and managing immense databases). Initially, MPPs were primarily used as scientific supercomputers, but now most of them are used in commercial environments. Another point that characterizes MPPs is their enormous I/O capacity. Problems big enough to warrant using MPPs invariably have massive amounts of data to be processed, often terabytes. These data must be distributed among many disks and need to be moved around the machine at great speed. IBM BlueGene system, Red Storm machine at Sandia National Laboratory. 56/57

57 References 1. Andrew S. Tanenbaum - Structured computer organization 5 TH Edition, Prentice Hall, Adve and Gharachorloo, Shared Memory Consistency Models: A Tutorial 3. Dongarra et al., The Sourcebook of Parallel Computing 4. Hwang and Xu, Scalable Parallel Computing 5. Gregory Pfister, In Search of Clusters, 2nd ed., Prentice Hall, James K. Archibald The Cache Coherency Problem in Shared- Memory Multiprocessors, University of Washington 8. /cachecoherence.pdf clstr-basics-cso.html 57/57

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Introduction UMA systems NUMA systems COMA systems Cache coherency Process synchronization Models of memory consistency 1 Shared memory multiprocessors A system with multiple

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2) Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas Multiple processor systems 1 Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor, access

More information

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Operating Systems, Fall Lecture 9, Tiina Niklander 1 Multiprocessor Systems Multiple processor systems Ch 8.1 8.3 1 Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor,

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

DISTRIBUTED SHARED MEMORY

DISTRIBUTED SHARED MEMORY DISTRIBUTED SHARED MEMORY COMP 512 Spring 2018 Slide material adapted from Distributed Systems (Couloris, et. al), and Distr Op Systems and Algs (Chow and Johnson) 1 Outline What is DSM DSM Design and

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16 Shared memory Caches, Cache coherence and Memory consistency models Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Shared memory Caches, Cache

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Suggested Readings! What makes a memory system coherent?! Lecture 27 Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality! 1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

Computer Organization. Chapter 16

Computer Organization. Chapter 16 William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Shared Memory Architecture Part One

Shared Memory Architecture Part One Babylon University College of Information Technology Software Department Shared Memory Architecture Part One By Classification Of Shared Memory Systems The simplest shared memory system consists of one

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Organisasi Sistem Komputer

Organisasi Sistem Komputer LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

9. Distributed Shared Memory

9. Distributed Shared Memory 9. Distributed Shared Memory Provide the usual programming model of shared memory in a generally loosely coupled distributed environment. Shared Memory Easy to program Difficult to build Tight coupling

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Characteristics of Mult l ip i ro r ce c ssors r

Characteristics of Mult l ip i ro r ce c ssors r Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central

More information

Distributed Shared Memory

Distributed Shared Memory Distributed Shared Memory History, fundamentals and a few examples Coming up The Purpose of DSM Research Distributed Shared Memory Models Distributed Shared Memory Timeline Three example DSM Systems The

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014 Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014 Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded:

More information

Shared Memory. SMP Architectures and Programming

Shared Memory. SMP Architectures and Programming Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont. CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Dr e v prasad Dt

Dr e v prasad Dt Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Portland State University ECE 588/688. Cache-Only Memory Architectures

Portland State University ECE 588/688. Cache-Only Memory Architectures Portland State University ECE 588/688 Cache-Only Memory Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Non-Uniform Memory Access (NUMA) Architectures Physical address space is statically

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information