Embedded Systems Architecture. Multiprocessor and Multicomputer Systems
|
|
- Noel Fletcher
- 5 years ago
- Views:
Transcription
1 Embedded Systems Architecture Multiprocessor and Multicomputer Systems M. Eng. Mariusz Rudnicki 1/57
2 Multiprocessor Systems 2/57
3 Multiprocessor Systems 3/57
4 Multiprocessor Systems 4/57
5 UMA SMP Architectures Uniform Memory Access: all processors share a unique centralized primary memory, each CPU has the same memory access time. These systems are also called Symmetric Shared-memory Multiprocessors (SMP) (Hennessy-Patterson, Fig. 6.1) 5/57
6 UMA Bus-Based SMP Architectures The simplest multiprocessors use a single bus: Two or more CPUs and one or more memory modules use the same bus for communication; When the bus is busy and CPU wants to access memory, it must wait; Increasing number of CPUs increases waiting time; This can be reduced by including cache support. 6/57
7 UMA Bus-Based SMP Architectures Multicore processors are small UMA multiprocessor systems; the first shared cache (L2 or L3) is the communication channel. Shared memory can become a bottleneck for the system performance, reason all processors must synchronize on the single bus and memory access. Caches local to each CPU alleviate the problem, furthermore each processor can be equipped with a private memory to store data of computations that need not been shared by other processors. Traffic to/from shared memory can reduce considerably (Tanenbaum, Fig. 8.24) 7/57
8 SINGLE BUS TOPOLOGY 8/57
9 SINGLE BUS TOPOLOGY 9/57
10 SINGLE BUS TOPOLOGY Local caches pose a fundamental issue: each processor sees memory through its own cache, because of this two processors can see different values for the same memory location. Time Event Cache A Cache B RAM location for X CPU X reads Z CPU Y reads Z CPU X stores 0 in X /57
11 CACHE COHERENCY This issue is cache coherency, and if not solved, it prevents using caches in the processors, with heavy consequences on performance. Many cache coherency protocols have been proposed. All of them designed to prevent different version of the same cache block being present in two or more caches (false sharing). All solutions are at hardware level: each cache controller is capable of monitoring all memory requests on the bus coming from other CPUs, and, if necessary, the coherency protocol is activated. Cache controllers implement bus snooping. 11/57
12 SNOOPING CACHES WRITE THROUGH a simple cache coherency protocol. Let us see event that happen between a processor accessing data, and its cache: read miss: CPU cache controller fetches from RAM the missing block and load its into the cache. Solution subsequent reads of the same data (read hit). write miss: modified data are written directly in RAM: prior to this, the block containing data is not loaded into local cache. write hit: the cache block is updated and the update is propagated to RAM. Write operations are propagated to RAM, whose content is always updated. 12/57
13 SNOOPING CACHES Now we consider the operations on the side of the snooper in another CPU. Cache X generates read/write ops., cache Y is a snooping cache (Tanenbaum, Fig. 8.25): read miss: cache Y sees cache X fetch a block from memory but does nothing (in case of read hit sees nothing at all). write miss/hit: cache Y checks if it holds a copy of the modified data: if not, it takes no action. If it does hold a copy, the block containing it is flagged as invalid in cache Y. Action Local request Remote request Read miss Read hit Write miss Fetch data from memory Use data from local cache Update data in memory Write hit Update cache and memory Invalidate cache entry 13/57
14 SNOOPING CACHES Since all caches snoop on all memory actions from other caches, when a cache modifies a data item, the update is carried out in the cache itself and in memory; old block is removed from all other caches (flagged as invalid). According to this protocol, cache have no inconsistent data. Variations to basic protocol. Example, old blocks could be updated with a new value (replicating writes). This version requires more work, but prevents future cache misses. 14/57
15 SNOOPING CACHES Cache coherency protocol simplicity. The basic disadvantage of write through based protocols is inefficiency communication bus is the bottleneck. To alleviate the problem, in this protocol not all write operations are immediately propagated to RAM: a bit is set the cache block, to signal that the block is up-to-date, while memory is old. Modified block will be written to the RAM, possibly after more updates (not after each of them). 15/57
16 UMA Multiprocessors MESI Protocol Even with MESI protocol, a single bus to interface all processors with memory limits the dimension of UMA multiprocessor systems up to 32 CPUs. 1. Invalid The cache entry does not contain valid data. 2. Shared Multiple caches may hold the line; memory is up to date. 3. Exclusive No other cache holds the line; memory is up to date. 4. Modified The entry is valid; memory is invalid; no copies exist. 16/57
17 UMA Multiprocessors MESI Protocol time 17/57
18 UMA Multiprocessors MESI Protocol After CPU is initially booted all cache entries are marked invalid. The first memory read the line referenced is fetched into the cache of CPU 1 reading memory. It is marked as being E (exclusive) state this is the only copy in a cache. CPU 2 reads the same memory block. CPU 1 snooper sees that it is no longer alone and announces on the bus I also have a copy. Both copies are marked as S (shared) state. CPU 2 writes data to the cache. It puts out an invalidate signal on the bus other CPUs should discard their copies. The copy cached goes to M (modified) state. CPU 3 wants to read block. CPU 2, which is the owner, asserts signal on the bus telling CPU 3 to please wait while it writes its line back to memory. After this CPU 3 fetch a copy (copies are marked as shared). 18/57
19 UMA Multiprocessors MESI Protocol After that CPU 2 writes the line again which invalidates a copy in CPU 3 s cache. In the end CPU 1 writes to a word in the line. CPU 2 sees that and asserts a bus signal that telling CPU 1 wait while it writes its line back to memory. When it finishes, it marks its own copy as invalid since it knows that another CPU is about to modify it. Now this is the situation in which a CPU is writing to an uncached line. If the write-allocate policy is in use the lin will be loaded into the cache and marked as modified, M state. If the write-allocate policy is not in use the write will go directly to memory and line will not be cached anywhere. 19/57
20 UMA Multiprocessors - Crossbar Switches To overcome this limitation, a different kind of interconnection network is needed. The simplest solution for connecting n CPUs to m memory modules is the crossbar switch. Crossbar switches have long been used in telecommunications switches. At each intersection is a cross point - a switch that can be opened or closed. The crossbar is a non-blocking network. The number of switches for 20/57
21 UMA Multiprocessors - Crossbar Switches... CPUs 21/57
22 UMA Multiprocessors - Crossbar Switches 22/57
23 NUMA Multiprocessors NUMA Non Uniform Memory Access these systems use a shared logical address space, but physical memory is distributed among CPUs, data access time depends on data position, in local or in remote memory. These systems are called Distributed Shared Memory (DSM) architectures (Hannessy-Patterson, Fig. 6.2). 23/57
24 NUMA Multiprocessors NUMA systems all CPUs share the same address space, but each processor has a local memory, visible to all other processors. In NUMA systems access to local memory blocks is quicker than remote memory blocks. All NUMA systems have a single logical address space shared by all CPUs, but physical memory is distributed among processors. There are two types of memories: local and remote memory. Even remote memory is accessed by each CPU with LOAD and STORE instructions. NUMA SYSTEMS: Non-Caching NUMA (NC-NUMA); Cache-Coherent NUMA (CC-NUMA); 24/57
25 NC-NUMA Multiprocessors In a NC-NUMA system CPUs have no local cache. Each memory access is managed with a modified MMU, which controls if the request is for a local or for a remote block; in the latter case, the request is forwarded to the node containing the requested data. Program using remote data will run much slower than what they would, if the data were stored in the local memory (Tanenbaum, Fig. 8.30). 25/57
26 NC-NUMA Multiprocessors In NC-NUMA systems there is no cache coherency problem there is no caching at all: each memory item is in a single location. Remote memory access is very inefficient. Because of this, NC- NUMA systems can resort to special software that relocates memory pages from one block to another, just to maximize performances. Page scanner demon activates every few seconds, exemines statistics on memory usage, and moves pages from one block to another, to increase performance. In NC-NUMA systems, each processor can also have a private memory and a cache, but only private data (those allocated in the private local memory) can be in the cache. This solution increases the performance of each processor (is adopted in Cray T3D/E). Remote data access time remains 400 processor clock cycles in Cray T3D/E, against 2 for retrieving data from local cache. 26/57
27 CC-NUMA Multiprocessors Caching can mitigate the problem due to remote data access, but brings back the cache coherency issue. Bus snooping is a method to enforce coherency, but this techniques is too expensive beyond a certain number of processors, and it is much too difficult to implement in system that do not rely on bus-based interconnections. DIRECTORY-BASED PROTOCOL is a common approach to enforce cache coherency in CC-NUMA systems contain many processors. The main idea is to associate each node in the system with a directory for its RAM blocks: a database stating in which cache is located a block, and what is its state. When a block of memory is addressed, the directory in the node where the block is located is queried, to know if the block is in any cache and, if so, if it has been changed respect to the copy in RAM. 27/57
28 CC-NUMA Multiprocessors A directory is queried at each access by an instruction to the corresponding memory block, it must be implemented with very fast hardware as an instance with associated cache, or at least with SRAM. Let us consider a 256-node system, each node equipped with CPU and 16 MB of local RAM. Total system memory RAM is 2 32 = 4GB, each node holds 2 18 blocks of 64 bytes (2 18 *2 6 =2 24 = 16MB). The address space is shared, it means node 0 containing memory addresses 0 16MB, node MB, and so on. Physical address consists of 32 bits. The 8 MSbits specify the node number holding the RAM block containing the addressed data. 28/57
29 CC-NUMA Multiprocessors The followed 18 bits identify the block within the 16 MB memory bank. The 6 LSbits address the byte within the block (Tanenbaum, Fig. 8.31b). Node Block Offset 8 bits 18 bits 6 bits Each node has directory holding 2 18 entries track of the block of the associated local memory. Each entry in the directory registers if the block is stored in any cache, and if so, in which node. 29/57
30 CC-NUMA Multiprocessors At first we assume that each 64-byte block is stored in a single cache in some processor at most. What happens when CPU 15 executes a LOAD into specifying RAM address. CPU 15 forwards the address to the local MMU, which translates the LOAD into physical address, e.g. 0x Node 34. Block 4. Offset 8. The MMU sees addressed data belongs in node 34, and sends the request through network to that node, to know if the block 4 is in a cache, and which one. 30/57
31 CC-NUMA Multiprocessors Node 34 forwards the request to its own directory, which checks and discovers - block is in no remote node cache. Block is fetched from local RAM and send to node 15, and the directory is updated to register that block 4 is in the cache at node Node 34 directory /57
32 CC-NUMA Multiprocessors Now we consider the case of request for block 2 in node 34. Node 34 directory discovers that block 2 is cached in node 82. Node 34 directory update block 2 entry, to reflect that the is at node 15, and sends node 82 a message requesting that block 2 is sent to node 15 and the corresponding entry in node 82 be invalidated. When are block updated in RAM? Only when they are modified. The simplest solution when CPU executes a STORE: the update is propagated to the RAM holding the block addressed by the STORE. This type of architecture has a lot of messages flowing the interconnection network. through 32/57
33 CC-NUMA Multiprocessors Overhead can be easily tolerated. Each node has a 16 MB of RAM, and 2 18 x 9-bit entries to keep track of the status of blocks. The overhead is 1,76 %. With 32 byte blocks overhead increase to 4% while it decreases with 128-byte blocks. In real systems, directory based architecture is more complicated: Block can be in at most one cache, and system efficiency can be increased allowing blocks to be in more caches (nodes) at the same time. By keeping track of the status of the block (modified, untouched) communication between CPU and memory can be minimized. For instance, if a cache block has not been modified, the original block in RAM is valid and read from remote CPU for that block can be answered by the RAM itself, without fetching the block from cache that holds a copy. 33/57
34 CC-NUMA Multiprocessors Process synchronization Mono-processor systems synchronization using system calls or constructors of the programming language: semaphores, conditional critical regions, monitors. Above-mentioned methods based on specific hardware synchronization primitives: An uninterruptible machine instruction capable of fetching and modifying value; of exchanging the contents of a register and of a memory word. Multiprocessor systems require similar primitives: processes share unique address space and synchronization must use this address space, rather than resorting to message exchange mechanisms. 34/57
35 CC-NUMA Multiprocessors Process synchronization Ex. classical solution to the critical section problem in a monoprocessor systems based on atomic exchange operation. Using this method we can build high level synchronization primitives such semaphores. Atomicity of the synchronization instruction is no sufficient. On one processor atomic instruction is executed without interrupts, but what about other processors? Would it be correct to disable all memory accesses since a synchronization primitive is launched, during the associated variable has been modified? Its possible, but with a slow down all memory operations not involved in synchronization (and so far we are ignoring any cache effect ). 35/57
36 CC-NUMA Multiprocessors Process synchronization Many processors use a couple of instructions, executed in a sequel. The first instruction tries to bring to the CPU the shared variable used by all processors for synchronization. The second one tries to modify the shared variable, and returns a value that tells if the couple has been executed in atomic fashion, and in a multiprocessor this means: no other process has modified the variable used for synchronization before the couple has completed execution, and no context switch occured in the processor between the two instructions. 36/57
37 CC-NUMA Multiprocessors Process synchronization [0(R1)] be the content of memory word addressed with 0(R1), used as shared synchronization variable. 1) LL Rx, 0(R1) linked load 2) SC Ry, 0(R1) store conditional The execution of the two instructions with respect to [0(R1)] is tied to what happens in-between: If [0(R1)] is modified (by another process) before SC executes, SC fails : [0(R1)] is not modified by SC and 0 is written in Ry; If SC does not fail: Ry is copied into [0 (R1)] and 1 is written in Ry. SC fails (with the same effects) if a context switch occurs in the CPU in-between the execution of the two instructions. 37/57
38 CC-NUMA Multiprocessors Process synchronization LL and SC the special instructions use an invisible register, the link register; LL store the memory address of the memory reference in the link reg. The link reg. is cleared if: the cache block it refers is invalidated; a context switch is executed. The SC checks that memory reference and link reg. match.; if so the LL SC couple behaves as an atomic memory reference. Inserting other instruction between LL and SC must be taken with care; only register-register instructions are safe. 38/57
39 CC-NUMA Multiprocessors Process synchronization Example of atomic exchange between R4 and [0 (R1)] in a shared memory multiprocessor system: retry_: OR Ry, R4, R0 LL Rx, 0 (R1) SC Ry, 0 (R1) BEQZ Ry, retry_ MOV R4, Rx MOV is executed, R4 and [0 (R1)] have been exchanged atomically, we are guaranteed that [0 (R1)] has not been changed by other processes before the completion of the exchanged - EXCH. 39/57
40 CC-NUMA Multiprocessors Process synchronization Using EXCH it is possible to implement spin locks: accesses to a critical section by a processor cycling on a lock variable mutually exclusive access. The lock variable tells if the critical section is free or occupied by another process. Busy waiting to implement critical section is right solution only for very short critical sections. Very short critical section can in turn be used to implement high level synchronization mechanisms and mutual exclusion, such as semaphores. Busy waiting is less of a problem in multiprocessors. Why? 40/57
41 CC-NUMA Multiprocessors Process synchronization If there where no cache (no coherency ) the lock variable could be left in memory; a process tries to get the lock with an atomic exchanged, and checks if the lock is free. If the cache coherency is in place, the lock variable can be kept in the cache of the all CPUs. This makes spin locks more caches). efficient (processors work on the 41/57
42 CC-NUMA Multiprocessors Process synchronization lockit: LD R2, 0(R1) //load of lock BNZE R2, lockit //lock not available, spin again ADD R2, 0 (R1), #1 //prepera value for locking EXCH R2, 0 (R1) //atomic exchange BNZE R2, lockit //spin if lock was not 0 The following example shows a case with three CPUs working according to MESI. Once CPU 0 sets the lock to 0, the entries in the two other caches are invalidate and the new value must be fetched from the cache in CPU 0. One of the two gets the value 0 first and succeeds in the exchange, the other processor finds the lock variable set to 1, and starts spinning again. 42/57
43 CC-NUMA Multiprocessors Process synchronization 43/57
44 CC-NUMA Multiprocessors Memory consistecy models STRICT consistency: any read from memory location Xreturns always the last value written in that memory location. SEQUENTIAL consistency: all CPUs see the same ordering. 44/57
45 COMA Multiprocessors NUMA and CC-NUMA machines have the disadvantage references to remote memory are much slower than references to local memory. In CC-NUMA, this performance difference is hidden to some extent by the caching. If remote data greatly exceeds the cache capacity cache misses will occur constantly and system performance will be poor. An alternative kind of multiprocessor ties to use each CPU s main memory as cache. COMA Cache Only Memory Access in these systems data have no specific permanent location (no specific memory address), where they stay and whence they can be read (copied into local caches) and/or modified (first in cache and then update at their permanent location). In COMA systems the physical address space is split into cache lines, which migrate around the system on demand. Blocks do not have home machines. A memory that just attracts lines as needed is called an attraction memory. Using the main RAM as a big cache greatly increases the hit rate, hence the performance. 45/57
46 COMA Multiprocessors COMA systems introduce two new problems: 1. How are cache lines located? 2. When a line is purged from memory, what happens if it is the last copy? The first question relates to the fact that when the MMU has translated a virtual address to a physical address, if the line is not in the true hardware cache, there is no easy way to tell if it is in main memory at all. Some solutions have been proposed. To check if a cache line is in main memory, new hardware could be added to keep track of the tag for each cache line. The MMU could then compare the tag for the line needed to the tags for all the cache lines in memory to look for hit. 46/57
47 COMA Multiprocessors In simple COMA system was implemented different solution. This solution is to map entire pages in but not require that all the cache lines be present. In this solution is needed a bit map per page, giving one bit per cache line indicating the presence or absence of the line. If cache line is present, it must be in right position in its page. If it is not present, any attempt to use it causes a trap the software should find it and bring it in. Other solution is to give each page a home machine in terms of where its directory entry is, but not where the data are. Then the message can be sent to the home machine to at least locate the cache line. Other scheme involve organizing memory as a tree and searching upward until the line is found. 47/57
48 COMA Multiprocessors The second problem relates to not purging the last copy. What happens if the line chosen happens to be the last copy? In this case, it can not be thrown out. One solution is to go back to the directory and check if there are other copies. If so, the line can be safely thrown out. In other case it has to be migrated some where else. Another solution is to label one copy of each cache line as the master copy and never throw it out. This solution avoids having to check with the directory. KSR-1, Data Diffusion Machine, SDAARC 48/57
49 Message Passing Multi-computers In multiprocessor systems shared memory can be implemented in many ways, including snooping buses, data crossbars, multistage switching networks, and various directory based schemes. Programs written for a multiprocessor can just access any location in memory without knowing anything about the internal topology or implementation scheme. This illusion is what makes multiprocessors so attractive and why programmers like this programming model. On the other hand, multiprocessors also have their limitations, which is why multi-computers are important, too. First and foremost, multiprocessors do not scale to large sizes. 49/57
50 Message Passing Multi-computers Enormous amount of hardware Sun had to use to get the E25K to scale to 72 CPUs. In contrast, a multicomputer that has 65,536 CPUs is nothing special. It will be years before anyone builds a commercial 65,536- node multiprocessor, and by then million-node multi-computers will be in use. 50/57
51 Message Passing Multi-computers (A.S. Tanenbaum) 51/57
52 Multi-computers Interconnection Network There are many different topologies of the interconnection networks. A star A complete interconnect 52/57
53 Multi-computers Interconnection Network A tree A ring 53/57
54 Multi-computers Interconnection Network A grid A double torus 54/57
55 Multi-computers Interconnection Network A cube A 4D hypercube 55/57
56 Message Passing Multi-computers There are two general styles of the multi-computers: MPPs and clusters. The first category consists of the MPPs (Massively Parallel Processors), which are huge supercomputers. These are used in science, in engineering, and in industry for very large calculations, for handling very large numbers of transactions per second, or for data warehousing (storing and managing immense databases). Initially, MPPs were primarily used as scientific supercomputers, but now most of them are used in commercial environments. Another point that characterizes MPPs is their enormous I/O capacity. Problems big enough to warrant using MPPs invariably have massive amounts of data to be processed, often terabytes. These data must be distributed among many disks and need to be moved around the machine at great speed. IBM BlueGene system, Red Storm machine at Sandia National Laboratory. 56/57
57 References 1. Andrew S. Tanenbaum - Structured computer organization 5 TH Edition, Prentice Hall, Adve and Gharachorloo, Shared Memory Consistency Models: A Tutorial 3. Dongarra et al., The Sourcebook of Parallel Computing 4. Hwang and Xu, Scalable Parallel Computing 5. Gregory Pfister, In Search of Clusters, 2nd ed., Prentice Hall, James K. Archibald The Cache Coherency Problem in Shared- Memory Multiprocessors, University of Washington 8. /cachecoherence.pdf clstr-basics-cso.html 57/57
Shared Memory Multiprocessors
Shared Memory Multiprocessors Introduction UMA systems NUMA systems COMA systems Cache coherency Process synchronization Models of memory consistency 1 Shared memory multiprocessors A system with multiple
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationMultiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)
Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationMultiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas
Multiple processor systems 1 Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor, access
More informationOperating Systems, Fall Lecture 9, Tiina Niklander 1
Multiprocessor Systems Multiple processor systems Ch 8.1 8.3 1 Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor,
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationPage 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence
SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationDISTRIBUTED SHARED MEMORY
DISTRIBUTED SHARED MEMORY COMP 512 Spring 2018 Slide material adapted from Distributed Systems (Couloris, et. al), and Distr Op Systems and Algs (Chow and Johnson) 1 Outline What is DSM DSM Design and
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationAdvanced OpenMP. Lecture 3: Cache Coherency
Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationShared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16
Shared memory Caches, Cache coherence and Memory consistency models Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Shared memory Caches, Cache
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationDistributed Shared Memory and Memory Consistency Models
Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationShared Memory Architecture Part One
Babylon University College of Information Technology Software Department Shared Memory Architecture Part One By Classification Of Shared Memory Systems The simplest shared memory system consists of one
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:
The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationCSCI 4717 Computer Architecture
CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel
More informationChapter Seven Morgan Kaufmann Publishers
Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be
More informationOrganisasi Sistem Komputer
LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationReview: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology
Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share
More information9. Distributed Shared Memory
9. Distributed Shared Memory Provide the usual programming model of shared memory in a generally loosely coupled distributed environment. Shared Memory Easy to program Difficult to build Tight coupling
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationCache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationCache Coherence and Atomic Operations in Hardware
Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationEN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors
EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationM4 Parallelism. Implementation of Locks Cache Coherence
M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationCharacteristics of Mult l ip i ro r ce c ssors r
Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central
More informationDistributed Shared Memory
Distributed Shared Memory History, fundamentals and a few examples Coming up The Purpose of DSM Research Distributed Shared Memory Models Distributed Shared Memory Timeline Three example DSM Systems The
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationCache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014
Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014 Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded:
More informationShared Memory. SMP Architectures and Programming
Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationCache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O
6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor
More informationParallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence
Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly
More informationPage 1. Cache Coherence
Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale
More informationMultiprocessor Synchronization
Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationCS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.
CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationPage 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology
CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:
More informationIntro to Multiprocessors
The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple
More informationMultiprocessors 1. Outline
Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples
More informationThe Cache Write Problem
Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationDr e v prasad Dt
Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction
More informationLimitations of parallel processing
Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors
More informationLecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections
Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System
More informationLecture 13. Shared memory: Architecture and programming
Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13
More informationPortland State University ECE 588/688. Cache-Only Memory Architectures
Portland State University ECE 588/688 Cache-Only Memory Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Non-Uniform Memory Access (NUMA) Architectures Physical address space is statically
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More information