Embedded Systems Architecture. Multiprocessor and Multicomputer Systems

Size: px

Start display at page:

Download "Embedded Systems Architecture. Multiprocessor and Multicomputer Systems"

Noel Fletcher
5 years ago
Views:

1 Embedded Systems Architecture Multiprocessor and Multicomputer Systems M. Eng. Mariusz Rudnicki 1/57

2 Multiprocessor Systems 2/57

3 Multiprocessor Systems 3/57

4 Multiprocessor Systems 4/57

5 UMA SMP Architectures Uniform Memory Access: all processors share a unique centralized primary memory, each CPU has the same memory access time. These systems are also called Symmetric Shared-memory Multiprocessors (SMP) (Hennessy-Patterson, Fig. 6.1) 5/57

6 UMA Bus-Based SMP Architectures The simplest multiprocessors use a single bus: Two or more CPUs and one or more memory modules use the same bus for communication; When the bus is busy and CPU wants to access memory, it must wait; Increasing number of CPUs increases waiting time; This can be reduced by including cache support. 6/57

7 UMA Bus-Based SMP Architectures Multicore processors are small UMA multiprocessor systems; the first shared cache (L2 or L3) is the communication channel. Shared memory can become a bottleneck for the system performance, reason all processors must synchronize on the single bus and memory access. Caches local to each CPU alleviate the problem, furthermore each processor can be equipped with a private memory to store data of computations that need not been shared by other processors. Traffic to/from shared memory can reduce considerably (Tanenbaum, Fig. 8.24) 7/57

8 SINGLE BUS TOPOLOGY 8/57

9 SINGLE BUS TOPOLOGY 9/57

10 SINGLE BUS TOPOLOGY Local caches pose a fundamental issue: each processor sees memory through its own cache, because of this two processors can see different values for the same memory location. Time Event Cache A Cache B RAM location for X CPU X reads Z CPU Y reads Z CPU X stores 0 in X /57

11 CACHE COHERENCY This issue is cache coherency, and if not solved, it prevents using caches in the processors, with heavy consequences on performance. Many cache coherency protocols have been proposed. All of them designed to prevent different version of the same cache block being present in two or more caches (false sharing). All solutions are at hardware level: each cache controller is capable of monitoring all memory requests on the bus coming from other CPUs, and, if necessary, the coherency protocol is activated. Cache controllers implement bus snooping. 11/57

12 SNOOPING CACHES WRITE THROUGH a simple cache coherency protocol. Let us see event that happen between a processor accessing data, and its cache: read miss: CPU cache controller fetches from RAM the missing block and load its into the cache. Solution subsequent reads of the same data (read hit). write miss: modified data are written directly in RAM: prior to this, the block containing data is not loaded into local cache. write hit: the cache block is updated and the update is propagated to RAM. Write operations are propagated to RAM, whose content is always updated. 12/57

13 SNOOPING CACHES Now we consider the operations on the side of the snooper in another CPU. Cache X generates read/write ops., cache Y is a snooping cache (Tanenbaum, Fig. 8.25): read miss: cache Y sees cache X fetch a block from memory but does nothing (in case of read hit sees nothing at all). write miss/hit: cache Y checks if it holds a copy of the modified data: if not, it takes no action. If it does hold a copy, the block containing it is flagged as invalid in cache Y. Action Local request Remote request Read miss Read hit Write miss Fetch data from memory Use data from local cache Update data in memory Write hit Update cache and memory Invalidate cache entry 13/57

14 SNOOPING CACHES Since all caches snoop on all memory actions from other caches, when a cache modifies a data item, the update is carried out in the cache itself and in memory; old block is removed from all other caches (flagged as invalid). According to this protocol, cache have no inconsistent data. Variations to basic protocol. Example, old blocks could be updated with a new value (replicating writes). This version requires more work, but prevents future cache misses. 14/57

15 SNOOPING CACHES Cache coherency protocol simplicity. The basic disadvantage of write through based protocols is inefficiency communication bus is the bottleneck. To alleviate the problem, in this protocol not all write operations are immediately propagated to RAM: a bit is set the cache block, to signal that the block is up-to-date, while memory is old. Modified block will be written to the RAM, possibly after more updates (not after each of them). 15/57

16 UMA Multiprocessors MESI Protocol Even with MESI protocol, a single bus to interface all processors with memory limits the dimension of UMA multiprocessor systems up to 32 CPUs. 1. Invalid The cache entry does not contain valid data. 2. Shared Multiple caches may hold the line; memory is up to date. 3. Exclusive No other cache holds the line; memory is up to date. 4. Modified The entry is valid; memory is invalid; no copies exist. 16/57

17 UMA Multiprocessors MESI Protocol time 17/57

18 UMA Multiprocessors MESI Protocol After CPU is initially booted all cache entries are marked invalid. The first memory read the line referenced is fetched into the cache of CPU 1 reading memory. It is marked as being E (exclusive) state this is the only copy in a cache. CPU 2 reads the same memory block. CPU 1 snooper sees that it is no longer alone and announces on the bus I also have a copy. Both copies are marked as S (shared) state. CPU 2 writes data to the cache. It puts out an invalidate signal on the bus other CPUs should discard their copies. The copy cached goes to M (modified) state. CPU 3 wants to read block. CPU 2, which is the owner, asserts signal on the bus telling CPU 3 to please wait while it writes its line back to memory. After this CPU 3 fetch a copy (copies are marked as shared). 18/57

19 UMA Multiprocessors MESI Protocol After that CPU 2 writes the line again which invalidates a copy in CPU 3 s cache. In the end CPU 1 writes to a word in the line. CPU 2 sees that and asserts a bus signal that telling CPU 1 wait while it writes its line back to memory. When it finishes, it marks its own copy as invalid since it knows that another CPU is about to modify it. Now this is the situation in which a CPU is writing to an uncached line. If the write-allocate policy is in use the lin will be loaded into the cache and marked as modified, M state. If the write-allocate policy is not in use the write will go directly to memory and line will not be cached anywhere. 19/57

20 UMA Multiprocessors - Crossbar Switches To overcome this limitation, a different kind of interconnection network is needed. The simplest solution for connecting n CPUs to m memory modules is the crossbar switch. Crossbar switches have long been used in telecommunications switches. At each intersection is a cross point - a switch that can be opened or closed. The crossbar is a non-blocking network. The number of switches for 20/57

21 UMA Multiprocessors - Crossbar Switches... CPUs 21/57

22 UMA Multiprocessors - Crossbar Switches 22/57

23 NUMA Multiprocessors NUMA Non Uniform Memory Access these systems use a shared logical address space, but physical memory is distributed among CPUs, data access time depends on data position, in local or in remote memory. These systems are called Distributed Shared Memory (DSM) architectures (Hannessy-Patterson, Fig. 6.2). 23/57

24 NUMA Multiprocessors NUMA systems all CPUs share the same address space, but each processor has a local memory, visible to all other processors. In NUMA systems access to local memory blocks is quicker than remote memory blocks. All NUMA systems have a single logical address space shared by all CPUs, but physical memory is distributed among processors. There are two types of memories: local and remote memory. Even remote memory is accessed by each CPU with LOAD and STORE instructions. NUMA SYSTEMS: Non-Caching NUMA (NC-NUMA); Cache-Coherent NUMA (CC-NUMA); 24/57

25 NC-NUMA Multiprocessors In a NC-NUMA system CPUs have no local cache. Each memory access is managed with a modified MMU, which controls if the request is for a local or for a remote block; in the latter case, the request is forwarded to the node containing the requested data. Program using remote data will run much slower than what they would, if the data were stored in the local memory (Tanenbaum, Fig. 8.30). 25/57

26 NC-NUMA Multiprocessors In NC-NUMA systems there is no cache coherency problem there is no caching at all: each memory item is in a single location. Remote memory access is very inefficient. Because of this, NC- NUMA systems can resort to special software that relocates memory pages from one block to another, just to maximize performances. Page scanner demon activates every few seconds, exemines statistics on memory usage, and moves pages from one block to another, to increase performance. In NC-NUMA systems, each processor can also have a private memory and a cache, but only private data (those allocated in the private local memory) can be in the cache. This solution increases the performance of each processor (is adopted in Cray T3D/E). Remote data access time remains 400 processor clock cycles in Cray T3D/E, against 2 for retrieving data from local cache. 26/57

27 CC-NUMA Multiprocessors Caching can mitigate the problem due to remote data access, but brings back the cache coherency issue. Bus snooping is a method to enforce coherency, but this techniques is too expensive beyond a certain number of processors, and it is much too difficult to implement in system that do not rely on bus-based interconnections. DIRECTORY-BASED PROTOCOL is a common approach to enforce cache coherency in CC-NUMA systems contain many processors. The main idea is to associate each node in the system with a directory for its RAM blocks: a database stating in which cache is located a block, and what is its state. When a block of memory is addressed, the directory in the node where the block is located is queried, to know if the block is in any cache and, if so, if it has been changed respect to the copy in RAM. 27/57

28 CC-NUMA Multiprocessors A directory is queried at each access by an instruction to the corresponding memory block, it must be implemented with very fast hardware as an instance with associated cache, or at least with SRAM. Let us consider a 256-node system, each node equipped with CPU and 16 MB of local RAM. Total system memory RAM is 2 32 = 4GB, each node holds 2 18 blocks of 64 bytes (2 18 *2 6 =2 24 = 16MB). The address space is shared, it means node 0 containing memory addresses 0 16MB, node MB, and so on. Physical address consists of 32 bits. The 8 MSbits specify the node number holding the RAM block containing the addressed data. 28/57

29 CC-NUMA Multiprocessors The followed 18 bits identify the block within the 16 MB memory bank. The 6 LSbits address the byte within the block (Tanenbaum, Fig. 8.31b). Node Block Offset 8 bits 18 bits 6 bits Each node has directory holding 2 18 entries track of the block of the associated local memory. Each entry in the directory registers if the block is stored in any cache, and if so, in which node. 29/57

30 CC-NUMA Multiprocessors At first we assume that each 64-byte block is stored in a single cache in some processor at most. What happens when CPU 15 executes a LOAD into specifying RAM address. CPU 15 forwards the address to the local MMU, which translates the LOAD into physical address, e.g. 0x Node 34. Block 4. Offset 8. The MMU sees addressed data belongs in node 34, and sends the request through network to that node, to know if the block 4 is in a cache, and which one. 30/57

31 CC-NUMA Multiprocessors Node 34 forwards the request to its own directory, which checks and discovers - block is in no remote node cache. Block is fetched from local RAM and send to node 15, and the directory is updated to register that block 4 is in the cache at node Node 34 directory /57

32 CC-NUMA Multiprocessors Now we consider the case of request for block 2 in node 34. Node 34 directory discovers that block 2 is cached in node 82. Node 34 directory update block 2 entry, to reflect that the is at node 15, and sends node 82 a message requesting that block 2 is sent to node 15 and the corresponding entry in node 82 be invalidated. When are block updated in RAM? Only when they are modified. The simplest solution when CPU executes a STORE: the update is propagated to the RAM holding the block addressed by the STORE. This type of architecture has a lot of messages flowing the interconnection network. through 32/57

33 CC-NUMA Multiprocessors Overhead can be easily tolerated. Each node has a 16 MB of RAM, and 2 18 x 9-bit entries to keep track of the status of blocks. The overhead is 1,76 %. With 32 byte blocks overhead increase to 4% while it decreases with 128-byte blocks. In real systems, directory based architecture is more complicated: Block can be in at most one cache, and system efficiency can be increased allowing blocks to be in more caches (nodes) at the same time. By keeping track of the status of the block (modified, untouched) communication between CPU and memory can be minimized. For instance, if a cache block has not been modified, the original block in RAM is valid and read from remote CPU for that block can be answered by the RAM itself, without fetching the block from cache that holds a copy. 33/57

34 CC-NUMA Multiprocessors Process synchronization Mono-processor systems synchronization using system calls or constructors of the programming language: semaphores, conditional critical regions, monitors. Above-mentioned methods based on specific hardware synchronization primitives: An uninterruptible machine instruction capable of fetching and modifying value; of exchanging the contents of a register and of a memory word. Multiprocessor systems require similar primitives: processes share unique address space and synchronization must use this address space, rather than resorting to message exchange mechanisms. 34/57

35 CC-NUMA Multiprocessors Process synchronization Ex. classical solution to the critical section problem in a monoprocessor systems based on atomic exchange operation. Using this method we can build high level synchronization primitives such semaphores. Atomicity of the synchronization instruction is no sufficient. On one processor atomic instruction is executed without interrupts, but what about other processors? Would it be correct to disable all memory accesses since a synchronization primitive is launched, during the associated variable has been modified? Its possible, but with a slow down all memory operations not involved in synchronization (and so far we are ignoring any cache effect ). 35/57

36 CC-NUMA Multiprocessors Process synchronization Many processors use a couple of instructions, executed in a sequel. The first instruction tries to bring to the CPU the shared variable used by all processors for synchronization. The second one tries to modify the shared variable, and returns a value that tells if the couple has been executed in atomic fashion, and in a multiprocessor this means: no other process has modified the variable used for synchronization before the couple has completed execution, and no context switch occured in the processor between the two instructions. 36/57

37 CC-NUMA Multiprocessors Process synchronization [0(R1)] be the content of memory word addressed with 0(R1), used as shared synchronization variable. 1) LL Rx, 0(R1) linked load 2) SC Ry, 0(R1) store conditional The execution of the two instructions with respect to [0(R1)] is tied to what happens in-between: If [0(R1)] is modified (by another process) before SC executes, SC fails : [0(R1)] is not modified by SC and 0 is written in Ry; If SC does not fail: Ry is copied into [0 (R1)] and 1 is written in Ry. SC fails (with the same effects) if a context switch occurs in the CPU in-between the execution of the two instructions. 37/57

38 CC-NUMA Multiprocessors Process synchronization LL and SC the special instructions use an invisible register, the link register; LL store the memory address of the memory reference in the link reg. The link reg. is cleared if: the cache block it refers is invalidated; a context switch is executed. The SC checks that memory reference and link reg. match.; if so the LL SC couple behaves as an atomic memory reference. Inserting other instruction between LL and SC must be taken with care; only register-register instructions are safe. 38/57

39 CC-NUMA Multiprocessors Process synchronization Example of atomic exchange between R4 and [0 (R1)] in a shared memory multiprocessor system: retry_: OR Ry, R4, R0 LL Rx, 0 (R1) SC Ry, 0 (R1) BEQZ Ry, retry_ MOV R4, Rx MOV is executed, R4 and [0 (R1)] have been exchanged atomically, we are guaranteed that [0 (R1)] has not been changed by other processes before the completion of the exchanged - EXCH. 39/57

40 CC-NUMA Multiprocessors Process synchronization Using EXCH it is possible to implement spin locks: accesses to a critical section by a processor cycling on a lock variable mutually exclusive access. The lock variable tells if the critical section is free or occupied by another process. Busy waiting to implement critical section is right solution only for very short critical sections. Very short critical section can in turn be used to implement high level synchronization mechanisms and mutual exclusion, such as semaphores. Busy waiting is less of a problem in multiprocessors. Why? 40/57

41 CC-NUMA Multiprocessors Process synchronization If there where no cache (no coherency ) the lock variable could be left in memory; a process tries to get the lock with an atomic exchanged, and checks if the lock is free. If the cache coherency is in place, the lock variable can be kept in the cache of the all CPUs. This makes spin locks more caches). efficient (processors work on the 41/57

42 CC-NUMA Multiprocessors Process synchronization lockit: LD R2, 0(R1) //load of lock BNZE R2, lockit //lock not available, spin again ADD R2, 0 (R1), #1 //prepera value for locking EXCH R2, 0 (R1) //atomic exchange BNZE R2, lockit //spin if lock was not 0 The following example shows a case with three CPUs working according to MESI. Once CPU 0 sets the lock to 0, the entries in the two other caches are invalidate and the new value must be fetched from the cache in CPU 0. One of the two gets the value 0 first and succeeds in the exchange, the other processor finds the lock variable set to 1, and starts spinning again. 42/57

43 CC-NUMA Multiprocessors Process synchronization 43/57

44 CC-NUMA Multiprocessors Memory consistecy models STRICT consistency: any read from memory location Xreturns always the last value written in that memory location. SEQUENTIAL consistency: all CPUs see the same ordering. 44/57

45 COMA Multiprocessors NUMA and CC-NUMA machines have the disadvantage references to remote memory are much slower than references to local memory. In CC-NUMA, this performance difference is hidden to some extent by the caching. If remote data greatly exceeds the cache capacity cache misses will occur constantly and system performance will be poor. An alternative kind of multiprocessor ties to use each CPU s main memory as cache. COMA Cache Only Memory Access in these systems data have no specific permanent location (no specific memory address), where they stay and whence they can be read (copied into local caches) and/or modified (first in cache and then update at their permanent location). In COMA systems the physical address space is split into cache lines, which migrate around the system on demand. Blocks do not have home machines. A memory that just attracts lines as needed is called an attraction memory. Using the main RAM as a big cache greatly increases the hit rate, hence the performance. 45/57

46 COMA Multiprocessors COMA systems introduce two new problems: 1. How are cache lines located? 2. When a line is purged from memory, what happens if it is the last copy? The first question relates to the fact that when the MMU has translated a virtual address to a physical address, if the line is not in the true hardware cache, there is no easy way to tell if it is in main memory at all. Some solutions have been proposed. To check if a cache line is in main memory, new hardware could be added to keep track of the tag for each cache line. The MMU could then compare the tag for the line needed to the tags for all the cache lines in memory to look for hit. 46/57

47 COMA Multiprocessors In simple COMA system was implemented different solution. This solution is to map entire pages in but not require that all the cache lines be present. In this solution is needed a bit map per page, giving one bit per cache line indicating the presence or absence of the line. If cache line is present, it must be in right position in its page. If it is not present, any attempt to use it causes a trap the software should find it and bring it in. Other solution is to give each page a home machine in terms of where its directory entry is, but not where the data are. Then the message can be sent to the home machine to at least locate the cache line. Other scheme involve organizing memory as a tree and searching upward until the line is found. 47/57

48 COMA Multiprocessors The second problem relates to not purging the last copy. What happens if the line chosen happens to be the last copy? In this case, it can not be thrown out. One solution is to go back to the directory and check if there are other copies. If so, the line can be safely thrown out. In other case it has to be migrated some where else. Another solution is to label one copy of each cache line as the master copy and never throw it out. This solution avoids having to check with the directory. KSR-1, Data Diffusion Machine, SDAARC 48/57

49 Message Passing Multi-computers In multiprocessor systems shared memory can be implemented in many ways, including snooping buses, data crossbars, multistage switching networks, and various directory based schemes. Programs written for a multiprocessor can just access any location in memory without knowing anything about the internal topology or implementation scheme. This illusion is what makes multiprocessors so attractive and why programmers like this programming model. On the other hand, multiprocessors also have their limitations, which is why multi-computers are important, too. First and foremost, multiprocessors do not scale to large sizes. 49/57

50 Message Passing Multi-computers Enormous amount of hardware Sun had to use to get the E25K to scale to 72 CPUs. In contrast, a multicomputer that has 65,536 CPUs is nothing special. It will be years before anyone builds a commercial 65,536- node multiprocessor, and by then million-node multi-computers will be in use. 50/57

51 Message Passing Multi-computers (A.S. Tanenbaum) 51/57

52 Multi-computers Interconnection Network There are many different topologies of the interconnection networks. A star A complete interconnect 52/57

53 Multi-computers Interconnection Network A tree A ring 53/57

54 Multi-computers Interconnection Network A grid A double torus 54/57

55 Multi-computers Interconnection Network A cube A 4D hypercube 55/57

56 Message Passing Multi-computers There are two general styles of the multi-computers: MPPs and clusters. The first category consists of the MPPs (Massively Parallel Processors), which are huge supercomputers. These are used in science, in engineering, and in industry for very large calculations, for handling very large numbers of transactions per second, or for data warehousing (storing and managing immense databases). Initially, MPPs were primarily used as scientific supercomputers, but now most of them are used in commercial environments. Another point that characterizes MPPs is their enormous I/O capacity. Problems big enough to warrant using MPPs invariably have massive amounts of data to be processed, often terabytes. These data must be distributed among many disks and need to be moved around the machine at great speed. IBM BlueGene system, Red Storm machine at Sandia National Laboratory. 56/57

57 References 1. Andrew S. Tanenbaum - Structured computer organization 5 TH Edition, Prentice Hall, Adve and Gharachorloo, Shared Memory Consistency Models: A Tutorial 3. Dongarra et al., The Sourcebook of Parallel Computing 4. Hwang and Xu, Scalable Parallel Computing 5. Gregory Pfister, In Search of Clusters, 2nd ed., Prentice Hall, James K. Archibald The Cache Coherency Problem in Shared- Memory Multiprocessors, University of Washington 8. /cachecoherence.pdf clstr-basics-cso.html 57/57

Shared Memory Multiprocessors

Shared Memory Multiprocessors Introduction UMA systems NUMA systems COMA systems Cache coherency Process synchronization Models of memory consistency 1 Shared memory multiprocessors A system with multiple