CSE 513: Distributed Systems (Distributed Shared Memory)

Size: px

Start display at page:

Download "CSE 513: Distributed Systems (Distributed Shared Memory)"

Augustine Daniel
5 years ago
Views:

CSE 513: Distributed Systems (Distributed Shared Memory) Guohong Cao Department of Computer & Engineering 310 Pond Lab gcao@cse.psu.

1 CSE 513: Distributed Systems (Distributed Shared Memory) Guohong Cao Department of Computer & Engineering 310 Pond Lab Distributed Shared Memory (DSM) Traditionally, distributed computing is based on message passing. DSM provides a virtual address space that is shared among all nodes in a distributed system. With DSM, programs access data in the shared address space just as they access data in traditional virtual memory. In DSM, each node can own data stored in the shared address space, and the ownership can change when data moves from one node to another. When a process accesses data in the shared address space, the DSM software layer, implemented in the kernel or as a runtime library routine, maps the shared memory address to the physical memory. 1 2 The DSM Advantages of DSM DSM hides the explicit message passing and provides a simpler abstraction for sharing data that programmers are used to. DSM allows complex structures (e.g., pointer, array) to be passed by reference, which is not supported by the message passing model. By moving the entire block or page containing the referenced data, make use of locality of reference. Cheaper to build than multiprocessor systems. Good scalability compared to multiprocessor systems. Programs written for shared memory multiprocessors can run on DSM systems. 3 4

2 DSM Implementation Issues How to keep track of the location of remote data? How to overcome the communication delays and high overhead associated with the execution of communication protocols in distributed systems when accessing remote data? How to make shared data concurrently accessible at several nodes in order to improve system performance? DSM Implementation Algorithms The client-server algorithm The data is maintained by the server Server becomes a bottleneck The migration algorithm The data is shipped to the location of the data request, allowing subsequent accesses to be performed locally. Thrashing: where pages frequently migrate between nodes while servicing only a few requests. Solution: use a tuning parameter that determines the duration for which a node can possess a shared data item. This allows a node to make a number of accesses to the page before it is migrated to another node. The read-replication algorithm Replicate the data blocks and allow multiple nodes to have read access or one node to have read-write access. The full-replication algorithm 5 6 Consistency Models Allowing multiple copies eases the performance problem, but it introduces a new problem: how to keep all the copies consistent? Maintaining perfect consistency is especially painful when the various copies are on different machines that can only communicate by sending messages over a slow network. In some DSM systems, the solution is to accept less than perfect consistency as the price for better performance. A consistency model is essentially a contract between the software and the memory. It says that if the software agrees to obey certain rules, the memory promises to work correctly. Strict Consistency Any read to a memory location x returns the value stored by the most recent write operation to x. P 2 : R(x)1 P 2 : R(x)0 R(x)1 7 8

3 Sequential Consistency The result of an execution is the same as if the operations of all processors are executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. All processes must see the same sequence of memory references. The following are two possible results a=1; b=1; c=1; print(b,c) print(a,c); print(a,b); a=1; a=1; b=1; print(b,c); b=1; c=1; b=1; print(a,c); print(a,b); print(a,c); print(b,c); print(a,c); c=1; c=1; a=1; print(a,b); print(a,b); print(b,c); Prints: prints: prints: P 2 : R(x)0 R(x)1 P 2 : R(x)1 R(x)0 9 Implementation: ensuring that no memory operation is started until all the previous ones have been completed. In a system with an efficient, totally-ordered reliable broadcast mechanism, all shared variables could be grouped together on one or more pages, and operations to the shared pages could be broadcasted. 10 Causal Consistency Processor Consistency Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. W(x)3 P 2 : R(x)1 W(x)2 P 3 : R(x)1 R(x)3 R(x)2 P 4 : R(x)1 R(x)2 R(x)3 11 Writes done by a single process are received by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes. The following is processor consistent, but not causal consistent P 2 : R(x)1 P 3 : P 4 : W(x)2 R(x)1 R(x)2 R(x)2 R(x)1 12

4 Weak Consistency Not all applications require even seeing all writes, let alone seeing them in order, e.g., operations in critical section. Synchronization variable: used for synchronization. When a synchronization completes, all writes done on that machine are propagated outward and all writes done on other machines are brought in. The properties of weak consistency: 1. Accesses to synchronization variables are sequentially consistent. 2. No access to a synchronization variable is allowed to be performed until all previous writes have been completed everywhere. 3. No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed. 13 W(x)2 S P 2 : R(x)1 R(x)2 S P 3 : R(x)2 R(x)1 S Weak Consistency W(x)2 S P 2 : S R(x)1 In the weak consistency model, when a synchronization variable is accessed, the memory does not know whether the process is finished writing the variable or about to start reading them. It must take actions required in both cases, making sure that all locally initiated writes have been competed, and gathering all writes from other machines. 14 Release Consistency Two kinds of accesses Acquire access is used to tell that a CS is about to be entered. Release access says that a CS has just been over. It follows the following rules: Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed. Before a release is allowed to be performed, all previous reads and writes done by the process must have competed. The acquire and release accesses must be processor consistent. 15 Granularity A large page size for the shared memory unit will take advantage of the locality of reference. By transferring large pages, less overhead is incurred due to paging activity and processing communication protocols. False sharing: occurs when two different data items, not shared but accessed by two different processes, are allocated to a single page. More false sharing when the page size is large. Smart compilers may partially solve the problem. However, if two processes share the same array, nothing can do about it. Another solution is to pre-fetch small pages. 16

5 Page Replacement When there is no free space in the memory, a page may need to be replaced Traditionally, we use Least recently used (LRU). In DSM, LRU may need to be modified, since data may be accessed in different modes such as shared, private, readonly, writable, etc. Private pages may be replaced before shared pages, as shared pages would have to be moved over the network, from their owner. Read-only pages can simply be deleted as their owners have a copy. Once a page is selected for replacement, the DSM must ensure that the page is not lost forever. One option is to swap the page onto disk. Another option is to use reserved memory, wherein each node is responsible for certain portions of the global virtual space and reserves memory space for those portions IVY IVY (Integrated Shared Virtual Memory at Yale) was implemented in the Apollo environment. The address space is divided into pages, with pages being spread over all the processors in the system When a processor references an address that is not local, a trap occurs, and the DSM software fetches the page containing the address and restarts the faulting instruction, which now competes successfully. Use replication to improve performance, especially read. Achieving sequential consistency by writeinvalidation. System Model for Page-Based DSM The Coherence Protocols When P i has a write fault to a page p. P i finds the owner of page p. The owner of page p sends the page and its copyset to P i and marks its page table entry for page p as null. The faulting P i sends out the invalidation messages to all the processors contained in the copyset. When P i has a read fault to a page p : P i finds the owner of p. The owner of p sends a copy of p to P i and adds P i to the copyset (p). P i has read-only access to p. The owner marks its page table entry for p as read-only

6 Invalidation Protocols Central Manager Approach How to locate owner (p) for a given p? Where to store copyset(p)? Solution: centralized manager scheme A server called a manager is used to store the location of owner(p) and the set copyset (p). In case of write fault, the previous owner also sends the page s copy set. The requesting process sends a multicast request to the members of the copy set and make sure acks are received. Needs two messages to locate the owner. Major problem: bottleneck, single point of failure Other Schemes Fixed distributed manager scheme Multiple manages are used, pages are divided statically between them. Use hashing techniques to map different pages to different managers. For example, with eight page manages, all pages that end with 000 are handled by manager 0. Pages that end with 001 are handled by manager 1. Multicast based distributed management In case of a process faults, it multicasts its page request to all other processes. Only the process that owns the page replies. Problems Consider C 1 and C 2 use multicast to locate a page owned by O. Suppose O receives C 1 s request first and transfers ownership to it. Before the page arrives at C 1, C 2 s request arrives at O and at C 1. O will discard C 2 s request since it is no longer the owner. C 1 also discards the request since it hasn t become the owner. Solution: C 1 defers processing C 2 s request until it becomes the owner. New problem: C 1 s request also queued at C 2. After C 1 gives C 2 the page, C 2 will receive and process C 1 s request, which is not necessary. Solution? 23 24

7 Dynamic Distributed Manager Algorithm The idea is to divide the overhead of locating pages between those computers that access them. Each process keeps, for each page p, a hint as to the page s current owner the probable owner of p, or probowner (p). Initially, each process has the accurate page owner, later there may be a long chain. The following schemes are used to reduce the length of the chain. When a process transfers ownership of page p to another process, it updates probowner (p) to be the recipient. When a process handles an invalidation request for a page p, it updates probowner (p) to be the requester. When a process that has requested read access to a page p receives it, it updates probowner (p) to be the provider. When a process receives a request for a page p that it does not own, it forwards the request to probowner (p) and resets probowner (p) to be the requester Other Optimizations Instead of obtaining a read copy from the owner, a client can obtain a copy from any process with a valid copy. Processes need to keep a record of clients that have obtained a copy of the page from them. This forms a tree rooted at the owner. How about invalidation? Double fault: a page not available locally is read and write successively. In which case, the page is transferred twice. A sequence number is associated with each page. When a node needs a read-write access to a page for which it has read-only access, it sends the sequence number along with a read-write access to the owner. Based on the sequence number, the owner may not transfer the whole page. 27

Shared Virtual Memory. Programming Models

Shared Virtual Memory. Programming Models Shared Virtual Memory Arvind Krishnamurthy Fall 2004 Programming Models Shared memory model Collection of threads Sharing the same address space Reads/writes on shared address space visible to all other