Queuing Spin Lock Algorithms to Support Timing Predictability

Size: px
Start display at page:

Download "Queuing Spin Lock Algorithms to Support Timing Predictability"

Transcription

1 Queuing Spin Lock Algorithms to Support Timing Predictability Travis S. Craig Department of Computer Science and Engineering, FR-35 University of Washington Seattle, WA Abstract We introduce practical new algorithms for FIFO and priority-ordered spin locks on shared-memory multiprocessors with an atomic swap instruction. Different versions of these queuing spin locks are designed for machines with coherent-cache and NUMA memory models. We discuss extensions to provide nested lock acquisition, conditional locking, timeout of lock requests, and preemption of waiters. These locks support predictable timing in the lowest kernel and user levels of time-line-scheduled and priority-scheduled real-time systems. 1 Introduction Predictable timing is a key requirement for hard, critical real-time functions and a desirable trait for less-critical functions or softer deadlines. The predictability of a function depends on predictability being provided at all levels of its implementation. Predictability must begin with the hardware and be progressively exported through the various levels of software. In this paper, we examine low-level software mechanisms, specifically spin locks for allocating the use of critical sections on shared-memory multiprocessors. Whether a system is scheduled by time lines or by priorities, it must control the order in which resource allocation requests are granted. The resource on which we focus here is the critical section. In shared-memory systems, the allocation of critical sections generally involves busy-waiting ( spinning ), at least at the lowest level of software. The traditional method of using an atomic instruction (eg. test-and-set) to spin on a single location, however, has two serious drawbacks. This research was supported in part by the National Science Foundation under grant number CCR and by a National Defense Science and Engineering Graduate Fellowship. First, the typical implementation of such a lock neither controls service order nor bounds the time to obtain the lock. Second, the spinning processors can add a significant load to the machine s processor-memory interconnect, slowing the progress of other processors. A newer mechanism, the queuing spin lock, offers solutions to both of these problems. Papers by Anderson {l], Graunke and Thakkar [7], and Mellor-Crummey and Scott [12] discuss numerous architectural and software considerations and evaluate several spin lock schemes in the context of nonreal-time parallel systems. Molesky et al. [13] and Markatos [ll] discuss the considerations for real-time systems. Each of these sources also introduces one or more algorithms for queuing spin locks, in which each waiting process spins on a separate location and ownership of the lock is explicitly passed from process to process. The time-line-driven systems considered in [13] benefit from bounding the time required to obtain a lock. Thus, for those systems, we are interested in schemes that grant locks in round-robin or FIFO order. The priority-driven systems considered in [ll] benefit from reducing priority inversion [15]. For those systems, we are interested in schemes that grant locks in priority order. In either type of system, it is helpful to reduce the interconnect load generated by spinning processes. We include both coherent-cache and nonuniform memory access (NUMA) machines. In this paper, we introduce four practical new algorithms for queuing spin locks on shared-memory multiprocessors with an atomic register-to-memory swap instruction: 0 FIFO-granting for coherent cache 0 Priority-granting for coherent cache 0 FIFO-granting for NUMA 0 Priority-granting for NUMA Each algorithm requires only one spin to acquire a lock and none to release it. None of them spin B3 $ IEEE 148

2 over the processor-memory interconnect (thereby reducing load on the interconnection network) and none require any atomic instruction other than the swap. Due to space limitations, we describe in detail only the algorithm that grants requests in priority order on coherent-cache machines. All four algorithms are presented in detail in [5]. Beyond the factors already mentioned, we broaden the practical applicability of our locking schemes by suggesting extensions for the following features: Nesting - Allow a process to hold more than one lock at a time. Timeout - Allow a process waiting for a lock to rescind its request if it hasn t been granted by some time limit. Conditional Request - Allow a process to request a lock on the condition that the lock is available. If the lock is being held by another process, then the requester returns immediately with an indication that it did not receive the lock. Preemption of Waiters - Ensure that a lock is not granted to a waiter that is not running, but don t prevent the scheduler from preempting a process that is waiting for a lock. Organization The rest of this paper contains: 0 Section 2 - Definitions and architectural model. 0 Section 3 - Details of our basic priority-queuing spin lock for coherent-cache machines and outlines of our other locks. Section 4 - Extensions to our basic locks. Section 5 - State of our work and comparison of our results with other queuing spin locks. e Section 6 - Summary and future work. 2 Definitions and model When we speak of a process in this paper, we mean a schedulable entity. Our process could be a thread in many systems. We could also substitute processor for process, particularly when considering the allocation of kernel data structures. We often refer to gaining or holding a lock. The lock, in this case, is the mechanism by which we guarantee mutually exclusive access to a resource or critical section. The critical section comprises the part of the process that uses the protected resource. We use the I,&, (a) Coherent Cache Machine I I I htercomh (b) NUMAMachine Figure 1: Memory models concepts of holding Q lock or resource and occupying a critical section interchangeably. The data structures we use generally include three record types, Locks, Processes, and Requests. There is one Lock record for each lock that can be requested. Our Lock records contain no locked indication, but only pointers to Request records. There is one Process record per process. To request a lock, a process places a Request record (and sometimes its own Process record) on the queue of requests for the lock. In all of the schemes, a lock is released by explicitly granting one of the requests on the list. The remaining requests on the list are pending. We assume a loosely defined shared-memory multiprocessor architecture. The machine is byteaddressed, with pointers being one word long. For discussion purposes, we use a 32-bit word. The machine has one type of read-modify-write instruction, an atomic swap (fetch-and-store) of one-word values between a register and a location in memory. During a swap to a shared location, no other access to the same location intervenes between the halves of the swap. To provide predictable timing, the FIFO locks require that the hardware bound a processor s access time for memory. A round-robin bus has the needed characteristics. A bounded access time also enables the priority locks to perform correctly, but a priorityoriented interconnect could contribute a further reduction in priority inversion. We use two simple memory models. Our generic coherent-cache machine is shown in Figure l(a). Each processor is connected through its own cache to a shared interconnect and then to the shared memory. The cache block size is some multiple of 2 bytes. Any 149

3 private physical memory that a processor might have (not shown) is used only for non-shared locations. The basic access time is uniform for a processor that reads any shared location not in its cache. The caches maintain coherence of the shared memory. Our generic NUMA machine is shown in Figure l(b). Each processor is connected to a local memory and to a shared interconnect that connects it to other processors and/or memories. Any cache that a processor might have (not shown) is used only for nonshared locations. The shared memory of the machine is spread across the local memories of the processors. The times for a processor to read or write shared locations is highly non-uniform. The read or write of a shared location in a processor s own memory (local access) is much faster than that of a location in another processor s memory (remote access). In addition, the remote access uses the interconnect while the local access does not. 3 Queuing spin locks on machines with coherent caches To overcome the drawbacks of test-and-set locks, Graunke and Thakkar [7] invented an algorithm, which we call G-T and describe here in terms of Lock and Request records. Our own algorithms are derived from Graunke and Thakkar s. A G-T Request record is aligned on a cache block boundary, so its address has zero in the low-order bit. Only one bit of the Request is used for data. The G-T Lock record has a 32-bit value. The high 31 bits are the high bits of a Request record address. The low bit tells whether a 1 or a 0 in that Request record means that the Request is pending. Each process in the G-T scheme has its own Request record for each lock that it needs to acquire. To acquire a lock, a process, say PI, places the address and the current value of its Request record for that lock into a register and then swaps the register with the Lock record. In the swap, PI receives the address and the what-means-pending value of the Request left there by whichever process last requested the lock before PI did. PI then spins, watching for the value in the Request record to differ from the what-meanspending value. When the process just ahead of PI has finished with the lock, it complements the bit in its own Request record, thereby granting the lock to PI. If several more swaps occur while PI watches for the grant, then those processes are each spinning on Requests in separate cache blocks and are unaffected when the lock is granted to PI. Thus, processor-tomemory interconnect loading is low. Because each process watches the Request left by the process before it and each Process grants the Request that it swapped into the Lock, the requests are granted in FIFO order. The lock is initialized by having the Lock record point to a Request that is already granted. The lock is left in that same state whenever the last requester releases it. The what-means-pending mechanism is required for a process to be able to reuse a Request to acquire a lock again. After process PI marks its Request as granted (by complementing its bit), it has no way to know when that grant has been seen by another process. Thus, PI cannot safely change the value of its own Request in preparation for acquiring the lock again. It can only record the current value of its Request by the swap into the Lock record. Due to the FIFO nature of the queue, when PI has been granted the lock again, it can be sure that the last grant it issued has been seen. A trivial example is when PI re-acquires the lock with no intervening acquisitions by other processes. Because we plan to grant locks in priority order, we require a more conservative rule for reuse of Request records. When PI grants the lock in priority order, it generally does not grant the Request that PI itself swapped into the Lock. Thus, we cannot assume that the Request enqueued by PI has even been marked granted by the time PI needs to reuse it to request the lock again. At the time it holds the lock, however, PI does know that the Request that it watched to obtain the lock is not being used by any other process. Thus, our solution is to have a process cede ownership of its Request record when enqueuing and automatically inherit ownership of the watched Request when receiving the lock grant. We use that method in all four of our algorithms. In fact, our FIFO algorithm for coherent cache machines is just the G-T algorithm, with exchange of Request ownership added and the whatm eans-pen ding mechanism removed. A side effect of transferring Request record ownership is to remove the requirement for a Request record per lock per process in the G-T scheme (O(L * P) Requests in a system with L locks and P processes). Our algorithms use just one Request per lock or process in the system (O(L+P) Requests). Besides saving space, it is simpler to merely allocate one Request record each time a lock or a process is created, rather than needing to know in advance how many different locks a process will use or to allocate Requests as the process runs.

4 (a) state mw4t-j watch ~1 G=GRANTED but new-block-aligned guarantees that the Request is P = PplDING watcha myproc head X=Don,tcarc aligned on a cache-block boundary. We thus prevent Request Roccgs Lock I = nil Requests from sharing cache blocks. The request-lock routine is called to acquire a lock. In that routine, a process (say PI) marks a Request as pending and provides it to the lock for PI S successor to spin on (watch) in exchange for the Request left there by 9 s predecessor. PI then spins on the predecessor s Request record until the Request is marked granted, at which time PI holds the lock and can proceed to use the protected resource. ( Predecessor and successor, here, refer to the order of requests, not of grants.) To make a priority queue lock for a coherent-cache machine (coherent-priority lock), we include pointers that allow the lock holder to traverse the list of waiters from the oldest request to the newest one on the._.. ~...~..~~~~~ Figure 2: Coherent-priority lock structure in operation 3.1 Priority queue with coherent cache The pseudo-code of our priority-granting lock for machines with coherent caches (coherent-priority lock) is given in Appendix A. Figure 2 and the following paragraphs illustrate and describe the operation of the algorithm. Figure 2(a) labels our graphic representation of the records of this lock. Figure 2(b) shows a newly initialized lock and process. The initjlock routine initializes the Lock record and allocates and initializes its associated Request record when a new lock is created. The init2rocess routine performs the corresponding functions for a new process. The new-block-aligned routine allocates a Request record and assigns its address to a pointer variable, just as new does in Pascal, Lock recordto the oldest Request and a pointer from each Request to the Process that is watching it. We also include a back pointer from each Request to the Process that owns it (placed it on the queue), to simplify the algorithm. Thus, a Process and the Request that it owns are doubly linked to each other. When a process enqueues its request (Figure 2(c)), it receives a pointer to the Request that it should watch, just as in the FIFO case. It then stores a backpointer to its own Process record in the watcher field of the Request. The list structure is stable (pointers are not written by any process other than the lock holder) from the head Request down to the Request owned by the process that most recently set its backpointer. Figure 2(d) shows a lock with four processes on its queue. Process PI holds the lock and the list is stable all the way from Ro to RB, where P4 has yet to set the watcher pointer. The stable portion is the part of the list on which the lock holder can operate safely. The stable portion always includes the current lock holder. Thus, the lock holder can remove itself and its new Request (the one it watched) from the list, which is its first step in releasing the lock. In Figure 2(e), PI has removed itself and its new Request from the list. The next step is to find the highest-priority waiter, which is done simply by traversing the stable portion of the list. As noted by Markatos [ll], we must first choose the highest-priority waiter and then grant it the lock. There is no atomic action to do both. Therefore, it is possible that, after the lock holder has traversed the stable portion of the list and chosen the highest priority waiter, but before it has granted the lock, more processes might finish enqueuing themselves. In 151

5 that case, they are considered the next time the lock is released. By starting at the head of the list, we give the newcomers the best chance to finish enqueuing themselves before we reach them. Note, however, that any processes with requests after Pq s, even if they are fully enqueued, will not be visible until P4 has finished enqueuing itself. Finally, the lock holder simply grants the request of the highest-priority waiter. In the situation pictured in Figure 2(e), PI chooses P3 as the highest-priority waiter and passes it the lock by setting the state of RS to granted. Figure 2(f) shows one more type of operation on the data structure. As shown there, P3 is preparing to release the lock and has removed itself and its new Request record from the list. The pointers that link this pair are shown as dashed arcs to distinguish them from the arcs that form the remainder of the list. Also shown is that P4 has finally finished enqueuing itself and will be considered in P3 s search for the highestpriority waiter. 3.2 Extending our scheme to a NUMA machine The algorithm for a NUMA machine cannot merely have different processors watch different locations. Without the cache coherence mechanism, each processor must spin on a location in its local shared memory. With a process spinning on whatever Request the Lock record pointed to before the swap, however, the Request record might not be local to the processor doing the watching. To adapt our scheme to the NUMA environment, then, even the FIFO lock has to set up a pointer from the Request record back to the Process that is watching it. The waiter can then check the Request once and, if not granted, spin on a location in its own (locally allocated) Process record. The granter, in turn, marks the Request as granted and then checks for a pointer to a waiting Process. If it finds such a pointer, it follows the pointer and sets the state of the waiting Process to granted. To prevent a possible race between the waiter and the granter, we use a protocol of swaps in which either the granted indication is passed to the waiter or the Process pointer is passed to the granter. To implement this scheme, we require Process pointers to be even numbers, so we can use the low-order bit of the Process pointer as the state of the Request. That way, the watcher and state fields of the Request record are combined in one 32-bit word and can be swapped as a unit. If the granter performs the swap before the waiter, then the granter receives a nil pointer and the waiter receives a state of granted. If the waiter performs its swap before the granter, then the waiter receives a state of pending and the granter receives the Process pointer, which it can then follow to set the waiter s Process record to a state of granted. In either case, either process knows the winner of the race and has all the information it needs immediately upon performing its own swap operation. This swap race is illustrated in [5], where our NUMA-FIFO and NUMA-Priority locks also are described in detail. 4 Other features 4.1 Nested locks With the other schemes that use O(L + P) space [ll, 121, a lock-holding process that needs to acquire another (nested) lock must provide an additional Request record. Therefore, with nesting they require O(L + P * 0) space, where D is the depth of the nesting. We can, however, adapt our schemes to allow nested lock acquisitions without requiring any additional Request records. For clarity, we have written our algorithm in this paper to have the lock holder remove itself from the list and set up the myreq and myproc pointers for its new Request as part of the grant-lock routine. In the nesting version of the lock, however, a process sets the myreq/myproc pointers at the beginning of the request-lock routine and removes itself from the list at the end of the same routine. The tnit-process routine, in turn, initializes the myreq field instead of the watch field. In the nesting version of our FIFO lock, we can no longer keep track of just one Request to be granted. The myreq field of the Process record still points to the next Request to grant, but that Request is the top of a linked stack of Requests for the given process to grant as it unwinds from the nested acquisitions. Each Request contains a pointer to the next Request on the stack. 4.2 Timeout A lock-request timeout feature can be useful in realtime systems to support detection of and recovery from faults. By a timeout feature, we mean that the request-lock routine either obtains the lock before a time limit or returns a status indicating a failure to do so. With a non-queuing spin lock, request-lock could simply include a time check in its spin loop and, after 152

6 the timeout, stop spinning on the lock and return the failed status. With a queue lock, however, there are additional problems of keeping the request from being granted after the timeout and removing the timed-out Request from the queue. If the waiter simply stops spinning, its Request will eventually be granted without any process watching it and no other requests for the lock will make progress. Also, with our queue locks, only the lock holder can safely remove Process and Request records from the queue. We have seen, however, that a waiter may safely write to the Request that it is watching. To timeout its request, a waiter (say PI) must leave behind both an indication that it no longer desires the lock and a pointer to the next Request in the queue. In setting up those indications, PI is racing with a potential granter. If the lock is granted to PI just before it marks its request as timed out, then PI is holding the lock and the request-lock routine must either return a success status or pass the lock to the next waiter and return a failed status. Either policy is supportable. The data structure to support timeouts is very similar to the one shown in Appendix A except that, as for the NUMA locks, the watcher and state fields are combined into a word that contains 31 bits of pointer and 1 bit of state. In addition, the pointer can point to either the Process record of the process that is watching the Request or the next Request record in the queue. The swap instruction can again be used to arbitrate the winner of the race between the process that is trying to release the lock (by storing a pointer of nil and a state of granted) and the process that is trying to timeout its request (by storing a pointer to the next Request record and a state of granted). When the holder sees a granted state and a non-nil pointer, it knows to remove the affected Process and Request records from the queue and to follow the pointer to the next Request record. A cost of this timeout scheme is that a process that is timing out its request cannot take its Request record with it, because a lock holder must use the information in the Request first. Therefore, we need a Request record per process for each lock that it uses, similar to the G-T lock. When a process wants to request a lock, it can try to reactivate a previously timed-out Request by swapping back the appropriate pointer and a state of pending. The swap, once again, arbitrates a race, this time between the reactivation of the Request and its removal from the queue. If the requester loses the race to reactivate its Request, then the Request is off the queue and can be used for the normal request procedure, as described in Section 3.1 and Appendix A. 4.3 Conditional lock A problem noted by Graunke and Thakkar [7] is that of allowing a conditional request for a lock, one that acquires the lock if it is immediately available and fails otherwise. We observe that a conditional request is very much like one with a zero timeout. After finding that the request it is watching has not been granted, the requester immediately executes the timeout action. Graunke has also developed a method to provide a conditional lock and remove the failed request from the queue [SI. 4.4 Preemption As addressed by Markatos [ll] and Wisniewski et al. [16], another lock-related issue is that of preemption. Clearly, if the holder of a lock loses the CPU, use of the affected resource stops until the holder is scheduled again. With non-queuing locks, preempting a waiter doesn t block the use of the critical section, because the lock is not actively granted; it is actively acquired. Once preempted, the waiter is no longer waiting and is not at risk to acquire the lock. With queuing locks, however, preempting a process does not prevent that process from being granted the lock, which would turn it into a preempted lock holder. We might, however, want to allow a scheduler to preempt a waiter. To do so, we need a way for the scheduler to determine that a process is spinning for a lock and to deactivate the lock s Request for the duration of the preemption. Our discussion of timeout suggests ways to deactivate and reactivate a Request. Notice that preemption is similar to a timeout in that the goal is to cause a request not to be granted. The scheduler must detect that a process it is preempting is waiting for a lock (and which lock) and perform the timeout action for it. When rescheduling the process onto the CPU, it can attempt to reactivate the Request. (If it hasn t been removed, the reactivation attempt succeeds.) If reactivation fails, the scheduler can enqueue a new request. Alternatively, it can start the process back at the beginning of the request-lock routine. In any case, we must still develop the technique for detecting that a process is waiting for a lock. Recent work with non-queuing locks [2] offers some promising techniques. 153

7 5 Discussion 5.1 Testing We have implemented nesting versions of all four of our algorithms and performed the minimal functional testing that we could do in a non-multiprogrammed environment (simple mutual exclusion, lock acquisition, and lock release). For the FIFO locks, we also tested nested lock acquisition and release. For the NUMA locks, we tested an older version of the protocol that handles the race between a waiter setting a pointer to itself and the lock holder granting its request. That protocol didn t involve a swap, but would not be reliable under some reasonable assumptions about read/write ordering in a NUMA machine. We have implemented our coherent-fifo lock on the Proteus multiprocessor simulator [3], along with a test-and-set lock and a G-T lock. One benefit of working on the simulator is that we can monitor the functional performance of the lock schemes. Our prototypes include detectors for violations of mutual exclusion that have been triggered only by implementation bugs and purposeful testing. We have also observed that locks are, indeed, granted in FIFO order. So far, we have simulated between 1 and 18 processes (each on its own processor) contending for a single lock. We are still generating quantitative output from the simulation, but we ve made three observations that provide some optimism. First, the cache hit rate is much less sensitive to contention with our queuing lock than it is with a simple test-and-set lock. Second, we get a very high hit rate when we have only one process repeatedly entering and leaving the critical section. Third, the maximum and standard deviation of spin times for lock acquisition appear to be much lower with the queuing locks than with the test-andset lock. 5.2 Characterizing spin lock schemes Mellor-Crummey and Scott [12] nicely identify four general traits by which to characterize and compare spin lock algorithms. We generalize these traits and add a fifth one of our own (atomic instruction set) to create the following list of traits and our comments on them. The first two traits are environmental, involving architectural factors that might not be controllable by the system designer. The importance of the third trait depends on the application at hand. The last two are performance-oriented results of the type of lock used. 0 Target machine memory system (coherent-cache or NUMA). A scheme should be optimized for the type of memory system on which it must run, but we found that designing an algorithm for a NUMA machine was more challenging than doing one for a coherent-cache machine. With a coherent-cache machine, there is one extra active element, the coherence mechanism, to ensure that the state of a Request is moved to a location local to the processor that is reading it. On the NUMA machine, we had to develop our swapping protocol (a special-purpose software-level coherence mechanism) to allow the releasing and requesting processes to cooperate in making the state of a request available locally for any spinning. In fact, any of the schemes that we ve seen for NUMA machines should work reasonably well on coherentcache machines, but not vice versa. 0 Target machine atomic instruction set (compareand-swap (C&S), swap, or test-and-set). Each of the locking schemes that we ve mentioned or described requires a specific type of atomic instruction(s). Often only one (or none) of these three instructions is implemented on a particular machine. We haven t done a serious survey of available architectures, but we believe that the most generally applicable lock would use just test-and-set. Given a machine with an atomic swap instruction (eg. multiprocessors that use the swap available in the Intel 80x86 [9] or Motorola 88x00 [14]), we also believe that our algorithms offer better characteristics than would a queuing spin lock that used the swap as a test-and-set. The algorithms that use compare-and-swap offer some potential performance advantages over ours in NUMA machines, but they also require a swap instruction. Neither swap nor compareand-swap may be trivially emulated by the other. The actual performance difference depends on the relative expense of the swap and compareand-swap instructions on a given machine. Finally, we have not yet looked formally at our work in the context of other fetch-and-op instructions, load-linked/store-conditional [lo], or transactional memory [8]. 0 Order in which the lock is granted to waiters (random, FIFO, or priority). Random order does not support predictable timing in real-time systems. A FIFO order is most useful in a time-line-driven system, but can also benefit a priority-driven one. A priority order is even better at supporting reduced priority inversion in a priority-driven system. Note also that being able to grant a lock in priority order automatically allows us to grant it 154

8 in nearly FIFO order, by assigning time stamps to requests (depending on the availability, synchronization, and precision of a real-time clock). A scheme that provides FIFO order, however, does not generally provide priority order. 0 Load imposed on the interconnect (remote spinning vs. no remote spinning). Much of our effort has been dedicated to minimizing this load. Our key goal here, which we (and most others) have accomplished, is to eliminate any spinning on remote locations. Next, we would like to minimize the number of atomic instructions executed over the interconnect. Finally, remote reads and writes should be minimized. 0 Space required per lock (L * P, L + P * D, or L + P). The simplest spin lock typically requires just a byte per lock. All of the queuing spin lock schemes require more than that and our goal is generally to minimize the space required. Ours is the only scheme to provide the lowest asymptotic space (O(L + P)) with nesting. The exact space taken, however, depends on the sizes of the individual records. The key advantage of our scheme is that we can statically allocate one Request record per process when creating the process and not worry about how many to allocate for each process. Table 1 presents a number of queuing spin lock schemes in terms of these characteristics. For coherent-cache machines, it includes the locks of Graunke and Thakkar [7], Markatos [ll], and ours. For NUMA machines, it includes the locks of Mellor- Crummey and Scott (MCS), with and without compare-and-swap [12], Markatos, and ours. 6 Summary and future work Our main technical contributions are our techniques and algorithms that provide tight control over lock grant order, use only the atomic swap instruction, use at most one (local only) spin for lock acquisition, use no spinning for lock release, and need only O(L + P) space on either a coherent-cache or NUMA machine. An additional contribution is our outline of techniques to extend the basic algorithms. We expect to further refine our algorithms to optimize their performance, particularly when applying them to specific machines. We also need to: 0 Experimentally and analytically compare the predictability and performance of our algorithms and those of others. 0 Quantify the contribution of our algorithms to improving the schedulability results of various real-time scheduling systems. 0 Complete our investigation of the extensions, producing working models of each idea for testing. 0 Further consider the applicability of other atomic instructions to our work Acknowledgments I want to thank Alan Shaw, Becky Callison, Larry Crum, Evangelos Markatos, Chang-Yun Park, Ricardo Pincheira, Sitaram Raju, and Bob Wisniewski for their reviews, comments, and encouragement. Special thanks to Becky for identifying a subtle but serious error in a time complexity claim in an earlier draft. Any remaining errors are my own. References [l] Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1( 1):6-16, January [2] Brian N. Bershad, David D. Redell, and John R. Ellis. Fast Mutual Exclusion for Uniprocessors. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages , October [3] Eric A. Brewer and Chrysthanos N. Dellarocas. PROTEUS User Documentation, Version 0.2. MIT, Cambridge, Massachusetts, [4] J. E. Burns. Mutual Exclusion with Linear Waiting using Binary Shared Variables. SIGACT News, 10(2), Summer [5] Travis S. Craig. Building FIFO and Priority- Queuing Spin Locks from Atomic Swap. Technical Report , University of Washington, Seattle, WA, February [6] Gary Graunke. Personal communication, December

9 Category Atomic Instruction Set Grant Order Interconnect Load Space Algorithm-+ 1 Trait C&S Swap Test&Set FIFO Priority Remote spin No remote spin L*P L+P*D L+P Memory Model Coherent Cache NUMA G-T Markatos Ours MCS MCS Markatos Ours (C&S) (Swap only) X X X X X X X X X X * linear X X usually * X X X X X rarely X X X X X usually X X X X w/nest w/nest w/nest w/onest w/onest w/onest X Gary Graunke and Shreekant Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, 23(6):60-69, June Maurice Herlihy and J. Eliot B. Moss. Transactional Memory: Architectural Support for Lock- Free Data Structures. Technical Report CRL 92/07, Digital Equipment Corporation, Cambridge Research Lab, December Intel Corporation, Santa Clara, California. Microprocessors, Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice-Hall, Englewood Cliffs, NJ, Evangelos P. Markatos. Multiprocessor Synchronization Primitives with Priorities. In Yann-Hang Lee and C. M. Krishna, editors, Readings in Real- Time Systems, pages IEEE Computer Society Press, Los Alamitos, CA, [13] Lory D. Molesky, Chia Shen, and Goran Zlokapa. Predictable Synchronization Mechanisms for Real-Time Systems. Real- Time Systems, 2(3): , September [I41 Motorola, Inc., Phoenix, Arizona. MC88110: Second Gene ration R ISC Mi crop rocess o r Us er s Manual, Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, 39(9): , September [16] Robert W. Wisniewski, Leonidas Kontothanassis, and Michael L. Scott. Scalable Spin Locks for Multiprogrammed Systems. Technical Report 454, University of Rochester, Rochester, NY, April John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems, 9( 1):21-65, February

10 A Coherent-cache priority queue lock algorithm type Lock = record tail : ^Request // recently enqueued head : ^Request // longest enqueued end record type Process = record pri : Priority watch : ^Request myreq : "Request end record type Request = record state : (PENDING. GRANTED) watcher : -Process // pointers for myproc : -Process // list traversal end record procedure init-lock (var L : Lock) nev-blockaligned (L.tail) // allocate a Request L.head := L.tai1 L.tail-.state := GRANTED L.tail-.watcher := NIL L.tail".myproc := NIL end procedure procedure initprocess (var P : Process) procedure grantlock (var L : Lock, var P : Process) V U highpri : Priority highreq : "Request currproc : -Process // remove my Process and the Request I watched // from the list P.nyreq^.myproc := P.watch".myproc if (P.myreq^.myproc # NIL) then P.myreq".myproc^.myreq := P.myreq else L.head := P.myreq end if // search the list for the highest-priority waiter highpri := LOYEST'PRIORITY - 1 highreq := L.head currproc := L.head".watcher while (currproc # NIL) do if (currproc".pri > highpri) then highpri := currproc-.pri highreq := currproc".watch end if currproc := currproc-.myreq^.watcher end while // pass the lock to the highest-priority waiter highreq".state := GRANTED // take ownership of the Request that I watched P.myreq := P.vatch P. myreq-. myproc : = addr (P) end procedure new-blockaligned (P.myreq) // allocate a Request P.myreq".myproc := addr(p) P.pri := priority of my process end procedure procedure requestlock (var L : Lock, var P : Process) // set up to enqueue my Request P.myreq-.state := PENDING P.myreq^.watcher := NIL // enqueue the Request and get one to watch P.vatch := fetchkstore (L.tai1, P.myreq) P.watch^.watcher := addr(p) // point request back to me loop until (P.vatch^.state = GRANTED) end procedure 157

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Priority Inheritance Spin Locks for Multiprocessor Real-Time Systems

Priority Inheritance Spin Locks for Multiprocessor Real-Time Systems Priority Inheritance Spin Locks for Multiprocessor Real-Time Systems Cai-Dong Wang, Hiroaki Takada, and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1 Hongo,

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

CPSC/ECE 3220 Summer 2017 Exam 2

CPSC/ECE 3220 Summer 2017 Exam 2 CPSC/ECE 3220 Summer 2017 Exam 2 Name: Part 1: Word Bank Write one of the words or terms from the following list into the blank appearing to the left of the appropriate definition. Note that there are

More information

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics. CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics. Name: Write one of the words or terms from the following list into the blank appearing to the left of the appropriate definition. Note that there are more

More information

6.852: Distributed Algorithms Fall, Class 15

6.852: Distributed Algorithms Fall, Class 15 6.852: Distributed Algorithms Fall, 2009 Class 15 Today s plan z z z z z Pragmatic issues for shared-memory multiprocessors Practical mutual exclusion algorithms Test-and-set locks Ticket locks Queue locks

More information

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract A simple correctness proof of the MCS contention-free lock Theodore Johnson Krishna Harathi Computer and Information Sciences Department University of Florida Abstract Mellor-Crummey and Scott present

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Distributed Operating Systems

Distributed Operating Systems Distributed Operating Systems Synchronization in Parallel Systems Marcus Völp 2009 1 Topics Synchronization Locking Analysis / Comparison Distributed Operating Systems 2009 Marcus Völp 2 Overview Introduction

More information

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background

More information

Reducing Priority Inversion in Interprocessor Synchronization on a Fixed-Priority Bus

Reducing Priority Inversion in Interprocessor Synchronization on a Fixed-Priority Bus Reducing Priority Inversion in Interprocessor Synchronization on a Fixed-Priority Bus Chi-Sing Chen (cschen@cs.fsu.edu) Pratit Santiprabhob (pratit@cs.fsu.edu) T.P. Baker (baker@cs.fsu.edu) Department

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

More on Synchronization and Deadlock

More on Synchronization and Deadlock Examples of OS Kernel Synchronization More on Synchronization and Deadlock Two processes making system calls to read/write on the same file, leading to possible race condition on the file system data structures

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Lock-Free Techniques for Concurrent Access to Shared Objects

Lock-Free Techniques for Concurrent Access to Shared Objects This is a revised version of the previously published paper. It includes a contribution from Shahar Frank who raised a problem with the fifo-pop algorithm. Revised version date: sept. 30 2003. Lock-Free

More information

The complete license text can be found at

The complete license text can be found at SMP & Locking These slides are made distributed under the Creative Commons Attribution 3.0 License, unless otherwise noted on individual slides. You are free: to Share to copy, distribute and transmit

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed

More information

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps Interactive Scheduling Algorithms Continued o Priority Scheduling Introduction Round-robin assumes all processes are equal often not the case Assign a priority to each process, and always choose the process

More information

Distributed Systems Synchronization. Marcus Völp 2007

Distributed Systems Synchronization. Marcus Völp 2007 Distributed Systems Synchronization Marcus Völp 2007 1 Purpose of this Lecture Synchronization Locking Analysis / Comparison Distributed Operating Systems 2007 Marcus Völp 2 Overview Introduction Hardware

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

AST: scalable synchronization Supervisors guide 2002

AST: scalable synchronization Supervisors guide 2002 AST: scalable synchronization Supervisors guide 00 tim.harris@cl.cam.ac.uk These are some notes about the topics that I intended the questions to draw on. Do let me know if you find the questions unclear

More information

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms Maged M. Michael Michael L. Scott Department of Computer Science University of Rochester Rochester, NY 14627-0226 fmichael,scottg@cs.rochester.edu

More information

Tasks. Task Implementation and management

Tasks. Task Implementation and management Tasks Task Implementation and management Tasks Vocab Absolute time - real world time Relative time - time referenced to some event Interval - any slice of time characterized by start & end times Duration

More information

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212 Operating Systems Spring 2009-2010 Outline Process Synchronisation (contd.) 1 Process Synchronisation (contd.) 2 Announcements Presentations: will be held on last teaching week during lectures make a 20-minute

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code

Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code Rotem Dvir 1 and Gadi Taubenfeld 2 1 The Interdisciplinary Center, P.O.Box 167, Herzliya 46150, Israel rotem.dvir@gmail.com

More information

Spin Locks and Contention Management

Spin Locks and Contention Management Chapter 7 Spin Locks and Contention Management 7.1 Introduction We now turn our attention to the performance of mutual exclusion protocols on realistic architectures. Any mutual exclusion protocol poses

More information

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019 250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

Computer Science 161

Computer Science 161 Computer Science 161 150 minutes/150 points Fill in your name, logname, and TF s name below. Name Logname The points allocated to each problem correspond to how much time we think the problem should take.

More information

Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors

Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors Maged M. Michael Michael L. Scott University of Rochester Department of

More information

Concurrent Update on Multiprogrammed Shared Memory Multiprocessors

Concurrent Update on Multiprogrammed Shared Memory Multiprocessors Concurrent Update on Multiprogrammed Shared Memory Multiprocessors Maged M. Michael Michael L. Scott Department of Computer Science University of Rochester Rochester, NY 467-06 michael,scott @cs.rochester.edu

More information

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Presentation by Joe Izraelevitz Tim Kopp Synchronization Primitives Spin Locks Used for

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

Chapter 5 Concurrency: Mutual Exclusion and Synchronization

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles Chapter 5 Concurrency: Mutual Exclusion and Synchronization Seventh Edition By William Stallings Designing correct routines for controlling concurrent

More information

High Performance Synchronization Algorithms for. Multiprogrammed Multiprocessors. (Extended Abstract)

High Performance Synchronization Algorithms for. Multiprogrammed Multiprocessors. (Extended Abstract) High Performance Synchronization Algorithms for Multiprogrammed Multiprocessors (Extended Abstract) Robert W. Wisniewski, Leonidas Kontothanassis, and Michael L. Scott Department of Computer Science University

More information

Per-Thread Batch Queues For Multithreaded Programs

Per-Thread Batch Queues For Multithreaded Programs Per-Thread Batch Queues For Multithreaded Programs Tri Nguyen, M.S. Robert Chun, Ph.D. Computer Science Department San Jose State University San Jose, California 95192 Abstract Sharing resources leads

More information

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency Constructive Computer Architecture Symmetric Multiprocessors: Synchronization and Sequential Consistency Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses. 1 Memory Management Address Binding The normal procedures is to select one of the processes in the input queue and to load that process into memory. As the process executed, it accesses instructions and

More information

CHAPTER 6: PROCESS SYNCHRONIZATION

CHAPTER 6: PROCESS SYNCHRONIZATION CHAPTER 6: PROCESS SYNCHRONIZATION The slides do not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams. TOPICS Background

More information

Distributed Operating Systems

Distributed Operating Systems Distributed Operating Systems Synchronization in Parallel Systems Marcus Völp 203 Topics Synchronization Locks Performance Distributed Operating Systems 203 Marcus Völp 2 Overview Introduction Hardware

More information

CS 153 Design of Operating Systems Winter 2016

CS 153 Design of Operating Systems Winter 2016 CS 153 Design of Operating Systems Winter 2016 Lecture 12: Scheduling & Deadlock Priority Scheduling Priority Scheduling Choose next job based on priority» Airline checkin for first class passengers Can

More information

Advanced Topic: Efficient Synchronization

Advanced Topic: Efficient Synchronization Advanced Topic: Efficient Synchronization Multi-Object Programs What happens when we try to synchronize across multiple objects in a large program? Each object with its own lock, condition variables Is

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

Lecture 10: Multi-Object Synchronization

Lecture 10: Multi-Object Synchronization CS 422/522 Design & Implementation of Operating Systems Lecture 10: Multi-Object Synchronization Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous

More information

Introduction to Real-Time Operating Systems

Introduction to Real-Time Operating Systems Introduction to Real-Time Operating Systems GPOS vs RTOS General purpose operating systems Real-time operating systems GPOS vs RTOS: Similarities Multitasking Resource management OS services to applications

More information

AUTOSAR Extensions for Predictable Task Synchronization in Multi- Core ECUs

AUTOSAR Extensions for Predictable Task Synchronization in Multi- Core ECUs 11AE-0089 AUTOSAR Extensions for Predictable Task Synchronization in Multi- Core ECUs Copyright 2011 SAE International Karthik Lakshmanan, Gaurav Bhatia, Ragunathan (Raj) Rajkumar Carnegie Mellon University

More information

Virtual Memory. 1 Administrivia. Tom Kelliher, CS 240. May. 1, Announcements. Homework, toolboxes due Friday. Assignment.

Virtual Memory. 1 Administrivia. Tom Kelliher, CS 240. May. 1, Announcements. Homework, toolboxes due Friday. Assignment. Virtual Memory Tom Kelliher, CS 240 May. 1, 2002 1 Administrivia Announcements Homework, toolboxes due Friday. Assignment From Last Time Introduction to caches. Outline 1. Virtual memory. 2. System support:

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

CSE 120. Fall Lecture 8: Scheduling and Deadlock. Keith Marzullo

CSE 120. Fall Lecture 8: Scheduling and Deadlock. Keith Marzullo CSE 120 Principles of Operating Systems Fall 2007 Lecture 8: Scheduling and Deadlock Keith Marzullo Aministrivia Homework 2 due now Next lecture: midterm review Next Tuesday: midterm 2 Scheduling Overview

More information

Microkernel/OS and Real-Time Scheduling

Microkernel/OS and Real-Time Scheduling Chapter 12 Microkernel/OS and Real-Time Scheduling Hongwei Zhang http://www.cs.wayne.edu/~hzhang/ Ack.: this lecture is prepared in part based on slides of Lee, Sangiovanni-Vincentelli, Seshia. Outline

More information

Midterm Exam Amy Murphy 19 March 2003

Midterm Exam Amy Murphy 19 March 2003 University of Rochester Midterm Exam Amy Murphy 19 March 2003 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your

More information

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem?

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem? What is the Race Condition? And what is its solution? Race Condition: Where several processes access and manipulate the same data concurrently and the outcome of the execution depends on the particular

More information

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

IT 540 Operating Systems ECE519 Advanced Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (5 th Week) (Advanced) Operating Systems 5. Concurrency: Mutual Exclusion and Synchronization 5. Outline Principles

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

The Deadlock Lecture

The Deadlock Lecture Concurrent systems Lecture 4: Deadlock, Livelock, and Priority Inversion DrRobert N. M. Watson The Deadlock Lecture 1 Reminder from last time Multi-Reader Single-Writer (MRSW) locks Alternatives to semaphores/locks:

More information

Advanced Operating Systems (CS 202) Scheduling (2)

Advanced Operating Systems (CS 202) Scheduling (2) Advanced Operating Systems (CS 202) Scheduling (2) Lottery Scheduling 2 2 2 Problems with Traditional schedulers Priority systems are ad hoc: highest priority always wins Try to support fair share by adjusting

More information

Final Exam Preparation Questions

Final Exam Preparation Questions EECS 678 Spring 2013 Final Exam Preparation Questions 1 Chapter 6 1. What is a critical section? What are the three conditions to be ensured by any solution to the critical section problem? 2. The following

More information

Liveness properties. Deadlock

Liveness properties. Deadlock Liveness properties From a theoretical viewpoint we must ensure that we eventually make progress i.e. we want to avoid : blocked threads/processes waiting for each other Livelock: processes/threads execute

More information

CS377P Programming for Performance Multicore Performance Synchronization

CS377P Programming for Performance Multicore Performance Synchronization CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015 Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional

More information

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei Non-blocking Array-based Algorithms for Stacks and Queues Niloufar Shafiei Outline Introduction Concurrent stacks and queues Contributions New algorithms New algorithms using bounded counter values Correctness

More information

System-on-a-Chip Processor Synchronization Support in Hardware

System-on-a-Chip Processor Synchronization Support in Hardware System-on-a-Chip rocessor Synchronization Support in Hardware Bilge E. Saglam Vincent J. Mooney III Georgia Institute of Technology Georgia Institute of Technology School of Electrical and Computer School

More information

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019 CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling CSE120 Principles of Operating Systems Prof Yuanyuan (YY) Zhou Scheduling Announcement l Homework 2 due on October 26th l Project 1 due on October 27th 2 Scheduling Overview l In discussing process management

More information

Multiprocessor and Real-Time Scheduling. Chapter 10

Multiprocessor and Real-Time Scheduling. Chapter 10 Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems

More information

Reactive Synchronization Algorithms for Multiprocessors

Reactive Synchronization Algorithms for Multiprocessors Synchronization Algorithms for Multiprocessors Beng-Hong Lim and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 039 Abstract Synchronization algorithms

More information

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Multiple Processes OS design is concerned with the management of processes and threads: Multiprogramming Multiprocessing Distributed processing

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap Håkan Sundell Philippas Tsigas OPODIS 2004: The 8th International Conference on Principles of Distributed Systems

More information

CS 167 Final Exam Solutions

CS 167 Final Exam Solutions CS 167 Final Exam Solutions Spring 2018 Do all questions. 1. [20%] This question concerns a system employing a single (single-core) processor running a Unix-like operating system, in which interrupts are

More information

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY> Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

University of Waterloo Midterm Examination Model Solution CS350 Operating Systems

University of Waterloo Midterm Examination Model Solution CS350 Operating Systems University of Waterloo Midterm Examination Model Solution CS350 Operating Systems Fall, 2003 1. (10 total marks) Suppose that two processes, a and b, are running in a uniprocessor system. a has three threads.

More information

Featherweight Monitors with Bacon Bits

Featherweight Monitors with Bacon Bits Featherweight Monitors with Bacon Bits David F. Bacon IBM T.J. Watson Research Center Contributors Chet Murthy Tamiya Onodera Mauricio Serrano Mark Wegman Rob Strom Kevin Stoodley Introduction It s the

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Multiprocessor Support

Multiprocessor Support CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the

More information

EE458 - Embedded Systems Lecture 8 Semaphores

EE458 - Embedded Systems Lecture 8 Semaphores EE458 - Embedded Systems Lecture 8 Semaphores Outline Introduction to Semaphores Binary and Counting Semaphores Mutexes Typical Applications RTEMS Semaphores References RTC: Chapter 6 CUG: Chapter 9 1

More information

Chapter 5 Concurrency: Mutual Exclusion. and. Synchronization. Operating Systems: Internals. and. Design Principles

Chapter 5 Concurrency: Mutual Exclusion. and. Synchronization. Operating Systems: Internals. and. Design Principles Operating Systems: Internals and Design Principles Chapter 5 Concurrency: Mutual Exclusion and Synchronization Seventh Edition By William Stallings Designing correct routines for controlling concurrent

More information

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

CSC2/458 Parallel and Distributed Systems Scalable Synchronization CSC2/458 Parallel and Distributed Systems Scalable Synchronization Sreepathi Pai February 20, 2018 URCS Outline Scalable Locking Barriers Outline Scalable Locking Barriers An alternative lock ticket lock

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

Mutex Implementation

Mutex Implementation COS 318: Operating Systems Mutex Implementation Jaswinder Pal Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Revisit Mutual Exclusion (Mutex) u Critical

More information

Midterm Exam Amy Murphy 6 March 2002

Midterm Exam Amy Murphy 6 March 2002 University of Rochester Midterm Exam Amy Murphy 6 March 2002 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your

More information

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization?

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization? Chapter 6: Process [& Thread] Synchronization CSCI [4 6] 730 Operating Systems Synchronization Part 1 : The Basics Why is synchronization needed? Synchronization Language/Definitions:» What are race conditions?»

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Process & Thread Management II. Queues. Sleep() and Sleep Queues CIS 657

Process & Thread Management II. Queues. Sleep() and Sleep Queues CIS 657 Process & Thread Management II CIS 657 Queues Run queues: hold threads ready to execute Not a single ready queue; 64 queues All threads in same queue are treated as same priority Sleep queues: hold threads

More information

Process & Thread Management II CIS 657

Process & Thread Management II CIS 657 Process & Thread Management II CIS 657 Queues Run queues: hold threads ready to execute Not a single ready queue; 64 queues All threads in same queue are treated as same priority Sleep queues: hold threads

More information

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 21 30 March 2017 Summary

More information