Queuing Spin Lock Algorithms to Support Timing Predictability

Size: px

Start display at page:

Download "Queuing Spin Lock Algorithms to Support Timing Predictability"

Lilian Strickland
5 years ago
Views:

1 Queuing Spin Lock Algorithms to Support Timing Predictability Travis S. Craig Department of Computer Science and Engineering, FR-35 University of Washington Seattle, WA Abstract We introduce practical new algorithms for FIFO and priority-ordered spin locks on shared-memory multiprocessors with an atomic swap instruction. Different versions of these queuing spin locks are designed for machines with coherent-cache and NUMA memory models. We discuss extensions to provide nested lock acquisition, conditional locking, timeout of lock requests, and preemption of waiters. These locks support predictable timing in the lowest kernel and user levels of time-line-scheduled and priority-scheduled real-time systems. 1 Introduction Predictable timing is a key requirement for hard, critical real-time functions and a desirable trait for less-critical functions or softer deadlines. The predictability of a function depends on predictability being provided at all levels of its implementation. Predictability must begin with the hardware and be progressively exported through the various levels of software. In this paper, we examine low-level software mechanisms, specifically spin locks for allocating the use of critical sections on shared-memory multiprocessors. Whether a system is scheduled by time lines or by priorities, it must control the order in which resource allocation requests are granted. The resource on which we focus here is the critical section. In shared-memory systems, the allocation of critical sections generally involves busy-waiting ( spinning ), at least at the lowest level of software. The traditional method of using an atomic instruction (eg. test-and-set) to spin on a single location, however, has two serious drawbacks. This research was supported in part by the National Science Foundation under grant number CCR and by a National Defense Science and Engineering Graduate Fellowship. First, the typical implementation of such a lock neither controls service order nor bounds the time to obtain the lock. Second, the spinning processors can add a significant load to the machine s processor-memory interconnect, slowing the progress of other processors. A newer mechanism, the queuing spin lock, offers solutions to both of these problems. Papers by Anderson {l], Graunke and Thakkar [7], and Mellor-Crummey and Scott [12] discuss numerous architectural and software considerations and evaluate several spin lock schemes in the context of nonreal-time parallel systems. Molesky et al. [13] and Markatos [ll] discuss the considerations for real-time systems. Each of these sources also introduces one or more algorithms for queuing spin locks, in which each waiting process spins on a separate location and ownership of the lock is explicitly passed from process to process. The time-line-driven systems considered in [13] benefit from bounding the time required to obtain a lock. Thus, for those systems, we are interested in schemes that grant locks in round-robin or FIFO order. The priority-driven systems considered in [ll] benefit from reducing priority inversion [15]. For those systems, we are interested in schemes that grant locks in priority order. In either type of system, it is helpful to reduce the interconnect load generated by spinning processes. We include both coherent-cache and nonuniform memory access (NUMA) machines. In this paper, we introduce four practical new algorithms for queuing spin locks on shared-memory multiprocessors with an atomic register-to-memory swap instruction: 0 FIFO-granting for coherent cache 0 Priority-granting for coherent cache 0 FIFO-granting for NUMA 0 Priority-granting for NUMA Each algorithm requires only one spin to acquire a lock and none to release it. None of them spin B3 $ IEEE 148

2 over the processor-memory interconnect (thereby reducing load on the interconnection network) and none require any atomic instruction other than the swap. Due to space limitations, we describe in detail only the algorithm that grants requests in priority order on coherent-cache machines. All four algorithms are presented in detail in [5]. Beyond the factors already mentioned, we broaden the practical applicability of our locking schemes by suggesting extensions for the following features: Nesting - Allow a process to hold more than one lock at a time. Timeout - Allow a process waiting for a lock to rescind its request if it hasn t been granted by some time limit. Conditional Request - Allow a process to request a lock on the condition that the lock is available. If the lock is being held by another process, then the requester returns immediately with an indication that it did not receive the lock. Preemption of Waiters - Ensure that a lock is not granted to a waiter that is not running, but don t prevent the scheduler from preempting a process that is waiting for a lock. Organization The rest of this paper contains: 0 Section 2 - Definitions and architectural model. 0 Section 3 - Details of our basic priority-queuing spin lock for coherent-cache machines and outlines of our other locks. Section 4 - Extensions to our basic locks. Section 5 - State of our work and comparison of our results with other queuing spin locks. e Section 6 - Summary and future work. 2 Definitions and model When we speak of a process in this paper, we mean a schedulable entity. Our process could be a thread in many systems. We could also substitute processor for process, particularly when considering the allocation of kernel data structures. We often refer to gaining or holding a lock. The lock, in this case, is the mechanism by which we guarantee mutually exclusive access to a resource or critical section. The critical section comprises the part of the process that uses the protected resource. We use the I,&, (a) Coherent Cache Machine I I I htercomh (b) NUMAMachine Figure 1: Memory models concepts of holding Q lock or resource and occupying a critical section interchangeably. The data structures we use generally include three record types, Locks, Processes, and Requests. There is one Lock record for each lock that can be requested. Our Lock records contain no locked indication, but only pointers to Request records. There is one Process record per process. To request a lock, a process places a Request record (and sometimes its own Process record) on the queue of requests for the lock. In all of the schemes, a lock is released by explicitly granting one of the requests on the list. The remaining requests on the list are pending. We assume a loosely defined shared-memory multiprocessor architecture. The machine is byteaddressed, with pointers being one word long. For discussion purposes, we use a 32-bit word. The machine has one type of read-modify-write instruction, an atomic swap (fetch-and-store) of one-word values between a register and a location in memory. During a swap to a shared location, no other access to the same location intervenes between the halves of the swap. To provide predictable timing, the FIFO locks require that the hardware bound a processor s access time for memory. A round-robin bus has the needed characteristics. A bounded access time also enables the priority locks to perform correctly, but a priorityoriented interconnect could contribute a further reduction in priority inversion. We use two simple memory models. Our generic coherent-cache machine is shown in Figure l(a). Each processor is connected through its own cache to a shared interconnect and then to the shared memory. The cache block size is some multiple of 2 bytes. Any 149

3 private physical memory that a processor might have (not shown) is used only for non-shared locations. The basic access time is uniform for a processor that reads any shared location not in its cache. The caches maintain coherence of the shared memory. Our generic NUMA machine is shown in Figure l(b). Each processor is connected to a local memory and to a shared interconnect that connects it to other processors and/or memories. Any cache that a processor might have (not shown) is used only for nonshared locations. The shared memory of the machine is spread across the local memories of the processors. The times for a processor to read or write shared locations is highly non-uniform. The read or write of a shared location in a processor s own memory (local access) is much faster than that of a location in another processor s memory (remote access). In addition, the remote access uses the interconnect while the local access does not. 3 Queuing spin locks on machines with coherent caches To overcome the drawbacks of test-and-set locks, Graunke and Thakkar [7] invented an algorithm, which we call G-T and describe here in terms of Lock and Request records. Our own algorithms are derived from Graunke and Thakkar s. A G-T Request record is aligned on a cache block boundary, so its address has zero in the low-order bit. Only one bit of the Request is used for data. The G-T Lock record has a 32-bit value. The high 31 bits are the high bits of a Request record address. The low bit tells whether a 1 or a 0 in that Request record means that the Request is pending. Each process in the G-T scheme has its own Request record for each lock that it needs to acquire. To acquire a lock, a process, say PI, places the address and the current value of its Request record for that lock into a register and then swaps the register with the Lock record. In the swap, PI receives the address and the what-means-pending value of the Request left there by whichever process last requested the lock before PI did. PI then spins, watching for the value in the Request record to differ from the what-meanspending value. When the process just ahead of PI has finished with the lock, it complements the bit in its own Request record, thereby granting the lock to PI. If several more swaps occur while PI watches for the grant, then those processes are each spinning on Requests in separate cache blocks and are unaffected when the lock is granted to PI. Thus, processor-tomemory interconnect loading is low. Because each process watches the Request left by the process before it and each Process grants the Request that it swapped into the Lock, the requests are granted in FIFO order. The lock is initialized by having the Lock record point to a Request that is already granted. The lock is left in that same state whenever the last requester releases it. The what-means-pending mechanism is required for a process to be able to reuse a Request to acquire a lock again. After process PI marks its Request as granted (by complementing its bit), it has no way to know when that grant has been seen by another process. Thus, PI cannot safely change the value of its own Request in preparation for acquiring the lock again. It can only record the current value of its Request by the swap into the Lock record. Due to the FIFO nature of the queue, when PI has been granted the lock again, it can be sure that the last grant it issued has been seen. A trivial example is when PI re-acquires the lock with no intervening acquisitions by other processes. Because we plan to grant locks in priority order, we require a more conservative rule for reuse of Request records. When PI grants the lock in priority order, it generally does not grant the Request that PI itself swapped into the Lock. Thus, we cannot assume that the Request enqueued by PI has even been marked granted by the time PI needs to reuse it to request the lock again. At the time it holds the lock, however, PI does know that the Request that it watched to obtain the lock is not being used by any other process. Thus, our solution is to have a process cede ownership of its Request record when enqueuing and automatically inherit ownership of the watched Request when receiving the lock grant. We use that method in all four of our algorithms. In fact, our FIFO algorithm for coherent cache machines is just the G-T algorithm, with exchange of Request ownership added and the whatm eans-pen ding mechanism removed. A side effect of transferring Request record ownership is to remove the requirement for a Request record per lock per process in the G-T scheme (O(L * P) Requests in a system with L locks and P processes). Our algorithms use just one Request per lock or process in the system (O(L+P) Requests). Besides saving space, it is simpler to merely allocate one Request record each time a lock or a process is created, rather than needing to know in advance how many different locks a process will use or to allocate Requests as the process runs.

4 (a) state mw4t-j watch ~1 G=GRANTED but new-block-aligned guarantees that the Request is P = PplDING watcha myproc head X=Don,tcarc aligned on a cache-block boundary. We thus prevent Request Roccgs Lock I = nil Requests from sharing cache blocks. The request-lock routine is called to acquire a lock. In that routine, a process (say PI) marks a Request as pending and provides it to the lock for PI S successor to spin on (watch) in exchange for the Request left there by 9 s predecessor. PI then spins on the predecessor s Request record until the Request is marked granted, at which time PI holds the lock and can proceed to use the protected resource. ( Predecessor and successor, here, refer to the order of requests, not of grants.) To make a priority queue lock for a coherent-cache machine (coherent-priority lock), we include pointers that allow the lock holder to traverse the list of waiters from the oldest request to the newest one on the._.. ~...~..~~~~~ Figure 2: Coherent-priority lock structure in operation 3.1 Priority queue with coherent cache The pseudo-code of our priority-granting lock for machines with coherent caches (coherent-priority lock) is given in Appendix A. Figure 2 and the following paragraphs illustrate and describe the operation of the algorithm. Figure 2(a) labels our graphic representation of the records of this lock. Figure 2(b) shows a newly initialized lock and process. The initjlock routine initializes the Lock record and allocates and initializes its associated Request record when a new lock is created. The init2rocess routine performs the corresponding functions for a new process. The new-block-aligned routine allocates a Request record and assigns its address to a pointer variable, just as new does in Pascal, Lock recordto the oldest Request and a pointer from each Request to the Process that is watching it. We also include a back pointer from each Request to the Process that owns it (placed it on the queue), to simplify the algorithm. Thus, a Process and the Request that it owns are doubly linked to each other. When a process enqueues its request (Figure 2(c)), it receives a pointer to the Request that it should watch, just as in the FIFO case. It then stores a backpointer to its own Process record in the watcher field of the Request. The list structure is stable (pointers are not written by any process other than the lock holder) from the head Request down to the Request owned by the process that most recently set its backpointer. Figure 2(d) shows a lock with four processes on its queue. Process PI holds the lock and the list is stable all the way from Ro to RB, where P4 has yet to set the watcher pointer. The stable portion is the part of the list on which the lock holder can operate safely. The stable portion always includes the current lock holder. Thus, the lock holder can remove itself and its new Request (the one it watched) from the list, which is its first step in releasing the lock. In Figure 2(e), PI has removed itself and its new Request from the list. The next step is to find the highest-priority waiter, which is done simply by traversing the stable portion of the list. As noted by Markatos [ll], we must first choose the highest-priority waiter and then grant it the lock. There is no atomic action to do both. Therefore, it is possible that, after the lock holder has traversed the stable portion of the list and chosen the highest priority waiter, but before it has granted the lock, more processes might finish enqueuing themselves. In 151

5 that case, they are considered the next time the lock is released. By starting at the head of the list, we give the newcomers the best chance to finish enqueuing themselves before we reach them. Note, however, that any processes with requests after Pq s, even if they are fully enqueued, will not be visible until P4 has finished enqueuing itself. Finally, the lock holder simply grants the request of the highest-priority waiter. In the situation pictured in Figure 2(e), PI chooses P3 as the highest-priority waiter and passes it the lock by setting the state of RS to granted. Figure 2(f) shows one more type of operation on the data structure. As shown there, P3 is preparing to release the lock and has removed itself and its new Request record from the list. The pointers that link this pair are shown as dashed arcs to distinguish them from the arcs that form the remainder of the list. Also shown is that P4 has finally finished enqueuing itself and will be considered in P3 s search for the highestpriority waiter. 3.2 Extending our scheme to a NUMA machine The algorithm for a NUMA machine cannot merely have different processors watch different locations. Without the cache coherence mechanism, each processor must spin on a location in its local shared memory. With a process spinning on whatever Request the Lock record pointed to before the swap, however, the Request record might not be local to the processor doing the watching. To adapt our scheme to the NUMA environment, then, even the FIFO lock has to set up a pointer from the Request record back to the Process that is watching it. The waiter can then check the Request once and, if not granted, spin on a location in its own (locally allocated) Process record. The granter, in turn, marks the Request as granted and then checks for a pointer to a waiting Process. If it finds such a pointer, it follows the pointer and sets the state of the waiting Process to granted. To prevent a possible race between the waiter and the granter, we use a protocol of swaps in which either the granted indication is passed to the waiter or the Process pointer is passed to the granter. To implement this scheme, we require Process pointers to be even numbers, so we can use the low-order bit of the Process pointer as the state of the Request. That way, the watcher and state fields of the Request record are combined in one 32-bit word and can be swapped as a unit. If the granter performs the swap before the waiter, then the granter receives a nil pointer and the waiter receives a state of granted. If the waiter performs its swap before the granter, then the waiter receives a state of pending and the granter receives the Process pointer, which it can then follow to set the waiter s Process record to a state of granted. In either case, either process knows the winner of the race and has all the information it needs immediately upon performing its own swap operation. This swap race is illustrated in [5], where our NUMA-FIFO and NUMA-Priority locks also are described in detail. 4 Other features 4.1 Nested locks With the other schemes that use O(L + P) space [ll, 121, a lock-holding process that needs to acquire another (nested) lock must provide an additional Request record. Therefore, with nesting they require O(L + P * 0) space, where D is the depth of the nesting. We can, however, adapt our schemes to allow nested lock acquisitions without requiring any additional Request records. For clarity, we have written our algorithm in this paper to have the lock holder remove itself from the list and set up the myreq and myproc pointers for its new Request as part of the grant-lock routine. In the nesting version of the lock, however, a process sets the myreq/myproc pointers at the beginning of the request-lock routine and removes itself from the list at the end of the same routine. The tnit-process routine, in turn, initializes the myreq field instead of the watch field. In the nesting version of our FIFO lock, we can no longer keep track of just one Request to be granted. The myreq field of the Process record still points to the next Request to grant, but that Request is the top of a linked stack of Requests for the given process to grant as it unwinds from the nested acquisitions. Each Request contains a pointer to the next Request on the stack. 4.2 Timeout A lock-request timeout feature can be useful in realtime systems to support detection of and recovery from faults. By a timeout feature, we mean that the request-lock routine either obtains the lock before a time limit or returns a status indicating a failure to do so. With a non-queuing spin lock, request-lock could simply include a time check in its spin loop and, after 152

6 the timeout, stop spinning on the lock and return the failed status. With a queue lock, however, there are additional problems of keeping the request from being granted after the timeout and removing the timed-out Request from the queue. If the waiter simply stops spinning, its Request will eventually be granted without any process watching it and no other requests for the lock will make progress. Also, with our queue locks, only the lock holder can safely remove Process and Request records from the queue. We have seen, however, that a waiter may safely write to the Request that it is watching. To timeout its request, a waiter (say PI) must leave behind both an indication that it no longer desires the lock and a pointer to the next Request in the queue. In setting up those indications, PI is racing with a potential granter. If the lock is granted to PI just before it marks its request as timed out, then PI is holding the lock and the request-lock routine must either return a success status or pass the lock to the next waiter and return a failed status. Either policy is supportable. The data structure to support timeouts is very similar to the one shown in Appendix A except that, as for the NUMA locks, the watcher and state fields are combined into a word that contains 31 bits of pointer and 1 bit of state. In addition, the pointer can point to either the Process record of the process that is watching the Request or the next Request record in the queue. The swap instruction can again be used to arbitrate the winner of the race between the process that is trying to release the lock (by storing a pointer of nil and a state of granted) and the process that is trying to timeout its request (by storing a pointer to the next Request record and a state of granted). When the holder sees a granted state and a non-nil pointer, it knows to remove the affected Process and Request records from the queue and to follow the pointer to the next Request record. A cost of this timeout scheme is that a process that is timing out its request cannot take its Request record with it, because a lock holder must use the information in the Request first. Therefore, we need a Request record per process for each lock that it uses, similar to the G-T lock. When a process wants to request a lock, it can try to reactivate a previously timed-out Request by swapping back the appropriate pointer and a state of pending. The swap, once again, arbitrates a race, this time between the reactivation of the Request and its removal from the queue. If the requester loses the race to reactivate its Request, then the Request is off the queue and can be used for the normal request procedure, as described in Section 3.1 and Appendix A. 4.3 Conditional lock A problem noted by Graunke and Thakkar [7] is that of allowing a conditional request for a lock, one that acquires the lock if it is immediately available and fails otherwise. We observe that a conditional request is very much like one with a zero timeout. After finding that the request it is watching has not been granted, the requester immediately executes the timeout action. Graunke has also developed a method to provide a conditional lock and remove the failed request from the queue [SI. 4.4 Preemption As addressed by Markatos [ll] and Wisniewski et al. [16], another lock-related issue is that of preemption. Clearly, if the holder of a lock loses the CPU, use of the affected resource stops until the holder is scheduled again. With non-queuing locks, preempting a waiter doesn t block the use of the critical section, because the lock is not actively granted; it is actively acquired. Once preempted, the waiter is no longer waiting and is not at risk to acquire the lock. With queuing locks, however, preempting a process does not prevent that process from being granted the lock, which would turn it into a preempted lock holder. We might, however, want to allow a scheduler to preempt a waiter. To do so, we need a way for the scheduler to determine that a process is spinning for a lock and to deactivate the lock s Request for the duration of the preemption. Our discussion of timeout suggests ways to deactivate and reactivate a Request. Notice that preemption is similar to a timeout in that the goal is to cause a request not to be granted. The scheduler must detect that a process it is preempting is waiting for a lock (and which lock) and perform the timeout action for it. When rescheduling the process onto the CPU, it can attempt to reactivate the Request. (If it hasn t been removed, the reactivation attempt succeeds.) If reactivation fails, the scheduler can enqueue a new request. Alternatively, it can start the process back at the beginning of the request-lock routine. In any case, we must still develop the technique for detecting that a process is waiting for a lock. Recent work with non-queuing locks [2] offers some promising techniques. 153

7 5 Discussion 5.1 Testing We have implemented nesting versions of all four of our algorithms and performed the minimal functional testing that we could do in a non-multiprogrammed environment (simple mutual exclusion, lock acquisition, and lock release). For the FIFO locks, we also tested nested lock acquisition and release. For the NUMA locks, we tested an older version of the protocol that handles the race between a waiter setting a pointer to itself and the lock holder granting its request. That protocol didn t involve a swap, but would not be reliable under some reasonable assumptions about read/write ordering in a NUMA machine. We have implemented our coherent-fifo lock on the Proteus multiprocessor simulator [3], along with a test-and-set lock and a G-T lock. One benefit of working on the simulator is that we can monitor the functional performance of the lock schemes. Our prototypes include detectors for violations of mutual exclusion that have been triggered only by implementation bugs and purposeful testing. We have also observed that locks are, indeed, granted in FIFO order. So far, we have simulated between 1 and 18 processes (each on its own processor) contending for a single lock. We are still generating quantitative output from the simulation, but we ve made three observations that provide some optimism. First, the cache hit rate is much less sensitive to contention with our queuing lock than it is with a simple test-and-set lock. Second, we get a very high hit rate when we have only one process repeatedly entering and leaving the critical section. Third, the maximum and standard deviation of spin times for lock acquisition appear to be much lower with the queuing locks than with the test-andset lock. 5.2 Characterizing spin lock schemes Mellor-Crummey and Scott [12] nicely identify four general traits by which to characterize and compare spin lock algorithms. We generalize these traits and add a fifth one of our own (atomic instruction set) to create the following list of traits and our comments on them. The first two traits are environmental, involving architectural factors that might not be controllable by the system designer. The importance of the third trait depends on the application at hand. The last two are performance-oriented results of the type of lock used. 0 Target machine memory system (coherent-cache or NUMA). A scheme should be optimized for the type of memory system on which it must run, but we found that designing an algorithm for a NUMA machine was more challenging than doing one for a coherent-cache machine. With a coherent-cache machine, there is one extra active element, the coherence mechanism, to ensure that the state of a Request is moved to a location local to the processor that is reading it. On the NUMA machine, we had to develop our swapping protocol (a special-purpose software-level coherence mechanism) to allow the releasing and requesting processes to cooperate in making the state of a request available locally for any spinning. In fact, any of the schemes that we ve seen for NUMA machines should work reasonably well on coherentcache machines, but not vice versa. 0 Target machine atomic instruction set (compareand-swap (C&S), swap, or test-and-set). Each of the locking schemes that we ve mentioned or described requires a specific type of atomic instruction(s). Often only one (or none) of these three instructions is implemented on a particular machine. We haven t done a serious survey of available architectures, but we believe that the most generally applicable lock would use just test-and-set. Given a machine with an atomic swap instruction (eg. multiprocessors that use the swap available in the Intel 80x86 [9] or Motorola 88x00 [14]), we also believe that our algorithms offer better characteristics than would a queuing spin lock that used the swap as a test-and-set. The algorithms that use compare-and-swap offer some potential performance advantages over ours in NUMA machines, but they also require a swap instruction. Neither swap nor compareand-swap may be trivially emulated by the other. The actual performance difference depends on the relative expense of the swap and compareand-swap instructions on a given machine. Finally, we have not yet looked formally at our work in the context of other fetch-and-op instructions, load-linked/store-conditional [lo], or transactional memory [8]. 0 Order in which the lock is granted to waiters (random, FIFO, or priority). Random order does not support predictable timing in real-time systems. A FIFO order is most useful in a time-line-driven system, but can also benefit a priority-driven one. A priority order is even better at supporting reduced priority inversion in a priority-driven system. Note also that being able to grant a lock in priority order automatically allows us to grant it 154

8 in nearly FIFO order, by assigning time stamps to requests (depending on the availability, synchronization, and precision of a real-time clock). A scheme that provides FIFO order, however, does not generally provide priority order. 0 Load imposed on the interconnect (remote spinning vs. no remote spinning). Much of our effort has been dedicated to minimizing this load. Our key goal here, which we (and most others) have accomplished, is to eliminate any spinning on remote locations. Next, we would like to minimize the number of atomic instructions executed over the interconnect. Finally, remote reads and writes should be minimized. 0 Space required per lock (L * P, L + P * D, or L + P). The simplest spin lock typically requires just a byte per lock. All of the queuing spin lock schemes require more than that and our goal is generally to minimize the space required. Ours is the only scheme to provide the lowest asymptotic space (O(L + P)) with nesting. The exact space taken, however, depends on the sizes of the individual records. The key advantage of our scheme is that we can statically allocate one Request record per process when creating the process and not worry about how many to allocate for each process. Table 1 presents a number of queuing spin lock schemes in terms of these characteristics. For coherent-cache machines, it includes the locks of Graunke and Thakkar [7], Markatos [ll], and ours. For NUMA machines, it includes the locks of Mellor- Crummey and Scott (MCS), with and without compare-and-swap [12], Markatos, and ours. 6 Summary and future work Our main technical contributions are our techniques and algorithms that provide tight control over lock grant order, use only the atomic swap instruction, use at most one (local only) spin for lock acquisition, use no spinning for lock release, and need only O(L + P) space on either a coherent-cache or NUMA machine. An additional contribution is our outline of techniques to extend the basic algorithms. We expect to further refine our algorithms to optimize their performance, particularly when applying them to specific machines. We also need to: 0 Experimentally and analytically compare the predictability and performance of our algorithms and those of others. 0 Quantify the contribution of our algorithms to improving the schedulability results of various real-time scheduling systems. 0 Complete our investigation of the extensions, producing working models of each idea for testing. 0 Further consider the applicability of other atomic instructions to our work Acknowledgments I want to thank Alan Shaw, Becky Callison, Larry Crum, Evangelos Markatos, Chang-Yun Park, Ricardo Pincheira, Sitaram Raju, and Bob Wisniewski for their reviews, comments, and encouragement. Special thanks to Becky for identifying a subtle but serious error in a time complexity claim in an earlier draft. Any remaining errors are my own. References [l] Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1( 1):6-16, January [2] Brian N. Bershad, David D. Redell, and John R. Ellis. Fast Mutual Exclusion for Uniprocessors. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages , October [3] Eric A. Brewer and Chrysthanos N. Dellarocas. PROTEUS User Documentation, Version 0.2. MIT, Cambridge, Massachusetts, [4] J. E. Burns. Mutual Exclusion with Linear Waiting using Binary Shared Variables. SIGACT News, 10(2), Summer [5] Travis S. Craig. Building FIFO and Priority- Queuing Spin Locks from Atomic Swap. Technical Report , University of Washington, Seattle, WA, February [6] Gary Graunke. Personal communication, December

9 Category Atomic Instruction Set Grant Order Interconnect Load Space Algorithm-+ 1 Trait C&S Swap Test&Set FIFO Priority Remote spin No remote spin L*P L+P*D L+P Memory Model Coherent Cache NUMA G-T Markatos Ours MCS MCS Markatos Ours (C&S) (Swap only) X X X X X X X X X X * linear X X usually * X X X X X rarely X X X X X usually X X X X w/nest w/nest w/nest w/onest w/onest w/onest X Gary Graunke and Shreekant Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, 23(6):60-69, June Maurice Herlihy and J. Eliot B. Moss. Transactional Memory: Architectural Support for Lock- Free Data Structures. Technical Report CRL 92/07, Digital Equipment Corporation, Cambridge Research Lab, December Intel Corporation, Santa Clara, California. Microprocessors, Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice-Hall, Englewood Cliffs, NJ, Evangelos P. Markatos. Multiprocessor Synchronization Primitives with Priorities. In Yann-Hang Lee and C. M. Krishna, editors, Readings in Real- Time Systems, pages IEEE Computer Society Press, Los Alamitos, CA, [13] Lory D. Molesky, Chia Shen, and Goran Zlokapa. Predictable Synchronization Mechanisms for Real-Time Systems. Real- Time Systems, 2(3): , September [I41 Motorola, Inc., Phoenix, Arizona. MC88110: Second Gene ration R ISC Mi crop rocess o r Us er s Manual, Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, 39(9): , September [16] Robert W. Wisniewski, Leonidas Kontothanassis, and Michael L. Scott. Scalable Spin Locks for Multiprogrammed Systems. Technical Report 454, University of Rochester, Rochester, NY, April John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems, 9( 1):21-65, February

10 A Coherent-cache priority queue lock algorithm type Lock = record tail : ^Request // recently enqueued head : ^Request // longest enqueued end record type Process = record pri : Priority watch : ^Request myreq : "Request end record type Request = record state : (PENDING. GRANTED) watcher : -Process // pointers for myproc : -Process // list traversal end record procedure init-lock (var L : Lock) nev-blockaligned (L.tail) // allocate a Request L.head := L.tai1 L.tail-.state := GRANTED L.tail-.watcher := NIL L.tail".myproc := NIL end procedure procedure initprocess (var P : Process) procedure grantlock (var L : Lock, var P : Process) V U highpri : Priority highreq : "Request currproc : -Process // remove my Process and the Request I watched // from the list P.nyreq^.myproc := P.watch".myproc if (P.myreq^.myproc # NIL) then P.myreq".myproc^.myreq := P.myreq else L.head := P.myreq end if // search the list for the highest-priority waiter highpri := LOYEST'PRIORITY - 1 highreq := L.head currproc := L.head".watcher while (currproc # NIL) do if (currproc".pri > highpri) then highpri := currproc-.pri highreq := currproc".watch end if currproc := currproc-.myreq^.watcher end while // pass the lock to the highest-priority waiter highreq".state := GRANTED // take ownership of the Request that I watched P.myreq := P.vatch P. myreq-. myproc : = addr (P) end procedure new-blockaligned (P.myreq) // allocate a Request P.myreq".myproc := addr(p) P.pri := priority of my process end procedure procedure requestlock (var L : Lock, var P : Process) // set up to enqueue my Request P.myreq-.state := PENDING P.myreq^.watcher := NIL // enqueue the Request and get one to watch P.vatch := fetchkstore (L.tai1, P.myreq) P.watch^.watcher := addr(p) // point request back to me loop until (P.vatch^.state = GRANTED) end procedure 157

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract