Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance

Size: px

Start display at page:

Download "Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance"

Arabella Cunningham
6 years ago
Views:

1 Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance Sanjeev Kumart Dongming Jiangt Rohit Chandra* Jaswinder Pal Singht idepartment of Computer Science Princeton University {skumar,dj,jps}.cs.princeton.edu *Silicon Graphics Inc. Abstract Synchronization is an area that exhibits rich hardware-software interactions in multiprocessors. It was studied extensively using microbenchmarks a decade ago. However, its performance implications are not well understood on modern systems or on real applications. We study the impact of synchronization primitives and algorithms on a modern, 64- processor, hardware-coherent shared address space multiprocessor: the SGI Origin In addition to the actual results on a modern system, we examine the key methodological issues in studying synchronization, for both microbenchmarks and applications. We find that although the efficient hardware support (Fetch&Op) for synchronization provided on our machine usually helps lock and barrier microbenchmarks, it does not help in improving application performance when compared to good software algorithms that use the processor-provided LL-SC instructions. This is true even in applications that spend a significant amount of time in synchronization operations. More elaborate hardware support is unlikely to have a significant benefit either. From the applications perspective, it is usually the waiting time due to load imbalance or serialization that dominates synchronization time, not the overhead of the synchronization operations themselves, even in apparently balanced cases where the overhead may be expected to be substantial. 1 Introduction Hardware cache-coherent shared address space multiprocessors are becoming increasingly popular for running parallel applications. While communication is implicit via loads and stores in the shared address programming model, synchronization is explicit via synchronization operations like locks, barriers and semaphores. Synchronization has a rich and diverse history of hardware-software tradeoffs on hardwarecoherent machines. While multiprocessors have experimented with full hardware support for high-level synchronization operations such as locks and barriers (e.g. the SGI 4D-240 [3] had a separate synchronization bus, and the Cray T3D [2] Permission to make dlgital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copses are not made or dwributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish. to post on servers or to redistribute to IIsts, reqwres prw specific permission and/or a fee. SIGMETRICS 99 5/99 Atlanta, Georgia, USA ACM l X/99/0004,,.$5.00 had hardware barriers), the current trend is to provide simple atomic primitives and implement higher-level synchronization operations via software algorithms that use these primitives. The performance of synchronization operations like locks and barriers has been widely studied. Some studies have focussed on developing better software algorithms, while others have proposed additional hardware support. Several evaluation studies were performed about ten years ago, including a comprehensive one [lo] that examined lock and barrier microbenchmarks on a bus-based Sequent Symmetry multiprocessor and a BBN Butterfly distributed-memory, non-coherent shared address space machine. Since then, other studies of synchronization on cache-coherent systems have been performed, but they have either been performed on simulators 18, 61 or have used microbenchmarks to evaluate synchronization performance [8, 4, 7, 51. A lot has changed in the last decade since the classic study [lo]. On the systems side, scalable, hardware-coherent machines with physically distributed memory, not examined in that study, have become very popular for moderate- to large-scale computing. The speeds of processors relative to memories and interconnects have changed. More importantly, new primitives called load-linked and storeconditional have been developed to implement atomic operations, and have replaced the atomic read-modify-write instructions examined in [lo] in many processors instruction sets. On the workload side, most of the early synchronization studies used microbenchmarks to evaluate algorithms and primitives. Microbenchmarks are useful since they enable easy isolation of performance issues, but the real goal of better synchronization methods is to improve performance of real applications, which microbenchmarks may not represent well. A substantial number of realistic scalable applications now exist for this programming model. These developments make it important to reexamine synchronization. In particular, not only do library designers need to understand what algorithms work well with what primitives, but multiprocessor architects need to understand the benefits of providing additional hardware support for synchronization in the communication architecture beyond the simple primitives (like LL-SC) already provided in processor instruction sets. We examine these issues, using microbenchmarks and applications on a 64processor SGI Origin 2000 multiprocessor. This machine is attractive for the study because it provides an aggressive communication architecture and support for both in-cache and at-memory synchronization primitives. Studying synchronization using both microbenchmarks and applications raises a number 23

2 of important methodological challenges that have not been adequately addressed before. For each type of workload, we raise these issues and develop methodologies to address them. These include new microbenchmarks, new versions of applications with different types and frequencies of synchronization, a methodology for dealing with the tricky and especially important issue of problem size in applications, and a way of examining whether additional hardware support beyond that already provided on the machine would be useful. The contributions of this work are both in the results obtained as well as in the methodologies described and used. The rest of the paper is organized as follows. Section 2 discusses the scope of the synchronization operations examined. Section 3 discusses the synchronization primitives supported by the Origin 2000 and the algorithms we used. Section 4 uses microbenchmarks to evaluate the synchre nization algorithms while Section 5 uses applications to do the same. Section 6 tries to evaluate the potential benefit of additional hardware support beyond that available in the machine. Section 6 presents the results when using smaller problem sizes. Finally, Section 8 summarizes the main conclusions. 2 Scope Locks and barriers are the two most popular synchronization mechanisms under the shared memory model. Locks, which provide mutual exclusion (or atomicity) for operations on shared data, allow processes to maintain the integrity of shared resources by serializing their access to the resources in critical sections, the code protected by the lock. Locks do not enforce any particular order in which the processes execute their critical sections. Barriers, which perform event synchronization, allow a group of processes to synchronize with each other before any one of them can go forward. We focus on locks and barriers in this study. More specific point-to-point event synchronization between processes may be used as well, either by using shared variables as flags or with semaphores. However the use of these forms of synchronization are less common and are not considered here. Barriers may be performed among a subset of processes or globally among all; the global case is the most common and stresses synchronization performance the most, and hence we focus on it. Also, we do not consider multiprogrammed workloads in this study. From a process s point of view, every synchronization event involves two stages: a waiting stage in which it waits for the other processes involved in the operation to arrive at the synchronization point; and an overhead stage in which it does the necessary communication and bookkeeping operations required to go forward. Synchronization primitives and algorithms may differ with regard to the overhead stage and potentially also the waiting stage (depending on how spinning is done). Most methods target improving the overhead stage; however, the improvements can affect the waiting stage as well, for example by reducing the effective sizes of critical sections and hence resulting in less serialization and waiting time at locks for processes that are further down the chain at contended locks. 3 Primitives and Algorithms We have implemented a variety of different spinlock and barrier algorithms on Origin Before presenting a brief description of the various implementations, we give an overview of the synchronization primitives provided on Origin 2000 for efficient synchronization. 3.1 Primitives on Origin 2000 The SGI Origin 2000 [7, 5] is a scalable shared address space (CC-NUMA) multiprocessing architecture. The communication architecture is very tightly integrated with the stated goal of treating a local memory reference as simply a small optimization of a general DSM memory reference (ratio of local to remote access latency is 1:3). We used a 64 processor Origin 2000 for this study. Implementing synchronization operations requires atomic read-modify-write operations to memory locations. Origin 2000 provides two separate primitives to implement these operations with very different performance characteristics. The first primitive (LL-SC) involves a pair of instructions that can be used to implement atomic operations to cachable memory locations. The first instruction load-linked (LL) loads a memory location into a register. This can be followed by an arbitrary sequence of instructions not involving a memory operation. Then a second special instruction, store-conditional (SC), to the same location is used. The SC will succeed only if no other processor has written to that register since the LL instruction was executed. Thus a successful SC indicates a successful read-modify-write operation to the memory location. If the SC fails, the entire operation must be retried. LL-SC is a flexible mechanism that can be used to implement a variety of atomic read-modify-write operations like test&set, Fetch&Op and compare&swap. On one hand, the performance of atomic operations degrades as the number of processors trying to update the same location increases due to an increase in failed LL-SC sequences. On the other hand, when a processor is spinning on a location waiting for updates to the location, it does this in the cache and no additional traffic is generated. The second primitive (Fetch&Op) supports at-memory atomic read-modify-write operations to special uncached memory locations. Only a few atomic operations like increment, decrement, logical and and logical or are supported directly on this machine. Operations like swap that are needed to implement some synchronization algorithms are not supported. Since the Fetch&Op locations are uncached, and therefore do not involve cache coherence operations, the atomic updates always involve exactly one round-trip network transaction. However, even simple reads and writes to these locations always incur network transactions. Therefore, spinning on one of these memory locations while waiting for a synchronization event constantly generates network traffic. The Origin 2000 does not provide more sophisticated hardware support for synchronization primitives, such as queue-on-lock-bit (QOLB) [6] or full hardware locks and barriers. A recent simulation study concludes that hardware support for queue-on-lock-bit (QOLB) locks makes a A LL to a location causes the cache line to be fetched in a readshared state. A SC to the same location requires the cache line to be upgraded to the write-exclusive state before the store can complete. One way to increase the probability of success of the LL-SC seouence is to do a store to a different location on the same line before the LL instruction. This will result in the line to be brounht in the cache in the write-exclusive state at the start of the sequence so that no network transactions are required between the LL and the SC instructions thereby increasing the chances of their successes. We use this technique in the implementation of fetch-and-increment operations but not in the implementation of test-and-set because the SC is not always attempted in the latter case. 24

3 substantial difference to the performance of applications [6]. However, that study uses small problem sizes, and it reports only percentage improvements and not baseline parallel performance or speedups. We cannot examine the benefits of QOLB directly on a real machine that does not provide QOLB support, but we discuss the issue of additional hardware support in Section 6. We used the hardware cycle counter supported by Origin 2000 to do all our timing measurements. Our measurements indicate that an access to the counter takes about 0.3 ps while our CLOCK routine that uses this counter takes about 0.4 ps Spinlocks The various spinlock algorithms implemented differ in the three ways-processor overhead, memory usage and network traffic generated. We use three basic algorithms (Table 1) as described below. Other lock algorithms exist, but these three are good representatives. simple In a simple lock, a processor trying to acquire a lock atomically checks to see if it is available and, if available, marks it unavailable. This is an unfair lock in which the processors do not necessarily succeed in acquiring the locks in the requested order. When a lock is freed, every processor waiting for it tries to acquire it, resulting in poor performance under contention (Section 4). The problem can be alleviated by using exponential backoff between retries. ticket A ticket lock is a fair lock in which every processor wanting to acquire a lock increments a global counter to determine its position in the queue. All processors spin on a second global counter, which gets incremented every time a lock is released (like a sandwich line at a deli). The advantage is that when a lock is released, only the one processor whose number matches tries to reacquire it. However, the coherence actions resulting from invalidating all the spinning processors due to the counter increment on a release and having them all immediately check the counter value results in performance degradation as the contention increases. The coherence actions for all processors except the first one in the queue can be delayed by using backoff proportional to each processor s position in the queue. When Fetch&Op is used to implement ticket lock, even the spinning is not in-cache and the backoff decreases the amount of network traffic generated. MCS An MCS lock [II] is also a fair lock that uses a distributed linked list to maintain the queue of waiters. Each waiter spins on a separate node of the linked list. This allows the processor releasing the lock to selectively signal only the processor waiting at the head of the queue, thereby avoiding the unnecessary invalidations of the ticket lock. Also, a particular processor always spins on the same memory location while trying to acquire that lock. So the algorithm can benefit from allocating the corresponding node close to the processor. However, MCS locks require space proportional to the number of waiting processors. We implemented all these algorithms using LL-SC primitives. We also implemented the ticket lock (with and without proportional backoff) using the Fetch&Op primitive. However, we did not implement MCS using only Fetch&Op because MCS requires an atomic swap operation which is not directly supported by Fetch&Op primitives. We tried implementing a hybrid version of MCS which uses LL-SC for queuing and Fetch&Op for signaling during lock transfer. Such a lock would benefit from generating only local traffic (due to spinning on an uncached location) and not create a hot spot in the system. However it would generate more traffic than other Fetch&Op locks like IlscTicketProp. Unfortunately, we could not use this lock for several reasons. First, we could not allocate enough Fetch&Op locations for all the locks in the applications (each lock requires 64 locations). Second, since we did not pin the processes to the processors, the processes are sometimes rescheduled by the OS to run on a different processors resulting in a lot of remote network traffic. Finally, since the memory hub caches only one 64-bit Fetch&Op location, a single 64-bit location would have to be used as two 32 bit locations (one for each of the two processors sharing the hub) to get good performance. 3.3 Barriers Like spinlocks, the various barrier algorithms also differ in overhead, memory usage and network traflic generated. We implemented two of barrier algorithms (Table 1) as describe below. Although, there are many more software barrier algorithms, the difference between them is very small on a cache coherent machine. Central In a centralized barrier, every processor increments a global counter and then waits for all the processors to arrive at the barrier by spinning on a flag. The last processor that arrives at the barrier signals all the other processors by setting the flag. Since the access to the counter is serialized, centralized barriers can yield bad performance. Tournament To avoid serialization at the counter, a tournament barrier [lo] uses a binary tree. Every processor starts at a separate leaf. At every node, one of the child processors moves up the tree after both the child processors have arrived. Thus, one processor reaches the root of the tree after all the processors have arrived at the barrier. The barrier completion event is then propagated down the tree. We implemented both these algorithms using LL-SC primitive. Since the Fetch&Op mechanism benefits the central algorithm, we also implemented two versions of central algorithm: one (called hybridcentral) that uses Fetch&Op for incrementing the counter and another (fopcentral) that uses it for both incrementing and waiting. 4 Microbenchmarks In this section, we present the microbenchmarks our study and the results obtained. 4.1 Spinlocks we used in The design of lock microbenchmarks raises some interesting methodological issues. We first discuss these and then describe the microbenchmarks we developed in response, and finally present results. 25

4 spinlock barrier Algorithm Description IlscSimple Simple lock using LL-SC 1lscSimpleExp Simple lock with exponential backoff using LL-SC 1lscTicket Ticket lock using LL-SC 1lscTicketProp Ticket lock with proportional backoff using LL-SC fopticket Ticket lock using fetch-and-op fopticketprop Ticket lock with proportional backoff using fetch-and-op 1lscMcs Mcs lock using LL-SC llsccentral Central barrier using LL-SC fopcentral Central barrier using fetch-and-op and spinning on an uncached location hybridcentral Central barrier using fetch-and-op and spinning on a location in cache 1lscTournament Tournament barrier using LL-SC Table 1: Synchronization algorithms implemented Methodology and Microbenchmarks Most older studies (e.g. [lo]) used a simple microbenchmark in which each of the participating processors would acquire and release a lock a certain number of times. When the number of participating processors was 1, this measured the latency of a lock acquire and release operation. When the number of processors was higher (about 4), it measured the time to transfer a lock from the processor releasing the lock to a waiting processor. When the number of processors was 2 or 3, it measured a combination of the two because the higher latency to grab a lock over the network would favor the processor that released the lock to reacquire it locally. The growing difference between the network latency and the local cache access time has resulted in the local reacquires succeeding more often, except when a fair lock algorithm is used, even when the number of processors was fairly large. So, this microbenchmark on its own inadequate. To ensure fairness, more recent studies [l, 8, 6] introduce delays between the lock operations. As before, each participating processor performs a fixed number of lock acquire and release operations. However, unlike before, after acquiring the lock, the processor waits for a certain amount of time in the critical section before releasing the lock. And after releasing the lock, it waits for a random period of time before trying to reacquire the lock, thus reducing unfairness due to local reacquires. However, due to the random delays used, the exact amount of contention (number of processors waiting for a lock when it is released) cannot be easily determined. Also the contention behavior of this microbenchmark changes across platforms with different processor speeds and network latencies. We propose three microbenchmarks (lock-delay, lock-null and lock-trafic) to measure the different aspects of lock operations. The lock-delay microbenchmark is similar to the one described above in that it uses delays both inside as well as outside the critical section. However, it uses fixed rather than random delays in both places. The delay (D,) outside the critical section has to be larger than the interprocessor lock transfer time (we used 8.31 ps2) ensures that the local processor will not succeed in reacquiring the lock. The delay (Oi) inside the critical section should be large enough that the last processor that released the lock is already waiting to acquire the lock before the lock is released (we used ps which is 3 x DO). This ensures that the amount of lock contention when using P processors is precisely known to be (P - 1). By using appropriate delays, this microbenchmark can be used across machines as well. 200 iterations of an idle loop. For this microbenchmark, TimePerLock = we compute ExecutionTime NoOfLockAcquires ExecutionTime _ D. ooflocka cquires _ Di _ D 0,P=l,P>l When P is 1, both the delays are in the critical path and are subtracted. So TimePerLock represents the time for an uncontended acquire and release the same processor. When P is greater than 1, only the delay inside the critical section is in the critical path and is subtracted; TimePerLock represents the time to transfer a lock from the processor releasing the lock to the next processor that acquires the lock in the presence of P - 1 waiting processors. The second microbenchmark, lock-null, does not use any delays at all-each of the processors does lock acquires and releases a fixed number of times. This is the same microbenchmark that was used in older studies [lo], as described earlier, and that penalizes fair locks. It also exposes lock algorithms that use excessive backoff to reduce network contention. Here TimePerLock = ExecutionTime NoOfLockAcquires In this case, the exact amount of contention and whether or not the lock was transferred is not known. Finally, the lock-traffic microbenchmark tries to measure the effect of the network traffic generated by the waiting processors on other processors that are doing useful work while the lock is being held. This microbenchmark is like the lock-delay benchmark. But, instead of an idle loop for the delay inside the critical section, it generates network transactions by doing reads to uncached locations in the loop (we used 60 reads). This measures the impact on these network transactions of the network traffic generated while waiting on the locks. The delay outside still satisfies the lock-delay criterion. Since the read latency on a DSM machine depends on the relative locations of the processor and the memory module, we allocate an uncached memory location on each of the memory modules and read them in a round robin fashion. We measure the time spent inside the critical sections and compute TimePerRead = TotalTimeInsideCriticalSection NoOfLockAcquires x NoOfReads In this case, TimePerRead represents the round-trip read latency in the presence of NoProcessors - 1 processors contending for the lock. 26

5 TimePerLock (in usecs) TimePerRead (in usecs) --t --+t fopticketprop IlscMcs No. of processors No. of processors No. of processors barrier-traffic TimePerRead (in usecs) t fopcentral No. of processors No. of processors No. of processors Figure 1: Microbenchmarks for spinlocks and barriers It is clear from the above discussion that especially due to the presence of unfair locks, the delays used in designing microbenchmarks must be carefully chosen and can be machine-dependent. Simple microbenchmarks without delays can be useful but are limited, and others should be used in conjunction with them Results The results from the various microbenchmarks (listed in Table 1) are presented in Figure 1. The results were obtained by taking the minimum times from 10 different runs. Each run involved a total of critical section accesses. The lock-delay graph shows that UscSimple and 1lscTicket perform poorly. 1lscSimpleExp does only slightly better. However, 1lscTickProp performs significantly better because of the proportional backoff although it is hurt when the contention is low. As expected, 1lscMcs performs the best among the LGSC locks. It is worth noting that the extra network transactions that llscmcs incurs [ll] when there is only one waiting processor can be observed. The performance of the two Fetch&Op based ticket locks is similar up to 16 processors beyond which the reduced contention at memory due to proportional backoff pays off. Overall, the fopticketprop lock performs the best under any contention due to the faster at-memory lock transfers and the reduced network traffic due to backoff. Looking at the single processor data, we see that the LL-SC locks can be up to 5 times faster than Fetch&Op locks when the lock is available locally. The lock-null graph shows that that the unfair locks benefit significantly in the absence of delays. Also, the perfor- mance numbers for the fair locks is usually better than the corresponding numbers in the lock-delay graph. This is because, in the absence of delays, the amount of contention is often smaller than the number of participating processors. Finally, the lock-traffic graphs shows that only fopticket generates enough network traffic to affect the performance of single processor reads in the critical section. Overall, fopticketprop performs the best among all the lock implementations on the microbenchmarks while 1lscMcs performs the best among the LL-SC locks. 4.2 Barriers There are several aspects of a barrier that influence application performance. In this section, we present a set of microbenchmarks we developed and present their results Methodology and Microbenchmarks We use three microbenchmarks (barrier-null, barrier-delay, and barrier-tmfic). In the barrier-null microbenchmark, all the processors simply execute fixed number of barriers in a loop. This microbenchmark has been used in other studies [lo] and it measures worst case performance. We com- pute TimePerBarrier = ExecutionTime NoOfBarriers The barrier-delay microbenchmark is like the barrier-null microbenchmark but it adds delays between barriers to stagger their arrival at the barrier and hence simulate load imbalance. Every barrier has two phases: one where all the 27

6 processors arrive at the barrier and one where all the processors leave the barrier after the last processor has arrived. The load imbalance in an application can hide the entire overhead of the first phase. So this microbenchmark measures the overhead of the second phase. In each loop iteration, all but one processors use a smaller delay D, (we used 410 ps3) while the remaining processor uses the larger delay Dl (we used 820 ps which is 2 x OS). The smaller delay has to be large enough to ensure that there is no interference between the two barriers. The difference between Dl and D, should be large enough that all the processors using D, delay finish the first phase of the barrier and are waiting for the last processor (using Dl delay) before it arrives at the barrier. Since only the larger delay is in the critical path of the microbenchmark, we compute TimePerBarrier = ExecutionTime _ D NoOfBarriers It is important to change the processor that is executing the larger delay in each iteration (we use a round-robin approach). This is because the last processor arriving at the barrier triggers the second phase of the barrier and, is therefore, often the first processor to complete the barrier. So letting the same processor always use the larger delay in this microbenchmark often underestimates the cost of the barrier. Finally, the barrier-traffic microbenchmark tries to measures the effect of the network traffic generated by the barrier on processors not participating in the barrier. In this microbenchmark, all but one processor repeatedly perform barrier operations without any delays. The remaining processor, which does not participate in the barrier, issues read operations to uncsched locations in all the memory modules much like the lock-traffic microbenchmark. When it is done, it signals all the other processors to stop doing the barriers by setting a flag. Here, we measure TimePerRead Results = TotalTimeSpentDoingReads NoOfReads The results from the barrier microbenchmarks are shown in Figure 1. The results were obtained by taking the minimum times from 10 different runs. Each run involved a total of barriers. The barrier-null graph shows that llsccentral performs significantly worse than all the other barriers. This is due to the serialization while performing the fetch-and-increment on the barrier counter using LL-SC. The remaining barriers are quite similar in performance. The barrier-delay graph shows that llsccentral benefits substantially from the staggering while others show only modest gains resulting in similar performance for all the barrier. This suggests that, in applications with some load imbalance, llsccentral may perform on par with the other barriers. fopcentral performs best because of at-memory barrier signaling while IlscTournament performs worst because of the wakeup tree it uses. Finally the barrier-traffic graphs shows that llsccentral (because of unsuccessful LL-SC sequences) and fopcentral (because on spinning on an uncached location) generate the most traffic. Overall, hybridcentral performs the best while IlscTournament is the best LL-SC barrier iterations of an idle loop. 5 Applications Evaluating the impact of the synchronization performance on applications is important for several reasons. First, microbenchmarks rarely capture every aspect of primitive performance (like memory usage). Even when they do, it is hard to predict their impact on application performance. For example, a lock or barrier that generates a lot of additional network traffic might have little impact on applications. Second, although it might be possible to conjure up scenarios which benefit from a certain hardware feature in microbenchmarks, they might not occur in real applications, and the feature may not in fact be justified. Third, even in applications that spend significant time in synchronization operations, the synchronization time might be dominated by wait time due to load imbalance and lock serialization in the application, which better implementations of locks and barriers may not be helpful in reducing. 5.1 Methodology Using applications to study synchronization raises several new methodological issues. Not handling them carefully can lead to misleading results. They include choosing the applications, choosing problem sizes, and choosing suitable metrics. Let us discuss each in turn. Applications. We chose seven of the SPLASH-2 sharedaddress-space parallel applications [12] that display a wide range of synchronization behaviors. Although we will see that, for the base problem sizes we choose, they spend a substantial amount of time at synchronization points, the SPLASH-2 applications are quite optimized for parallel performance and usually perform synchronization only when really needed. It is reasonable to expect versions of the same or similar applications to be produced by non-expert programmers with more synchronization. We modify some of these applications, without going overboard, by undoing some of the optimizations. In Ocean, for instance, the barriers we added results in code that a parallelizing compiler might generate. The modifications are described in Table 2. Interestingly, we found that these modifications had little impact on the overall speedup of the applications. However, we have used only the modified version of the applications throughout the paper because they involve more synchronization. Problem Sizes. Problem size is a very important issue that even previous applications studies have largely ignored. Generally, the larger the problem size the lower the frequency of synchronization relative to computation. On one hand, using large problem sizes will therefore make synchronization operations seem less important. On the other hand, small problem sizes might result in very low speedup making them uninteresting on a machine of this scale. We choose problem sizes as follows. For each application, we examine parallel speedups for a range of valid problem sizes using our best lock and barrier implementations. We choose a threshold of 40% parallel efficiency (speedup over best sequential version divided by number of processors) which we consider to be adequate speedup on a machine of this scale. (If there is no such problem size available for an application, e.g. for Radiosity with the available data sets, we choose the problem size at hand that achieves the best parallel efficiency, even though it is below 40% parallel efficiency.) Among the problem sizes that achieve more than 28

7 Application Input I Modification to Orginal SPLASH2 version,a-y&c, 128xi28 car None barnes 512K particles Lock on cells in tree partitioning and center of mass calculation in addition to tree cell updates radiosity room None ocean 1026x1026 grid Barrier after each grid computation volrend instead of barriers after each phase only 512x512~512 head Lock on task queue conditional updates instead of locking on updates only water-spatial 512 molecules None water-nsquared 4096 molecules Lock on conditional updates of molecules in force calculation I- instead of locking on updates only Table 2: Applications 40% speedup, we picked the one that shows the maximum change when the spinlock or the barrier implementation is varied. Table 2 shows the applications and the problem sizes we selected using this methodology. Since our choice of 40% is somewhat arbitrary, we examine the impact of choosing smaller problem sizes in Section 7. Metrics One natural metric to study how much a synchronization algorithm or primitive matters is the fraction by which it improves execution time or parallel speedup on a multiprocessor. While this is useful, and it does matter, it can be misleading since it does not convey the actual parallel speedup before or after. For instance, a 20% performance improvement is not interesting if the actual speedup is 2 on a 64 processor machine while it is when the actual speedup is 40. We, therefore, present both the base speedup as well the change in speedup due to a synchronization algorithm or primitive. 5.2 Applications Characteristics We start by presenting the application characteristics with respect to synchronization operations. This allows us to understand, and even predict, the impact of the synchronization operation s performance on applications. First, we instrumented the code by inserting CLOCK calls around all lock and barrier operations (1lscMcs lock and 1lscTournament barriers were used). We took the best run from 10 different runs. We compared the total running time of the applications with and without instrumentation to determine the perturbation introduced. In all but one application, the difference was less than 10% (for radiosity, it was 22%). This instrumentation was not used in the experiments that measure the impact of synchronization primitives on applications. Table 3 shows a few aspects of the application behavior with respect to synchronization. It shows the breakdown of the total running time into time spent in computation4 outside critical sections, in computation inside critical sections, in lock operations and in barrier operations. We can see that synchronization accounts for a significant portion of the execution time on most applications. It should be noted that some of the cost of the synchronization operations (like application slowdown due to network traffic they generate) can result in an increase in the non-synchronization time. To get detailed measurements of the lock contention in applications, we used an instrumented version of the IlscTicketProp lock. At every lock acquire, the instrumentation code records the number of processors already waiting on the lock 4The computation time include time spent in memory accesses. (It is the difference between the ticket being served and the ticket assigned to it). We ran the applications using this instrumented lock but without the instrumentation code from the previous paragraphs. As before, we took the best run from 10 different runs. In every application, the perturbation introduced by these measurements was less that 10%. Figure 2 shows the number of processors waiting at a lock when a new request for it arrives. For 0 waiters, it is split into two cases whose costs can be very different: the dark bar (local) is for the case where the lock requester was the last releaser (so in an LL-SC implementation it will likely reacquire the lock in-cache), while the light bar (remote) is for the case where the lock has to be acquired from someone else but it is found to be free. The Y-axis is on a log scale which can make it difficult to read. For this reason, a separate line graph uses a simple calculation to estimate the contribution of each of the bars to the lock operation time. This line uses a linear scale on the Y-axis and estimates the percentage contribution (see Figure 2) of each of the bars to the total time spent in lock operations. Note that fairly short bars (even on the log scale) on the right size of the graph often make significant contribution to the lock overhead. Figure 2 shows that the characteristics of the lock synchronization and contention in our applications are indeed quite varied. For locks, raytrace, radiosity, ocean and waterspatial have small critical sections but high contention. Barnes, volrend and water-nsquared have fairly low contention. Although, volrend has a fair amount of locking (to protect task queues for stealing), most of the time is spent manipulating its own task queue and the lock is available in the local cache. Our measurements can be used to predict and explain the benefit of other proposed techniques to reduce synchronization overhead. For instance, using reactive synchronization [8] to combine the best performing lock in the absence of contention (the IlscSimple lock) with the performance of the best lock in the presence of contention (like 1lscMcs) is unlikely to benefit any of these applications because the uncontended lock acquires account for a small fraction of the total running time (Table 3 and Figure 2) Similarly, compiler prefetching of locks [6] can help barnes and water-nsquared by at most 4% and should have little benefit effect on the other applications. This is because prefetching converts the remote uncontended lock acquires (the remote, 0 processor case in Figure 2) into local uncontended lock acquires. A negligible improvement from prefetching locks was indeed observed in [6], though it was speculated that this may have to do with the prefetching not being aggressive enough. For barriers, although a fair amount of time is spent in barrier operations, the bulk of this time spent in the barrier 29

8 loo M 1 0.k 1 c s loo s Water-Nsquared No of Waiting Processors LocalReacquireTime ContributionPerAcquire = LockTransferTime + (LockTransferTime + CriticalSection) x Contention Local Remote % Contribution = ContributionPerAcquire c(c OntributionPerAcquire x NoOfAcquires x NoOfAcquires) x loo Figure 2: Lock Contention 30

9 Raytrace Barnes Radiosity Ocean Volrend Water-Spatial Water-Nsquared A ps p /.LS ~LS /.hs ps ps B 58.99% 78.27% 50.68% 76.01% 95.25% 75.84% 71.84% C 0.72% 5.53% 1.88% 0.02% 0.43% 0.15% 11.63% D 30.88% 10.53% 47.12% 0.84% 4.12% 10.29% 6.81% E 9.41% 5.67% 0.33% 23.13% 0.21% 13.72% 9.72% F G ps ps ps ps /IS pus ps H 5.52,us 11.82,LLS 3.68 ps 2.72 /.LS 2.53 /AS 4.26 /.LS ps I J ps ps 894 ps 681 p-s 722 ps 321 /.LS /.Ls K p /AS p 2949 ps /J.s 2344 ps ps Table 3: Application Characteristics on 64 Processors: (A) Total Running time (B) P ercentage of running time spent computing outside critical sections (C) Percentage of running time spent computing inside critical sections (D) Percentage of running time spent in lock operations (E) Percentage of running time spent in barrier operations (F) Number of Critical sections (G) Average time spent in lock operations per critical section (H) Average size of a critical section (I) Number of Barriers (J) Average time spent in barrier operations per barrier per processor (K) Average time between two barriers. All times are in in waiting time. We infer this from the big difference between the barrier overhead from the microbenchmarks and the time spent in barrier operations in the applications. Overall, the speedups achieved, which make these problem sizes realistic for a machine, the varied synchronization characteristics, and the significant fraction of time spent at synchronization points make our applications and problem sizes meaningful for a study of synchronization methods. 5.3 Performance Table 4 shows the performance of the various applications with different spinlocks and barriers. The results presented were the best application performance from 10 runs. In most cases, the results from several runs were fairly close to one another. We point out any inconsistent results in the discussion below. We chose the best performing LL-SC spinlock (1lscMcs) and barrier (l&tournament) from our microbenchmarks as our base case. The first row in the figure shows the parallel speedup achieved in the base case. All other rows show the relative parallel speedup with respect to the base case (i.e. greater than 1 represents a better speedup than the base case). The second set of rows show the performance of applications when the lock algorithm is varied while using the same barrier algorithm (IlscTournament). The last set of rows show the performance of applications when the barrier algorithm is varied while keeping the lock algorithm fixed (1lscMcs). Vohend has the smallest amount of time spent in synchronization operations. Although, most of the runs generate fairly consistent results, about one in 20 runs happens up to 40% faster than the rest. As a result, about half of measurements (we pick the minimum from 10 runs) shown in the figure have a significantly higher speedup than the base case. In these cases, we looked at all the runs to make our conclusions. Looking at the second set of rows, obtained from varying the lock algorithm, we can see that IlscSimple, 1lscSimpleExp and 1lscTicket locks often perform significantly worse than the 1lscMc.s (which is used in the base case). 1lscTicketProp performance is similar to llscmcs in all but one application (radiosity). The worse performance in radiosity is probably due to the hot spot created because of high contention on a single lock. fopticketprop performs significantly better than 1lscMcs in just one application (raytrace). In the remaining applications, it performs significantly worse. This is contrary to our microbenchmark measurements which indicated that the proportional backoff is successful in making the extra traflic generated due to spinning on an uncached location insignificant. The worse performance may be due to one of two reasons: First, in the presence of network traflic generated by other processors in the application, the extra traffic generated due to spinning on an uncached location (even with proportional backoff) could become significant. Second, all the waiting processors on the same lock could create a hot spot in the system, thereby degrading performance. Looking at the last set of rows, obtained from varying the barrier algorithm, we can see that fopcentral performs consistently worse than the 1lscTournament barrier (which is used in the base case). However, though the barrier-traffic benchmark shows that llsccentral generates as much traffic as fopcentral, most of its traffic is generated when all the processors arrive at the barrier at the same time. Since the applications have significant load imbalance at barriers, the performance of llsccentra1 is fairly similar to that of hybridcentral and 1lscTournament (the slightly different performance in the case of radicsity and volrend is due to noise). Overall, we found that LL-SC based algorithms performed as well, and often better, than the implementations using Fetch&Op as well for both locks and barriers. This is contrary to microbenchmark results where the best performing algorithms used the Fetch&Op primitive. For locks, this is because the microbenchmarks did not capture the full im- pact of the network traffic generated by the Fetch&Op primitives. For barriers, this was because the load-imbalance dominated the time spent in barrier operations. As a result, improving the barrier algorithm did not make any significant impact on the application. 6 Is Further Hardware Support Valuable? We mentioned in the introduction that an important question for hardware designers is whether to provide hardware Note that we have not performed an exhaustive study of algorithms that can be tailor-made to leverage the Fetch&Op mechanism. 31

10 Synchronization Raytrace Barnes Radiosity Ocean Volrend Water-Spatial Water-Nsquared Base (22.49) (45.64) (8.02) (34.57) (21.18) (34.69) (45.20) llscsimple lscSimpleExp lscTicket lscTicketProp fopticket fopticketprop llsccentral fopcentral hybridcentral Table 4: Application performance: The first row shows the absolute speedups for the base case (which uses 1lscMcs lock and 1lscTournament barrier). The other rows are speedups relative to the base case. The second set of rows presents the relative speedups while varying the lock alone. The final set of rows present the relative speedups while varying the barrier alone. support for synchronization in the communication architecture. The Origin 2000 provides hardware support for Fetch&Op. Further hardware support has been proposed, ranging from special synchronization buses or networks that don t interfere with data traffic (e.g. in the SGI 4D-240, the Cray T3D, and the CM-5) to more sophisticated hardware primitives like QOLB that essentially implements the lock waiters queue in hardware so that a lock transfer costs only one network operation in the presence of contention. 6.1 Spinlocks Methodology Determining the effect of additional hardware support that does not exist on the machine machine being used is difficult. It can be done via simulation, which has its problems as discussed earlier. We do something reasonable here recognizing that any methodology for this issue on a real system will have its flaws. Since we cannot improve the primitives on a real machine, what we do instead is to make them worse by varying degrees, look at the impact of doing so, and see if the trend tells us what we might expect by extrapolating in the other direction. Such a methodology has been used before on real systems in the context of determining the impact of communication parameters on a cluster [9]. Apart from reducing the overhead or occupancy needed for processing at the nodes involved in the synchronization, what hardware support really buys us is a reduction in the number of network transactions. For example, compared to the best performing lock in applications, the 1lscMcs lock, the QOLB hardware support reduces the number of transactions to transfer a lock from 5 (7 when there is just a single waiter) to 1 [S]. Our technique for worsening synchronizai tion behavior is therefore to add network transactions after the lock has been acquired to simulate the lock transfer taking more network transactions. We use the llscmcs lock as the base case. However, when acquiring a lock that was last accessed locally, most lock algorithms will not generate network traffic. So, the network transactions are added only in the case when the lock was last acquired remotely. Adding these transactions increases both the latency in the critical path of the lock and the traffic in the data network Results We present results for this experiment marks and applications. for both microbench- We ran the microbenchmarks on the worsened versions of locks. To save space, we just present the 1 processor and 64 processor numbers from the lock-delay microbenchmarks in Table 5. The base-delay version includes the extra code needed to generate varying amount of local and remote delays, but uses no delays. The local-delay version adds a delay of about 4 ps to the base-delay version when the lock is available locally. The remote-delay, version adds n round-trip network transactions to the base-delay version after each lock acquire. The network transactions are generated by doing reads to different uncached locations much like in the lock-traffic and barrier-traffic microbenchmarks. The microbenchmarks simply confirm that 4 p is added to the local-delay version in the 1 processor case while n round-trip network transactions in the remote-delay, version. Table 6 shows the performance of the different applications when using the delay versions of 1lscMcs lock. As explained before, the fluctuation in the performance of volrend is not because of the delays added. Looking at all the runs, we found that the delays (both in the local as well as the remote case) made little difference to volrend. The local-delay version did not significantly aifect any of the applications. This agrees with our application measurements in Section 5.2 which suggested that time spent in acquiring uncontended locks that were last accessed locally account for a very small fraction of the total running time. The additional delays added in the remote-delay, case significantly affects the performance of locks in only two cases: raytrace and radiosity. Extrapolating the results on the other side (even linearly which may be excessive) suggests that reducing the number of network transactions in the lock algorithm from 5 to 1 (remember that 2 network transactions are added to remote-delay, each time) may improve the performance of radiosity and raytrace by about 15-20%. The change should not have any significant effect on the other applications Discussion A reduction of lock transfer overhead has a significant impact only on applications with relatively small critical section and that spend a significant time in synchronization operations. This is because a higher lock transfer overhead, in effect, just dilates the critical section. Hence a larger critical section will experience a proportionately smaller benefit from reducing the lock transfer time (see formulae in Figure 2). However, for most applications with small critical sections where lock serialization is a problem, simple, well- 32

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms. The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff