Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance

Size: px
Start display at page:

Download "Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance"

Transcription

1 Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance Sanjeev Kumart Dongming Jiangt Rohit Chandra* Jaswinder Pal Singht idepartment of Computer Science Princeton University {skumar,dj,jps}.cs.princeton.edu *Silicon Graphics Inc. Abstract Synchronization is an area that exhibits rich hardware-software interactions in multiprocessors. It was studied extensively using microbenchmarks a decade ago. However, its performance implications are not well understood on modern systems or on real applications. We study the impact of synchronization primitives and algorithms on a modern, 64- processor, hardware-coherent shared address space multiprocessor: the SGI Origin In addition to the actual results on a modern system, we examine the key methodological issues in studying synchronization, for both microbenchmarks and applications. We find that although the efficient hardware support (Fetch&Op) for synchronization provided on our machine usually helps lock and barrier microbenchmarks, it does not help in improving application performance when compared to good software algorithms that use the processor-provided LL-SC instructions. This is true even in applications that spend a significant amount of time in synchronization operations. More elaborate hardware support is unlikely to have a significant benefit either. From the applications perspective, it is usually the waiting time due to load imbalance or serialization that dominates synchronization time, not the overhead of the synchronization operations themselves, even in apparently balanced cases where the overhead may be expected to be substantial. 1 Introduction Hardware cache-coherent shared address space multiprocessors are becoming increasingly popular for running parallel applications. While communication is implicit via loads and stores in the shared address programming model, synchronization is explicit via synchronization operations like locks, barriers and semaphores. Synchronization has a rich and diverse history of hardware-software tradeoffs on hardwarecoherent machines. While multiprocessors have experimented with full hardware support for high-level synchronization operations such as locks and barriers (e.g. the SGI 4D-240 [3] had a separate synchronization bus, and the Cray T3D [2] Permission to make dlgital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copses are not made or dwributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish. to post on servers or to redistribute to IIsts, reqwres prw specific permission and/or a fee. SIGMETRICS 99 5/99 Atlanta, Georgia, USA ACM l X/99/0004,,.$5.00 had hardware barriers), the current trend is to provide simple atomic primitives and implement higher-level synchronization operations via software algorithms that use these primitives. The performance of synchronization operations like locks and barriers has been widely studied. Some studies have focussed on developing better software algorithms, while others have proposed additional hardware support. Several evaluation studies were performed about ten years ago, including a comprehensive one [lo] that examined lock and barrier microbenchmarks on a bus-based Sequent Symmetry multiprocessor and a BBN Butterfly distributed-memory, non-coherent shared address space machine. Since then, other studies of synchronization on cache-coherent systems have been performed, but they have either been performed on simulators 18, 61 or have used microbenchmarks to evaluate synchronization performance [8, 4, 7, 51. A lot has changed in the last decade since the classic study [lo]. On the systems side, scalable, hardware-coherent machines with physically distributed memory, not examined in that study, have become very popular for moderate- to large-scale computing. The speeds of processors relative to memories and interconnects have changed. More importantly, new primitives called load-linked and storeconditional have been developed to implement atomic operations, and have replaced the atomic read-modify-write instructions examined in [lo] in many processors instruction sets. On the workload side, most of the early synchronization studies used microbenchmarks to evaluate algorithms and primitives. Microbenchmarks are useful since they enable easy isolation of performance issues, but the real goal of better synchronization methods is to improve performance of real applications, which microbenchmarks may not represent well. A substantial number of realistic scalable applications now exist for this programming model. These developments make it important to reexamine synchronization. In particular, not only do library designers need to understand what algorithms work well with what primitives, but multiprocessor architects need to understand the benefits of providing additional hardware support for synchronization in the communication architecture beyond the simple primitives (like LL-SC) already provided in processor instruction sets. We examine these issues, using microbenchmarks and applications on a 64processor SGI Origin 2000 multiprocessor. This machine is attractive for the study because it provides an aggressive communication architecture and support for both in-cache and at-memory synchronization primitives. Studying synchronization using both microbenchmarks and applications raises a number 23

2 of important methodological challenges that have not been adequately addressed before. For each type of workload, we raise these issues and develop methodologies to address them. These include new microbenchmarks, new versions of applications with different types and frequencies of synchronization, a methodology for dealing with the tricky and especially important issue of problem size in applications, and a way of examining whether additional hardware support beyond that already provided on the machine would be useful. The contributions of this work are both in the results obtained as well as in the methodologies described and used. The rest of the paper is organized as follows. Section 2 discusses the scope of the synchronization operations examined. Section 3 discusses the synchronization primitives supported by the Origin 2000 and the algorithms we used. Section 4 uses microbenchmarks to evaluate the synchre nization algorithms while Section 5 uses applications to do the same. Section 6 tries to evaluate the potential benefit of additional hardware support beyond that available in the machine. Section 6 presents the results when using smaller problem sizes. Finally, Section 8 summarizes the main conclusions. 2 Scope Locks and barriers are the two most popular synchronization mechanisms under the shared memory model. Locks, which provide mutual exclusion (or atomicity) for operations on shared data, allow processes to maintain the integrity of shared resources by serializing their access to the resources in critical sections, the code protected by the lock. Locks do not enforce any particular order in which the processes execute their critical sections. Barriers, which perform event synchronization, allow a group of processes to synchronize with each other before any one of them can go forward. We focus on locks and barriers in this study. More specific point-to-point event synchronization between processes may be used as well, either by using shared variables as flags or with semaphores. However the use of these forms of synchronization are less common and are not considered here. Barriers may be performed among a subset of processes or globally among all; the global case is the most common and stresses synchronization performance the most, and hence we focus on it. Also, we do not consider multiprogrammed workloads in this study. From a process s point of view, every synchronization event involves two stages: a waiting stage in which it waits for the other processes involved in the operation to arrive at the synchronization point; and an overhead stage in which it does the necessary communication and bookkeeping operations required to go forward. Synchronization primitives and algorithms may differ with regard to the overhead stage and potentially also the waiting stage (depending on how spinning is done). Most methods target improving the overhead stage; however, the improvements can affect the waiting stage as well, for example by reducing the effective sizes of critical sections and hence resulting in less serialization and waiting time at locks for processes that are further down the chain at contended locks. 3 Primitives and Algorithms We have implemented a variety of different spinlock and barrier algorithms on Origin Before presenting a brief description of the various implementations, we give an overview of the synchronization primitives provided on Origin 2000 for efficient synchronization. 3.1 Primitives on Origin 2000 The SGI Origin 2000 [7, 5] is a scalable shared address space (CC-NUMA) multiprocessing architecture. The communication architecture is very tightly integrated with the stated goal of treating a local memory reference as simply a small optimization of a general DSM memory reference (ratio of local to remote access latency is 1:3). We used a 64 processor Origin 2000 for this study. Implementing synchronization operations requires atomic read-modify-write operations to memory locations. Origin 2000 provides two separate primitives to implement these operations with very different performance characteristics. The first primitive (LL-SC) involves a pair of instructions that can be used to implement atomic operations to cachable memory locations. The first instruction load-linked (LL) loads a memory location into a register. This can be followed by an arbitrary sequence of instructions not involving a memory operation. Then a second special instruction, store-conditional (SC), to the same location is used. The SC will succeed only if no other processor has written to that register since the LL instruction was executed. Thus a successful SC indicates a successful read-modify-write operation to the memory location. If the SC fails, the entire operation must be retried. LL-SC is a flexible mechanism that can be used to implement a variety of atomic read-modify-write operations like test&set, Fetch&Op and compare&swap. On one hand, the performance of atomic operations degrades as the number of processors trying to update the same location increases due to an increase in failed LL-SC sequences. On the other hand, when a processor is spinning on a location waiting for updates to the location, it does this in the cache and no additional traffic is generated. The second primitive (Fetch&Op) supports at-memory atomic read-modify-write operations to special uncached memory locations. Only a few atomic operations like increment, decrement, logical and and logical or are supported directly on this machine. Operations like swap that are needed to implement some synchronization algorithms are not supported. Since the Fetch&Op locations are uncached, and therefore do not involve cache coherence operations, the atomic updates always involve exactly one round-trip network transaction. However, even simple reads and writes to these locations always incur network transactions. Therefore, spinning on one of these memory locations while waiting for a synchronization event constantly generates network traffic. The Origin 2000 does not provide more sophisticated hardware support for synchronization primitives, such as queue-on-lock-bit (QOLB) [6] or full hardware locks and barriers. A recent simulation study concludes that hardware support for queue-on-lock-bit (QOLB) locks makes a A LL to a location causes the cache line to be fetched in a readshared state. A SC to the same location requires the cache line to be upgraded to the write-exclusive state before the store can complete. One way to increase the probability of success of the LL-SC seouence is to do a store to a different location on the same line before the LL instruction. This will result in the line to be brounht in the cache in the write-exclusive state at the start of the sequence so that no network transactions are required between the LL and the SC instructions thereby increasing the chances of their successes. We use this technique in the implementation of fetch-and-increment operations but not in the implementation of test-and-set because the SC is not always attempted in the latter case. 24

3 substantial difference to the performance of applications [6]. However, that study uses small problem sizes, and it reports only percentage improvements and not baseline parallel performance or speedups. We cannot examine the benefits of QOLB directly on a real machine that does not provide QOLB support, but we discuss the issue of additional hardware support in Section 6. We used the hardware cycle counter supported by Origin 2000 to do all our timing measurements. Our measurements indicate that an access to the counter takes about 0.3 ps while our CLOCK routine that uses this counter takes about 0.4 ps Spinlocks The various spinlock algorithms implemented differ in the three ways-processor overhead, memory usage and network traffic generated. We use three basic algorithms (Table 1) as described below. Other lock algorithms exist, but these three are good representatives. simple In a simple lock, a processor trying to acquire a lock atomically checks to see if it is available and, if available, marks it unavailable. This is an unfair lock in which the processors do not necessarily succeed in acquiring the locks in the requested order. When a lock is freed, every processor waiting for it tries to acquire it, resulting in poor performance under contention (Section 4). The problem can be alleviated by using exponential backoff between retries. ticket A ticket lock is a fair lock in which every processor wanting to acquire a lock increments a global counter to determine its position in the queue. All processors spin on a second global counter, which gets incremented every time a lock is released (like a sandwich line at a deli). The advantage is that when a lock is released, only the one processor whose number matches tries to reacquire it. However, the coherence actions resulting from invalidating all the spinning processors due to the counter increment on a release and having them all immediately check the counter value results in performance degradation as the contention increases. The coherence actions for all processors except the first one in the queue can be delayed by using backoff proportional to each processor s position in the queue. When Fetch&Op is used to implement ticket lock, even the spinning is not in-cache and the backoff decreases the amount of network traffic generated. MCS An MCS lock [II] is also a fair lock that uses a distributed linked list to maintain the queue of waiters. Each waiter spins on a separate node of the linked list. This allows the processor releasing the lock to selectively signal only the processor waiting at the head of the queue, thereby avoiding the unnecessary invalidations of the ticket lock. Also, a particular processor always spins on the same memory location while trying to acquire that lock. So the algorithm can benefit from allocating the corresponding node close to the processor. However, MCS locks require space proportional to the number of waiting processors. We implemented all these algorithms using LL-SC primitives. We also implemented the ticket lock (with and without proportional backoff) using the Fetch&Op primitive. However, we did not implement MCS using only Fetch&Op because MCS requires an atomic swap operation which is not directly supported by Fetch&Op primitives. We tried implementing a hybrid version of MCS which uses LL-SC for queuing and Fetch&Op for signaling during lock transfer. Such a lock would benefit from generating only local traffic (due to spinning on an uncached location) and not create a hot spot in the system. However it would generate more traffic than other Fetch&Op locks like IlscTicketProp. Unfortunately, we could not use this lock for several reasons. First, we could not allocate enough Fetch&Op locations for all the locks in the applications (each lock requires 64 locations). Second, since we did not pin the processes to the processors, the processes are sometimes rescheduled by the OS to run on a different processors resulting in a lot of remote network traffic. Finally, since the memory hub caches only one 64-bit Fetch&Op location, a single 64-bit location would have to be used as two 32 bit locations (one for each of the two processors sharing the hub) to get good performance. 3.3 Barriers Like spinlocks, the various barrier algorithms also differ in overhead, memory usage and network traflic generated. We implemented two of barrier algorithms (Table 1) as describe below. Although, there are many more software barrier algorithms, the difference between them is very small on a cache coherent machine. Central In a centralized barrier, every processor increments a global counter and then waits for all the processors to arrive at the barrier by spinning on a flag. The last processor that arrives at the barrier signals all the other processors by setting the flag. Since the access to the counter is serialized, centralized barriers can yield bad performance. Tournament To avoid serialization at the counter, a tournament barrier [lo] uses a binary tree. Every processor starts at a separate leaf. At every node, one of the child processors moves up the tree after both the child processors have arrived. Thus, one processor reaches the root of the tree after all the processors have arrived at the barrier. The barrier completion event is then propagated down the tree. We implemented both these algorithms using LL-SC primitive. Since the Fetch&Op mechanism benefits the central algorithm, we also implemented two versions of central algorithm: one (called hybridcentral) that uses Fetch&Op for incrementing the counter and another (fopcentral) that uses it for both incrementing and waiting. 4 Microbenchmarks In this section, we present the microbenchmarks our study and the results obtained. 4.1 Spinlocks we used in The design of lock microbenchmarks raises some interesting methodological issues. We first discuss these and then describe the microbenchmarks we developed in response, and finally present results. 25

4 spinlock barrier Algorithm Description IlscSimple Simple lock using LL-SC 1lscSimpleExp Simple lock with exponential backoff using LL-SC 1lscTicket Ticket lock using LL-SC 1lscTicketProp Ticket lock with proportional backoff using LL-SC fopticket Ticket lock using fetch-and-op fopticketprop Ticket lock with proportional backoff using fetch-and-op 1lscMcs Mcs lock using LL-SC llsccentral Central barrier using LL-SC fopcentral Central barrier using fetch-and-op and spinning on an uncached location hybridcentral Central barrier using fetch-and-op and spinning on a location in cache 1lscTournament Tournament barrier using LL-SC Table 1: Synchronization algorithms implemented Methodology and Microbenchmarks Most older studies (e.g. [lo]) used a simple microbenchmark in which each of the participating processors would acquire and release a lock a certain number of times. When the number of participating processors was 1, this measured the latency of a lock acquire and release operation. When the number of processors was higher (about 4), it measured the time to transfer a lock from the processor releasing the lock to a waiting processor. When the number of processors was 2 or 3, it measured a combination of the two because the higher latency to grab a lock over the network would favor the processor that released the lock to reacquire it locally. The growing difference between the network latency and the local cache access time has resulted in the local reacquires succeeding more often, except when a fair lock algorithm is used, even when the number of processors was fairly large. So, this microbenchmark on its own inadequate. To ensure fairness, more recent studies [l, 8, 6] introduce delays between the lock operations. As before, each participating processor performs a fixed number of lock acquire and release operations. However, unlike before, after acquiring the lock, the processor waits for a certain amount of time in the critical section before releasing the lock. And after releasing the lock, it waits for a random period of time before trying to reacquire the lock, thus reducing unfairness due to local reacquires. However, due to the random delays used, the exact amount of contention (number of processors waiting for a lock when it is released) cannot be easily determined. Also the contention behavior of this microbenchmark changes across platforms with different processor speeds and network latencies. We propose three microbenchmarks (lock-delay, lock-null and lock-trafic) to measure the different aspects of lock operations. The lock-delay microbenchmark is similar to the one described above in that it uses delays both inside as well as outside the critical section. However, it uses fixed rather than random delays in both places. The delay (D,) outside the critical section has to be larger than the interprocessor lock transfer time (we used 8.31 ps2) ensures that the local processor will not succeed in reacquiring the lock. The delay (Oi) inside the critical section should be large enough that the last processor that released the lock is already waiting to acquire the lock before the lock is released (we used ps which is 3 x DO). This ensures that the amount of lock contention when using P processors is precisely known to be (P - 1). By using appropriate delays, this microbenchmark can be used across machines as well. 200 iterations of an idle loop. For this microbenchmark, TimePerLock = we compute ExecutionTime NoOfLockAcquires ExecutionTime _ D. ooflocka cquires _ Di _ D 0,P=l,P>l When P is 1, both the delays are in the critical path and are subtracted. So TimePerLock represents the time for an uncontended acquire and release the same processor. When P is greater than 1, only the delay inside the critical section is in the critical path and is subtracted; TimePerLock represents the time to transfer a lock from the processor releasing the lock to the next processor that acquires the lock in the presence of P - 1 waiting processors. The second microbenchmark, lock-null, does not use any delays at all-each of the processors does lock acquires and releases a fixed number of times. This is the same microbenchmark that was used in older studies [lo], as described earlier, and that penalizes fair locks. It also exposes lock algorithms that use excessive backoff to reduce network contention. Here TimePerLock = ExecutionTime NoOfLockAcquires In this case, the exact amount of contention and whether or not the lock was transferred is not known. Finally, the lock-traffic microbenchmark tries to measure the effect of the network traffic generated by the waiting processors on other processors that are doing useful work while the lock is being held. This microbenchmark is like the lock-delay benchmark. But, instead of an idle loop for the delay inside the critical section, it generates network transactions by doing reads to uncached locations in the loop (we used 60 reads). This measures the impact on these network transactions of the network traffic generated while waiting on the locks. The delay outside still satisfies the lock-delay criterion. Since the read latency on a DSM machine depends on the relative locations of the processor and the memory module, we allocate an uncached memory location on each of the memory modules and read them in a round robin fashion. We measure the time spent inside the critical sections and compute TimePerRead = TotalTimeInsideCriticalSection NoOfLockAcquires x NoOfReads In this case, TimePerRead represents the round-trip read latency in the presence of NoProcessors - 1 processors contending for the lock. 26

5 TimePerLock (in usecs) TimePerRead (in usecs) --t --+t fopticketprop IlscMcs No. of processors No. of processors No. of processors barrier-traffic TimePerRead (in usecs) t fopcentral No. of processors No. of processors No. of processors Figure 1: Microbenchmarks for spinlocks and barriers It is clear from the above discussion that especially due to the presence of unfair locks, the delays used in designing microbenchmarks must be carefully chosen and can be machine-dependent. Simple microbenchmarks without delays can be useful but are limited, and others should be used in conjunction with them Results The results from the various microbenchmarks (listed in Table 1) are presented in Figure 1. The results were obtained by taking the minimum times from 10 different runs. Each run involved a total of critical section accesses. The lock-delay graph shows that UscSimple and 1lscTicket perform poorly. 1lscSimpleExp does only slightly better. However, 1lscTickProp performs significantly better because of the proportional backoff although it is hurt when the contention is low. As expected, 1lscMcs performs the best among the LGSC locks. It is worth noting that the extra network transactions that llscmcs incurs [ll] when there is only one waiting processor can be observed. The performance of the two Fetch&Op based ticket locks is similar up to 16 processors beyond which the reduced contention at memory due to proportional backoff pays off. Overall, the fopticketprop lock performs the best under any contention due to the faster at-memory lock transfers and the reduced network traffic due to backoff. Looking at the single processor data, we see that the LL-SC locks can be up to 5 times faster than Fetch&Op locks when the lock is available locally. The lock-null graph shows that that the unfair locks benefit significantly in the absence of delays. Also, the perfor- mance numbers for the fair locks is usually better than the corresponding numbers in the lock-delay graph. This is because, in the absence of delays, the amount of contention is often smaller than the number of participating processors. Finally, the lock-traffic graphs shows that only fopticket generates enough network traffic to affect the performance of single processor reads in the critical section. Overall, fopticketprop performs the best among all the lock implementations on the microbenchmarks while 1lscMcs performs the best among the LL-SC locks. 4.2 Barriers There are several aspects of a barrier that influence application performance. In this section, we present a set of microbenchmarks we developed and present their results Methodology and Microbenchmarks We use three microbenchmarks (barrier-null, barrier-delay, and barrier-tmfic). In the barrier-null microbenchmark, all the processors simply execute fixed number of barriers in a loop. This microbenchmark has been used in other studies [lo] and it measures worst case performance. We com- pute TimePerBarrier = ExecutionTime NoOfBarriers The barrier-delay microbenchmark is like the barrier-null microbenchmark but it adds delays between barriers to stagger their arrival at the barrier and hence simulate load imbalance. Every barrier has two phases: one where all the 27

6 processors arrive at the barrier and one where all the processors leave the barrier after the last processor has arrived. The load imbalance in an application can hide the entire overhead of the first phase. So this microbenchmark measures the overhead of the second phase. In each loop iteration, all but one processors use a smaller delay D, (we used 410 ps3) while the remaining processor uses the larger delay Dl (we used 820 ps which is 2 x OS). The smaller delay has to be large enough to ensure that there is no interference between the two barriers. The difference between Dl and D, should be large enough that all the processors using D, delay finish the first phase of the barrier and are waiting for the last processor (using Dl delay) before it arrives at the barrier. Since only the larger delay is in the critical path of the microbenchmark, we compute TimePerBarrier = ExecutionTime _ D NoOfBarriers It is important to change the processor that is executing the larger delay in each iteration (we use a round-robin approach). This is because the last processor arriving at the barrier triggers the second phase of the barrier and, is therefore, often the first processor to complete the barrier. So letting the same processor always use the larger delay in this microbenchmark often underestimates the cost of the barrier. Finally, the barrier-traffic microbenchmark tries to measures the effect of the network traffic generated by the barrier on processors not participating in the barrier. In this microbenchmark, all but one processor repeatedly perform barrier operations without any delays. The remaining processor, which does not participate in the barrier, issues read operations to uncsched locations in all the memory modules much like the lock-traffic microbenchmark. When it is done, it signals all the other processors to stop doing the barriers by setting a flag. Here, we measure TimePerRead Results = TotalTimeSpentDoingReads NoOfReads The results from the barrier microbenchmarks are shown in Figure 1. The results were obtained by taking the minimum times from 10 different runs. Each run involved a total of barriers. The barrier-null graph shows that llsccentral performs significantly worse than all the other barriers. This is due to the serialization while performing the fetch-and-increment on the barrier counter using LL-SC. The remaining barriers are quite similar in performance. The barrier-delay graph shows that llsccentral benefits substantially from the staggering while others show only modest gains resulting in similar performance for all the barrier. This suggests that, in applications with some load imbalance, llsccentral may perform on par with the other barriers. fopcentral performs best because of at-memory barrier signaling while IlscTournament performs worst because of the wakeup tree it uses. Finally the barrier-traffic graphs shows that llsccentral (because of unsuccessful LL-SC sequences) and fopcentral (because on spinning on an uncached location) generate the most traffic. Overall, hybridcentral performs the best while IlscTournament is the best LL-SC barrier iterations of an idle loop. 5 Applications Evaluating the impact of the synchronization performance on applications is important for several reasons. First, microbenchmarks rarely capture every aspect of primitive performance (like memory usage). Even when they do, it is hard to predict their impact on application performance. For example, a lock or barrier that generates a lot of additional network traffic might have little impact on applications. Second, although it might be possible to conjure up scenarios which benefit from a certain hardware feature in microbenchmarks, they might not occur in real applications, and the feature may not in fact be justified. Third, even in applications that spend significant time in synchronization operations, the synchronization time might be dominated by wait time due to load imbalance and lock serialization in the application, which better implementations of locks and barriers may not be helpful in reducing. 5.1 Methodology Using applications to study synchronization raises several new methodological issues. Not handling them carefully can lead to misleading results. They include choosing the applications, choosing problem sizes, and choosing suitable metrics. Let us discuss each in turn. Applications. We chose seven of the SPLASH-2 sharedaddress-space parallel applications [12] that display a wide range of synchronization behaviors. Although we will see that, for the base problem sizes we choose, they spend a substantial amount of time at synchronization points, the SPLASH-2 applications are quite optimized for parallel performance and usually perform synchronization only when really needed. It is reasonable to expect versions of the same or similar applications to be produced by non-expert programmers with more synchronization. We modify some of these applications, without going overboard, by undoing some of the optimizations. In Ocean, for instance, the barriers we added results in code that a parallelizing compiler might generate. The modifications are described in Table 2. Interestingly, we found that these modifications had little impact on the overall speedup of the applications. However, we have used only the modified version of the applications throughout the paper because they involve more synchronization. Problem Sizes. Problem size is a very important issue that even previous applications studies have largely ignored. Generally, the larger the problem size the lower the frequency of synchronization relative to computation. On one hand, using large problem sizes will therefore make synchronization operations seem less important. On the other hand, small problem sizes might result in very low speedup making them uninteresting on a machine of this scale. We choose problem sizes as follows. For each application, we examine parallel speedups for a range of valid problem sizes using our best lock and barrier implementations. We choose a threshold of 40% parallel efficiency (speedup over best sequential version divided by number of processors) which we consider to be adequate speedup on a machine of this scale. (If there is no such problem size available for an application, e.g. for Radiosity with the available data sets, we choose the problem size at hand that achieves the best parallel efficiency, even though it is below 40% parallel efficiency.) Among the problem sizes that achieve more than 28

7 Application Input I Modification to Orginal SPLASH2 version,a-y&c, 128xi28 car None barnes 512K particles Lock on cells in tree partitioning and center of mass calculation in addition to tree cell updates radiosity room None ocean 1026x1026 grid Barrier after each grid computation volrend instead of barriers after each phase only 512x512~512 head Lock on task queue conditional updates instead of locking on updates only water-spatial 512 molecules None water-nsquared 4096 molecules Lock on conditional updates of molecules in force calculation I- instead of locking on updates only Table 2: Applications 40% speedup, we picked the one that shows the maximum change when the spinlock or the barrier implementation is varied. Table 2 shows the applications and the problem sizes we selected using this methodology. Since our choice of 40% is somewhat arbitrary, we examine the impact of choosing smaller problem sizes in Section 7. Metrics One natural metric to study how much a synchronization algorithm or primitive matters is the fraction by which it improves execution time or parallel speedup on a multiprocessor. While this is useful, and it does matter, it can be misleading since it does not convey the actual parallel speedup before or after. For instance, a 20% performance improvement is not interesting if the actual speedup is 2 on a 64 processor machine while it is when the actual speedup is 40. We, therefore, present both the base speedup as well the change in speedup due to a synchronization algorithm or primitive. 5.2 Applications Characteristics We start by presenting the application characteristics with respect to synchronization operations. This allows us to understand, and even predict, the impact of the synchronization operation s performance on applications. First, we instrumented the code by inserting CLOCK calls around all lock and barrier operations (1lscMcs lock and 1lscTournament barriers were used). We took the best run from 10 different runs. We compared the total running time of the applications with and without instrumentation to determine the perturbation introduced. In all but one application, the difference was less than 10% (for radiosity, it was 22%). This instrumentation was not used in the experiments that measure the impact of synchronization primitives on applications. Table 3 shows a few aspects of the application behavior with respect to synchronization. It shows the breakdown of the total running time into time spent in computation4 outside critical sections, in computation inside critical sections, in lock operations and in barrier operations. We can see that synchronization accounts for a significant portion of the execution time on most applications. It should be noted that some of the cost of the synchronization operations (like application slowdown due to network traffic they generate) can result in an increase in the non-synchronization time. To get detailed measurements of the lock contention in applications, we used an instrumented version of the IlscTicketProp lock. At every lock acquire, the instrumentation code records the number of processors already waiting on the lock 4The computation time include time spent in memory accesses. (It is the difference between the ticket being served and the ticket assigned to it). We ran the applications using this instrumented lock but without the instrumentation code from the previous paragraphs. As before, we took the best run from 10 different runs. In every application, the perturbation introduced by these measurements was less that 10%. Figure 2 shows the number of processors waiting at a lock when a new request for it arrives. For 0 waiters, it is split into two cases whose costs can be very different: the dark bar (local) is for the case where the lock requester was the last releaser (so in an LL-SC implementation it will likely reacquire the lock in-cache), while the light bar (remote) is for the case where the lock has to be acquired from someone else but it is found to be free. The Y-axis is on a log scale which can make it difficult to read. For this reason, a separate line graph uses a simple calculation to estimate the contribution of each of the bars to the lock operation time. This line uses a linear scale on the Y-axis and estimates the percentage contribution (see Figure 2) of each of the bars to the total time spent in lock operations. Note that fairly short bars (even on the log scale) on the right size of the graph often make significant contribution to the lock overhead. Figure 2 shows that the characteristics of the lock synchronization and contention in our applications are indeed quite varied. For locks, raytrace, radiosity, ocean and waterspatial have small critical sections but high contention. Barnes, volrend and water-nsquared have fairly low contention. Although, volrend has a fair amount of locking (to protect task queues for stealing), most of the time is spent manipulating its own task queue and the lock is available in the local cache. Our measurements can be used to predict and explain the benefit of other proposed techniques to reduce synchronization overhead. For instance, using reactive synchronization [8] to combine the best performing lock in the absence of contention (the IlscSimple lock) with the performance of the best lock in the presence of contention (like 1lscMcs) is unlikely to benefit any of these applications because the uncontended lock acquires account for a small fraction of the total running time (Table 3 and Figure 2) Similarly, compiler prefetching of locks [6] can help barnes and water-nsquared by at most 4% and should have little benefit effect on the other applications. This is because prefetching converts the remote uncontended lock acquires (the remote, 0 processor case in Figure 2) into local uncontended lock acquires. A negligible improvement from prefetching locks was indeed observed in [6], though it was speculated that this may have to do with the prefetching not being aggressive enough. For barriers, although a fair amount of time is spent in barrier operations, the bulk of this time spent in the barrier 29

8 loo M 1 0.k 1 c s loo s Water-Nsquared No of Waiting Processors LocalReacquireTime ContributionPerAcquire = LockTransferTime + (LockTransferTime + CriticalSection) x Contention Local Remote % Contribution = ContributionPerAcquire c(c OntributionPerAcquire x NoOfAcquires x NoOfAcquires) x loo Figure 2: Lock Contention 30

9 Raytrace Barnes Radiosity Ocean Volrend Water-Spatial Water-Nsquared A ps p /.LS ~LS /.hs ps ps B 58.99% 78.27% 50.68% 76.01% 95.25% 75.84% 71.84% C 0.72% 5.53% 1.88% 0.02% 0.43% 0.15% 11.63% D 30.88% 10.53% 47.12% 0.84% 4.12% 10.29% 6.81% E 9.41% 5.67% 0.33% 23.13% 0.21% 13.72% 9.72% F G ps ps ps ps /IS pus ps H 5.52,us 11.82,LLS 3.68 ps 2.72 /.LS 2.53 /AS 4.26 /.LS ps I J ps ps 894 ps 681 p-s 722 ps 321 /.LS /.Ls K p /AS p 2949 ps /J.s 2344 ps ps Table 3: Application Characteristics on 64 Processors: (A) Total Running time (B) P ercentage of running time spent computing outside critical sections (C) Percentage of running time spent computing inside critical sections (D) Percentage of running time spent in lock operations (E) Percentage of running time spent in barrier operations (F) Number of Critical sections (G) Average time spent in lock operations per critical section (H) Average size of a critical section (I) Number of Barriers (J) Average time spent in barrier operations per barrier per processor (K) Average time between two barriers. All times are in in waiting time. We infer this from the big difference between the barrier overhead from the microbenchmarks and the time spent in barrier operations in the applications. Overall, the speedups achieved, which make these problem sizes realistic for a machine, the varied synchronization characteristics, and the significant fraction of time spent at synchronization points make our applications and problem sizes meaningful for a study of synchronization methods. 5.3 Performance Table 4 shows the performance of the various applications with different spinlocks and barriers. The results presented were the best application performance from 10 runs. In most cases, the results from several runs were fairly close to one another. We point out any inconsistent results in the discussion below. We chose the best performing LL-SC spinlock (1lscMcs) and barrier (l&tournament) from our microbenchmarks as our base case. The first row in the figure shows the parallel speedup achieved in the base case. All other rows show the relative parallel speedup with respect to the base case (i.e. greater than 1 represents a better speedup than the base case). The second set of rows show the performance of applications when the lock algorithm is varied while using the same barrier algorithm (IlscTournament). The last set of rows show the performance of applications when the barrier algorithm is varied while keeping the lock algorithm fixed (1lscMcs). Vohend has the smallest amount of time spent in synchronization operations. Although, most of the runs generate fairly consistent results, about one in 20 runs happens up to 40% faster than the rest. As a result, about half of measurements (we pick the minimum from 10 runs) shown in the figure have a significantly higher speedup than the base case. In these cases, we looked at all the runs to make our conclusions. Looking at the second set of rows, obtained from varying the lock algorithm, we can see that IlscSimple, 1lscSimpleExp and 1lscTicket locks often perform significantly worse than the 1lscMc.s (which is used in the base case). 1lscTicketProp performance is similar to llscmcs in all but one application (radiosity). The worse performance in radiosity is probably due to the hot spot created because of high contention on a single lock. fopticketprop performs significantly better than 1lscMcs in just one application (raytrace). In the remaining applications, it performs significantly worse. This is contrary to our microbenchmark measurements which indicated that the proportional backoff is successful in making the extra traflic generated due to spinning on an uncached location insignificant. The worse performance may be due to one of two reasons: First, in the presence of network traflic generated by other processors in the application, the extra traffic generated due to spinning on an uncached location (even with proportional backoff) could become significant. Second, all the waiting processors on the same lock could create a hot spot in the system, thereby degrading performance. Looking at the last set of rows, obtained from varying the barrier algorithm, we can see that fopcentral performs consistently worse than the 1lscTournament barrier (which is used in the base case). However, though the barrier-traffic benchmark shows that llsccentral generates as much traffic as fopcentral, most of its traffic is generated when all the processors arrive at the barrier at the same time. Since the applications have significant load imbalance at barriers, the performance of llsccentra1 is fairly similar to that of hybridcentral and 1lscTournament (the slightly different performance in the case of radicsity and volrend is due to noise). Overall, we found that LL-SC based algorithms performed as well, and often better, than the implementations using Fetch&Op as well for both locks and barriers. This is contrary to microbenchmark results where the best performing algorithms used the Fetch&Op primitive. For locks, this is because the microbenchmarks did not capture the full im- pact of the network traffic generated by the Fetch&Op primitives. For barriers, this was because the load-imbalance dominated the time spent in barrier operations. As a result, improving the barrier algorithm did not make any significant impact on the application. 6 Is Further Hardware Support Valuable? We mentioned in the introduction that an important question for hardware designers is whether to provide hardware Note that we have not performed an exhaustive study of algorithms that can be tailor-made to leverage the Fetch&Op mechanism. 31

10 Synchronization Raytrace Barnes Radiosity Ocean Volrend Water-Spatial Water-Nsquared Base (22.49) (45.64) (8.02) (34.57) (21.18) (34.69) (45.20) llscsimple lscSimpleExp lscTicket lscTicketProp fopticket fopticketprop llsccentral fopcentral hybridcentral Table 4: Application performance: The first row shows the absolute speedups for the base case (which uses 1lscMcs lock and 1lscTournament barrier). The other rows are speedups relative to the base case. The second set of rows presents the relative speedups while varying the lock alone. The final set of rows present the relative speedups while varying the barrier alone. support for synchronization in the communication architecture. The Origin 2000 provides hardware support for Fetch&Op. Further hardware support has been proposed, ranging from special synchronization buses or networks that don t interfere with data traffic (e.g. in the SGI 4D-240, the Cray T3D, and the CM-5) to more sophisticated hardware primitives like QOLB that essentially implements the lock waiters queue in hardware so that a lock transfer costs only one network operation in the presence of contention. 6.1 Spinlocks Methodology Determining the effect of additional hardware support that does not exist on the machine machine being used is difficult. It can be done via simulation, which has its problems as discussed earlier. We do something reasonable here recognizing that any methodology for this issue on a real system will have its flaws. Since we cannot improve the primitives on a real machine, what we do instead is to make them worse by varying degrees, look at the impact of doing so, and see if the trend tells us what we might expect by extrapolating in the other direction. Such a methodology has been used before on real systems in the context of determining the impact of communication parameters on a cluster [9]. Apart from reducing the overhead or occupancy needed for processing at the nodes involved in the synchronization, what hardware support really buys us is a reduction in the number of network transactions. For example, compared to the best performing lock in applications, the 1lscMcs lock, the QOLB hardware support reduces the number of transactions to transfer a lock from 5 (7 when there is just a single waiter) to 1 [S]. Our technique for worsening synchronizai tion behavior is therefore to add network transactions after the lock has been acquired to simulate the lock transfer taking more network transactions. We use the llscmcs lock as the base case. However, when acquiring a lock that was last accessed locally, most lock algorithms will not generate network traffic. So, the network transactions are added only in the case when the lock was last acquired remotely. Adding these transactions increases both the latency in the critical path of the lock and the traffic in the data network Results We present results for this experiment marks and applications. for both microbench- We ran the microbenchmarks on the worsened versions of locks. To save space, we just present the 1 processor and 64 processor numbers from the lock-delay microbenchmarks in Table 5. The base-delay version includes the extra code needed to generate varying amount of local and remote delays, but uses no delays. The local-delay version adds a delay of about 4 ps to the base-delay version when the lock is available locally. The remote-delay, version adds n round-trip network transactions to the base-delay version after each lock acquire. The network transactions are generated by doing reads to different uncached locations much like in the lock-traffic and barrier-traffic microbenchmarks. The microbenchmarks simply confirm that 4 p is added to the local-delay version in the 1 processor case while n round-trip network transactions in the remote-delay, version. Table 6 shows the performance of the different applications when using the delay versions of 1lscMcs lock. As explained before, the fluctuation in the performance of volrend is not because of the delays added. Looking at all the runs, we found that the delays (both in the local as well as the remote case) made little difference to volrend. The local-delay version did not significantly aifect any of the applications. This agrees with our application measurements in Section 5.2 which suggested that time spent in acquiring uncontended locks that were last accessed locally account for a very small fraction of the total running time. The additional delays added in the remote-delay, case significantly affects the performance of locks in only two cases: raytrace and radiosity. Extrapolating the results on the other side (even linearly which may be excessive) suggests that reducing the number of network transactions in the lock algorithm from 5 to 1 (remember that 2 network transactions are added to remote-delay, each time) may improve the performance of radiosity and raytrace by about 15-20%. The change should not have any significant effect on the other applications Discussion A reduction of lock transfer overhead has a significant impact only on applications with relatively small critical section and that spend a significant time in synchronization operations. This is because a higher lock transfer overhead, in effect, just dilates the critical section. Hence a larger critical section will experience a proportionately smaller benefit from reducing the lock transfer time (see formulae in Figure 2). However, for most applications with small critical sections where lock serialization is a problem, simple, well- 32

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms. The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff

More information

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of

More information

The Architectural and Operating System Implications on the Performance of Synchronization on ccnuma Multiprocessors

The Architectural and Operating System Implications on the Performance of Synchronization on ccnuma Multiprocessors International Journal of Parallel Programming, Vol. 29, No. 3, 2001 The Architectural and Operating System Implications on the Performance of Synchronization on ccnuma Multiprocessors Dimitrios S. Nikolopoulos

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Coherency protocols do not: make sure that only one thread accesses shared data

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters Angelos Bilas and Jaswinder Pal Singh Department of Computer Science Olden Street Princeton University Princeton,

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization. Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)

More information

Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems

Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems Angelos Bilas Dept. of Elec. and Comp. Eng. 10 King s College Road University of Toronto Toronto,

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 19: Synchronization CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 4 due tonight at 11:59 PM Synchronization primitives (that we have or will

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

SHARED-MEMORY COMMUNICATION

SHARED-MEMORY COMMUNICATION SHARED-MEMORY COMMUNICATION IMPLICITELY VIA MEMORY PROCESSORS SHARE SOME MEMORY COMMUNICATION IS IMPLICIT THROUGH LOADS AND STORES NEED TO SYNCHRONIZE NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Midterm Exam Amy Murphy 6 March 2002

Midterm Exam Amy Murphy 6 March 2002 University of Rochester Midterm Exam Amy Murphy 6 March 2002 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven

More information

CS 326: Operating Systems. CPU Scheduling. Lecture 6

CS 326: Operating Systems. CPU Scheduling. Lecture 6 CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating

More information

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019 250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

Mutex Implementation

Mutex Implementation COS 318: Operating Systems Mutex Implementation Jaswinder Pal Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Revisit Mutual Exclusion (Mutex) u Critical

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections ) Lecture 19: Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 1 Caching Locks Spin lock: to acquire a lock, a process may enter an infinite loop that keeps attempting

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

LOCK PREDICTION TO REDUCE THE OVERHEAD OF SYNCHRONIZATION PRIMITIVES. A Thesis ANUSHA SHANKAR

LOCK PREDICTION TO REDUCE THE OVERHEAD OF SYNCHRONIZATION PRIMITIVES. A Thesis ANUSHA SHANKAR LOCK PREDICTION TO REDUCE THE OVERHEAD OF SYNCHRONIZATION PRIMITIVES A Thesis by ANUSHA SHANKAR Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

Determining the Number of CPUs for Query Processing

Determining the Number of CPUs for Query Processing Determining the Number of CPUs for Query Processing Fatemah Panahi Elizabeth Soechting CS747 Advanced Computer Systems Analysis Techniques The University of Wisconsin-Madison fatemeh@cs.wisc.edu, eas@cs.wisc.edu

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

The Art and Science of Memory Allocation

The Art and Science of Memory Allocation Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking

More information

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

The need for atomicity This code sequence illustrates the need for atomicity. Explain. Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock

More information

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems COMP 242 Class Notes Section 9: Multiprocessor Operating Systems 1 Multiprocessors As we saw earlier, a multiprocessor consists of several processors sharing a common memory. The memory is typically divided

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1 Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

6.852: Distributed Algorithms Fall, Class 15

6.852: Distributed Algorithms Fall, Class 15 6.852: Distributed Algorithms Fall, 2009 Class 15 Today s plan z z z z z Pragmatic issues for shared-memory multiprocessors Practical mutual exclusion algorithms Test-and-set locks Ticket locks Queue locks

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing This document consists of two parts. The first part introduces basic concepts and issues that apply generally in discussions of parallel computing. The second part consists

More information

Lecture #7: Implementing Mutual Exclusion

Lecture #7: Implementing Mutual Exclusion Lecture #7: Implementing Mutual Exclusion Review -- 1 min Solution #3 to too much milk works, but it is really unsatisfactory: 1) Really complicated even for this simple example, hard to convince yourself

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2

z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os 1.2 introduced a new heuristic for determining whether it is more efficient in terms

More information

Reactive Synchronization Algorithms for Multiprocessors

Reactive Synchronization Algorithms for Multiprocessors Synchronization Algorithms for Multiprocessors Beng-Hong Lim and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 039 Abstract Synchronization algorithms

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea Programming With Locks s Tricky Multicore processors are the way of the foreseeable future thread-level parallelism anointed as parallelism model of choice Just one problem Writing lock-based multi-threaded

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Welfare Navigation Using Genetic Algorithm

Welfare Navigation Using Genetic Algorithm Welfare Navigation Using Genetic Algorithm David Erukhimovich and Yoel Zeldes Hebrew University of Jerusalem AI course final project Abstract Using standard navigation algorithms and applications (such

More information

Remaining Contemplation Questions

Remaining Contemplation Questions Process Synchronisation Remaining Contemplation Questions 1. The first known correct software solution to the critical-section problem for two processes was developed by Dekker. The two processes, P0 and

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Parallel Computing Concepts. CSInParallel Project

Parallel Computing Concepts. CSInParallel Project Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................

More information

Enhancing Linux Scheduler Scalability

Enhancing Linux Scheduler Scalability Enhancing Linux Scheduler Scalability Mike Kravetz IBM Linux Technology Center Hubertus Franke, Shailabh Nagar, Rajan Ravindran IBM Thomas J. Watson Research Center {mkravetz,frankeh,nagar,rajancr}@us.ibm.com

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Conventions. Barrier Methods. How it Works. Centralized Barrier. Enhancements. Pitfalls

Conventions. Barrier Methods. How it Works. Centralized Barrier. Enhancements. Pitfalls Conventions Barrier Methods Based on Algorithms for Scalable Synchronization on Shared Memory Multiprocessors, by John Mellor- Crummey and Michael Scott Presentation by Jonathan Pearson In code snippets,

More information

Application Layer. Protocol/Programming Model Layer. Communication Layer. Communication Library. Network

Application Layer. Protocol/Programming Model Layer. Communication Layer. Communication Library. Network Limits to the Performance of Software Shared Memory: A Layered Approach Jaswinder Pal Singh, Angelos Bilas, Dongming Jiang and Yuanyuan Zhou Department of Computer Science Princeton University Princeton,

More information

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps Interactive Scheduling Algorithms Continued o Priority Scheduling Introduction Round-robin assumes all processes are equal often not the case Assign a priority to each process, and always choose the process

More information

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) EEC 581 Computer rchitecture Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) Chansu Yu Electrical and Computer Engineering Cleveland State University cknowledgement Part of class notes

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

IT 540 Operating Systems ECE519 Advanced Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (5 th Week) (Advanced) Operating Systems 5. Concurrency: Mutual Exclusion and Synchronization 5. Outline Principles

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Chapter 8 Virtual Memory What are common with paging and segmentation are that all memory addresses within a process are logical ones that can be dynamically translated into physical addresses at run time.

More information

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency. Recap Protocol Design Space of Snooping Cache Coherent ultiprocessors CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Snooping cache coherence solve difficult problem by applying

More information

Adaptive Backoff Synchronization Techniques

Adaptive Backoff Synchronization Techniques Adaptive Backoff Synchronization Techniques Anant Agarwal and Mathews Cherian Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Abstract Shared-memory multiprocessors

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

VIRTUAL MEMORY READING: CHAPTER 9

VIRTUAL MEMORY READING: CHAPTER 9 VIRTUAL MEMORY READING: CHAPTER 9 9 MEMORY HIERARCHY Core! Processor! Core! Caching! Main! Memory! (DRAM)!! Caching!! Secondary Storage (SSD)!!!! Secondary Storage (Disk)! L cache exclusive to a single

More information