Chapter 5: Achieving Good Performance

Size: px

Start display at page:

Download "Chapter 5: Achieving Good Performance"

Cassandra Daniels
6 years ago
Views:

1 Chapter 5: Achieving Good Performance Typically, it is fairly straightforward to reason about the performance of sequential computations. For most programs, it suffices simply to count the number of instructions that are executed. In some cases, we realize that memory system performance is the bottleneck, so we find ways to reduce memory usage or to improve memory locality. In general, programmers are encouraged to avoid premature optimization by remembering the 90/10 rule, which states that 90% of the time is spent in 10% of the code. Thus, a prudent strategy is to write a program in a clean manner, and if its performance needs improving, to identify the 10% of the code that dominates the execution time. This 10% can then be rewritten, perhaps even rewritten in some alternative language, such as assembly code or C. Unfortunately, the situation is much more complex with parallel programs. As we will see, the factors that determine performance are not just instruction times, but also communication time, waiting time, dependences, etc. Dynamic effects, such as contention, are time-dependent and vary from problem to problem and from machine to machine. Furthermore, controlling the costs is much more complicated. But before considering the complications, consider a fundamental principle of parallel computation. Amdahl s Law Amdahl's Law observes that if 1/S of a computation is inherently sequential, then the maximum performance improvement is limited to a factor of S. The reasoning is that the parallel execution time, T P, of a computation with sequential execution time, T S, will be the sum of the time for the sequential component and the parallel component. For P processors we have T P = 1/S T S + (1-1/S) T S / P Imagining a value for P so large that the parallel portion takes negligible time, the maximum performance improvement is a factor of S. That is, the proportion of sequential code in a computation determines its potential for improvement using parallelism. Given Amdahl's Law, we can see that the 90/10 rule does not work, even if the 90% of the execution time goes to 0. By leaving the 10% of the code unchanged, our execution time is at best 1/10 of the original, and when we use many more than 10 processors, a 10x speedup is likely to be unsatisfactory.

2 The situation is actually somewhat worse than Amdahl s Law implies. One obvious problem is that the parallelizable portion of the computation might not be improved to an Amdahl s Law. The law was enunciated in a 1967 paper by Gene Amdahl, an IBM mainframe architect [Amdahl, G.M., Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, AFIPS Press 30: , 1967]. It is a law in the same sense that the Law of Supply and Demand is a law: It describes a relationship between two components of program execution time, as expressed by the equation given in the text. Both laws are powerful tools to explain the behavior of important phenomena, and both laws assume as constant other quantities that affect the behavior. Amdahl s Law applies to a program instance. unlimited extent that is, there is probably an upper limit on the number of processors that can be used and still improve the performance so the parallel execution time is unlikely to vanish. Furthermore, a parallel implementation often executes more total instruction than the sequential solution, making the (1-1/S) T S an under estimate. Many, including Amdahl, have interpreted the law as proof that applying large numbers of processors to a problem will have limited success, but this seems to contradict news reports in which huge parallel computers improve computations by huge factors. What gives? Amdahl s law describes a key fact that applies to an instance of a computation. Portions of a computation that are sequential will, as parallelism is applied, dominate the execution time. The law fixes an instance, and considers the effect of increasing parallelism. Most parallel computations, such as those in the news, fix the parallelism and expand the instances. In such cases the proportion of sequential code diminishes relative to the overall problem as larger instances are considered. So, doubling the problem size may increase the sequential portion negligibly, making a greater fraction of the problem available for parallel execution. In summary, Amdahl s law does not deny the value of parallel computing. Rather, it reminds us that to achieve parallel performance we must be concerned with the entire program. Measuring Performance As mentioned repeatedly, the main point of parallel computing is to run computations faster. Faster obviously means in less time, but we immediately wonder, How much less? To understand both what is possible and what we can expect to achieve, we use several metrics to measure parallel performance, each with its own strengths and weaknesses. Execution Time Perhaps the most intuitive metric is execution time. Most of us think of the so called wall clock time as synonymous with execution time, and for programs that run for hours and hours, that equivalence is accurate enough. But the elapsed wall clock time includes operating system time for loading and initiating the program, I/O time for reading data, paging time for the compulsory page misses, check-pointing time, etc. For

3 short computations the kind that we often use when we are analyzing program behavior these items can be significant contributors to execution time. One argument says that because they are not affected by the user programming, they should be factored out of performance analysis that is directed at understanding the behavior of a parallel solution; the other view says that some services provided by the OS are needed, and the time should be charged. It is a complicated matter that we take up again at the end of the chapter. In this book we use execution time to refer to the net execution time of a parallel program exclusive of initial OS, I/O, etc. charges. The problem of compulsory page misses is usually handled by running the computation twice and measuring only the second one. When we intend to include all of the components contributing to execution time, we will refer to wall clock time. Notice that execution times (and wall clock times for that matter) cannot be compared if they come from different computers. And, in most cases it is not possible to compare the execution times of programs running different inputs even for the same computer. FLOPS Another common metric is FLOPS, short for floating point operations per second, which is often used in scientific computations that are dominated by floating point arithmetic. Because double precision floating point arithmetic is usually significantly more expensive than single precision, it is common when reporting FLOPS to state which type of arithmetic is being measured. An obvious downside to using FLOPS is that it ignores other costs such as integer computations, which may also be a significant component of computation time. Perhaps more significant is that FLOPS rates can often be affected by extremely low-level program modifications that allow the programs to exploit a special feature of the hardware, e.g. a combined multiply/add operation. Such improvements typically have little generality, either to other computations or to other computers. A limitation of both of the above metrics is that they distill all performance into a single number without providing an indication of the parallel behavior of the computation. Instead, we often wish to understand how the performance of the program scales as we change the amount of parallelism. Speedup Speedup is defined as the execution time of a sequential program divided by the execution time of a parallel program that computes the same result. In particular, Speedup = T S / T P, where T S is the sequential time and T P is the parallel time running on P processors. Speedup is often plotted on the y-axis and the number of processors on the x- axis, as shown in Figure 5.1.

4 48 Performance Speedup Program1 Program2 0 0 Processors 64 Figure 5.1. A typical speedup graph showing performance for two programs. The speedup graph shows a characteristic typical of many parallel programs, namely, that the speedup curves level off as we increase the number of processors. This feature is the result of keeping the problem size constant while increasing the number of processors, which causes the amount of work per processor to decrease; with less work per processor costs such as overhead or sequential computation, as Amdahl predicted become more significant, causing the total execution not to scale so well. Efficiency Efficiency is a normalized measure of speedup: Efficiency = Speedup/P. Ideally, speedup should scale linearly with P, implying that efficiency should have a constant value of 1. Of course, because of various sources of performance loss, efficiency is more typically below 1, and it diminishes as we increase the number of processors. Efficiency greater than 1 represents superlinear speedup. Superlinear Speedup The upper curve in the Figure 5.1 graph indicates superlinear speedup, which occurs when speedup grows faster than the number of processors. How is this possible? Surely the sequential program, which is the basis for the speedup computation, could just simulate the P processes of the parallel program to achieve an execution time that is no more than P times the parallel execution time. Shouldn t superlinear speedup be impossible? There are two reasons why superlinear speedup occurs. The most common reason is that the computation s working set that is, the set of pages needed for the computationally intensive part of the program does not fit in the cache when executed on a single processor, but it does fit into the caches of the multiple processors when the problem is divided amongst them for parallel execution. In such cases the superlinear speedup derives from improved execution time due to the more efficient memory system behavior of the multi-processor execution.

5 The second case of superlinear speedup occurs when performing a search that is terminated as soon as the desired element is found. When performed in parallel, the search is effectively performed in a different order, implying that the total amount of data searched can actually be less than in the sequential case. Thus, the parallel execution actually performs less work. Issues with Speedup and Efficiency Since speedup is a ratio of two execution times, it is a unitless metric that would seem to factor out technological details such as processor speed. Instead, such details insidiously affect speedup, so we must be careful in interpreting speedup figures. There are several concerns. First, recognize that it is difficult to compare speedup from machines of different generations, even if they have the same architecture. The problem is that different components of a parallel machine are generally improved by different amounts, changing their relative importance. So, for example, processor performance has increased over time, but communication latency has not fallen proportionately. Thus, the time spent communicating will not have diminished as much as the time spent computing. As a result, speedup values have generally decreased over time. Stated another way, the parallel components of a computation have become relatively more expensive compared to the processing components. The second issue concerns T S, speedup s numerator, which should be the time for the fastest sequential solution for the given processor and problem size. If T S is artificially inflated, speedup will be greater. A subtle way to increase T S is to turn off scalar compiler optimizations for both the sequential and parallel programs, which might seem fair since it is using the same compiler for both programs. However, such a change effectively slows the processors and improves relatively speaking communication latency. When reporting speedup, the sequential program should be provided and the compiler optimization settings detailed. Another common way to increase T S is to measure the one-processor performance of the parallel program. Speedup computed on this basis is called relative speedup and should be reported as such. True speedup includes the likely possibility that the sequential algorithm is different than the parallel algorithm. Relative speedup, which simply compares different runs of the same algorithm, takes as the base case an algorithm optimized for concurrent execution but with no parallelism; it will likely run slower because of parallel overheads, causing the speedup to look better. Notice that it can happen that a well-written parallel program on one processor is faster than any known sequential program, making it the best sequential program. In such cases we have true speedup, not relative speedup. The situation should be explicitly identified. Relative speed up cannot always be avoided. For example, for large computations it may be impossible to measure a sequential program on a given problem size, because the data structures do not fit in memory. In such cases relative speedup is all that can be reported. The base case will be a parallel computation on a small number of processors, and the y-

6 axis of the speedup plot should be scaled by that amount. So, for example, if the smallest possible run has P=4, then dividing by the runtime for P=64, will show perfect speedup at y=16. Another way to inadvertently affect T S is the cold start problem. An easy way to accidentally get a large T S value is to run the sequential program once and include all of the paging behavior and compulsory cache misses in its timing. As noted earlier it is good practice to run a parallel computation a few times, measuring only the later runs. This allows the caches to warm up, so that compulsory cache miss times are not unnecessarily included in the performance measure, thereby complicating our understanding of the program s speedup. (Of course, if the program has conflict misses, they should and will be counted.) Properly, most analysts warm their programs. But the sequential program should be warmed, too, so that the paging and compulsory misses do not figure into its execution time. Though easily overlooked, cold starts are also easily corrected. More worrisome are computations that involve considerable off-processor activity, e.g. disk I/O. One-time I/O bursts, say to read in problem data, are fine because timing measurements can by-pass them; the problem is continual off-processor operations. Not only are they slow relative to the processors, but they greatly complicate the speedup analysis of a computation. For example, if both the sequential and parallel solutions have to perform the same off-processor operations from a single source, huge times for these operations can completely obscure the parallelism because they will dominate the measurements. In such cases it is not necessary to parallelize the program at all. If processors can independently perform the off-processor operations, then this parallelism alone dominates the speedup computation, which will likely look perfect. Any measurements of a computation involving off-processor charges must control their effects carefully. Performance Trade-Offs We know that communication time, idle time, wait time, and many other quantities can affect the performance of a parallel computation. The complicating factor is that attempts to lower one cost can increase others. In this section we consider such complications. Communication vs. computation Communication costs are a direct expense for using parallelism because they do not arise in sequential computing. Accordingly, it is almost always smart to attempt to reduce them. Overlap Communication and Computation. One way to reduce communication costs is to overlap communication with computation. Because communication can be performed concurrently with computation, and because the computation must be performed anyway, a perfect overlap that is, the data is available when it is needed hides the communication cost perfectly. Partial overlap will diminish waiting time and give partial improvement. The key, of course, is to identify computation that is independent of the communication. From a performance perspective, overlapping is generally a win without

7 costs. From a programming perspective, overlapping communication and computation can complicate the program s structure. Redundant Computation. Another way to reduce communication costs is to perform redundant computations. We observed in Chapter 2, for example, that the local generation of a random number, r, by all processes was superior to generating the value in one process and requiring all others to reference it. Unlike overlapping, redundant computation incurs a cost because there is no parallelism when all processors must execute the random number generator code. Stated another way, we have increased the total number of instructions to be executed in order to remove the communication cost. Whenever the cost of the redundant computation is less than the communication cost, redundant computation is a win. Notice that redundant computation also removes a dependence from the original program between the generating process and the others that will need the value. It is useful to remove dependences even if the cost of the added computation exactly matches the communication cost. In the case of the random number generation, redundant computation removes the possibility that a client process will have to wait for the server process to produce it. If the client can generate its own random number, it does not have to wait. Such cases complicate the assessing the trade-off. Memory vs. parallelism Memory usage and parallelism interact in many ways. Perhaps the most favorable is the cache effect that leads to superlinear parallel performance, noted above. With all processors having caches, there is more fast memory in a parallel computer. But there are other cases where memory and parallelism interact. Privatization. For example, parallelism can be increased by using additional memory to break false dependences. One memorable example is the use of private_count variables in the Count 3s program, which removed the need for threads to interact each time they recorded the next 3. The effect was to increase the number of count variables from 1 to t, the number of threads. It is a tiny memory cost for a big savings in reduced dependences Batching. One way to reduce the number of dependences is to increase the granularity of interaction. Batching is a programming technique in which work or transmissions are performed as a group. For example, rather than transmitting elements of an array, transmit a whole row or column; rather than grabbing one task from the task queue, get several. Batching effectively raises the granularity (see below) of fine-grain interactions to reduce their frequency. The added memory is simply required to record the items of the batch, and like privatization, is almost always worth the memory costs. Memoization. Memoization stores a computed value to avoid re-computing later. An example is a stencil optimization: A value is computed based on some combination of the scaled values of its neighbors, shown schematically below,

8 where color indicates the scaling coefficient; elements such as the corner elements are multiplied by the scale factor four times as the stencil moves through the array, and memoizing this value can reduce the number of multiplies and memory references. [DETAILED EXAMPLE HERE.] It is a sensible program optimization that removes instruction executions that, strictly speaking, may not result in parallelism improvements. However, in many cases memoization will result in better parallelism, as when the computation is redundant or involves non-local data values. Padding. Finally, we note that false sharing references to independent variables that become dependent because they are allocated to the same cache line can be eliminated by padding data structures to push the values onto different cache lines. Overhead vs. parallelism Parallelism and overhead are sometimes at odds. At one extreme, all parallel overhead, such as lock contention, can be avoided by using just one process. As we increase the number of threads the contention will likely increase. If the problem size remains fixed each processor has less work to perform between synchronizations, causing synchronization to become a larger portion of the overall computation. And a smaller problem size implies that there is less computation available to overlap with communication, which will typically increase the wait times for data. It is the overhead of parallelism that is usually the reason why P cannot increase without bound. Indeed, even computations that could conceptually be solved with a processor devoted to each data point will be buried by overhead before P=n. Thus, we find that most programs have an upper limit for each data size at which the marginal value of an additional processor is negative, that is, adding a processor causes the execution time to increase. Parallelize Overhead. Recall that in Chapter 4, when lock contention became a serious concern, we adopted a combining tree to solve it. In essence, the threads split up the task of accumulating intermediate values into several independent parallel activities. [THIS SECTION CONTINUES WITH THESE TOPICS] Load balance vs. parallelism. Increased parallelism can also improve load balance, as it's often easier to distribute evenly a large number of fine-grained units of work than a smaller number of coarse-grained units of work. Granularity tradeoffs. Many of the above tradeoffs are related to the granularity of parallelism. The best granularity often depends on both algorithmic characteristics, such as the amount of parallelism and the types of dependences, and hardware characteristics,

9 such as the cache size, the cache line size, and the latency and bandwidth of the machine's communication substrate. Latency vs. bandwidth. As discussed in Chapter 3, there are many instances where bandwidth can be used to reduce latency. Scaled speedup vs. Fixed-Size speedup Choosing a problem size can be difficult. What should we measure? The kernel or the entire program? Amdahl s law says that everything is important! Operating System Costs Because operating systems are so integral to computation, it is complicated to assess their effects on performance. Initialization. How is memory laid out in the parallel computer? Summary Exercises

10 Chapter 6: Programming with Threads Recall in Chapter 1 that we used threads to implement the count 3's program. In this chapter we'll explore thread-based programming in more detail using the standard POSIX Threads interface. We'll first explain the basic concepts needed to create threads and to let them interact with one another. We'll then discuss issues of safety and performance before we step back and evaluate the overall approach. Thread Creation and Destruction Consider the following standard code: 1 #include <pthread.h> 2 int err; 3 4 void main () 5 { 6 pthread_t tid[max]; /* An array of Thread ID's, one for each */ 7 /* thread that is created */ 8 9 for (i=0; i<t; i++) 10 { 11 err = pthread_create (&tid[i], NULL, count3s_thread, i); 12 } for (i=0; i<t; i++) 15 { 16 err = pthread_join_(tid[i], &status[i]) 17 } 18 } The above code shows a main() function, which then creates and launches t threads in the first loop, and then waits for the t threads to complete in the second loop. We often refer to the creating thread as the parent and the created threads as children. The above code differs from the pseudocode in Chapter 1 in a few details. Line 1 includes the pthreads header file, which declares the various pthreads routines and datatypes. Each thread that is created needs its own thread ID, so these thread ID's are declared on line 6. To create a thread, we invoke the pthread_create() routine with four parameters. The first parameter is a pointer to a thread ID, which will point to a valid thread ID when this thread successfully returns. The second argument provides the thread s attributes; in this case, the NULL value specifies default attributes. The third parameter is a pointer to the start function, which the thread will execute once it s created. The fourth argument is passed to the start routine, in this case, it represents a unique integer between 0 and t-1 that is associated with each thread. The loop on line 16 then calls pthread_join() to wait for each of the child threads to terminate. If

11 instead of waiting for the child threads to complete, the main() routine finishes and exits using pthread_exit(), the child threads will continue to execute. Otherwise, the child threads will automatically terminate when main() finishes, since the entire process will have terminated. See Code Specs 1 and 2. pthread_create() int pthread_create ( // create a new thread pthread_t *tid, // thread ID const pthread_attr_t *attr, // thread attributes void *(*start_routine) (void *),// pointer to function to execute void *arg // argument to function ); Arguments: The thread ID of the successfully created thread. The thread's attributes, explained below; the NULL value specifies default attributes. The function that the new thread will execute once it is created. An argument passed to the start_routine(), in this case, it represents a unique integer between 0 and t-1 that is associated with each thread. Return value: 0 if successful. Error code from <errno.h> otherwise. Notes: Use a structure to pass multiple arguments to the start routine. Code Spec 1. pthread_create(). The POSIX Threads thread creation function. pthread_join() int pthread_join ( pthread_t tid, void **status ); // wait for a thread to terminate // thread IT to wait for // exit status Arguments: The ID of the thread to wait for. The completion status of the exiting thread will be copied into *status unless status is NULL, in which case the completion status is not copied. Return value: 0 for success. Error code from <errno.h> otherwise. Notes: Once a thread is joined, the thread no longer exists, its thread ID is no longer valid, and it cannot be joined with any other thread.

12 Code Spec 2. pthread_join(). The POSIX Threads rendezvous function pthread_join(). Thread ID s Each thread has a unique ID of type pthread_t. As with all pthread data types, a thread ID should be treated as an opaque type, meaning that individual fields of the structure should never be accessed directly. Because child threads do not know their thread ID, the two routines allow a thread to determine its thread ID, pthread_self(), and to compare two thread ID s, pthread_equal(), see Code Specs 3 and 4. pthread_self() pthread_t pthread_self (); // Get my thread ID Return value: The ID of the thread that called this function. Code Spec 3. pthread_self(). The POSIX Threads function to fetch a thread s ID. pthread_equal() int pthread_equal ( pthread_t t1, pthread_t t2 ); // Test for equality // First operand thread ID // Second operand thread ID Arguments: Two thread ID s Return value: Non-zero if the two thread ID s are the same (following the C convention). 0 if the two threads are different. Code Spec 4. pthread_equal(). The POSIX Threads function to compare two thread IDs for equality. Destroying Threads There are three ways that threads can terminate. 1. A thread can return from the start routine. 2. A thread can call pthread_exit(). 3. A thread can be cancelled by another thread. In each case, the thread is destroyed and its resources become unavailable.

13 void pthread_exit() void pthread_exit ( void *status ); // terminate a thread // completion status Arguments: The completion status of the thread that has exited. This pointer value is available to other threads. Return value: None Notes: When a thread exits by simply returning from the start routine, the thread s completion status is set to the start routine s return value. Code Spec 5. pthread_exit(). The POSIX Threads thread termination function pthread_exit(). Thread Attributes Each thread maintains its own properties, known as attributes, which are stored in a structure of type pthread_attr_t. For example, threads can be either detached or joinable. Detached threads cannot be joined with other threads, so they have slightly lower overhead in some implementations of POSIX Threads. For parallel computing, we will rarely need detached threads. Threads can also be either bound or unbound. Bound threads are scheduled by the operating system, whereas unbound threads are scheduled by the Pthreads library. For parallel computing, we typically use bound threads so that each thread provides physical concurrency. POSIX Threads provides routines to initialize thread attributes, set their attributes, and destroy attributes, as shown in Code Spec 6. Code Spec 6. pthread attributes. An example of how thread attributes are set in the POSIX Threads Thread Attributes pthread_attr_t attr; pthread_t tid; interface. // Declare a thread attribute pthread_attr_init(&attr); // Initialize a thread attribute pthread_attr_setdetachstate(&attr, // Set the thread attribute PTHREAD_CREATE_UNDETACHED); pthread_create (&tid, &attr, start_func, NULL); // Use the attribute // to create a thread pthread_join(tid, NULL); pthread_attr_destroy(&attr); // Destroy the thread attribute

14 Example The following example illustrates a potential pitfall that can occur because of the interaction between parent and child threads. The parent thread simply creates a child thread and waits for the child to exit. The child thread does some useful work and then exits, returning an error code. Do you see what s wrong with this code? 1 #include <pthread.h> 2 3 void main () 4 { 5 pthread_t tid; 6 int *status; 7 8 pthread_create (&tid, NULL, start, NULL); 9 pthread_join_(tid, &status); 10 } void start() 13 { 14 int errorcode; 15 /* do something useful... */ if (... ) 18 errorcode = something; 19 pthread_exit(&errorcode); 20 } The problem occurs in the call to pthread_exit() on line 17, where the child is attempting to return an error code to the parent. Unfortunately, because errorcode is declared to be local to the start() function, the memory for errorcode is allocated on the child thread s stack. When the child exits, its call stack is de-allocated, and the parent has a dangling pointer to errorcode. At some point in the future, when a new procedure is invoked, it will over-write the stack location where errorcode resides, and the value of errorcode will change. Mutual Exclusion We can now create and destroy threads, but to allow threads to interact constructively, we need methods of coordinating their interaction. In particular, when two threads share access to memory, it is often useful to employ a lock, called a mutex, to provide mutual exclusion or mutually exclusive access to the variable. As we saw in Chapter 1, without mutual exclusion, race conditions can lead to unpredictable results, because when multiple threads execute the following code, the count variable, which is shared among all threads, will not be atomically updated. for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) { count++;

15 } } The solution, of course, is to protect the update of count using a mutex, as shown below: 1 pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; 2 3 void count3s_thread (int id) 4 { 5 /* Compute portion of array that this thread should work on */ 6 int length_per_thread = length/t; 7 int start = id * length_per_thread; 8 9 for (i=start; i<start+length_per_thread; i++) 10 { 11 if (array[i] == 3) 12 { 13 pthread_mutex_lock(&lock); 14 count++; 15 pthread_mutex_unlock(&lock); 16 } 17 } 18 } Line 1 shows how a mutex can be statically declared. Like threads, mutexes have attributes, and by initializing the mutex to PTHREAD_MUTEX_INITIALIZER, the mutex is assigned default attributes. To use this mutex, its address is passed to the lock and unlock routines on lines 13 and 15, respectively. The appropriate discipline, of course, is to bracket all critical sections, that is, code that must be executed atomically by only one thread at a time, by the locking of a mutex upon entrance and the unlocking of a mutex upon exit. Only one thread can acquire the mutex at any one time, so a thread will block if it attempts to acquire a mutex that is already held by another thread. When a mutex is unlocked, or relinquished, one of the threads that was blocked attempting to acquire the lock will become unblocked and granted the mutex. The POSIX Threads standard defines no notion of fairness, so the order in which the locks are acquired is not guaranteed to match the order in which the threads attempted to acquire the locks. It is an error to unlock a mutex that has not been locked, and it is an error to lock a mutex that is already held. The latter will lead to deadlock, in which the thread cannot make progress because it is blocked waiting for an event that cannot happen. We will discuss deadlock and techniques to avoid deadlock in more detail later in the chapter.

16 Acquiring and Releasing Mutexes int pthread_mutex_lock( pthread_mutex_t *mutex); int pthread_mutex_unlock( pthread_mutex_t *mutex); int pthread_mutex_trylock( pthread_mutex_t *mutex); // Lock a mutex // Unlock a mutex // Non-blocking lock Arguments: Each function takes the address of a mutex variable. Return value: 0 if successful. Error code from <errno.h> otherwise. Notes: The pthread_mutex_trylock() routine attempts to acquire a mutex but will not block. This routine returns EBUSY if the mutex is locked. Code Spec 7. The POSIX Threads routines for acquiring and releasing mutexes. Serializability It s clear that our use of mutexes provides atomicity: the thread that acquires the mutex m will execute the code in the critical section until it relinquishes the mutex. Thus, in our above example, the counter will be updated by only one thread at a time. Atomicity is important because it ensures serializability: A concurrent execution is serializable if the execution is guaranteed to execute in an order that corresponds to some serial execution of those threads. Mutex Creation and Destruction In our above example, we knew that only one mutex was needed, so we were able to statically allocate it. In cases where the number of required mutexes is not known a priori, we can instead allocate and deallocate mutexes dynamically. Code Spec 8 shows how such a mutex is dynamically allocated, initialized with default attributes, and destroyed.

17 Mutex Creation and Destruction int pthread_mutex_init( pthread_mutex_t *mutex, pthread_mutexattr_t *attr); int pthread_mutex_destroy ( pthread_mutex_t *mutex); int pthread_mutexattr_init( pthread_mutexattr_t *attr); int pthread_mutexattr_destroy ( pthread_mutexattr_t *attr); // Initialize a mutex // Destroy a mutex // Initialize a mutex attribute // Destroy a mutex attribute Arguments: The pthread_mutex_init() routine takes two arguments, a pointer to a mutex and a pointer to a mutex attribute. The latter is presumed to have already been initialized. The pthread_mutexattr_init() and pthread_mutexattr_destroy() routines take a pointer to a mutex attribute as arguments. Notes: If the second argument to pthread_mutex_init() is NULL, default attributes will be used. Code Spec 8. The POSIX Threads routines for dynamically creating and destroying mutexes. Dynamically Allocated Mutexes pthread_mutex_t *lock; // Declare a pointer to a lock lock = (pthread_mutex_lock_t *) malloc(sizeof (pthread_mutex_t)); pthread_mutex_init(lock, NULL); /* * Code that uses this lock. */ pthread_mutex_destroy (lock); free (lock); Code Spec 9. An example of how dynamically allocated mutexes are used in the POSIX Threads interface. Synchronization Mutexes are sufficient to provide atomicity for critical sections, but in many situations we would like a thread to synchronize its behavior with that of some other thread. For example, consider a classic bounded buffer problem in which one or more threads put

18 items into a circular buffer while other threads remove items from the same buffer. As shown in Figure 1, we would like the producers to stop producing data to wait if the consumer is unable to keep up and the buffer becomes full, and we would like the consumers to wait if the buffer is empty. Circular Buffer Get Put Empty Buffer Full Buffer Get Put Put Get Figure 1. A bounded buffer with producers and consumers. The Put and Get cursors indicate where the producers will insert the next item and where the consumers will remove its next item, respectively. When the buffer is empty, the consumers must wait. When the buffer is full, the producers must wait. Such synchronization is supported by condition variables, which are a more general form of synchronization than joining threads. A condition variable allows threads to wait until some condition becomes true, at which point one of the waiting threads is nondeterministically chosen to stop waiting. We can think of the condition variable as a gate (see Figure 2). Threads wait at the gate until some condition is true. Other threads open the gate to signal that the condition has become true, at which point one of the waiters is allowed to enter the gate and resume execution. If a thread opens the gate when there are no threads waiting, the signal has no effect. signaler signaler waiter waiter waiter Figure 2. Condition variables act like a gate. Threads wait outside the gate by calling pthread_cond_wait(), and threads open the gate by calling pthread_cond_signal(). When the gate is opened, one waiter is allowed through. If there are no waiters when the gate is opened, the signal has no effect. We can solve our bounded buffer problem with two condition variables, nonempty and nonfull, as shown below. 1 pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; 2 pthread_cond_t nonempty = PTHREAD_COND_INITIALIZER; 3 pthread_cond_t nonfull= PTHREAD_COND_INITIALIZER; 4 Item buffer[size]; 5 int in = 0; // Buffer index for next insertion 6 int out = 0; // Buffer index for next removal

19 7 8 void put (Item x) // Producer thread 9 { 10 pthread_mutex_lock(&lock); 11 while (in out) == SIZE) // While buffer is full 12 pthread_cond_wait(&nonfull, &lock); 13 buffer[in % SIZE] = x; 14 in++; 15 pthread_cond_signal(&nonempty); 16 pthread_mutex_unlock(&lock); 17 } Item get() // Consumer thread 20 { 21 Item x; 22 pthread_mutex_lock(&lock); 23 while (out in) // While buffer is empty 24 pthread_cond_wait(&nonempty, &lock); 25 x = buffer[out % SIZE]; 26 out++; 27 pthread_cond_signal(&nonfull); 28 pthread_mutex_unlock(&lock); 29 return x; 30 } Of course, since multiple threads will be updating these condition variables, we need to protect their access with a mutex, so Line 1 declares a mutex. The remaining declarations define a buffer, buffer, and its two cursors, in and out, which indicate where to insert the next item and where to remove the next item. The two cursors wrap around when they exceed the bounds of buffer, yielding a circular buffer. Given these data structures, the producer thread executes the put() routine, which first acquires the mutex to access the condition variables. (This code omits the actual creation of the producer and consumer threads, which are assumed to iteratively invoke the put() and get() routines, respectively.) If the buffer is full, the producer waits on the nonfull condition so that it will later be awakened when the buffer becomes non-full. If this thread blocks, the mutex that it holds must be relinquished to avoid deadlock. Because these two events the releasing of the mutex and the blocking of this waiting thread must occur atomically, they must be performed by pthread_cond_wait(), so the mutex is passed as a parameter to pthread_cond_wait(). When the producer resumes execution after returning from the wait on Line 12, the protecting mutex will have been re-acquired by the system on behalf of the producer. In a moment we will explain the need for the while loop on Line 11, but for now assume when the producer executes Line 13, the buffer is not full, so it is safe to insert a new item and to bump the In cursor by one. At this point, the buffer cannot be empty because the producer has just inserted an element, so the producer signals that the buffer is nonempty, waking one more consumers that may be waiting on an empty buffer. If there are no waiting consumers, the signal is lost. Finally, the producer releases the

20 mutex and exits the routine. The consumer thread executes the get() routine, which operates in a very similar manner. pthread_cond_wait() int pthread_cond_wait( pthread_cond_t *cond, pthread_mutex_t *mutex); // Condition to wait on // Protecting mutex int pthread_cond_timedwait ( pthread_cond_t *cond, pthread_mutex_t *mutex, const struct timespec *abstime); // Time-out value Arguments: A condition variable to wait on. A mutex that protects access to the condition variable. The mutex is released before the thread blocks, and these two actions occur atomically. When this thread is later unblocked, the mutex is reacquired on behalf of this thread. Return value: 0 if successful. Error code from <errno.h> otherwise. Code Spec 10. pthread_cond_wait(): The POSIX Thread routines for waiting on condition variables. pthread_cond_signal() int pthread_cond_signal( pthread_cond_t *cond); int pthread_cond_broadcast ( pthread_cond_t *cond); // Condition to signal // Condition to signal Arguments: A condition variable to signal. Return value: 0 if successful. Error code from <errno.h> otherwise. Notes: These routines have no effect if there are no threads waiting on cond. In particular, there is no memory of the signal when a later call is made to pthread_cond_wait(). The pthread_cond_signal() routine may wake up more than one thread, but only one of these threads will hold the protecting mutex. The pthread_cond_broadcast() routine wakes up all waiting threads. Only one awakened thread will hold the protecting mutex. Code Spec 11. pthread_cond_signal(). The POSIX Threads routines for signaling a condition variable.

21 Protecting Condition Variables Let us now return to the while loop on Line 11 of the bounded buffer program. If our system has multiple producer threads, this loop is essential because pthread_cond_signal() can wake up multiple waiting threads 1, of which only one will hold the protecting mutex at any particular time. Thus, at the time of the signal, the buffer is not full, but when any particular thread acquires the mutex, the buffer may have become full again, in which case the thread should call pthread_cond_wait() again. When the producer thread executes Line 13, the buffer is necessarily not full, so it is safe to insert a new item and to bump the In cursor. We see on Lines 15 and 27 that the call to pthread_cond_signal() is also protected by the lock. The following example shows that this protection is necessary. time Signaling Thread Waiting Thread lock (mutex) while (out in) insert(item); pthread_cond_signal(&nonempty); // Signal is dropped pthread_cond_wait(&nonempty, lock); // Will wait forever Formatted: Bullets and Numbering Figure 3. Example of why a signaling thread needs to be protected by a mutex. In this example, the waiting thread, in this case the consumer, acquires the protecting mutex and finds that the buffer is empty, so it executes pthread_cond_wait(). If the signaling thread, in this case the producer, does not protect the call to pthread_cond_signal() with a mutex, it could insert an item into the buffer immediately after the waiting thread found it empty. If the producer then signals that the buffer is non-empty before the waiting thread executes the call to pthread_cond_wait(), the signal will be dropped and the consumer thread will not realize that the buffer is actually not empty. In the case that the producer only inserts a single item, the waiting thread will needlessly wait forever. The problem, of course, is that there is a race condition involving the manipulation of the buffer. The obvious solution is to protect both the call to pthread_cond_signal() with the same mutex that protects the call to pthread_cond_wait(), as shown in the code for our bounded buffer solution. Because both the Put() and Get() routines are protected by the same mutex, we have three critical sections related to the nonempty buffer, as shown in Figure 4, and in no case can the signal be dropped while a waiting thread thinks that the buffer is empty. 1 These semantics are due to implementation details. In some cases it can be expensive to ensure that exactly one waiter is unblocked by a signal.

22 Put() insert(item); pthread_cond_signal(&nonempty); Get() lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); remove(item); Critical section A Critical section B Critical section C time time time Signaling Thread Waiting Thread Case 1: Order A, B, C insert(item); pthread_cond_signal(&nonempty); lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); remove(item); Case 2: Order B, A, C lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); insert(item); pthread_cond_signal(&nonempty); remove(item); Case 3: Order B, C, A lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); remove(item); insert(item); pthread_cond_signal(&nonempty); Formatted: Bullets and Numbering Formatted: Bullets and Numbering Formatted: Bullets and Numbering Figure 4. Proper locking of the signaling code prevents race conditions. By identifying and protecting three critical sections pertaining to the nonempty buffer, we guarantee that each of A, B, and C will execute atomically, so our problem from Figure 3 is avoided: There is no way for the Put() routine s signal to be dropped while a thread executing the Get() routine thinks that the buffer is empty. We have argued that the call to pthread_cond_signal()must be protected by the same mutex that protects the waiting code. However, notice that the race condition occurs not from the signaling of the condition variable, but with the access to the shared buffer. Thus, we could instead simply protect any code that manipulates the shared buffer, which implies that the Put()code could release the mutex immediately after inserting an item into the buffer but before calling pthread_cond_signal(). This new code is not only legal, but it produces better performance because it reduces the size of the critical section, thereby allowing more concurrency.

23 Creating and Destroying Condition Variables Like threads and mutexes, condition variables can be created and destroyed either statically or dynamically. In our bounded buffer example above, the static condition variables were both given default attributes by initializing them to PTHREAD_COND_INITIALIZER. Condition variables can be dynamically allocated as indicated in Code Spec 12. Dynamically Allocated Condition Variables int pthread_cond_init( pthread_cond_t *cond, // Condition variable const pthread_condattr_t *attr); // Condition attribute int pthread_cond_destroy ( pthread_cond_t *cond); // Condition to destroy Arguments: Default attributes are used if attr is NULL. Return value: 0 if successful. Error code from <errno.h> otherwise. Code Spec 12. The POSIX Threads routines for dynamically creating and destroying condition variables. Waiting on Multiple Condition Variables In some cases a piece of code cannot execute unless multiple conditions are met. In these situations the waiting thread should test all conditions simultaneously, as shown below. 1 EatJuicyFruit() 2 { 3 pthread_mutex_lock(&lock); 4 while (apples==0 && oranges==0) 5 { 6 pthread_cond_wait(&more_apples, &lock); 7 pthread_cond_wait(&more_oranges, &lock); 8 } 9 /* Eat both an apple and an orange */ 10 pthread_mutex_unlock(&lock); 11 } By contrast, the following code, which waits on each condition in turn, fails because there is no guarantee that both conditions will be true at the same time. That is, after returning from the first call to pthread_cond_wait() but before returning from the second call to pthread_cond_wait(), some other thread may have removed an apple, making the first condition false. 1 EatJuicyFruit() 2 {

24 3 pthread_mutex_lock(&lock); 4 while (apples==0) 5 pthread_cond_wait(&more_apples, &lock); 6 while (oranges==0) 7 pthread_cond_wait(&more_oranges, &lock); 8 9 /* Eat both an apple and an orange */ 10 pthread_mutex_unlock(&lock); 11 } Thread-Specific Data It is often useful for threads to maintain private data that is not shared. For example, we have seen examples where a thread index is passed to the start function so that the thread knows what portion of an array to work on. This index can be used to give each thread a different element of an array, as shown below: for (i=0; i<t; i++) 4 err = pthread_create (&tid[i], NULL, start_function, i); 5 6 void start_function(int index) 7 { 8 private_count[index] = 0; 9... A problem occurs, however, if the code that accesses index occurs in a function, foo(), which is buried deep within other code. In such situations, how does foo() get the value of index? One solution is to pass the index parameter to every procedure that calls foo(), including procedures that call foo() indirectly through other procedures. This solution is cumbersome, particularly for those procedures that require the parameter but do not directly use it. Instead, what we really want is a variable that is global in scope to all code but which can have different values for each thread. POSIX Threads supports such a notion in the form of thread-specific data, which uses a set of keys, which are shared by all threads in a process, but which map to different pointer values for each thread. (See Figure 4.) Thread 0 Memory key1 key2 Thread 1

ANSI/IEEE POSIX Standard Thread management

ANSI/IEEE POSIX Standard Thread management Pthread Prof. Jinkyu Jeong( jinkyu@skku.edu) TA Jinhong Kim( jinhong.kim@csl.skku.edu) TA Seokha Shin(seokha.shin@csl.skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu The