Chapter 5: Achieving Good Performance

Size: px
Start display at page:

Download "Chapter 5: Achieving Good Performance"

Transcription

1 Chapter 5: Achieving Good Performance Typically, it is fairly straightforward to reason about the performance of sequential computations. For most programs, it suffices simply to count the number of instructions that are executed. In some cases, we realize that memory system performance is the bottleneck, so we find ways to reduce memory usage or to improve memory locality. In general, programmers are encouraged to avoid premature optimization by remembering the 90/10 rule, which states that 90% of the time is spent in 10% of the code. Thus, a prudent strategy is to write a program in a clean manner, and if its performance needs improving, to identify the 10% of the code that dominates the execution time. This 10% can then be rewritten, perhaps even rewritten in some alternative language, such as assembly code or C. Unfortunately, the situation is much more complex with parallel programs. As we will see, the factors that determine performance are not just instruction times, but also communication time, waiting time, dependences, etc. Dynamic effects, such as contention, are time-dependent and vary from problem to problem and from machine to machine. Furthermore, controlling the costs is much more complicated. But before considering the complications, consider a fundamental principle of parallel computation. Amdahl s Law Amdahl's Law observes that if 1/S of a computation is inherently sequential, then the maximum performance improvement is limited to a factor of S. The reasoning is that the parallel execution time, T P, of a computation with sequential execution time, T S, will be the sum of the time for the sequential component and the parallel component. For P processors we have T P = 1/S T S + (1-1/S) T S / P Imagining a value for P so large that the parallel portion takes negligible time, the maximum performance improvement is a factor of S. That is, the proportion of sequential code in a computation determines its potential for improvement using parallelism. Given Amdahl's Law, we can see that the 90/10 rule does not work, even if the 90% of the execution time goes to 0. By leaving the 10% of the code unchanged, our execution time is at best 1/10 of the original, and when we use many more than 10 processors, a 10x speedup is likely to be unsatisfactory.

2 The situation is actually somewhat worse than Amdahl s Law implies. One obvious problem is that the parallelizable portion of the computation might not be improved to an Amdahl s Law. The law was enunciated in a 1967 paper by Gene Amdahl, an IBM mainframe architect [Amdahl, G.M., Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, AFIPS Press 30: , 1967]. It is a law in the same sense that the Law of Supply and Demand is a law: It describes a relationship between two components of program execution time, as expressed by the equation given in the text. Both laws are powerful tools to explain the behavior of important phenomena, and both laws assume as constant other quantities that affect the behavior. Amdahl s Law applies to a program instance. unlimited extent that is, there is probably an upper limit on the number of processors that can be used and still improve the performance so the parallel execution time is unlikely to vanish. Furthermore, a parallel implementation often executes more total instruction than the sequential solution, making the (1-1/S) T S an under estimate. Many, including Amdahl, have interpreted the law as proof that applying large numbers of processors to a problem will have limited success, but this seems to contradict news reports in which huge parallel computers improve computations by huge factors. What gives? Amdahl s law describes a key fact that applies to an instance of a computation. Portions of a computation that are sequential will, as parallelism is applied, dominate the execution time. The law fixes an instance, and considers the effect of increasing parallelism. Most parallel computations, such as those in the news, fix the parallelism and expand the instances. In such cases the proportion of sequential code diminishes relative to the overall problem as larger instances are considered. So, doubling the problem size may increase the sequential portion negligibly, making a greater fraction of the problem available for parallel execution. In summary, Amdahl s law does not deny the value of parallel computing. Rather, it reminds us that to achieve parallel performance we must be concerned with the entire program. Measuring Performance As mentioned repeatedly, the main point of parallel computing is to run computations faster. Faster obviously means in less time, but we immediately wonder, How much less? To understand both what is possible and what we can expect to achieve, we use several metrics to measure parallel performance, each with its own strengths and weaknesses. Execution Time Perhaps the most intuitive metric is execution time. Most of us think of the so called wall clock time as synonymous with execution time, and for programs that run for hours and hours, that equivalence is accurate enough. But the elapsed wall clock time includes operating system time for loading and initiating the program, I/O time for reading data, paging time for the compulsory page misses, check-pointing time, etc. For

3 short computations the kind that we often use when we are analyzing program behavior these items can be significant contributors to execution time. One argument says that because they are not affected by the user programming, they should be factored out of performance analysis that is directed at understanding the behavior of a parallel solution; the other view says that some services provided by the OS are needed, and the time should be charged. It is a complicated matter that we take up again at the end of the chapter. In this book we use execution time to refer to the net execution time of a parallel program exclusive of initial OS, I/O, etc. charges. The problem of compulsory page misses is usually handled by running the computation twice and measuring only the second one. When we intend to include all of the components contributing to execution time, we will refer to wall clock time. Notice that execution times (and wall clock times for that matter) cannot be compared if they come from different computers. And, in most cases it is not possible to compare the execution times of programs running different inputs even for the same computer. FLOPS Another common metric is FLOPS, short for floating point operations per second, which is often used in scientific computations that are dominated by floating point arithmetic. Because double precision floating point arithmetic is usually significantly more expensive than single precision, it is common when reporting FLOPS to state which type of arithmetic is being measured. An obvious downside to using FLOPS is that it ignores other costs such as integer computations, which may also be a significant component of computation time. Perhaps more significant is that FLOPS rates can often be affected by extremely low-level program modifications that allow the programs to exploit a special feature of the hardware, e.g. a combined multiply/add operation. Such improvements typically have little generality, either to other computations or to other computers. A limitation of both of the above metrics is that they distill all performance into a single number without providing an indication of the parallel behavior of the computation. Instead, we often wish to understand how the performance of the program scales as we change the amount of parallelism. Speedup Speedup is defined as the execution time of a sequential program divided by the execution time of a parallel program that computes the same result. In particular, Speedup = T S / T P, where T S is the sequential time and T P is the parallel time running on P processors. Speedup is often plotted on the y-axis and the number of processors on the x- axis, as shown in Figure 5.1.

4 48 Performance Speedup Program1 Program2 0 0 Processors 64 Figure 5.1. A typical speedup graph showing performance for two programs. The speedup graph shows a characteristic typical of many parallel programs, namely, that the speedup curves level off as we increase the number of processors. This feature is the result of keeping the problem size constant while increasing the number of processors, which causes the amount of work per processor to decrease; with less work per processor costs such as overhead or sequential computation, as Amdahl predicted become more significant, causing the total execution not to scale so well. Efficiency Efficiency is a normalized measure of speedup: Efficiency = Speedup/P. Ideally, speedup should scale linearly with P, implying that efficiency should have a constant value of 1. Of course, because of various sources of performance loss, efficiency is more typically below 1, and it diminishes as we increase the number of processors. Efficiency greater than 1 represents superlinear speedup. Superlinear Speedup The upper curve in the Figure 5.1 graph indicates superlinear speedup, which occurs when speedup grows faster than the number of processors. How is this possible? Surely the sequential program, which is the basis for the speedup computation, could just simulate the P processes of the parallel program to achieve an execution time that is no more than P times the parallel execution time. Shouldn t superlinear speedup be impossible? There are two reasons why superlinear speedup occurs. The most common reason is that the computation s working set that is, the set of pages needed for the computationally intensive part of the program does not fit in the cache when executed on a single processor, but it does fit into the caches of the multiple processors when the problem is divided amongst them for parallel execution. In such cases the superlinear speedup derives from improved execution time due to the more efficient memory system behavior of the multi-processor execution.

5 The second case of superlinear speedup occurs when performing a search that is terminated as soon as the desired element is found. When performed in parallel, the search is effectively performed in a different order, implying that the total amount of data searched can actually be less than in the sequential case. Thus, the parallel execution actually performs less work. Issues with Speedup and Efficiency Since speedup is a ratio of two execution times, it is a unitless metric that would seem to factor out technological details such as processor speed. Instead, such details insidiously affect speedup, so we must be careful in interpreting speedup figures. There are several concerns. First, recognize that it is difficult to compare speedup from machines of different generations, even if they have the same architecture. The problem is that different components of a parallel machine are generally improved by different amounts, changing their relative importance. So, for example, processor performance has increased over time, but communication latency has not fallen proportionately. Thus, the time spent communicating will not have diminished as much as the time spent computing. As a result, speedup values have generally decreased over time. Stated another way, the parallel components of a computation have become relatively more expensive compared to the processing components. The second issue concerns T S, speedup s numerator, which should be the time for the fastest sequential solution for the given processor and problem size. If T S is artificially inflated, speedup will be greater. A subtle way to increase T S is to turn off scalar compiler optimizations for both the sequential and parallel programs, which might seem fair since it is using the same compiler for both programs. However, such a change effectively slows the processors and improves relatively speaking communication latency. When reporting speedup, the sequential program should be provided and the compiler optimization settings detailed. Another common way to increase T S is to measure the one-processor performance of the parallel program. Speedup computed on this basis is called relative speedup and should be reported as such. True speedup includes the likely possibility that the sequential algorithm is different than the parallel algorithm. Relative speedup, which simply compares different runs of the same algorithm, takes as the base case an algorithm optimized for concurrent execution but with no parallelism; it will likely run slower because of parallel overheads, causing the speedup to look better. Notice that it can happen that a well-written parallel program on one processor is faster than any known sequential program, making it the best sequential program. In such cases we have true speedup, not relative speedup. The situation should be explicitly identified. Relative speed up cannot always be avoided. For example, for large computations it may be impossible to measure a sequential program on a given problem size, because the data structures do not fit in memory. In such cases relative speedup is all that can be reported. The base case will be a parallel computation on a small number of processors, and the y-

6 axis of the speedup plot should be scaled by that amount. So, for example, if the smallest possible run has P=4, then dividing by the runtime for P=64, will show perfect speedup at y=16. Another way to inadvertently affect T S is the cold start problem. An easy way to accidentally get a large T S value is to run the sequential program once and include all of the paging behavior and compulsory cache misses in its timing. As noted earlier it is good practice to run a parallel computation a few times, measuring only the later runs. This allows the caches to warm up, so that compulsory cache miss times are not unnecessarily included in the performance measure, thereby complicating our understanding of the program s speedup. (Of course, if the program has conflict misses, they should and will be counted.) Properly, most analysts warm their programs. But the sequential program should be warmed, too, so that the paging and compulsory misses do not figure into its execution time. Though easily overlooked, cold starts are also easily corrected. More worrisome are computations that involve considerable off-processor activity, e.g. disk I/O. One-time I/O bursts, say to read in problem data, are fine because timing measurements can by-pass them; the problem is continual off-processor operations. Not only are they slow relative to the processors, but they greatly complicate the speedup analysis of a computation. For example, if both the sequential and parallel solutions have to perform the same off-processor operations from a single source, huge times for these operations can completely obscure the parallelism because they will dominate the measurements. In such cases it is not necessary to parallelize the program at all. If processors can independently perform the off-processor operations, then this parallelism alone dominates the speedup computation, which will likely look perfect. Any measurements of a computation involving off-processor charges must control their effects carefully. Performance Trade-Offs We know that communication time, idle time, wait time, and many other quantities can affect the performance of a parallel computation. The complicating factor is that attempts to lower one cost can increase others. In this section we consider such complications. Communication vs. computation Communication costs are a direct expense for using parallelism because they do not arise in sequential computing. Accordingly, it is almost always smart to attempt to reduce them. Overlap Communication and Computation. One way to reduce communication costs is to overlap communication with computation. Because communication can be performed concurrently with computation, and because the computation must be performed anyway, a perfect overlap that is, the data is available when it is needed hides the communication cost perfectly. Partial overlap will diminish waiting time and give partial improvement. The key, of course, is to identify computation that is independent of the communication. From a performance perspective, overlapping is generally a win without

7 costs. From a programming perspective, overlapping communication and computation can complicate the program s structure. Redundant Computation. Another way to reduce communication costs is to perform redundant computations. We observed in Chapter 2, for example, that the local generation of a random number, r, by all processes was superior to generating the value in one process and requiring all others to reference it. Unlike overlapping, redundant computation incurs a cost because there is no parallelism when all processors must execute the random number generator code. Stated another way, we have increased the total number of instructions to be executed in order to remove the communication cost. Whenever the cost of the redundant computation is less than the communication cost, redundant computation is a win. Notice that redundant computation also removes a dependence from the original program between the generating process and the others that will need the value. It is useful to remove dependences even if the cost of the added computation exactly matches the communication cost. In the case of the random number generation, redundant computation removes the possibility that a client process will have to wait for the server process to produce it. If the client can generate its own random number, it does not have to wait. Such cases complicate the assessing the trade-off. Memory vs. parallelism Memory usage and parallelism interact in many ways. Perhaps the most favorable is the cache effect that leads to superlinear parallel performance, noted above. With all processors having caches, there is more fast memory in a parallel computer. But there are other cases where memory and parallelism interact. Privatization. For example, parallelism can be increased by using additional memory to break false dependences. One memorable example is the use of private_count variables in the Count 3s program, which removed the need for threads to interact each time they recorded the next 3. The effect was to increase the number of count variables from 1 to t, the number of threads. It is a tiny memory cost for a big savings in reduced dependences Batching. One way to reduce the number of dependences is to increase the granularity of interaction. Batching is a programming technique in which work or transmissions are performed as a group. For example, rather than transmitting elements of an array, transmit a whole row or column; rather than grabbing one task from the task queue, get several. Batching effectively raises the granularity (see below) of fine-grain interactions to reduce their frequency. The added memory is simply required to record the items of the batch, and like privatization, is almost always worth the memory costs. Memoization. Memoization stores a computed value to avoid re-computing later. An example is a stencil optimization: A value is computed based on some combination of the scaled values of its neighbors, shown schematically below,

8 where color indicates the scaling coefficient; elements such as the corner elements are multiplied by the scale factor four times as the stencil moves through the array, and memoizing this value can reduce the number of multiplies and memory references. [DETAILED EXAMPLE HERE.] It is a sensible program optimization that removes instruction executions that, strictly speaking, may not result in parallelism improvements. However, in many cases memoization will result in better parallelism, as when the computation is redundant or involves non-local data values. Padding. Finally, we note that false sharing references to independent variables that become dependent because they are allocated to the same cache line can be eliminated by padding data structures to push the values onto different cache lines. Overhead vs. parallelism Parallelism and overhead are sometimes at odds. At one extreme, all parallel overhead, such as lock contention, can be avoided by using just one process. As we increase the number of threads the contention will likely increase. If the problem size remains fixed each processor has less work to perform between synchronizations, causing synchronization to become a larger portion of the overall computation. And a smaller problem size implies that there is less computation available to overlap with communication, which will typically increase the wait times for data. It is the overhead of parallelism that is usually the reason why P cannot increase without bound. Indeed, even computations that could conceptually be solved with a processor devoted to each data point will be buried by overhead before P=n. Thus, we find that most programs have an upper limit for each data size at which the marginal value of an additional processor is negative, that is, adding a processor causes the execution time to increase. Parallelize Overhead. Recall that in Chapter 4, when lock contention became a serious concern, we adopted a combining tree to solve it. In essence, the threads split up the task of accumulating intermediate values into several independent parallel activities. [THIS SECTION CONTINUES WITH THESE TOPICS] Load balance vs. parallelism. Increased parallelism can also improve load balance, as it's often easier to distribute evenly a large number of fine-grained units of work than a smaller number of coarse-grained units of work. Granularity tradeoffs. Many of the above tradeoffs are related to the granularity of parallelism. The best granularity often depends on both algorithmic characteristics, such as the amount of parallelism and the types of dependences, and hardware characteristics,

9 such as the cache size, the cache line size, and the latency and bandwidth of the machine's communication substrate. Latency vs. bandwidth. As discussed in Chapter 3, there are many instances where bandwidth can be used to reduce latency. Scaled speedup vs. Fixed-Size speedup Choosing a problem size can be difficult. What should we measure? The kernel or the entire program? Amdahl s law says that everything is important! Operating System Costs Because operating systems are so integral to computation, it is complicated to assess their effects on performance. Initialization. How is memory laid out in the parallel computer? Summary Exercises

10 Chapter 6: Programming with Threads Recall in Chapter 1 that we used threads to implement the count 3's program. In this chapter we'll explore thread-based programming in more detail using the standard POSIX Threads interface. We'll first explain the basic concepts needed to create threads and to let them interact with one another. We'll then discuss issues of safety and performance before we step back and evaluate the overall approach. Thread Creation and Destruction Consider the following standard code: 1 #include <pthread.h> 2 int err; 3 4 void main () 5 { 6 pthread_t tid[max]; /* An array of Thread ID's, one for each */ 7 /* thread that is created */ 8 9 for (i=0; i<t; i++) 10 { 11 err = pthread_create (&tid[i], NULL, count3s_thread, i); 12 } for (i=0; i<t; i++) 15 { 16 err = pthread_join_(tid[i], &status[i]) 17 } 18 } The above code shows a main() function, which then creates and launches t threads in the first loop, and then waits for the t threads to complete in the second loop. We often refer to the creating thread as the parent and the created threads as children. The above code differs from the pseudocode in Chapter 1 in a few details. Line 1 includes the pthreads header file, which declares the various pthreads routines and datatypes. Each thread that is created needs its own thread ID, so these thread ID's are declared on line 6. To create a thread, we invoke the pthread_create() routine with four parameters. The first parameter is a pointer to a thread ID, which will point to a valid thread ID when this thread successfully returns. The second argument provides the thread s attributes; in this case, the NULL value specifies default attributes. The third parameter is a pointer to the start function, which the thread will execute once it s created. The fourth argument is passed to the start routine, in this case, it represents a unique integer between 0 and t-1 that is associated with each thread. The loop on line 16 then calls pthread_join() to wait for each of the child threads to terminate. If

11 instead of waiting for the child threads to complete, the main() routine finishes and exits using pthread_exit(), the child threads will continue to execute. Otherwise, the child threads will automatically terminate when main() finishes, since the entire process will have terminated. See Code Specs 1 and 2. pthread_create() int pthread_create ( // create a new thread pthread_t *tid, // thread ID const pthread_attr_t *attr, // thread attributes void *(*start_routine) (void *),// pointer to function to execute void *arg // argument to function ); Arguments: The thread ID of the successfully created thread. The thread's attributes, explained below; the NULL value specifies default attributes. The function that the new thread will execute once it is created. An argument passed to the start_routine(), in this case, it represents a unique integer between 0 and t-1 that is associated with each thread. Return value: 0 if successful. Error code from <errno.h> otherwise. Notes: Use a structure to pass multiple arguments to the start routine. Code Spec 1. pthread_create(). The POSIX Threads thread creation function. pthread_join() int pthread_join ( pthread_t tid, void **status ); // wait for a thread to terminate // thread IT to wait for // exit status Arguments: The ID of the thread to wait for. The completion status of the exiting thread will be copied into *status unless status is NULL, in which case the completion status is not copied. Return value: 0 for success. Error code from <errno.h> otherwise. Notes: Once a thread is joined, the thread no longer exists, its thread ID is no longer valid, and it cannot be joined with any other thread.

12 Code Spec 2. pthread_join(). The POSIX Threads rendezvous function pthread_join(). Thread ID s Each thread has a unique ID of type pthread_t. As with all pthread data types, a thread ID should be treated as an opaque type, meaning that individual fields of the structure should never be accessed directly. Because child threads do not know their thread ID, the two routines allow a thread to determine its thread ID, pthread_self(), and to compare two thread ID s, pthread_equal(), see Code Specs 3 and 4. pthread_self() pthread_t pthread_self (); // Get my thread ID Return value: The ID of the thread that called this function. Code Spec 3. pthread_self(). The POSIX Threads function to fetch a thread s ID. pthread_equal() int pthread_equal ( pthread_t t1, pthread_t t2 ); // Test for equality // First operand thread ID // Second operand thread ID Arguments: Two thread ID s Return value: Non-zero if the two thread ID s are the same (following the C convention). 0 if the two threads are different. Code Spec 4. pthread_equal(). The POSIX Threads function to compare two thread IDs for equality. Destroying Threads There are three ways that threads can terminate. 1. A thread can return from the start routine. 2. A thread can call pthread_exit(). 3. A thread can be cancelled by another thread. In each case, the thread is destroyed and its resources become unavailable.

13 void pthread_exit() void pthread_exit ( void *status ); // terminate a thread // completion status Arguments: The completion status of the thread that has exited. This pointer value is available to other threads. Return value: None Notes: When a thread exits by simply returning from the start routine, the thread s completion status is set to the start routine s return value. Code Spec 5. pthread_exit(). The POSIX Threads thread termination function pthread_exit(). Thread Attributes Each thread maintains its own properties, known as attributes, which are stored in a structure of type pthread_attr_t. For example, threads can be either detached or joinable. Detached threads cannot be joined with other threads, so they have slightly lower overhead in some implementations of POSIX Threads. For parallel computing, we will rarely need detached threads. Threads can also be either bound or unbound. Bound threads are scheduled by the operating system, whereas unbound threads are scheduled by the Pthreads library. For parallel computing, we typically use bound threads so that each thread provides physical concurrency. POSIX Threads provides routines to initialize thread attributes, set their attributes, and destroy attributes, as shown in Code Spec 6. Code Spec 6. pthread attributes. An example of how thread attributes are set in the POSIX Threads Thread Attributes pthread_attr_t attr; pthread_t tid; interface. // Declare a thread attribute pthread_attr_init(&attr); // Initialize a thread attribute pthread_attr_setdetachstate(&attr, // Set the thread attribute PTHREAD_CREATE_UNDETACHED); pthread_create (&tid, &attr, start_func, NULL); // Use the attribute // to create a thread pthread_join(tid, NULL); pthread_attr_destroy(&attr); // Destroy the thread attribute

14 Example The following example illustrates a potential pitfall that can occur because of the interaction between parent and child threads. The parent thread simply creates a child thread and waits for the child to exit. The child thread does some useful work and then exits, returning an error code. Do you see what s wrong with this code? 1 #include <pthread.h> 2 3 void main () 4 { 5 pthread_t tid; 6 int *status; 7 8 pthread_create (&tid, NULL, start, NULL); 9 pthread_join_(tid, &status); 10 } void start() 13 { 14 int errorcode; 15 /* do something useful... */ if (... ) 18 errorcode = something; 19 pthread_exit(&errorcode); 20 } The problem occurs in the call to pthread_exit() on line 17, where the child is attempting to return an error code to the parent. Unfortunately, because errorcode is declared to be local to the start() function, the memory for errorcode is allocated on the child thread s stack. When the child exits, its call stack is de-allocated, and the parent has a dangling pointer to errorcode. At some point in the future, when a new procedure is invoked, it will over-write the stack location where errorcode resides, and the value of errorcode will change. Mutual Exclusion We can now create and destroy threads, but to allow threads to interact constructively, we need methods of coordinating their interaction. In particular, when two threads share access to memory, it is often useful to employ a lock, called a mutex, to provide mutual exclusion or mutually exclusive access to the variable. As we saw in Chapter 1, without mutual exclusion, race conditions can lead to unpredictable results, because when multiple threads execute the following code, the count variable, which is shared among all threads, will not be atomically updated. for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) { count++;

15 } } The solution, of course, is to protect the update of count using a mutex, as shown below: 1 pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; 2 3 void count3s_thread (int id) 4 { 5 /* Compute portion of array that this thread should work on */ 6 int length_per_thread = length/t; 7 int start = id * length_per_thread; 8 9 for (i=start; i<start+length_per_thread; i++) 10 { 11 if (array[i] == 3) 12 { 13 pthread_mutex_lock(&lock); 14 count++; 15 pthread_mutex_unlock(&lock); 16 } 17 } 18 } Line 1 shows how a mutex can be statically declared. Like threads, mutexes have attributes, and by initializing the mutex to PTHREAD_MUTEX_INITIALIZER, the mutex is assigned default attributes. To use this mutex, its address is passed to the lock and unlock routines on lines 13 and 15, respectively. The appropriate discipline, of course, is to bracket all critical sections, that is, code that must be executed atomically by only one thread at a time, by the locking of a mutex upon entrance and the unlocking of a mutex upon exit. Only one thread can acquire the mutex at any one time, so a thread will block if it attempts to acquire a mutex that is already held by another thread. When a mutex is unlocked, or relinquished, one of the threads that was blocked attempting to acquire the lock will become unblocked and granted the mutex. The POSIX Threads standard defines no notion of fairness, so the order in which the locks are acquired is not guaranteed to match the order in which the threads attempted to acquire the locks. It is an error to unlock a mutex that has not been locked, and it is an error to lock a mutex that is already held. The latter will lead to deadlock, in which the thread cannot make progress because it is blocked waiting for an event that cannot happen. We will discuss deadlock and techniques to avoid deadlock in more detail later in the chapter.

16 Acquiring and Releasing Mutexes int pthread_mutex_lock( pthread_mutex_t *mutex); int pthread_mutex_unlock( pthread_mutex_t *mutex); int pthread_mutex_trylock( pthread_mutex_t *mutex); // Lock a mutex // Unlock a mutex // Non-blocking lock Arguments: Each function takes the address of a mutex variable. Return value: 0 if successful. Error code from <errno.h> otherwise. Notes: The pthread_mutex_trylock() routine attempts to acquire a mutex but will not block. This routine returns EBUSY if the mutex is locked. Code Spec 7. The POSIX Threads routines for acquiring and releasing mutexes. Serializability It s clear that our use of mutexes provides atomicity: the thread that acquires the mutex m will execute the code in the critical section until it relinquishes the mutex. Thus, in our above example, the counter will be updated by only one thread at a time. Atomicity is important because it ensures serializability: A concurrent execution is serializable if the execution is guaranteed to execute in an order that corresponds to some serial execution of those threads. Mutex Creation and Destruction In our above example, we knew that only one mutex was needed, so we were able to statically allocate it. In cases where the number of required mutexes is not known a priori, we can instead allocate and deallocate mutexes dynamically. Code Spec 8 shows how such a mutex is dynamically allocated, initialized with default attributes, and destroyed.

17 Mutex Creation and Destruction int pthread_mutex_init( pthread_mutex_t *mutex, pthread_mutexattr_t *attr); int pthread_mutex_destroy ( pthread_mutex_t *mutex); int pthread_mutexattr_init( pthread_mutexattr_t *attr); int pthread_mutexattr_destroy ( pthread_mutexattr_t *attr); // Initialize a mutex // Destroy a mutex // Initialize a mutex attribute // Destroy a mutex attribute Arguments: The pthread_mutex_init() routine takes two arguments, a pointer to a mutex and a pointer to a mutex attribute. The latter is presumed to have already been initialized. The pthread_mutexattr_init() and pthread_mutexattr_destroy() routines take a pointer to a mutex attribute as arguments. Notes: If the second argument to pthread_mutex_init() is NULL, default attributes will be used. Code Spec 8. The POSIX Threads routines for dynamically creating and destroying mutexes. Dynamically Allocated Mutexes pthread_mutex_t *lock; // Declare a pointer to a lock lock = (pthread_mutex_lock_t *) malloc(sizeof (pthread_mutex_t)); pthread_mutex_init(lock, NULL); /* * Code that uses this lock. */ pthread_mutex_destroy (lock); free (lock); Code Spec 9. An example of how dynamically allocated mutexes are used in the POSIX Threads interface. Synchronization Mutexes are sufficient to provide atomicity for critical sections, but in many situations we would like a thread to synchronize its behavior with that of some other thread. For example, consider a classic bounded buffer problem in which one or more threads put

18 items into a circular buffer while other threads remove items from the same buffer. As shown in Figure 1, we would like the producers to stop producing data to wait if the consumer is unable to keep up and the buffer becomes full, and we would like the consumers to wait if the buffer is empty. Circular Buffer Get Put Empty Buffer Full Buffer Get Put Put Get Figure 1. A bounded buffer with producers and consumers. The Put and Get cursors indicate where the producers will insert the next item and where the consumers will remove its next item, respectively. When the buffer is empty, the consumers must wait. When the buffer is full, the producers must wait. Such synchronization is supported by condition variables, which are a more general form of synchronization than joining threads. A condition variable allows threads to wait until some condition becomes true, at which point one of the waiting threads is nondeterministically chosen to stop waiting. We can think of the condition variable as a gate (see Figure 2). Threads wait at the gate until some condition is true. Other threads open the gate to signal that the condition has become true, at which point one of the waiters is allowed to enter the gate and resume execution. If a thread opens the gate when there are no threads waiting, the signal has no effect. signaler signaler waiter waiter waiter Figure 2. Condition variables act like a gate. Threads wait outside the gate by calling pthread_cond_wait(), and threads open the gate by calling pthread_cond_signal(). When the gate is opened, one waiter is allowed through. If there are no waiters when the gate is opened, the signal has no effect. We can solve our bounded buffer problem with two condition variables, nonempty and nonfull, as shown below. 1 pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; 2 pthread_cond_t nonempty = PTHREAD_COND_INITIALIZER; 3 pthread_cond_t nonfull= PTHREAD_COND_INITIALIZER; 4 Item buffer[size]; 5 int in = 0; // Buffer index for next insertion 6 int out = 0; // Buffer index for next removal

19 7 8 void put (Item x) // Producer thread 9 { 10 pthread_mutex_lock(&lock); 11 while (in out) == SIZE) // While buffer is full 12 pthread_cond_wait(&nonfull, &lock); 13 buffer[in % SIZE] = x; 14 in++; 15 pthread_cond_signal(&nonempty); 16 pthread_mutex_unlock(&lock); 17 } Item get() // Consumer thread 20 { 21 Item x; 22 pthread_mutex_lock(&lock); 23 while (out in) // While buffer is empty 24 pthread_cond_wait(&nonempty, &lock); 25 x = buffer[out % SIZE]; 26 out++; 27 pthread_cond_signal(&nonfull); 28 pthread_mutex_unlock(&lock); 29 return x; 30 } Of course, since multiple threads will be updating these condition variables, we need to protect their access with a mutex, so Line 1 declares a mutex. The remaining declarations define a buffer, buffer, and its two cursors, in and out, which indicate where to insert the next item and where to remove the next item. The two cursors wrap around when they exceed the bounds of buffer, yielding a circular buffer. Given these data structures, the producer thread executes the put() routine, which first acquires the mutex to access the condition variables. (This code omits the actual creation of the producer and consumer threads, which are assumed to iteratively invoke the put() and get() routines, respectively.) If the buffer is full, the producer waits on the nonfull condition so that it will later be awakened when the buffer becomes non-full. If this thread blocks, the mutex that it holds must be relinquished to avoid deadlock. Because these two events the releasing of the mutex and the blocking of this waiting thread must occur atomically, they must be performed by pthread_cond_wait(), so the mutex is passed as a parameter to pthread_cond_wait(). When the producer resumes execution after returning from the wait on Line 12, the protecting mutex will have been re-acquired by the system on behalf of the producer. In a moment we will explain the need for the while loop on Line 11, but for now assume when the producer executes Line 13, the buffer is not full, so it is safe to insert a new item and to bump the In cursor by one. At this point, the buffer cannot be empty because the producer has just inserted an element, so the producer signals that the buffer is nonempty, waking one more consumers that may be waiting on an empty buffer. If there are no waiting consumers, the signal is lost. Finally, the producer releases the

20 mutex and exits the routine. The consumer thread executes the get() routine, which operates in a very similar manner. pthread_cond_wait() int pthread_cond_wait( pthread_cond_t *cond, pthread_mutex_t *mutex); // Condition to wait on // Protecting mutex int pthread_cond_timedwait ( pthread_cond_t *cond, pthread_mutex_t *mutex, const struct timespec *abstime); // Time-out value Arguments: A condition variable to wait on. A mutex that protects access to the condition variable. The mutex is released before the thread blocks, and these two actions occur atomically. When this thread is later unblocked, the mutex is reacquired on behalf of this thread. Return value: 0 if successful. Error code from <errno.h> otherwise. Code Spec 10. pthread_cond_wait(): The POSIX Thread routines for waiting on condition variables. pthread_cond_signal() int pthread_cond_signal( pthread_cond_t *cond); int pthread_cond_broadcast ( pthread_cond_t *cond); // Condition to signal // Condition to signal Arguments: A condition variable to signal. Return value: 0 if successful. Error code from <errno.h> otherwise. Notes: These routines have no effect if there are no threads waiting on cond. In particular, there is no memory of the signal when a later call is made to pthread_cond_wait(). The pthread_cond_signal() routine may wake up more than one thread, but only one of these threads will hold the protecting mutex. The pthread_cond_broadcast() routine wakes up all waiting threads. Only one awakened thread will hold the protecting mutex. Code Spec 11. pthread_cond_signal(). The POSIX Threads routines for signaling a condition variable.

21 Protecting Condition Variables Let us now return to the while loop on Line 11 of the bounded buffer program. If our system has multiple producer threads, this loop is essential because pthread_cond_signal() can wake up multiple waiting threads 1, of which only one will hold the protecting mutex at any particular time. Thus, at the time of the signal, the buffer is not full, but when any particular thread acquires the mutex, the buffer may have become full again, in which case the thread should call pthread_cond_wait() again. When the producer thread executes Line 13, the buffer is necessarily not full, so it is safe to insert a new item and to bump the In cursor. We see on Lines 15 and 27 that the call to pthread_cond_signal() is also protected by the lock. The following example shows that this protection is necessary. time Signaling Thread Waiting Thread lock (mutex) while (out in) insert(item); pthread_cond_signal(&nonempty); // Signal is dropped pthread_cond_wait(&nonempty, lock); // Will wait forever Formatted: Bullets and Numbering Figure 3. Example of why a signaling thread needs to be protected by a mutex. In this example, the waiting thread, in this case the consumer, acquires the protecting mutex and finds that the buffer is empty, so it executes pthread_cond_wait(). If the signaling thread, in this case the producer, does not protect the call to pthread_cond_signal() with a mutex, it could insert an item into the buffer immediately after the waiting thread found it empty. If the producer then signals that the buffer is non-empty before the waiting thread executes the call to pthread_cond_wait(), the signal will be dropped and the consumer thread will not realize that the buffer is actually not empty. In the case that the producer only inserts a single item, the waiting thread will needlessly wait forever. The problem, of course, is that there is a race condition involving the manipulation of the buffer. The obvious solution is to protect both the call to pthread_cond_signal() with the same mutex that protects the call to pthread_cond_wait(), as shown in the code for our bounded buffer solution. Because both the Put() and Get() routines are protected by the same mutex, we have three critical sections related to the nonempty buffer, as shown in Figure 4, and in no case can the signal be dropped while a waiting thread thinks that the buffer is empty. 1 These semantics are due to implementation details. In some cases it can be expensive to ensure that exactly one waiter is unblocked by a signal.

22 Put() insert(item); pthread_cond_signal(&nonempty); Get() lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); remove(item); Critical section A Critical section B Critical section C time time time Signaling Thread Waiting Thread Case 1: Order A, B, C insert(item); pthread_cond_signal(&nonempty); lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); remove(item); Case 2: Order B, A, C lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); insert(item); pthread_cond_signal(&nonempty); remove(item); Case 3: Order B, C, A lock (mutex) while (out in) pthread_cond_wait(&nonempty, lock); remove(item); insert(item); pthread_cond_signal(&nonempty); Formatted: Bullets and Numbering Formatted: Bullets and Numbering Formatted: Bullets and Numbering Figure 4. Proper locking of the signaling code prevents race conditions. By identifying and protecting three critical sections pertaining to the nonempty buffer, we guarantee that each of A, B, and C will execute atomically, so our problem from Figure 3 is avoided: There is no way for the Put() routine s signal to be dropped while a thread executing the Get() routine thinks that the buffer is empty. We have argued that the call to pthread_cond_signal()must be protected by the same mutex that protects the waiting code. However, notice that the race condition occurs not from the signaling of the condition variable, but with the access to the shared buffer. Thus, we could instead simply protect any code that manipulates the shared buffer, which implies that the Put()code could release the mutex immediately after inserting an item into the buffer but before calling pthread_cond_signal(). This new code is not only legal, but it produces better performance because it reduces the size of the critical section, thereby allowing more concurrency.

23 Creating and Destroying Condition Variables Like threads and mutexes, condition variables can be created and destroyed either statically or dynamically. In our bounded buffer example above, the static condition variables were both given default attributes by initializing them to PTHREAD_COND_INITIALIZER. Condition variables can be dynamically allocated as indicated in Code Spec 12. Dynamically Allocated Condition Variables int pthread_cond_init( pthread_cond_t *cond, // Condition variable const pthread_condattr_t *attr); // Condition attribute int pthread_cond_destroy ( pthread_cond_t *cond); // Condition to destroy Arguments: Default attributes are used if attr is NULL. Return value: 0 if successful. Error code from <errno.h> otherwise. Code Spec 12. The POSIX Threads routines for dynamically creating and destroying condition variables. Waiting on Multiple Condition Variables In some cases a piece of code cannot execute unless multiple conditions are met. In these situations the waiting thread should test all conditions simultaneously, as shown below. 1 EatJuicyFruit() 2 { 3 pthread_mutex_lock(&lock); 4 while (apples==0 && oranges==0) 5 { 6 pthread_cond_wait(&more_apples, &lock); 7 pthread_cond_wait(&more_oranges, &lock); 8 } 9 /* Eat both an apple and an orange */ 10 pthread_mutex_unlock(&lock); 11 } By contrast, the following code, which waits on each condition in turn, fails because there is no guarantee that both conditions will be true at the same time. That is, after returning from the first call to pthread_cond_wait() but before returning from the second call to pthread_cond_wait(), some other thread may have removed an apple, making the first condition false. 1 EatJuicyFruit() 2 {

24 3 pthread_mutex_lock(&lock); 4 while (apples==0) 5 pthread_cond_wait(&more_apples, &lock); 6 while (oranges==0) 7 pthread_cond_wait(&more_oranges, &lock); 8 9 /* Eat both an apple and an orange */ 10 pthread_mutex_unlock(&lock); 11 } Thread-Specific Data It is often useful for threads to maintain private data that is not shared. For example, we have seen examples where a thread index is passed to the start function so that the thread knows what portion of an array to work on. This index can be used to give each thread a different element of an array, as shown below: for (i=0; i<t; i++) 4 err = pthread_create (&tid[i], NULL, start_function, i); 5 6 void start_function(int index) 7 { 8 private_count[index] = 0; 9... A problem occurs, however, if the code that accesses index occurs in a function, foo(), which is buried deep within other code. In such situations, how does foo() get the value of index? One solution is to pass the index parameter to every procedure that calls foo(), including procedures that call foo() indirectly through other procedures. This solution is cumbersome, particularly for those procedures that require the parameter but do not directly use it. Instead, what we really want is a variable that is global in scope to all code but which can have different values for each thread. POSIX Threads supports such a notion in the form of thread-specific data, which uses a set of keys, which are shared by all threads in a process, but which map to different pointer values for each thread. (See Figure 4.) Thread 0 Memory key1 key2 Thread 1

ANSI/IEEE POSIX Standard Thread management

ANSI/IEEE POSIX Standard Thread management Pthread Prof. Jinkyu Jeong( jinkyu@skku.edu) TA Jinhong Kim( jinhong.kim@csl.skku.edu) TA Seokha Shin(seokha.shin@csl.skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu The

More information

Introduction to PThreads and Basic Synchronization

Introduction to PThreads and Basic Synchronization Introduction to PThreads and Basic Synchronization Michael Jantz, Dr. Prasad Kulkarni Dr. Douglas Niehaus EECS 678 Pthreads Introduction Lab 1 Introduction In this lab, we will learn about some basic synchronization

More information

Multithreading Programming II

Multithreading Programming II Multithreading Programming II Content Review Multithreading programming Race conditions Semaphores Thread safety Deadlock Review: Resource Sharing Access to shared resources need to be controlled to ensure

More information

Threads need to synchronize their activities to effectively interact. This includes:

Threads need to synchronize their activities to effectively interact. This includes: KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS Information and Computer Science Department ICS 431 Operating Systems Lab # 8 Threads Synchronization ( Mutex & Condition Variables ) Objective: When multiple

More information

Pthreads. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pthreads. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pthreads Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu The Pthreads API ANSI/IEEE POSIX1003.1-1995 Standard Thread management Work directly on

More information

THREADS. Jo, Heeseung

THREADS. Jo, Heeseung THREADS Jo, Heeseung TODAY'S TOPICS Why threads? Threading issues 2 PROCESSES Heavy-weight A process includes many things: - An address space (all the code and data pages) - OS resources (e.g., open files)

More information

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References COSC 6374 Parallel Computation Shared memory programming with POSIX Threads Fall 2012 References Some of the slides in this lecture is based on the following references: http://www.cobweb.ecn.purdue.edu/~eigenman/ece563/h

More information

POSIX PTHREADS PROGRAMMING

POSIX PTHREADS PROGRAMMING POSIX PTHREADS PROGRAMMING Download the exercise code at http://www-micrel.deis.unibo.it/~capotondi/pthreads.zip Alessandro Capotondi alessandro.capotondi(@)unibo.it Hardware Software Design of Embedded

More information

Posix Threads (Pthreads)

Posix Threads (Pthreads) Posix Threads (Pthreads) Reference: Programming with POSIX Threads by David R. Butenhof, Addison Wesley, 1997 Threads: Introduction main: startthread( funk1 ) startthread( funk1 ) startthread( funk2 )

More information

Threads. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Threads. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Threads Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3052: Introduction to Operating Systems, Fall 2017, Jinkyu Jeong (jinkyu@skku.edu) Concurrency

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Lecture 4. Threads vs. Processes. fork() Threads. Pthreads. Threads in C. Thread Programming January 21, 2005

Lecture 4. Threads vs. Processes. fork() Threads. Pthreads. Threads in C. Thread Programming January 21, 2005 Threads vs. Processes Lecture 4 Thread Programming January 21, 2005 fork() is expensive (time, memory) Interprocess communication is hard. Threads are lightweight processes: one process can contain several

More information

Motivation and definitions Processes Threads Synchronization constructs Speedup issues

Motivation and definitions Processes Threads Synchronization constructs Speedup issues Motivation and definitions Processes Threads Synchronization constructs Speedup issues Overhead Caches Amdahl s Law CS550: Advanced Operating Systems 2 If task can be completely decoupled into independent

More information

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective? Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective? CS 470 Spring 2019 POSIX Mike Lam, Professor Multithreading & Pthreads MIMD

More information

POSIX Threads. Paolo Burgio

POSIX Threads. Paolo Burgio POSIX Threads Paolo Burgio paolo.burgio@unimore.it The POSIX IEEE standard Specifies an operating system interface similar to most UNIX systems It extends the C language with primitives that allows the

More information

High Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 21 Matthew Jacob Indian Institute of Science Semaphore Examples Semaphores can do more than mutex locks Example: Consider our concurrent program where process P1 reads

More information

Outline. CS4254 Computer Network Architecture and Programming. Introduction 2/4. Introduction 1/4. Dr. Ayman A. Abdel-Hamid.

Outline. CS4254 Computer Network Architecture and Programming. Introduction 2/4. Introduction 1/4. Dr. Ayman A. Abdel-Hamid. Threads Dr. Ayman Abdel-Hamid, CS4254 Spring 2006 1 CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department Virginia Tech Threads Outline Threads (Chapter

More information

LSN 13 Linux Concurrency Mechanisms

LSN 13 Linux Concurrency Mechanisms LSN 13 Linux Concurrency Mechanisms ECT362 Operating Systems Department of Engineering Technology LSN 13 Creating Processes fork() system call Returns PID of the child process created The new process is

More information

POSIX Threads. HUJI Spring 2011

POSIX Threads. HUJI Spring 2011 POSIX Threads HUJI Spring 2011 Why Threads The primary motivation for using threads is to realize potential program performance gains and structuring. Overlapping CPU work with I/O. Priority/real-time

More information

Shared Memory Programming. Parallel Programming Overview

Shared Memory Programming. Parallel Programming Overview Shared Memory Programming Arvind Krishnamurthy Fall 2004 Parallel Programming Overview Basic parallel programming problems: 1. Creating parallelism & managing parallelism Scheduling to guarantee parallelism

More information

Agenda. Process vs Thread. ! POSIX Threads Programming. Picture source:

Agenda. Process vs Thread. ! POSIX Threads Programming. Picture source: Agenda POSIX Threads Programming 1 Process vs Thread process thread Picture source: https://computing.llnl.gov/tutorials/pthreads/ 2 Shared Memory Model Picture source: https://computing.llnl.gov/tutorials/pthreads/

More information

real time operating systems course

real time operating systems course real time operating systems course 4 introduction to POSIX pthread programming introduction thread creation, join, end thread scheduling thread cancellation semaphores thread mutexes and condition variables

More information

Concurrency and Threads

Concurrency and Threads Concurrency and Threads CSE 333 Spring 2018 Instructor: Justin Hsia Teaching Assistants: Danny Allen Dennis Shao Eddie Huang Kevin Bi Jack Xu Matthew Neldam Michael Poulain Renshu Gu Robby Marver Waylon

More information

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1 Thread Disclaimer: some slides are adopted from the book authors slides with permission 1 IPC Shared memory Recap share a memory region between processes read or write to the shared memory region fast

More information

ENCM 501 Winter 2019 Assignment 9

ENCM 501 Winter 2019 Assignment 9 page 1 of 6 ENCM 501 Winter 2019 Assignment 9 Steve Norman Department of Electrical & Computer Engineering University of Calgary April 2019 Assignment instructions and other documents for ENCM 501 can

More information

Synchronization Primitives

Synchronization Primitives Synchronization Primitives Locks Synchronization Mechanisms Very primitive constructs with minimal semantics Semaphores A generalization of locks Easy to understand, hard to program with Condition Variables

More information

Concurrent Programming

Concurrent Programming Concurrent Programming is Hard! Concurrent Programming Kai Shen The human mind tends to be sequential Thinking about all possible sequences of events in a computer system is at least error prone and frequently

More information

TCSS 422: OPERATING SYSTEMS

TCSS 422: OPERATING SYSTEMS TCSS 422: OPERATING SYSTEMS OBJECTIVES Introduction to threads Concurrency: An Introduction Wes J. Lloyd Institute of Technology University of Washington - Tacoma Race condition Critical section Thread

More information

Data Races and Deadlocks! (or The Dangers of Threading) CS449 Fall 2017

Data Races and Deadlocks! (or The Dangers of Threading) CS449 Fall 2017 Data Races and Deadlocks! (or The Dangers of Threading) CS449 Fall 2017 Data Race Shared Data: 465 1 8 5 6 209? tail A[] thread switch Enqueue(): A[tail] = 20; tail++; A[tail] = 9; tail++; Thread 0 Thread

More information

Παράλληλη Επεξεργασία

Παράλληλη Επεξεργασία Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook

More information

Chapter 4 Concurrent Programming

Chapter 4 Concurrent Programming Chapter 4 Concurrent Programming 4.1. Introduction to Parallel Computing In the early days, most computers have only one processing element, known as the Central Processing Unit (CPU). Due to this hardware

More information

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto Ricardo Rocha Department of Computer Science Faculty of Sciences University of Porto For more information please consult Advanced Programming in the UNIX Environment, 3rd Edition, W. Richard Stevens and

More information

CSci 4061 Introduction to Operating Systems. Synchronization Basics: Locks

CSci 4061 Introduction to Operating Systems. Synchronization Basics: Locks CSci 4061 Introduction to Operating Systems Synchronization Basics: Locks Synchronization Outline Basics Locks Condition Variables Semaphores Basics Race condition: threads + shared data Outcome (data

More information

Concurrent Programming

Concurrent Programming Concurrent Programming is Hard! Concurrent Programming Kai Shen The human mind tends to be sequential Thinking about all possible sequences of events in a computer system is at least error prone and frequently

More information

CS 3305 Intro to Threads. Lecture 6

CS 3305 Intro to Threads. Lecture 6 CS 3305 Intro to Threads Lecture 6 Introduction Multiple applications run concurrently! This means that there are multiple processes running on a computer Introduction Applications often need to perform

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University Threads Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics Why threads? Threading issues 2 Processes Heavy-weight A process includes

More information

Thread and Synchronization

Thread and Synchronization Thread and Synchronization pthread Programming (Module 19) Yann-Hang Lee Arizona State University yhlee@asu.edu (480) 727-7507 Summer 2014 Real-time Systems Lab, Computer Science and Engineering, ASU Pthread

More information

CSE 333 SECTION 9. Threads

CSE 333 SECTION 9. Threads CSE 333 SECTION 9 Threads HW4 How s HW4 going? Any Questions? Threads Sequential execution of a program. Contained within a process. Multiple threads can exist within the same process. Every process starts

More information

Concurrent Programming

Concurrent Programming Concurrent Programming CS 485G-006: Systems Programming Lectures 32 33: 18 20 Apr 2016 1 Concurrent Programming is Hard! The human mind tends to be sequential The notion of time is often misleading Thinking

More information

The mutual-exclusion problem involves making certain that two things don t happen at once. A non-computer example arose in the fighter aircraft of

The mutual-exclusion problem involves making certain that two things don t happen at once. A non-computer example arose in the fighter aircraft of The mutual-exclusion problem involves making certain that two things don t happen at once. A non-computer example arose in the fighter aircraft of World War I (pictured is a Sopwith Camel). Due to a number

More information

CS 261 Fall Mike Lam, Professor. Threads

CS 261 Fall Mike Lam, Professor. Threads CS 261 Fall 2017 Mike Lam, Professor Threads Parallel computing Goal: concurrent or parallel computing Take advantage of multiple hardware units to solve multiple problems simultaneously Motivations: Maintain

More information

CSCI4430 Data Communication and Computer Networks. Pthread Programming. ZHANG, Mi Jan. 26, 2017

CSCI4430 Data Communication and Computer Networks. Pthread Programming. ZHANG, Mi Jan. 26, 2017 CSCI4430 Data Communication and Computer Networks Pthread Programming ZHANG, Mi Jan. 26, 2017 Outline Introduction What is Multi-thread Programming Why to use Multi-thread Programming Basic Pthread Programming

More information

Interprocess Communication By: Kaushik Vaghani

Interprocess Communication By: Kaushik Vaghani Interprocess Communication By: Kaushik Vaghani Background Race Condition: A situation where several processes access and manipulate the same data concurrently and the outcome of execution depends on the

More information

Multicore and Multiprocessor Systems: Part I

Multicore and Multiprocessor Systems: Part I Chapter 3 Multicore and Multiprocessor Systems: Part I Max Planck Institute Magdeburg Jens Saak, Scientific Computing II 44/337 Symmetric Multiprocessing Definition (Symmetric Multiprocessing (SMP)) The

More information

Introduction to pthreads

Introduction to pthreads CS 220: Introduction to Parallel Computing Introduction to pthreads Lecture 25 Threads In computing, a thread is the smallest schedulable unit of execution Your operating system has a scheduler that decides

More information

Processes Prof. James L. Frankel Harvard University. Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved.

Processes Prof. James L. Frankel Harvard University. Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved. Processes Prof. James L. Frankel Harvard University Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved. Process Model Each process consists of a sequential program

More information

pthreads CS449 Fall 2017

pthreads CS449 Fall 2017 pthreads CS449 Fall 2017 POSIX Portable Operating System Interface Standard interface between OS and program UNIX-derived OSes mostly follow POSIX Linux, macos, Android, etc. Windows requires separate

More information

PThreads in a Nutshell

PThreads in a Nutshell PThreads in a Nutshell Chris Kauffman CS 499: Spring 2016 GMU Logistics Today POSIX Threads Briefly Reading Grama 7.1-9 (PThreads) POSIX Threads Programming Tutorial HW4 Upcoming Post over the weekend

More information

Concurrent Server Design Multiple- vs. Single-Thread

Concurrent Server Design Multiple- vs. Single-Thread Concurrent Server Design Multiple- vs. Single-Thread Chuan-Ming Liu Computer Science and Information Engineering National Taipei University of Technology Fall 2007, TAIWAN NTUT, TAIWAN 1 Examples Using

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution

More information

Chapter 4 Threads. Images from Silberschatz 03/12/18. CS460 Pacific University 1

Chapter 4 Threads. Images from Silberschatz 03/12/18. CS460 Pacific University 1 Chapter 4 Threads Images from Silberschatz Pacific University 1 Threads Multiple lines of control inside one process What is shared? How many PCBs? Pacific University 2 Typical Usages Word Processor Web

More information

Threads Tuesday, September 28, :37 AM

Threads Tuesday, September 28, :37 AM Threads_and_fabrics Page 1 Threads Tuesday, September 28, 2004 10:37 AM Threads A process includes an execution context containing Memory map PC and register values. Switching between memory maps can take

More information

Chap. 6 Part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 6 Part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 6 Part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Chap 6: specific programming techniques Languages and libraries Authors blur the distinction Languages: access parallel programming

More information

CSE 374 Programming Concepts & Tools

CSE 374 Programming Concepts & Tools CSE 374 Programming Concepts & Tools Hal Perkins Fall 2017 Lecture 22 Shared-Memory Concurrency 1 Administrivia HW7 due Thursday night, 11 pm (+ late days if you still have any & want to use them) Course

More information

CS533 Concepts of Operating Systems. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole CS533 Concepts of Operating Systems Jonathan Walpole Introduction to Threads and Concurrency Why is Concurrency Important? Why study threads and concurrent programming in an OS class? What is a thread?

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 Process creation in UNIX All processes have a unique process id getpid(),

More information

Threads. studykorner.org

Threads. studykorner.org Threads Thread Subpart of a process Basic unit of CPU utilization Smallest set of programmed instructions, can be managed independently by OS No independent existence (process dependent) Light Weight Process

More information

Parallel Programming with Threads

Parallel Programming with Threads Thread Programming with Shared Memory Parallel Programming with Threads Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set

More information

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas HPCSE - I «Introduction to multithreading» Panos Hadjidoukas 1 Processes and Threads POSIX Threads API Outline Thread management Synchronization with mutexes Deadlock and thread safety 2 Terminology -

More information

Synchronization and Semaphores. Copyright : University of Illinois CS 241 Staff 1

Synchronization and Semaphores. Copyright : University of Illinois CS 241 Staff 1 Synchronization and Semaphores Copyright : University of Illinois CS 241 Staff 1 Synchronization Primatives Counting Semaphores Permit a limited number of threads to execute a section of the code Binary

More information

CS 326: Operating Systems. Process Execution. Lecture 5

CS 326: Operating Systems. Process Execution. Lecture 5 CS 326: Operating Systems Process Execution Lecture 5 Today s Schedule Process Creation Threads Limited Direct Execution Basic Scheduling 2/5/18 CS 326: Operating Systems 2 Today s Schedule Process Creation

More information

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5 CS 153 Lab4 and 5 Kishore Kumar Pusukuri Outline Introduction A thread is a straightforward concept : a single sequential flow of control. In traditional operating systems, each process has an address

More information

COMP 3430 Robert Guderian

COMP 3430 Robert Guderian Operating Systems COMP 3430 Robert Guderian file:///users/robg/dropbox/teaching/3430-2018/slides/04_threads/index.html?print-pdf#/ 1/58 1 Threads Last week: Processes This week: Lesser processes! file:///users/robg/dropbox/teaching/3430-2018/slides/04_threads/index.html?print-pdf#/

More information

Concurrent Programming

Concurrent Programming Concurrent Programming Prof. Jinkyu Jeong( jinkyu@skku.edu) TA Jinhong Kim( jinhong.kim@csl.skku.edu) TA Seokha Shin(seokha.shin@csl.skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

More information

A Brief Introduction to OS/2 Multithreading

A Brief Introduction to OS/2 Multithreading A Brief Introduction to OS/2 Multithreading Jaroslav Kaƒer jkacer@kiv.zcu.cz University of West Bohemia Faculty of Applied Sciences Department of Computer Science and Engineering Why Use Parallelism? Performance

More information

So far, we know: Wednesday, October 4, Thread_Programming Page 1

So far, we know: Wednesday, October 4, Thread_Programming Page 1 Thread_Programming Page 1 So far, we know: 11:50 AM How to create a thread via pthread_mutex_create How to end a thread via pthread_mutex_join How to lock inside a thread via pthread_mutex_lock and pthread_mutex_unlock.

More information

High Performance Computing Course Notes Shared Memory Parallel Programming

High Performance Computing Course Notes Shared Memory Parallel Programming High Performance Computing Course Notes 2009-2010 2010 Shared Memory Parallel Programming Techniques Multiprocessing User space multithreading Operating system-supported (or kernel) multithreading Distributed

More information

Programming with Shared Memory. Nguyễn Quang Hùng

Programming with Shared Memory. Nguyễn Quang Hùng Programming with Shared Memory Nguyễn Quang Hùng Outline Introduction Shared memory multiprocessors Constructs for specifying parallelism Creating concurrent processes Threads Sharing data Creating shared

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

CS 6400 Lecture 11 Name:

CS 6400 Lecture 11 Name: Readers and Writers Example - Granularity Issues. Multiple concurrent readers, but exclusive access for writers. Original Textbook code with ERRORS - What are they? Lecture 11 Page 1 Corrected Textbook

More information

POSIX threads CS 241. February 17, Copyright University of Illinois CS 241 Staff

POSIX threads CS 241. February 17, Copyright University of Illinois CS 241 Staff POSIX threads CS 241 February 17, 2012 Copyright University of Illinois CS 241 Staff 1 Recall: Why threads over processes? Creating a new process can be expensive Time A call into the operating system

More information

CSci 4061 Introduction to Operating Systems. (Threads-POSIX)

CSci 4061 Introduction to Operating Systems. (Threads-POSIX) CSci 4061 Introduction to Operating Systems (Threads-POSIX) How do I program them? General Thread Operations Create/Fork Allocate memory for stack, perform bookkeeping Parent thread creates child threads

More information

Threads. Threads (continued)

Threads. Threads (continued) Threads A thread is an alternative model of program execution A process creates a thread through a system call Thread operates within process context Use of threads effectively splits the process state

More information

CSE 333 Section 9 - pthreads

CSE 333 Section 9 - pthreads CSE 333 Section 9 - pthreads Welcome back to section! We re glad that you re here :) Process A process has a virtual address space. Each process is started with a single thread, but can create additional

More information

Concurrency: Threads. CSE 333 Autumn 2018

Concurrency: Threads. CSE 333 Autumn 2018 Concurrency: Threads CSE 333 Autumn 2018 Instructor: Hal Perkins Teaching Assistants: Tarkan Al-Kazily Renshu Gu Trais McGaha Harshita Neti Thai Pham Forrest Timour Soumya Vasisht Yifan Xu Administriia

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Condition Variables. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Condition Variables. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Condition Variables Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3052: Introduction to Operating Systems, Fall 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Concurrent Programming is Hard! Concurrent Programming. Reminder: Iterative Echo Server. The human mind tends to be sequential

Concurrent Programming is Hard! Concurrent Programming. Reminder: Iterative Echo Server. The human mind tends to be sequential Concurrent Programming is Hard! Concurrent Programming 15 213 / 18 213: Introduction to Computer Systems 23 rd Lecture, April 11, 213 Instructors: Seth Copen Goldstein, Anthony Rowe, and Greg Kesden The

More information

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective? Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective? CS 470 Spring 2018 POSIX Mike Lam, Professor Multithreading & Pthreads MIMD

More information

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages CMSC 330: Organization of Programming Languages Multithreading Multiprocessors Description Multiple processing units (multiprocessor) From single microprocessor to large compute clusters Can perform multiple

More information

Multithreaded Programming

Multithreaded Programming Multithreaded Programming The slides do not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams. September 4, 2014 Topics Overview

More information

Operating systems and concurrency (B08)

Operating systems and concurrency (B08) Operating systems and concurrency (B08) David Kendall Northumbria University David Kendall (Northumbria University) Operating systems and concurrency (B08) 1 / 20 Introduction Semaphores provide an unstructured

More information

Systèmes d Exploitation Avancés

Systèmes d Exploitation Avancés Systèmes d Exploitation Avancés Instructor: Pablo Oliveira ISTY Instructor: Pablo Oliveira (ISTY) Systèmes d Exploitation Avancés 1 / 32 Review : Thread package API tid thread create (void (*fn) (void

More information

CS Lecture 3! Threads! George Mason University! Spring 2010!

CS Lecture 3! Threads! George Mason University! Spring 2010! CS 571 - Lecture 3! Threads! George Mason University! Spring 2010! Threads! Overview! Multithreading! Example Applications! User-level Threads! Kernel-level Threads! Hybrid Implementation! Observing Threads!

More information

Threads. Jo, Heeseung

Threads. Jo, Heeseung Threads Jo, Heeseung Multi-threaded program 빠른실행 프로세스를새로생성에드는비용을절약 데이터공유 파일, Heap, Static, Code 의많은부분을공유 CPU 를보다효율적으로활용 코어가여러개일경우코어에 thread 를할당하는방식 2 Multi-threaded program Pros. Cons. 대량의데이터처리에적합 - CPU

More information

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits CS307 What is a thread? Threads A thread is a basic unit of CPU utilization contains a thread ID, a program counter, a register set, and a stack shares with other threads belonging to the same process

More information

Lecture 19: Shared Memory & Synchronization

Lecture 19: Shared Memory & Synchronization Lecture 19: Shared Memory & Synchronization COMP 524 Programming Language Concepts Stephen Olivier April 16, 2009 Based on notes by A. Block, N. Fisher, F. Hernandez-Campos, and D. Stotts Forking int pid;

More information

Process Synchronization

Process Synchronization Process Synchronization Part III, Modified by M.Rebaudengo - 2013 Silberschatz, Galvin and Gagne 2009 POSIX Synchronization POSIX.1b standard was adopted in 1993 Pthreads API is OS-independent It provides:

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing Shared Memory Programming with Pthreads (3) Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Last time

More information

Process Management And Synchronization

Process Management And Synchronization Process Management And Synchronization In a single processor multiprogramming system the processor switches between the various jobs until to finish the execution of all jobs. These jobs will share the

More information

More Shared Memory Programming

More Shared Memory Programming More Shared Memory Programming Shared data structures We want to make data structures that can be shared by threads. For example, our program to copy a file from one disk to another used a shared FIFO

More information

CS-345 Operating Systems. Tutorial 2: Grocer-Client Threads, Shared Memory, Synchronization

CS-345 Operating Systems. Tutorial 2: Grocer-Client Threads, Shared Memory, Synchronization CS-345 Operating Systems Tutorial 2: Grocer-Client Threads, Shared Memory, Synchronization Threads A thread is a lightweight process A thread exists within a process and uses the process resources. It

More information

Concurrent Programming Lecture 10

Concurrent Programming Lecture 10 Concurrent Programming Lecture 10 25th September 2003 Monitors & P/V Notion of a process being not runnable : implicit in much of what we have said about P/V and monitors is the notion that a process may

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University Concurrent Programming Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Echo Server Revisited int main (int argc, char *argv[]) {... listenfd = socket(af_inet, SOCK_STREAM, 0); bzero((char

More information

20-EECE-4029 Operating Systems Fall, 2015 John Franco

20-EECE-4029 Operating Systems Fall, 2015 John Franco 20-EECE-4029 Operating Systems Fall, 2015 John Franco Final Exam name: Question 1: Processes and Threads (12.5) long count = 0, result = 0; pthread_mutex_t mutex; pthread_cond_t cond; void *P1(void *t)

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012 Pre-lab #2 tutorial ECE 254 Operating Systems and Systems Programming May 24, 2012 Content Concurrency Concurrent Programming Thread vs. Process POSIX Threads Synchronization and Critical Sections Mutexes

More information

CS510 Operating System Foundations. Jonathan Walpole

CS510 Operating System Foundations. Jonathan Walpole CS510 Operating System Foundations Jonathan Walpole The Process Concept 2 The Process Concept Process a program in execution Program - description of how to perform an activity instructions and static

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Paralleland Distributed Programming. Concurrency

Paralleland Distributed Programming. Concurrency Paralleland Distributed Programming Concurrency Concurrency problems race condition synchronization hardware (eg matrix PCs) software (barrier, critical section, atomic operations) mutual exclusion critical

More information