The University of Texas at Arlington

The University of Texas at Arlington Lecture 10: Threading and Parallel Programming Constraints CSE 5343/4342 Embedded d Systems II

Objectives: Lab 3: Windows Threads (win32 threading API) Convert serial applications to a threaded version. Lab Assignment Use Windows threads to thread the serial code to compute PI using 8 threads.

4.0 Numerical Integration Example 1 0 4.0 (1+x 2 ) dx = static long num_steps=100000; double step, pi; 2.0 void main() { int i; double x, sum = 0.0; 0; 0.0 X 1.0 } step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf( Pi = %f\n,pi); 3

More Task Decomposition: Dependence Graph Graph = {vertices, (directed) edges} Vertix (node) for each: Variable assignment (except index variables) Constant Operator or function call Directed edges (arrows) indicate use of variables and constants for: Data flow Control flow

Dependence Graph Example #1 for (i = 0; i < 3; i++) a[i] = b[i] / 2.0; b[0] 2 b[1] 2 b[2] 2 / / / a[0] a[1] a[2] 5

Dependence Graph Example #1 for (i = 0; i < 3; i++) a[i] = b[i] / 2.0; Domain decomposition possible b[0] 2 b[1] 2 b[2] 2 / / / a[0] a[1] a[2] 6

Dependence Graph Example #2 for (i = 1; i < 4; i++) a[i] = a[i-1] * b[i]; a[0] b[1] b[2] b[3] * * * a[1] a[2] a[3] 7

Dependence Graph Example #2 for (i = 1; i < 4; i++) a[i] = a[i-1] * b[i]; No domain decomposition b[3] a[0] b[1] b[2] * * * a[1] a[2] a[3] 8

Dependence Graph Example #3 a = f(x, y, z); b = g(w, x); t = a + b; c = h(z); s = t / c; w x y z g f h b a c t / s 9

Dependence Graph Example #3 a = f(x, y, z); b = g(w, x); t = a + b; c = h(z); s = t / c; w x y z g f h Task decomposition with 3 CPUs. b t a / c s 10

Multi-thread thread Concepts Multi-Threading concepts are needed in order to obtain maximum performance from the multi-core microprocessors. These concepts include : Creating, Terminating, Suspending, and Resuming Threads Thread Synchronization i Methods: Semaphores, Mutexes, Locks and Critical Sections.

Using Threads Benefits of using threads include: Increased performance Better resource utilization Efficient data sharing However there are risks of using threads: Data race conditions Deadlocks Code complexity Portability issues Testing and debugging difficulty

Waiting for Threads Blocking versus non-blocking Looping on a condition is expensive Thread scheduled even when no work CPU time stolen from threads performing work Hard to find the right balance Locking probably too much or not enough Thread.Sleep inflexible Better option: Just wait for it!

Synchronization Synchronization controls the relative order of thread execution and resolves conflicts among threads. Threads sometime need to wait for other threads to be in known state before continuing In shared-memory systems constraints have to be imposed for proper order of execution or to avoid corrupted/locked data. Two basic types of synchronization: 1. mutual exclusion 2. condition synchronization

Thread Synchronization Two or more threads cooperating One thread waits for another to be in known state before continuing Lack of synchronization leads to data corruption/lockups Using methods/constructs to enforce required behavior 15

Mutual Exclusion Program logic used to ensure single-thread access to a critical region. One thread blocks a critical section of code that contains shared data that one or more threads wait for access. Other threads are blocked from entering critical section (until the first thread is done) The use of proper synchronization techniques insures that only one thread is allowed access to a critical section at any one instance. The major challenge of threaded d programming is to implement critical sections in such a way that multiple threads perform mutually exclusive operations for critical sections and do not use critical sections simultaneously.

Mutual Exclusion done by a Critical Section

Condition Synchronization Condition synchronization allows a thread to wait until a specific condition is reached (e.g., Semaphores)

Using the Mutex The most common method of making sure that two threads take turns before accessing a given object is to use a shared lock. Since only one thread at a time can have the lock, other threads wait their turn. Similar to a lock is the Mutex object. Only one thread can lock the Mutex at a time, and that same thread must then release it. The key difference between a Mutex and a standard d lock is that it works across processes for more advanced a scenarios. 19

Deadlocks Thread waits for a resource that will never become available Self-deadlock (recursive deadlock): A thread wants to acquire a resource that is already belonging to it Lock-ordering ordering deadlock (more common): Example: thread A locks resource R1 then tries to lock resource R2; meanwhile thread B locks R2 and tries to lock R1; in some scenario thread A could have acquired R1 and is waiting for R2 while B has acquired R2 and is waiting for R1. As implied by the name, deadlocks are not good, they need to be avoided at all costs.

Deadlocks (cont d) Deadlocks Occur when a thread waits for a condition that never occurs. Are commonly results from the competition between threads for system resources held by other threads. The four necessary conditions for a deadlock are: Mutual exclusion condition Hold and wait condition No preemption condition Circular wait condition

Deadlock.cpp This program illustrates the potential for deadlock in a badlocking hierarchy. It is possible for one thread to lock both critical sections and avoid deadlock. However, concurrent programs that t rely on a particular order-of-execution without enforcing that t order will eventually fail.

Race Conditions Race conditions: Are the most common errors in concurrent programs. Occur because the programmer assumes a particular order of execution but does not guarantee that order through synchronization. A Data Race: Refers to a storage conflict situation. Occurs when two or more threads simultaneously access the same memory location while at least one thread is updating that location. Result in two possible conflicts: Read/Write conflicts Write/Write conflicts Race conditions are usually not obvious Errors most likely only occur unexpectedly and unpredictably Locks are the key to avoidance

Using Synchronization Synchronization is about making sure that threads take turns when they need to, typically to access some shared object. Depending on your specific application needs, you will find that t different options make more sense than others. Operating systems have to provide some support for atomic operations. Windows simplifies this process since it has built-in support for suspending a thread at the scheduler level when necessary. In this manner, one thread can be put to sleep until a certain condition occurs in another thread. By letting one thread sleep instead of just repeatedly checking to see if another thread is done, performance is dramatically improved.

Synchronization Primitives Synchronization typically performed by three types of primitives: Semaphores Locks, and Condition variables Primitives are implemented by atomic operations by use of a memory fence or barrier processor dependent d operation that t insures threads see other threads memory operations by maintaining reasonable order

Semaphores Introduced by Edsger Dijkstra (1968) A Semaphore is a form of a counter that allows multiple threads access to a resource by incrementing or decrementing the semaphore. Typical use is protecting a shared resource of which at most n instances are allowed to exist simultaneously. Use P to acquire a resource and V to release. Concept of capacity, thus can be represented by an integer. Semaphores are created with a specified capacity, and once that number of threads have locked (P-proberen) it, subsequent access is blocked until a slot opens up (V- verhogen).

Semaphores (cont d) P and V need to be atomic to protect the semaphore variable. The P operation busy-waits (or maybe sleeps) until a resource is available, whereupon it immediately claims one. A Semaphore with a capacity of one is a binary semaphore (it is also essentially a Mutex, with the exception that any thread can release it, not just a thread that has acquired it.). Semaphores can be used across processes as well. Semaphores are not as frequently used anymore.

Semaphores (example) Producer/Consumer threads: Producer: void producer() { while(1) { } produce_data(); p_sem->release(); // V operation } Consumer: void consumer() { while(1) { } } p_sem->wait(); // P operation consume_data();

Locks Insure that a only a single thread can have access to a resource The coarse granular locks have higher lock contention than finer granular ones. Locks could be realized by binary semaphores (and an initialization entity) Acquire(): waits for the lock sate to be unlocked and then sets the state to lock Release(): Changes the lock state from locked to unlocked

Critical Section Implementation To avoid deadlocks, locks should be mostly used inside critical sections where there is but a single entry and single exit point. <critical section start> <acquire lock A> (operate on shared memory protected t by lock) <release lock A> <critical section end>

Locks Locking restricts access to an object to one thread Minimize locking/synchronization whenever possible Make objects thread-safe when appropriate Acquire locks late, release early Shorter duration, the better Lock only when necessary

Locking Example private object padlock = new object(); public void CoordinateWork() { (new Thread(PerformWork)).Start(); (new Thread(PerformWork)).Start(); } private void PerformWork() { } while(true) { lock(padlock) { /* GET NEXT ITEM */ } /* DO WORK HERE */ } unlock(padlock)

Lock Types Mutex: simplest lock; can include a timer attribute for release or an try-finally exception to release Recursive: can be repeatedly acquired by the owning thread (used in recursive functions). This can thus avoid recursive deadlocks. dl Read-Write Locks: allow simultaneous read access for multiple threads but limit the write access to only one thread. Use when multiple threads need to read shared data but do not need to perform a write operation on the data. Granularity (how much is locked) matters. Spin Locks: Waiting threads must spin or poll the states of a lock rather than getting blocked. Used mostly on multiprocessor systems as the one processor is essentially blocked spinning. Use when hold time of locks are short (i.e., less than a blocking or waking up of a thread).

Condition Variables Usually, condition variables are user-mode objects that cannot be shared across processes. In general condition variables are a method to implement a message regarding a specific condition that a thread is waiting on and the thread has a lock on specific resource. To prevent occurrences of deadlocks, dl the following atomic operations on a condition variable can be used. Wait(L), Signal(L), and Broadcast(L) Condition variables enable threads to atomically release a lock and enter the sleeping state. They can be used with critical sections or slim reader/writer (SRW) locks. Condition variables support operations that t "wake one" or "wake all" waiting threads. After a thread is woken, it re-acquires the lock it released when the thread entered the sleeping state.

Condition Variables Suppose a thread has a lock on specific resource, but cannot proceed until a particular condition occurs. For this case, the thread can release the lock but will need it returned when the condition occurs. The wait() is a method of releasing the lock and letting the next thread waiting on this resource to now use the resource. The condition the original thread was waiting on is passed via the condition variable to the new thread with the lock. When the new thread is finished with the resource, it checks the condition variable and returns the resource to the original i holder by use of the signal() or broadcast commands. The broadcasts enables all waiting threads for that resource to run.

Example: Condition Variable Condition C; Lock L; Bool LC = false; void producer() { while (1) { L ->acquire(); // start critical section while(lc == true) { C -> wait(l); } // produce the next data LC = true; C ->signal(l); // end critical section L ->release(); } } void consumer() { while (1) { L ->acquire(); // start critical section while (LC == false) { C ->wait(l); } // consume the next data LC = false; } //end critical section L ->release(); }

Message Passing Message is a special method of communication to transfer information or a signal from one domain to another. For multi-threading environments the domain is referred to as the boundary of a thread. Message passing, or MPI, (used in distributed computing, parallel l processing, etc.) A method to communicate between threads, or processes.

Messages Threads communication within a process is known as intra-process communication. Messages that reside in different processes use inter-process communication. To synchronize operation of threads, semaphores, locks, and condition variables are used. Synchronization primitives convey status and access information. To communicate data thread messaging is done.

Summary For synchronization, an understanding of atomic operations will help avoid deadlock and eliminate race conditions. Use a proper synchronization construct-based framework for threaded applications. Use higher-level synchronization constructs over primitive types (more OS support) An application cannot contain any possibility for a deadlock scenario. Threads can perform message passing using different approaches: intra-process, inter-process Important to understand the way threading features of third-party libraries are implemented. Different implementations may cause applications to fail in unexpected ways.