Programming Shared Address Space Platforms (Chapter 7)

Similar documents
Programming Shared Address Space Platforms (cont.) Alexandre David B2-206

Multithread Programming. Alexandre David

Shared Address Space Programming with Pthreads. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 18, 2014

Xu Liu Derived from John Mellor-Crummey s COMP422 at Rice University

Parallel Computing: Programming Shared Address Space Platforms (Pthread) Jin, Hai

Programming Shared Address Space Platforms

Shared-Memory Paradigm Multithreading

Shared-Memory Paradigm Multithreading

Lecture 08: Programming with PThreads: PThreads basics, Mutual Exclusion and Locks, and Examples

Shared-Memory Paradigm Multithreading

COMP 322: Fundamentals of Parallel Programming. Lecture 38: Comparison of Programming Models

Running example. Threads. Overview. Programming Shared-memory Machines 3/21/2018

Introduction to parallel computing

Shared Memory Parallel Programming

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References

POSIX Threads. HUJI Spring 2011

CSci 4061 Introduction to Operating Systems. Synchronization Basics: Locks

POSIX Threads. Paolo Burgio

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

Threads need to synchronize their activities to effectively interact. This includes:

real time operating systems course

Pthreads. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to PThreads and Basic Synchronization

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas

Parallel Programming with Threads

Multithreading Programming II

Threads. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

THREADS. Jo, Heeseung

ANSI/IEEE POSIX Standard Thread management

Paralleland Distributed Programming. Concurrency

CS-345 Operating Systems. Tutorial 2: Grocer-Client Threads, Shared Memory, Synchronization

Thread and Synchronization

Posix Threads (Pthreads)

EPL372 Lab Exercise 2: Threads and pthreads. Εργαστήριο 2. Πέτρος Παναγή

Locks and semaphores. Johan Montelius KTH

Concurrent Server Design Multiple- vs. Single-Thread

PThreads in a Nutshell

CSCI4430 Data Communication and Computer Networks. Pthread Programming. ZHANG, Mi Jan. 26, 2017

Locks and semaphores. Johan Montelius KTH

Lecture 4. Threads vs. Processes. fork() Threads. Pthreads. Threads in C. Thread Programming January 21, 2005

Shared Memory Programming. Parallel Programming Overview

Synchronization and Semaphores. Copyright : University of Illinois CS 241 Staff 1

TCSS 422: OPERATING SYSTEMS

Lecture 19: Shared Memory & Synchronization

Synchronization Primitives

recap, what s the problem Locks and semaphores Total Store Order Peterson s algorithm Johan Montelius 0 0 a = 1 b = 1 read b read a

Synchronising Threads

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Shared Memory Programming with Pthreads. T. Yang. UCSB CS240A.

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

POSIX PTHREADS PROGRAMMING

LSN 13 Linux Concurrency Mechanisms

Computer Systems Laboratory Sungkyunkwan University

Real Time Operating Systems and Middleware

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5

A Programmer s View of Shared and Distributed Memory Architectures

Concurrency, Thread. Dongkun Shin, SKKU

Synchronization and Semaphores. Copyright : University of Illinois CS 241 Staff 1

Threads. Jo, Heeseung

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

High Performance Computing Course Notes Shared Memory Parallel Programming

Condition Variables CS 241. Prof. Brighten Godfrey. March 16, University of Illinois

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Data Races and Deadlocks! (or The Dangers of Threading) CS449 Fall 2017

Introduc)on to pthreads. Shared memory Parallel Programming

2. What is Pthreads? 1. Basics of Shared Memory Programming with Pthreads

Motivation and definitions Processes Threads Synchronization constructs Speedup issues

Programming with Shared Memory. Nguyễn Quang Hùng

CS 6400 Lecture 11 Name:

CS 261 Fall Mike Lam, Professor. Threads

Pthreads (2) Dong-kun Shin Embedded Software Laboratory Sungkyunkwan University Embedded Software Lab.

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Agenda. Process vs Thread. ! POSIX Threads Programming. Picture source:

Introduction to pthreads

Concurrency and Synchronization. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

CS533 Concepts of Operating Systems. Jonathan Walpole

Shared Memory Programming Models III

COMP 3430 Robert Guderian

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

CMSC421: Principles of Operating Systems

Threaded Programming. Lecture 9: Alternatives to OpenMP

CS 3723 Operating Systems: Final Review

Preview. What are Pthreads? The Thread ID. The Thread ID. The thread Creation. The thread Creation 10/25/2017

Outline. CS4254 Computer Network Architecture and Programming. Introduction 2/4. Introduction 1/4. Dr. Ayman A. Abdel-Hamid.

Shared Memory Programming: Threads and OpenMP. Lecture 6

User Space Multithreading. Computer Science, University of Warwick

CSE 333 SECTION 9. Threads

CS 345 Operating Systems. Tutorial 2: Treasure Room Simulation Threads, Shared Memory, Synchronization

Lecture Outline. CS 5523 Operating Systems: Concurrency and Synchronization. a = 0; b = 0; // Initial state Thread 1. Objectives.

Network Programming TDC 561

pthreads CS449 Fall 2017

Threads. Threads (continued)

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

CS516 Programming Languages and Compilers II

Operating Systems (2INC0) 2017/18

Concurrent Programming

Threads. Jo, Heeseung

TCSS 422: OPERATING SYSTEMS

Threads Tuesday, September 28, :37 AM

C Grundlagen - Threads

Transcription:

Programming Shared Address Space Platforms (Chapter 7) Alexandre David 1.2.05 This is about pthreads. 1

Comparison Directive based: OpenMP. Explicit parallel programming: pthreads shared memory focus on synchronization, MPI message passing focus on communication. Both: Specify tasks & interactions. 25-02-2008 Alexandre David, MVP'08 2 Programming paradigms for shared address space machines focus on constructs for expressing concurrency and synchronization. Communication in shared memory programming is implicitly specified. We focus on minimizing data-sharing overheads (for MPI it s communication overheads). 2

Programming Models Concurrency supported by: Processes private data unless otherwise specified. Threads shared memory, lightweight. Directive based programming concurrency specified as high level compiler directive, OpenMP. See OS course. 25-02-2008 Alexandre David, MVP'08 3 3

Threads Basics All memory is globally accessible. But the stack is considered local. In practice both local (private) and global (shared) memory. Recall that memory is physically distributed and local accesses are faster. Applicable to SMP/multi-core machines. 25-02-2008 Alexandre David, MVP'08 4 4

Why Threads?? Software portability applications developed and run without modification on multi-processor machines. Latency hiding recall chapter 2. Implicit scheduling and load balancing specify many tasks and let the system map and schedule them. Ease of programming, widespread. 25-02-2008 Alexandre David, MVP'08 5 5

The POSIX Thread API It is a standard API (like MPI). Supported by most vendors. General concepts applicable to other thread APIs (java threads, NT threads, etc). Low level functions, API is missing high level constructs, e.g., no collective communication like in MPI. 25-02-2008 Alexandre David, MVP'08 6 6

Thread Creation #include <pthread.h> int pthread_create( pthread_t *thread_handle, const pthread_attr_t *attribute, void* (*thread_function)(void *), void *arg); Header. Identifier. NULL for default. Function to call with its argument. 25-02-2008 Alexandre David, MVP'08 7 Invokes thread_function as a thread. Notes: The identifier thread_handle is written before the function returns. The function returns in the main thread, the function thread_function runs in parallel in another thread. On uni-processor machines the thread may preempt its creator thread. There is a returned result (success or not). Beware of race conditions: Make sure to initialize everything before creating the thread (and not after). 7

Waiting for Termination int pthread_join( pthread_t thread, void ** ptr); Thread to wait for. Threads call pthread_exit(void*). The caller can read a (void*) at address ptr. The creator process/thread calls this function to wait for its spawned threads. 25-02-2008 Alexandre David, MVP'08 8 And returns success (0) or an error code. 8

Thread Creation & Termination #include <pthread.h> int pthread_create( pthread_t *thread_handle, const pthread_attr_t *attribute, void* (*thread_function)(void *), void *arg); int pthread_join( pthread_t thread, void ** ptr); Questions? void pthread_exit(void *); 25-02-2008 Alexandre David, MVP'08 9 9

Example: Compute PI #include <pthread.h>... main() { pthread_t p_threads[max_threads]; pthread_attr_t attr; pthread_attr_init (&attr); for (i=0; i< num_threads; i++) { hits[i] = i; pthread_create(&p_threads[i], &attr, compute_pi, (void *) &hits[i]); for (i=0; i< num_threads; i++) { pthread_join(p_threads[i], NULL); total_hits += hits[i]; 25-02-2008 Alexandre David, MVP'08 10 This is a lousy computation of pi. Based on area ratios. Take a square 1x1 and put a circle inside. Square area = 1, circle area = πr 2 = π/4 (r = ½). Choose many points randomly and the ratio hits/total will converge towards π/4. 10

Example: Compute PI void *compute_pi (void *s) { int seed, i; double rand_no_x, rand_no_y; int *hit_pointer = (int *) s; seed = *hit_pointer; int local_hits = 0; for (i = 0; i < sample_points_per_thread; i++) { Count hits in the circle. *hit_pointer = local_hits; pthread_exit(0); To return the result. Used to pass seed. rand_no_x =(double)(rand_r(&seed))/(double)((2<<14)-1); rand_no_y =(double)(rand_r(&seed))/(double)((2<<14)-1); if (((rand_no_x - 0.5) * (rand_no_x - 0.5) + (rand_no_y - 0.5) * (rand_no_y - 0.5)) < 0.25) local_hits ++; seed *= i; Return result. 25-02-2008 Alexandre David, MVP'08 11 Call to rand_r, worse than drand48 or rand, because we need a reentrant function. 11

Performance if change to global count, execute on 4 processor machine.? This shows false sharing (chapter 2). 25-02-2008 Alexandre David, MVP'08 12 Speedup of 3.91, efficiency = 0.98. Note: The threads do not synchronize with each other. 12

Race Condition Need to synchronize if a shared variable is updated concurrently. if (my_cost < best_cost) best_cost = my_cost; Race condition. Can give wrong (inconsistent) result. We want this to be atomic but we can t so this is a critical segment: Must be executed by only one thread at a time. 25-02-2008 Alexandre David, MVP'08 13 Race condition: The result depends on the order of the different statements in parallel, i.e., the interleaving. Inconsistent result: It does not correspond to any serialization of the threads (considering the test-and-update atomic). 13

Mutex-Locks Implement critical section. Mutex-locks can be locked or unlocked. Locking is atomic. Threads must acquire a lock to enter a critical section. Threads must release their locks when leaving a critical section. Locks represent serialization points. Too many locks will decrease performance. 25-02-2008 Alexandre David, MVP'08 14 In the book critical segment but usually called critical section. The call to lock-a-thread is blocking and returns only when the lock is acquired. Of course all locks must initialized to unlocked when starting programs. Be also careful on the granularity of what you lock. Locking big portions of code is bad since you are killing parallelism for the code you are locking. 14

Mutex-Lock int pthread_mutex_init( pthread_mutex_t *mutex_lock, const pthread_mutextattr_t *lock_attr); int pthread_mutex_lock( pthread_mutex_t *mutex_lock); int pthread_mutex_unlock( pthread_mutex_t *mutex_lock); 25-02-2008 Alexandre David, MVP'08 15 15

Example Revisited pthread_mutex_t minimum_value_lock;... main() {... pthread_mutex_init(&minimum_value_lock, NULL);... void *find_min(void *list_ptr) {... pthread_mutex_lock(&minimum_value_lock); if (my_min < minimum_value) minimum_value = my_min; pthread_mutex_unlock(&minimum_value_lock); 25-02-2008 Alexandre David, MVP'08 16 Careful with the use of mutex locks. Don t use one mutex lock for all your locks if they are independent. Use one different lock for different kinds of code segments that are not mutually exclusive it may still be the case that you have 2 portions of code accessing the same data, in which case you need to use the same lock. 16

Producer-Consumer Example Shared buffer containing one task. No overwrite until cleared. No read until written. Pick one task at a time. Note: Better with semaphores in this case. 25-02-2008 Alexandre David, MVP'08 17 17

Example pthread_mutex_t task_queue_lock; int task_available;... main() {... task_available = 0; pthread_mutex_init(&task_queue_lock, NULL);... 25-02-2008 Alexandre David, MVP'08 18 18

Example (cont.) void *producer(void *producer_thread_data) {... while (!done()) { inserted = 0; create_task(&my_task); while (inserted == 0) { pthread_mutex_lock(&task_queue_lock); if (task_available == 0) { insert_into_queue(my_task); task_available = 1; inserted = 1; pthread_mutex_unlock(&task_queue_lock); local 25-02-2008 Alexandre David, MVP'08 19 critical section Why is this a bad example? 19

Example (cont.) void *consumer(void *consumer_thread_data) { while (!done()) { extracted = 0; while (extracted == 0) { pthread_mutex_lock(&task_queue_lock); if (task_available == 1) { extract_from_queue(&my_task); task_available = 0; extracted = 1; pthread_mutex_unlock(&task_queue_lock); process_task(my_task); 25-02-2008 Alexandre David, MVP'08 20 Do it better with semaphores. 20

Producer-Consumers with Semaphores Recall Producer do P(S) write(&data) V(M) loop Semaphore S Shared data Semaphore M Consumer do P(M); if hasdata(&data) read(&data) V(M) else V(S) loop 25-02-2008 Alexandre David, MVP'08 21 21

Overhead of Locking Locks represent serialization points. Keep critical sections small. Previous example: create & process tasks outside the section. Faster variant: int pthread_mutex_trylock( pthread_mutex_t *mutex_lock); Does not block, returns EBUSY if failed. 25-02-2008 Alexandre David, MVP'08 22 This variant is faster because there is no management of waiting queues and waking up threads that are blocked. 22

Example void *find_entries(void *start_pointer) { /* This is the thread function */ struct database_record *next_record; int count; current_pointer = start_pointer; do { next_record = find_next_entry(current_pointer); count = output_record(next_record); while (count < requested_number_of_records); 25-02-2008 Alexandre David, MVP'08 23 Find k matches in a list. The example is not fully correct. 23

Example (cont.) int output_record(struct database_record *record_ptr) { int count; pthread_mutex_lock(&output_count_lock); output_count ++; count = output_count; pthread_mutex_unlock(&output_count_lock); if (count <= requested_number_of_records) { print_record(record_ptr); return (count); 25-02-2008 Alexandre David, MVP'08 24 Looks ok but if the times of the previous loop and this section are comparable then we have a terrible overhead. 24

Reducing Locking Overhead int output_record(struct database_record *record_ptr) { int count; int lock_status = pthread_mutex_trylock(&output_count_lock); if (lock_status == EBUSY) { insert_into_local_list(record_ptr); return(0); else { count = output_count; output_count += number_on_local_list + 1; pthread_mutex_unlock(&output_count_lock); print_records(record_ptr, local_list, requested_number_of_records - count); return(count + number_on_local_list + 1); 25-02-2008 Alexandre David, MVP'08 25 Example is not completely correct in fact (more entries searched than asked). Better performance because the locking call is much faster and the number of locked operations is reduced. Very important: The lock must be released, and only when it was acquired. 25

Try-lock To reduce idling overheads. Good if critical section can be delayed. Cheaper call. Although it is polling. 25-02-2008 Alexandre David, MVP'08 26 26

Condition Variables for Synchronization One condition variable one predicate. How to implement condition variables with monitors. int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex); A condition variable is always associated with a mutex. Lock/unlock to test & wait, re-lock/unlock to re-test. Similar concept of monitors in Java, though implemented differently. 25-02-2008 Alexandre David, MVP'08 27 Associate one condition to one predicate only. 27

Condition Variables & Monitors Monitor Condition variable pthread_mutex_lock pthread_cond_wait enter & test success failure pthread_cond_signal signaled re-enter & test pthread_mutex_unlock 25-02-2008 Alexandre David, MVP'08 28 In java: synchronized void foo() { if (!condition) wait(); notify(); 28

Monitors with Pthread pthread_mutex_lock(&lock); while(!condition) { pthread_cond_wait(&predicate, &lock); <critical section> pthread_cond_signal(&predicate); pthread_mutex_unlock(&lock); 25-02-2008 Alexandre David, MVP'08 29 Why do we have a loop on the condition variable? 29

Monitors in Java synchronized void foo() { while(!condition) wait(); <critical section> notify(); 25-02-2008 Alexandre David, MVP'08 30 30

Monitors in C# using System.Threading; void foo() { Monitor.enter(obj); while(!condition) Monitor.wait(obj); <critical section> Monitor.pulse(obj); Monitor.exit(obj); 25-02-2008 Alexandre David, MVP'08 31 31

Calls int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex); int pthread_cond_signal(pthread_cond_t *cond); int pthread_cond_broadcast(pthread_cond_t *cond); int pthread_cond_init(pthread_cond_t *cond, const pthread_condattr_t *attr); int pthread_cond_destroy(pthread_cond_t *cond); 25-02-2008 Alexandre David, MVP'08 32 There is variant pthread_cond_timedwait for a wait with time-out. 32

Example: Producer-Consumer pthread_cond_t cond_queue_empty, cond_queue_full; pthread_mutex_t task_queue_cond_lock; int task_available; main() { task_available = 0; pthread_init(); pthread_cond_init(&cond_queue_empty, NULL); pthread_cond_init(&cond_queue_full, NULL); pthread_mutex_init(&task_queue_cond_lock, NULL); /* create and join producer and consumer threads */ 25-02-2008 Alexandre David, MVP'08 33 The example is overkill and is here only for pedagogical purposes. 33

task_available == 0 cond_queue_empty task_available == 1 cond_queue_full Example: Producer-Consumer void *producer(void *producer_thread_data) { int inserted; while (!done()) { create_task(); pthread_mutex_lock(&task_queue_cond_lock); while (!(task_available == 0)) { pthread_cond_wait(&cond_queue_empty, &task_queue_cond_lock); insert_into_queue(); task_available = 1; pthread_cond_signal(&cond_queue_full); pthread_mutex_unlock(&task_queue_cond_lock); 25-02-2008 Alexandre David, MVP'08 34 34

task_available == 0 cond_queue_empty task_available == 1 cond_queue_full Example: Producer-Consumer void *consumer(void *consumer_thread_data) { while (!done()) { pthread_mutex_lock(&task_queue_cond_lock); while (!(task_available == 1)) { pthread_cond_wait(&cond_queue_full, &task_queue_cond_lock); my_task = extract_from_queue(); task_available = 0; pthread_cond_signal(&cond_queue_empty); pthread_mutex_unlock(&task_queue_cond_lock); process_task(my_task); 25-02-2008 Alexandre David, MVP'08 35 35

Attribute Objects To control threads and synchronization. Change scheduling policy Specify mutex types. Types of mutexes: Normal 1 lock per thread or deadlock. Recursive several locks per thread OK. Error check 1 lock per thread or error. 25-02-2008 Alexandre David, MVP'08 36 36

Thread Cancellation Stop a thread in the middle of its work. Function may return before the thread is really stopped! int pthread_cancel(pthread_t thread); 25-02-2008 Alexandre David, MVP'08 37 37

Composite Synchronization Constructs Pthread API offers (low-level) basic functions. Higher level constructs built with basic functions. Read-write locks. Barriers. well in fact these two are part of the API. 25-02-2008 Alexandre David, MVP'08 38 38

Read-Write Locks Read often/write sometimes. Multiple reads/unique write. Priority of writers over readers. Use condition variables. Count readers and writers. readers_proceed pending_writers == 0 && writer == 0. writer_proceed writer == 0 && readers == 0. 25-02-2008 Alexandre David, MVP'08 39 39

Read-Write Lock - RLocking void mylib_rwlock_rlock(mylib_rwlock_t *l) { pthread_mutex_lock(&(l -> read_write_lock)); while ((l -> pending_writers > 0) (l -> writer > 0)) { pthread_cond_wait(&(l -> readers_proceed), &(l -> read_write_lock)); l -> readers ++; pthread_mutex_unlock(&(l -> read_write_lock)); 25-02-2008 Alexandre David, MVP'08 40 Notice that there is no signal here. 40

Read-Write Lock - WLocking void mylib_rwlock_wlock(mylib_rwlock_t *l) { pthread_mutex_lock(&(l -> read_write_lock)); while ((l -> writer > 0) (l -> readers > 0)) { l -> pending_writers ++; pthread_cond_wait(&(l -> writer_proceed), &(l -> read_write_lock)); l -> pending_writers --; l -> writer ++; pthread_mutex_unlock(&(l -> read_write_lock)); typo in the book 25-02-2008 Alexandre David, MVP'08 41 There is a mistake in the book for the while loop. Either you move the l- >pending_writers-- inside the while loop, which is logical w.r.t. pending writers, or you move the l->pending_writers++ outside the loop. Keeping it inside is utterly incorrect. 41

? Read-Write Lock - Unlocking void mylib_rwlock_unlock(mylib_rwlock_t *l) { pthread_mutex_lock(&(l -> read_write_lock)); if (l -> writer > 0) { l -> writer = 0; else if (l -> readers > 0) { l -> readers --; if ((l -> readers == 0) && (l -> pending_writers > 0)) { pthread_cond_signal(&(l -> writer_proceed)); else if (l -> readers > 0) { bug pthread_cond_broadcast(&(l -> readers_proceed)); pthread_mutex_unlock(&(l -> read_write_lock)); 25-02-2008 Alexandre David, MVP'08 42 42

Another Bug? Example 7.7 has a bug. Test & update is not atomic as you can see. Fix: Re-test after the write lock has been obtained. BTW: The read-lock is useless here. What you should do: Acquire a write lock, test and update. 25-02-2008 Alexandre David, MVP'08 43 43

Barriers Encoded with a counter, a mutex, and a condition variable. Idea: Count & block threads. Signal them all. 25-02-2008 Alexandre David, MVP'08 44 44

Barriers void mylib_barrier(mylib_barrier_t *b, int num_threads) { pthread_mutex_lock(&(b -> count_lock)); b -> count ++; if (b -> count == num_threads) { /* last thread */ b -> count = 0; pthread_cond_broadcast(&(b -> ok_to_proceed)); else { pthread_cond_wait(&(b -> ok_to_proceed), &(b -> count_lock)); pthread_mutex_unlock(&(b -> count_lock)); 25-02-2008 Alexandre David, MVP'08 45 Performance bottleneck: The mutex serializes all the threads, execution time is O(n). Possible to improve by grouping threads by pairs. 45

O (n )? Smaller constant because less contention. 25-02-2008 Alexandre David, MVP'08 46 46

Avoiding Incorrect Code Avoid relying on thread inertia. Threads are asynchronous. Initialize data before starting threads. Never assume that a thread will wait for you. Never bet on thread race. Assume that at any point, any thread may go to sleep for any period of time. No ordering exists between threads unless you cause ordering. 25-02-2008 Alexandre David, MVP'08 47 47

Avoiding Incorrect Code Scheduling is not the same as synchronization. Never use sleep to synchronize. Never try to tune with timing. Beware of deadlocks & priority inversion. One predicate one condition variable. 25-02-2008 Alexandre David, MVP'08 48 48

Avoiding Performance Problems Beware of concurrent serialization. Use the right number of mutexes. Too much mutex contention or too much locking without contention? Avoid false sharing. And don t forget to compile like this: gcc Wall o hello hello.c -lpthread 25-02-2008 Alexandre David, MVP'08 49 49