COMP6771 Advanced C++ Programming

Similar documents
COMP6771 Advanced C++ Programming

COMP6771 Advanced C++ Programming

Shared-Memory Programming

Let s go straight to some code to show how to fire off a function that runs in a separate thread:

Database Systems on Modern CPU Architectures

C++11 threads -- a simpler interface than pthreads Examples largely taken from 013/02/24/investigating-c11-threads/

Multi-threaded Programming in C++

tcsc 2016 Luca Brianza 1 Luca Brianza 19/07/16 INFN & University of Milano-Bicocca

Distributed Programming

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Introduction to Threads

Synchronising Threads

std::async() in C++11 Basic Multithreading

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Programming in Parallel COMP755

Data Races and Deadlocks! (or The Dangers of Threading) CS449 Fall 2017

CS 261 Fall Mike Lam, Professor. Threads

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5

Concurrency, Thread. Dongkun Shin, SKKU

Threads Tuesday, September 28, :37 AM

COMP6771 Advanced C++ Programming

CSE 374 Programming Concepts & Tools

Threads Cannot Be Implemented As a Library

Parallelization, OpenMP

Threading and Synchronization. Fahd Albinali

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

So far, we know: Wednesday, October 4, Thread_Programming Page 1

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Parallel Programming: Background Information

Parallel Programming: Background Information

Multicore and Multiprocessor Systems: Part I

EECE.4810/EECE.5730: Operating Systems Spring 2017 Homework 2 Solution

Synchronization Spinlocks - Semaphores

CS 153 Design of Operating Systems Winter 2016

High Performance Computing Course Notes Shared Memory Parallel Programming

Advanced Threads & Monitor-Style Programming 1/24

C - Grundlagen und Konzepte

CS 450 Exam 2 Mon. 4/11/2016

Datenstrukturen und Algorithmen

COMP6771 Advanced C++ Programming

THREADS: (abstract CPUs)

Parallel Computer Architecture and Programming Written Assignment 3

The C++ Memory Model. Rainer Grimm Training, Coaching and Technology Consulting

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

Threads and Synchronization. Kevin Webb Swarthmore College February 15, 2018

Parallelism Marco Serafini

An Introduction to Parallel Systems

ECE 574 Cluster Computing Lecture 8

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Department of Computer Science. COS 122 Operating Systems. Practical 3. Due: 22:00 PM

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

CMSC 330: Organization of Programming Languages. Concurrency & Multiprocessing

CS533 Concepts of Operating Systems. Jonathan Walpole

G52CON: Concepts of Concurrency

[537] Locks. Tyler Harter

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

1. Introduction to Concurrent Programming

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Lecture 10 Midterm review

Learning Outcomes. Concurrency and Synchronisation. Textbook. Concurrency Example. Inter- Thread and Process Communication. Sections & 2.

Threaded Programming. Lecture 9: Alternatives to OpenMP

CSE 160 Lecture 7. C++11 threads C++11 memory model

Shared Memory Parallel Programming with Pthreads An overview

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Discussion CSE 224. Week 4

Concurrency and Synchronisation

CMSC 330: Organization of Programming Languages

Multiprocessor Systems. Chapter 8, 8.1

Chapter 4 Concurrent Programming

A wishlist. Can directly pass pointers to data-structures between tasks. Data-parallelism and pipeline-parallelism and task parallelism...

Announcements. Office hours. Reading. W office hour will be not starting next week. Chapter 7 (this whole week) CMSC 412 S02 (lect 7)

CPSC 261 Midterm 2 Thursday March 17 th, 2016

Parallel Programming Languages COMP360

C++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay

Highlights. - Making threads. - Waiting for threads. - Review (classes, pointers, inheritance)

Programming Languages

Processes and Threads. Industrial Programming. Processes and Threads (cont'd) Processes and Threads (cont'd)

CSE 333 Section 9 - pthreads

TCSS 422: OPERATING SYSTEMS

Highlights. - Making threads. - Waiting for threads

CS420: Operating Systems. Process Synchronization

Introduction to PThreads and Basic Synchronization

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Dealing with Issues for Interprocess Communication

Synchronization. CSE 2431: Introduction to Operating Systems Reading: Chapter 5, [OSC] (except Section 5.10)

POSIX Threads. HUJI Spring 2011

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Concurrency: a crash course

Multiprocessors and Locking

EE4144: Basic Concepts of Real-Time Operating Systems

Alfio Lazzaro: Introduction to OpenMP

COMP6771 Advanced C++ Programming

Multithreading done right? Rainer Grimm: Multithreading done right?

Semaphores. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

User Space Multithreading. Computer Science, University of Warwick

MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE

COMP6771 Advanced C++ Programming

Transcription:

1.... COMP6771 Advanced C++ Programming Week 9 Multithreading 2016 www.cse.unsw.edu.au/ cs6771

.... Single Threaded Programs All programs so far this semester have been single threaded They have a single call stack, which: increases when a function is called decreases when a function returns call stack. main() func1() main() func2() func1() main() time func1() main() main() Only one command executes at any given time 2

3.... Multi-threaded Prograns Most modern CPUs are multi-threaded or have multiple cores This means they can execute multiple commands at the same time Dual and Quad core CPUs are common GPUs (graphical processing units) can have thousands of cores Programs that use multiple threads can be many times faster than single-threaded programs

.... Multi-threading in C++ Each program begins with one thread Can use the thread library in C++11 to create additional threads when required Each thread has it s own call stack. Consider a single threaded call stack, that calls two unrelated functions: call stack. main() func1() main() main() time func2() main() main() func2 can t be executed until after func1 has been executed 4

.... Multi-threading in C++ Now consider a multi-threaded C++ program that creates a new thread for func1: call stack 1. main() func2() main() main() time 5 call stack 2. new thread func1() time terminate thread func1 and func2 run in parallel, speeding up the overall program execution time

.... 6 Common types of program workloads Sequential e.g., Random number generation, the previous number(s) are used to calculate the next number Parallel e.g., Compressing a file, the file can be split into chunks and a different processor core can independently compress each chunk (note: some sequential code may be required to split and join the chunks) Embarrassingly parallel e.g., A multi-threaded webserver that serves files to each user using a different processor. No communication between threads is required.. Speedup. A metric to measure the increase in performance of parallel over sequential task execution. Super-linear speedup can occur in embarrassingly. parrallel problems.

7.... Memory Models All threads share the same memory. Need to ensure that one thread doesn t write to the same memory location that another thread is reading from at the same time Local Memory Each thread has it s own exclusive memory that other threads cannot access Message Passing Each thread has it s own exclusive memory and can share data with other threads by sending them data objects through messages

.... Symmetric multiprocessing (SMP) Most consumer multi-core CPUs have this architecture. 8 https: //en.wikipedia.org/wiki/symmetric_multiprocessing

9.... An operation that, generally, takes one CPU clock cycle. Most memory writes are atomic. Consider the command: i++ When you compile the machine code may look like:.1 Load value from RAM location 0x42 into register r1.2 Increase value in register r1.3 Store value from register r1 into RAM location 0x42. RAM. Bus CPU Registers Each machine instruction takes one clock cycle Only one thread can write to a specific memory location each cycle (but they may be able to write to different locations)

10.... Race Conditions Consider: int i = 1; Thread 1 wants to increment i: i++; Thread 2 wants to decrememnt i: i--; The two commands may consist of the following machine instructions: // Incremement:.1 Load value from RAM location 0x42 into register r1.2 Increase value in register r1.3 Store value from register r1 into RAM location 0x42 // Decrement:.1 Load value from RAM location 0x42 into register r2.2 Decrease value in register r2.3 Store value from register r2 into RAM location 0x42

11.... Three possible outcomes Because the writing into the memory location 0x42 is atomic, there are three possible final values for i Case 1: increment happens entirely before the decrement Afterwards, i == 1; // Thread 1: Load 0x42 into register r1 Increase register r1 Store register r1 into 0x42 // Thread 2: Load 0x42 into register r2 Decrease register r2 Store register r2 into 0x42

12.... Three possible outcomes Because the writing into the memory location 0x42 is atomic, there are three possible final values for i Case 2: decrement happens entirely before the incremement Afterwards, i == 1; // Thread 1: Load 0x42 into register r1 Increase register r1 Store register r1 into 0x42 // Thread 2: Load 0x42 into register r2 Decrease register r2 Store register r2 into 0x42

13.... Three possible outcomes Because the writing into the memory location 0x42 is atomic, there are three possible final values for i. Case 3: The increment happens in parallel with the decrement. Afterwards, i == 0 (or 2); // Thread 1: Load 0x42 into register r1 Increase register r1 Store register r1 into 0x42 // Thread 2: Load 0x42 into register r2 Decrease register r2 Store register r2 into 0x42 Problem! incrementing and decrementing i should == 1! Not 0 (or 2). Example: threadlamdarace.cpp

.... 14 1 #include <iostream> 2 #include <thread> 3 threadlamdarace.cpp 4 // NOTE: do not compile this with -O2 it optimises out the ++ 5 // call and prevents the race condition from happening. 6 7 int main() { 8 int i = 1; 9 const long numiterations = 1000000; 10 std::thread t1{[&] { 11 for (int j = 0; j < numiterations; ++j) { 12 i++; 13 } 14 }}; 15 std::thread t2{[&] { 16 for (int j = 0; j < numiterations; ++j) { 17 i--; 18 } 19 }}; 20 t1.join(); 21 t2.join(); 22 std::cout << i << std::endl; 23 }

15.... Mutual Exclusion: Critical Section We need a way to ensure that these three machine instructions always happen as one block operation. This block of machine instructions is called a critical section. A Mutex allows us to only let a single thread at a time access a critical section. A Mutex is an object that is shared between threads. Generally you have a Mutex object for each critical section (or shared variable) in your code.

16.... Mutual Exclusion: Locking A Mutex can act as a lock on a critical section of code. A thread locks a mutex when it wants to execute a section of code and unlocks it when it is finished with that section. If a second thread wants to lock the mutex while it is already locked, the second thread has to wait for the first thread to unlock it. Forgetting to unlock mutexes is a major source of bugs! Additionally, you can write code that directly accesses a critical section without using a mutex, but this is very dumb.

17.... Deadlocks Consider the following: You have two shared objects A and B. You have two shared mutexes: mutexa and mutexb. Thread 1 wants to call the function: std::swap(a,b) Thread 2 wants to call the function: std::swap(b,a) The code may look something like this: 1 // Thread 1: 2 mutexa.lock(); 3 mutexb.lock(); 4 std::swap(a,b); 5 mutexa.unlock(); 6 mutexb.unlock(); What is the problem with this code? 1 // Thread 2: 2 mutexb.lock(); 3 mutexa.lock(); 4 std::swap(b,a); 5 mutexb.unlock(); 6 mutexa.unlock();

18.... 1 // Thread 1: 2 mutexa.lock(); 3 mutexb.lock(); 4 std::swap(a,b); 5 mutexa.unlock(); 6 mutexb.unlock(); Deadlocks 1 // Thread 2: 2 mutexb.lock(); 3 mutexa.lock(); 4 std::swap(b,a); 5 mutexb.unlock(); 6 mutexa.unlock(); What is the problem with this code? Thread 1 tries to lock mutexb which is already locked by thread 2 so waits for it to be unlocked. Thread 2 tries to lock mutexa which is already locked by thread 1 so waits for it to be unlocked. The code is deadlocked as both threads are waiting for each other. Deadlocks are a major source of bugs and are not easy to test for.

19.... Defined in <thread> header file Need to link against pthread under g++ on linux g++ -std=c++11 -Wall -Werror -O2 -pthread -o file file.cpp New threads can be started on functions, function objects, lamdas, or member functions.

.... 20 1 #include <iostream> 2 #include <thread> 3 Threads with Function Pointer 4 void counter(int numiterations) { 5 for (int i = 0; i < numiterations; ++i) { 6 std::cout << "Thread id: " << std::this_thread::get_id(); 7 std::cout << " value = " << i << std::endl; 8 } 9 } 10 11 int main() { 12 std::thread t1{counter, 6}; // start two threads 13 std::thread t2{counter, 4}; 14 counter(2); // call counter on the main thread too 15 t1.join(); // wait for the two threads to end. 16 t2.join(); 17 }. join(). The call to join() for each thread is required for this code to work. It blocks the main thread which waits until the other threads have

21.... See: p.750 Gregoire, M., Professional C++ (2014). 1 #include <iostream> 2 #include <thread> 3 Threads with Function Object 4 struct Counter { 5 int numiterations; 6 Counter(int n) : numiterations{n} {} // constructor 7 void operator()() { 8 for (int i = 0; i < numiterations; ++i) { 9 std::cout << "Thread id: " << std::this_thread::get_id(); 10 std::cout << " value = " << i << std::endl; 11 } 12 } 13 }; 14 15 int main() { 16 std::thread t1 { Counter { 10 }}; 17 // named variable 18 Counter c5; 19 std::thread t2{c}; 20 t1.join(); 21 t2.join(); 22 }

22.... Threads with Lamda Functions 1 #include <iostream> 2 #include <thread> 3 4 int main() { 5 auto lamdafunc = [](int numiterations) { 6 for (int i = 0; i < numiterations; ++i) { 7 std::cout << "Thread id: " << std::this_thread::get_id(); 8 std::cout << " value = " << i << std::endl; 9 } 10 }; 11 12 std::thread t1{lamdafunc,10}; 13 std::thread t2{lamdafunc,5}; 14 t1.join(); 15 t2.join(); 16 }

.... Threads with Lamda Functions 2 Be careful if you have lamda captures - by reference. 1 #include <iostream> 2 #include <thread> 3 4 int main() { 5 int numiterations = 10; 6 7 auto lamdafunc = [&numiterations]() { 8 for (int i = 0; i < numiterations; ++i) { 9 std::cout << "Thread id: " << std::this_thread::get_id(); 10 std::cout << " value = " << i << std::endl; 11 } 12 }; 13 14 std::thread t1{lamdafunc}; 15 numiterations = 5; 16 std::thread t2{lamdafunc}; 17 t1.join(); 18 t2.join(); 19 } 23 How many times does thread t1 print?

.... Threads with Lamda Functions 3 Be careful if you have lamda captures - by value. 1 #include <iostream> 2 #include <thread> 3 4 int main() { 5 int numiterations = 10; 6 7 auto lamdafunc = [numiterations]() { 8 for (int i = 0; i < numiterations; ++i) { 9 std::cout << "Thread id: " << std::this_thread::get_id(); 10 std::cout << " value = " << i << std::endl; 11 } 12 }; 13 14 std::thread t1{lamdafunc}; 15 numiterations = 5; 16 std::thread t2{lamdafunc}; 17 t1.join(); 18 t2.join(); 19 } 24 How many times does thread t1 print?

25.... Threads with Member Function 1 #include <iostream> 2 #include <thread> 3 4 struct Iterator { 5 void counter(int numiterations) { 6 for (int i = 0; i < numiterations; ++i) { 7 std::cout << "Thread id: " << std::this_thread::get_id(); 8 std::cout << " value = " << i << std::endl; 9 } 10 } 11 }; 12 13 int main() { 14 Iterator i; // create an object 15 // call the member function through a thread 16 std::thread t1 { &Iterator::counter, &i, 10 }; 17 // call the same member function on the same object in parallel 18 i.counter(5); 19 t1.join(); 20 }

26.... Atomic types automatically synchronise and prevent race conditions The C++11 standard provides atomic primative types: Typedef name Full specialisation atomic bool atomic<bool> atomic char atomic<char> atomic int atomic<int> atomic uint atomic<unsigned int> atomic long atomic<long>

.... Example Our problematic race condition fixed with an atomic int: 1 #include <iostream> 2 #include <thread> 3 #include <atomic> 4 int main() { 5 std::atomic<int> i {1}; 6 const long numiterations = 1000000; 7 std::thread t1([&i,numiterations] { 8 for (int j = 0; j < numiterations; ++j) { 9 i++; 10 } 11 }); 12 std::thread t2([&i,numiterations] { 13 for (int j = 0; j < numiterations; ++j) { 14 i--; 15 } 16 }); 17 t1.join(); 18 t2.join(); 19 std::cout << i << std::endl; 20 } 27 Is this a good approach?

28.... Atomic Performance Problem The purpose of multi-threaded code is to improve performance Atomic operations slow down our code as they need to synchronise through locking How you structure multi-threaded code has a large influence on performance In our example we lock every time we need to increment, how might we do this better?

.... Improved Performance Example 1 int main() { 2 std::atomic<int> i {1}; 3 const long numiterations = 1000000; 4 std::thread t1([&i,numiterations] { 5 int k = 0; // use a local variable 6 for (int j = 0; j < numiterations; ++j) { 7 k++; // generally more complex math 8 } 9 i += k; 10 }); 11 std::thread t2([&i,numiterations] { 12 int k = 0; 13 for (int j = 0; j < numiterations; ++j) { 14 k--; 15 } 16 i -= k; 17 }); 18 t1.join(); 19 t2.join(); 20 std::cout << i << std::endl; 21 } 29 Only write to the atomic int once per thread!

30.... Library The standard defines a number of operations on atomic types. These operations will be faster than your code. Example, fetch add(): 1 #include <iostream> 2 #include <atomic> 3 #include <thread> 4 5 int main() { 6 std::atomic<int> i{10}; 7 std::cout << "Initial Value: " << i << std::endl; 8 int fetchedvalue = i.fetch_add(5); 9 std::cout << "Fetched Value: " << fetchedvalue << std::endl; 10 std::cout << "Current Value: " << i << std::endl; 11 } Ouput: 1 Initial Value: 10 2 Fetched Value: 10 3 Current Value: 15