Threads: either under- or over-utilised

Size: px
Start display at page:

Download "Threads: either under- or over-utilised"

Transcription

1 Threads: either under- or over-utilised Underutilised: limited by creation speed of work Cannot exploit all the CPUs even though there is more work Overutilised: losing performance due to context switches There is overhead when switching between OS threads Each thread needs to warm up cache again Increases memory pressure Worst case: continual slow-down The cost of creating threads is partially borne by kernel User code may slow down more than kernel code under load Number of workers slowly goes up; completion rate goes down HPCE / dt1/ 215 / 7.1

2 Solving under-utilisation template<class TI, class TF> void parallel_for(const TI &begin, const TI &end, const TF &f) if(begin+1 == end) f(begin); else TI mid=(begin+end)/2; std::thread left( // Spawn the left thread in parallel ); [&]() parallel_for(begin, mid, f); // Perform the right segment on our thread parallel_for(mid, end, f); // wait for the left to finish left.join(); HPCE / dt1/ 215 / 7.2

3 Creation of work using trees Tree starts on one thread [..4) template<class TI, class TF> void parallel_for( const TI &begin,const TI &end, const TF &f) if(begin+1 == end) f(begin); else TI mid=(begin+end)/2; std::thread left( [&]() parallel_for(begin, mid, f); ); parallel_for(mid, end, f); left.join(); HPCE / dt1/ 215 / 7.3

4 Creation of work using trees Tree starts on one thread Create thread to branch [..4) spawn [2..4) [..2) template<class TI, class TF> void parallel_for( const TI &begin,const TI &end, const TF &f) if(begin+1 == end) f(begin); else TI mid=(begin+end)/2; std::thread left( [&]() parallel_for(begin, mid, f); ); parallel_for(mid, end, f); left.join(); HPCE / dt1/ 215 / 7.4

5 Creation of work using trees Tree starts on one thread Create thread to branch [..4) spawn [2..4) spawn [..2) spawn [3..4) [2..3) [1..2) [..1) template<class TI, class TF> void parallel_for( const TI &begin,const TI &end, const TF &f) if(begin+1 == end) f(begin); else TI mid=(begin+end)/2; std::thread left( [&]() parallel_for(begin, mid, f); ); parallel_for(mid, end, f); left.join(); HPCE / dt1/ 215 / 7.5

6 Creation of work using trees Tree starts on one thread Create thread to branch [..4) spawn Execute function at leaves [2..4) spawn [..2) spawn [3..4) [2..3) [1..2) [..1) template<class TI, class TF> void parallel_for( const TI &begin,const TI &end, const TF &f) if(begin+1 == end) f(begin); else TI mid=(begin+end)/2; f(3) f(2) f(1) f() std::thread left( [&]() parallel_for(begin, mid, f); ); parallel_for(mid, end, f); left.join(); HPCE / dt1/ 215 / 7.6

7 Creation of work using trees Tree starts on one thread Create thread to branch Execute function at leaves Join back up to the root [..4) [2..4) [3..4) spawn spawn [..2) spawn [2..3) [1..2) [..1) template<class TI, class TF> void parallel_for( const TI &begin,const TI &end, const TF &f) if(begin+1 == end) f(begin); else TI mid=(begin+end)/2; f(3) f(2) f(1) f() [3..4) [2..3) [1..2) [..1) join join [2..4) join [..2) std::thread left( [&]() parallel_for(begin, mid, f); ); parallel_for(mid, end, f); left.join(); [..4) HPCE / dt1/ 215 / 7.7

8 Properties of fork / join trees Recursively creating trees of work is very efficient We are not limited to one thread creating all tasks Exponential rather than linear growth of threads with time Problem solved? Growth of threads is exponential with time Can put significant pressure on the OS thread scheduler Context switching 1s of threads is very inefficient Each thread requires significant resources Need kernel handles, stack, thread-info block,... Can t allocate more than a few thousand threads per process HPCE / dt1/ 215 / 7.8

9 Re-examining the goals What we want is parallel_for: Iterations may execute in parallel std::thread gives us something different: The new thread will execute in parallel Our thread based strategy is too eager to go parallel We want to go just parallel enough, then stay serial HPCE / dt1/ 215 / 7.9

10 Tasks versus threads A task is a chunk of work that can be executed A task may execute in parallel with other tasks A task will eventually be executed, but no guarantee on when Tasks are scheduled and executed by a run-time (TBB) Maintain a list of tasks which are ready to run Have one thread per CPU for running tasks If a thread is idle, assign a task from the ready queue to it No limit on number of tasks which are ready to run (OS is still responsible for mapping threads to CPUs) TBB has a number of high-level ways to use tasks But there is a single low-level underlying task primitive HPCE / dt1/ 215 / 7.1

11 Overview of task groups A task group collects together a number of child tasks The task creating the group is called the parent One or more child tasks are created and run() by the parent Child tasks may execute in parallel Parent task must wait() for all child tasks before returning HPCE / dt1/ 215 / 7.11

12 parallel_for using tbb::task_group #include "tbb/task_group.h" template<class TI, class TF> void parallel_for(const TI &begin, const TI &end, const TF &f) if(begin+1 == end) f(begin); else auto left=[&]() parallel_for(begin, (begin+end)/2, f); auto right=[&]() parallel_for((begin+end)/2, end, f); // Spawn the two tasks in a group tbb::task_group group; group.run(left); group.run(right); group.wait(); // Wait for both to finish HPCE / dt1/ 215 / 7.12

13 Overview of task groups A task group collects together a number of child tasks The task creating the group is called the parent One or more child tasks are created and run() by the parent Child tasks may execute in parallel Parent task must wait() for all child tasks before returning Some important differences between tasks and threads Threads must execute in parallel A thread may continue after its creator exits Threads must be joined individually HPCE / dt1/ 215 / 7.13

14 More patterns: tbb::parallel_invoke template<typename Func, typename Func1> void parallel_invoke(const Func& f, const Func1& f1); template<typename Func, typename Func1, typename Func2> void parallel_invoke(const Func& f, const Func1& f1, const Func2& f2); Takes two or more functions and may run in parallel Overloaded for different numbers of arguments No overload for 1 argument for obvious reasons Interface is very clean, but also quite simple Decision about number of tasks is completely static You can t add more tasks once some starts No choice about when to synchronise with tasks HPCE / dt1/ 215 / 7.14

15 parallel_invoke using task_group parallel_invoke can be implement using task_group task_group supports a super-set of the functionality template<typename Fc, typename Fc1, typename Fc2> void parallel_invoke(const Fc& f, const Fc1& f1, const Fc2& f2) tbb::task_group group; group.run(f); group.run(f1); group.run(f2); group.wait(); HPCE / dt1/ 215 / 7.15

16 Can t do task_group using parallel_invoke task_group is intrinsically dynamic Decide how much work to add at run-time Can add work even while tasks are running in the group void my_function(int n, float *x) tbb::task_group group; for(unsigned i=;i<n;i++) if(x[i]==) group.run([=]() f(i); ); else group.run([=]() g(x[i]); ); group.wait(); HPCE / dt1/ 215 / 7.16

17 The underlying primitive: tbb::task TBB has a basic primitive called tbb::task This is the raw unit of scheduling understood by the lib. Other high-level wrappers create tasks internally The TBB run-time takes tasks and schedules them to a CPU Tasks are very flexible, with a lot of power Can express complicate dependency graphs Build non-local synchronisation and barriers With power comes responsibility They allow you to make mistakes Possible (though not likely) to mess up the TBB run-time Better to create wrappers on top that hide tasks parallel_for, parallel_invoke, task_group, parallel_reduce,... HPCE / dt1/ 215 / 7.17

18 class : public tbb::task int start, end; ; (int _start, int _end) start=_start; end=_end; tbb::task * execute() if(cond()) set_ref_count(3); void (int start, int end) if(cond()) tbb::task_group group; group.run([=]()(start,(start+end)/2); ); group.run([=]()((start+end)/2,end); ); DoSomethingFirst(); group.wait(); DoSomethingElse(); &t1=*new(allocate_child()) (start,(start+end)/2); spawn(t1); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse(); void CreateTasks(int start, int end) &root=*new(allocate_root()) (start,end); tbb::task::spawn_root_and_wait(); HPCE / dt1/ 215 / 7.18

19 Life-cycle of a task Life-cycle of task due to interaction between task and runtime Individual task calls spawn, wait_for_all (sync), return TBB run-time will keep track of a task s children (dependencies)

20 Scheduling through reference counts Each task has a reference count and a task The reference count identifies whether a task is blocked If the reference count is zero then the task could be run But only if it has been given to the task scheduler Legal to create a task and not give it to the scheduler Note the difference: reference count vs C++ reference Successor task identifies the task blocked by this task Generally the task is the creator, or parent When a task completes it decrements the count of its HPCE / dt1/ 215 / 7.2

21 tbb::task Created - void CreateTasks(int start, int end) &root=*new(allocate_root()) (start,end); tbb::task::spawn_root_and_wait();

22 tbb::task Running - void CreateTasks(int start, int end) &root=*new(allocate_root()) (start,end); tbb::task::spawn_root_and_wait();

23 tbb::task Running - tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

24 tbb::task Running - 3 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

25 tbb::task Running - tbb::task Created 3 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

26 tbb::task Running - tbb::task Created tbb::task Created 3 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

27 tbb::task Running - tbb::task Ready tbb::task Ready 3 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

28 tbb::task Running - tbb::task Running tbb::task Ready 3 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

29 tbb::task Blocked - tbb::task Running tbb::task Ready 2 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

30 tbb::task Blocked - tbb::task Running tbb::task Ready 2 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

31 tbb::task Blocked - tbb::task Running tbb::task Running 2 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

32 tbb::task Blocked - tbb::task Running tbb::task Finished 1 tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

33 tbb::task Running - tbb::task Finished tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

34 tbb::task Running - tbb::task * ::execute() if(cond()) set_ref_count(3); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

35 tbb::task Finished - void CreateTasks(int start, int end) &root=*new(allocate_root()) (start,end); tbb::task::spawn_root_and_wait();

36 Managing reference counts What happens if we get the reference count wrong? Finishing task calls decrement_ref_count on Automatically returns task to scheduler if count becomes zero HPCE / dt1/ 215 / 7.36

37 tbb::task Running - 1 tbb::task * ::execute() if(cond()) set_ref_count(1); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

38 tbb::task Running - tbb::task Finished tbb::task Ready tbb::task * ::execute() if(cond()) set_ref_count(1); &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

39 tbb::task Running - tbb::task Running tbb::task * ::execute() if(cond()) &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); set_ref_count(3); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

40 tbb::task Running - tbb::task Finished tbb::task Ready -1 tbb::task * ::execute() if(cond()) &t1=*new(allocate_child()) (start,(start+end)/2); &t2=*new(allocate_child()) ((start+end)/2, end); spawn(t1); spawn(t2); set_ref_count(3); DoSomethingFirst(); wait_for_all(); DoSomethingElse();

41 Some help is available TBB library comes in two forms: debug and release release library does no error checking all about speed debug library will check reference counts at many points Choose library version at compilation and link stages Debug: #define TBB_USE_DEBUG=1 when compiling On microsoft compilers it will automatically link the correct library On other compilers use -ltbb vs -ltbb_debug Usually maintain different release and debug settings Debug: /DTBB_USE_DEBUG=1 /MDd Release: /DNDEBUG=1 /O2 Can setup in Visual Studio or in a makefile Generally: try to avoid raw tasks if possible HPCE / dt1/ 215 / 7.41

42 Many design patterns are built on tasks Iteration in various forms parallel_for, parallel_for_each Reduction and accumulation parallel_reduce Data-dependent looping and queue processing parallel_do Support for heterogeneous tasks parallel_invoke, task_group Heterogeneous tasks and token-based data-flow parallel_pipeline Goal: turn design patterns in concrete functions HPCE / dt1/ 215 / 7.42

A wishlist. Can directly pass pointers to data-structures between tasks. Data-parallelism and pipeline-parallelism and task parallelism...

A wishlist. Can directly pass pointers to data-structures between tasks. Data-parallelism and pipeline-parallelism and task parallelism... A wishlist In-process shared-memory parallelism Can directly pass pointers to data-structures between tasks Sophisticated design-patterns Data-parallelism and pipeline-parallelism and task parallelism...

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

Case Study: Parallelizing a Recursive Problem with Intel Threading Building Blocks

Case Study: Parallelizing a Recursive Problem with Intel Threading Building Blocks 4/8/2012 2:44:00 PM Case Study: Parallelizing a Recursive Problem with Intel Threading Building Blocks Recently I have been working closely with DreamWorks Animation engineers to improve the performance

More information

CW1-CW4 Progress. All the marking scripts for CW1-CW4 were done over summer Just haven t had time to sanity check the output

CW1-CW4 Progress. All the marking scripts for CW1-CW4 were done over summer Just haven t had time to sanity check the output CW1-CW4 Progress Best efforts are going to pot All the marking scripts for CW1-CW4 were done over summer Just haven t had time to sanity check the output CW4 : You should all get a private repo Encountered

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

Processes. Process Concept

Processes. Process Concept Processes These slides are created by Dr. Huang of George Mason University. Students registered in Dr. Huang s courses at GMU can make a single machine readable copy and print a single copy of each slide

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Shared memory parallel computing. Intel Threading Building Blocks

Shared memory parallel computing. Intel Threading Building Blocks Shared memory parallel computing Intel Threading Building Blocks Introduction & history Threading Building Blocks (TBB) cross platform C++ template lib for task-based shared memory parallel programming

More information

Shared-Memory Programming

Shared-Memory Programming Shared-Memory Programming 1. Threads 2. Mutual Exclusion 3. Thread Scheduling 4. Thread Interfaces 4.1. POSIX Threads 4.2. C++ Threads 4.3. OpenMP 4.4. Threading Building Blocks 5. Side Effects of Hardware

More information

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms

More information

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)

More information

Jan 20, 2005 Lecture 2: Multiprogramming OS

Jan 20, 2005 Lecture 2: Multiprogramming OS Jan 20, 2005 Lecture 2: Multiprogramming OS February 17, 2005 1 Review OS mediates between hardware and user software QUIZ: Q: What is the most important function in an OS? A: To support multiprogramming

More information

Processes. OS Structure. OS Structure. Modes of Execution. Typical Functions of an OS Kernel. Non-Kernel OS. COMP755 Advanced Operating Systems

Processes. OS Structure. OS Structure. Modes of Execution. Typical Functions of an OS Kernel. Non-Kernel OS. COMP755 Advanced Operating Systems OS Structure Processes COMP755 Advanced Operating Systems An OS has many parts. The Kernel is the core of the OS. It controls the execution of the system. Many OS features run outside of the kernel, such

More information

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve? What is an Operating System? A Whirlwind Tour of Operating Systems Trusted software interposed between the hardware and application/utilities to improve efficiency and usability Most computing systems

More information

Practical task-based parallelism : Threaded Building Blocks. HPCE / dt10 / 2013 / 6.1

Practical task-based parallelism : Threaded Building Blocks. HPCE / dt10 / 2013 / 6.1 Practical task-based parallelism : Threaded Building Blocks HPCE / dt10 / 2013 / 6.1 Coursework 1 Accelerate three algorithms using TBB Matrix-matrix Multiply Fast-Fourier Transform Graph Distance Full

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Chapter 9: Virtual Memory

Chapter 9: Virtual Memory Chapter 9: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Shared Memory Programming. Parallel Programming Overview

Shared Memory Programming. Parallel Programming Overview Shared Memory Programming Arvind Krishnamurthy Fall 2004 Parallel Programming Overview Basic parallel programming problems: 1. Creating parallelism & managing parallelism Scheduling to guarantee parallelism

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Introduction to Programming Using Java (98-388)

Introduction to Programming Using Java (98-388) Introduction to Programming Using Java (98-388) Understand Java fundamentals Describe the use of main in a Java application Signature of main, why it is static; how to consume an instance of your own class;

More information

Threads CS1372. Lecture 13. CS1372 Threads Fall / 10

Threads CS1372. Lecture 13. CS1372 Threads Fall / 10 Threads CS1372 Lecture 13 CS1372 Threads Fall 2008 1 / 10 Threads 1 In order to implement concurrent algorithms, such as the parallel bubble sort discussed previously, we need some way to say that we want

More information

CSE 4/521 Introduction to Operating Systems

CSE 4/521 Introduction to Operating Systems CSE 4/521 Introduction to Operating Systems Lecture 5 Threads (Overview, Multicore Programming, Multithreading Models, Thread Libraries, Implicit Threading, Operating- System Examples) Summer 2018 Overview

More information

Intel Threading Building Blocks (TBB)

Intel Threading Building Blocks (TBB) Intel Threading Building Blocks (TBB) SDSC Summer Institute 2012 Pietro Cicotti Computational Scientist Gordon Applications Team Performance Modeling and Characterization Lab Parallelism and Decomposition

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Mon Sep 17, 2007 Lecture 3: Process Management

Mon Sep 17, 2007 Lecture 3: Process Management Mon Sep 17, 2007 Lecture 3: Process Management September 19, 2007 1 Review OS mediates between hardware and user software QUIZ: Q: Name three layers of a computer system where the OS is one of these layers.

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading

More information

Process a program in execution; process execution must progress in sequential fashion. Operating Systems

Process a program in execution; process execution must progress in sequential fashion. Operating Systems Process Concept An operating system executes a variety of programs: Batch system jobs Time-shared systems user programs or tasks 1 Textbook uses the terms job and process almost interchangeably Process

More information

CS 261 Fall Mike Lam, Professor. Threads

CS 261 Fall Mike Lam, Professor. Threads CS 261 Fall 2017 Mike Lam, Professor Threads Parallel computing Goal: concurrent or parallel computing Take advantage of multiple hardware units to solve multiple problems simultaneously Motivations: Maintain

More information

Multi-threaded Programming in C++

Multi-threaded Programming in C++ Multi-threaded Programming in C++ (2016/2017) Giuseppe Massari, Federico Terraneo giuseppe.massari@polimi.it federico.terraneo@polimi.it Outline 2/45 Introduction Multi-tasking implementations C++11 threading

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)

More information

Chapter 3: Operating-System Structures

Chapter 3: Operating-System Structures Chapter 3: Operating-System Structures System Components Operating System Services System Calls POSIX System Programs System Structure Virtual Machines System Design and Implementation System Generation

More information

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017 CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication

More information

W4118: OS Overview. Junfeng Yang

W4118: OS Overview. Junfeng Yang W4118: OS Overview Junfeng Yang References: Modern Operating Systems (3 rd edition), Operating Systems Concepts (8 th edition), previous W4118, and OS at MIT, Stanford, and UWisc Outline OS definitions

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [THREADS] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Shuffle less/shuffle better Which actions?

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

Getting to know you. Anatomy of a Process. Processes. Of Programs and Processes

Getting to know you. Anatomy of a Process. Processes. Of Programs and Processes Getting to know you Processes A process is an abstraction that supports running programs A sequential stream of execution in its own address space A process is NOT the same as a program! So, two parts

More information

Cilk programs as a DAG

Cilk programs as a DAG Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command

More information

Templates what and why? Beware copying classes! Templates. A simple example:

Templates what and why? Beware copying classes! Templates. A simple example: Beware copying classes! Templates what and why? class A { private: int data1,data2[5]; float fdata; public: // methods etc. } A a1,a2; //some work initializes a1... a2=a1; //will copy all data of a2 into

More information

System Call. Preview. System Call. System Call. System Call 9/7/2018

System Call. Preview. System Call. System Call. System Call 9/7/2018 Preview Operating System Structure Monolithic Layered System Microkernel Virtual Machine Process Management Process Models Process Creation Process Termination Process State Process Implementation Operating

More information

CSE 431S Type Checking. Washington University Spring 2013

CSE 431S Type Checking. Washington University Spring 2013 CSE 431S Type Checking Washington University Spring 2013 Type Checking When are types checked? Statically at compile time Compiler does type checking during compilation Ideally eliminate runtime checks

More information

Intel Thread Building Blocks, Part IV

Intel Thread Building Blocks, Part IV Intel Thread Building Blocks, Part IV SPD course 2017-18 Massimo Coppola 13/04/2018 1 Mutexes TBB Classes to build mutex lock objects The lock object will Lock the associated data object (the mutex) for

More information

Processes and Threads

Processes and Threads COS 318: Operating Systems Processes and Threads Kai Li and Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall13/cos318 Today s Topics u Concurrency

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

Fall 2015 COMP Operating Systems. Lab #3

Fall 2015 COMP Operating Systems. Lab #3 Fall 2015 COMP 3511 Operating Systems Lab #3 Outline n Operating System Debugging, Generation and System Boot n Review Questions n Process Control n UNIX fork() and Examples on fork() n exec family: execute

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

COMP6771 Advanced C++ Programming

COMP6771 Advanced C++ Programming 1.... COMP6771 Advanced C++ Programming Week 9 Multithreading 2016 www.cse.unsw.edu.au/ cs6771 .... Single Threaded Programs All programs so far this semester have been single threaded They have a single

More information

OS Structure, Processes & Process Management. Don Porter Portions courtesy Emmett Witchel

OS Structure, Processes & Process Management. Don Porter Portions courtesy Emmett Witchel OS Structure, Processes & Process Management Don Porter Portions courtesy Emmett Witchel 1 What is a Process?! A process is a program during execution. Ø Program = static file (image) Ø Process = executing

More information

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING 1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism

More information

Multithreading and Interactive Programs

Multithreading and Interactive Programs Multithreading and Interactive Programs CS160: User Interfaces John Canny. Last time Model-View-Controller Break up a component into Model of the data supporting the App View determining the look of the

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

CPSC 341 OS & Networks. Processes. Dr. Yingwu Zhu

CPSC 341 OS & Networks. Processes. Dr. Yingwu Zhu CPSC 341 OS & Networks Processes Dr. Yingwu Zhu Process Concept Process a program in execution What is not a process? -- program on a disk A process is an active object, but a program is just a file It

More information

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1 Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

Simulating Multi-Core RISC-V Systems in gem5

Simulating Multi-Core RISC-V Systems in gem5 Simulating Multi-Core RISC-V Systems in gem5 Tuan Ta, Lin Cheng, and Christopher Batten School of Electrical and Computer Engineering Cornell University 2nd Workshop on Computer Architecture Research with

More information

Optimize an Existing Program by Introducing Parallelism

Optimize an Existing Program by Introducing Parallelism Optimize an Existing Program by Introducing Parallelism 1 Introduction This guide will help you add parallelism to your application using Intel Parallel Studio. You will get hands-on experience with our

More information

CMPSC 311- Introduction to Systems Programming Module: Concurrency

CMPSC 311- Introduction to Systems Programming Module: Concurrency CMPSC 311- Introduction to Systems Programming Module: Concurrency Professor Patrick McDaniel Fall 2013 Sequential Programming Processing a network connection as it arrives and fulfilling the exchange

More information

Chapter 8: Virtual Memory. Operating System Concepts Essentials 2 nd Edition

Chapter 8: Virtual Memory. Operating System Concepts Essentials 2 nd Edition Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

by Marina Cholakyan, Hyduke Noshadi, Sepehr Sahba and Young Cha

by Marina Cholakyan, Hyduke Noshadi, Sepehr Sahba and Young Cha CS 111 Scribe Notes for 4/11/05 by Marina Cholakyan, Hyduke Noshadi, Sepehr Sahba and Young Cha Processes What is a process? A process is a running instance of a program. The Web browser you're using to

More information

Task-based Data Parallel Programming

Task-based Data Parallel Programming Task-based Data Parallel Programming Asaf Yaffe Developer Products Division May, 2009 Agenda Overview Data Parallel Algorithms Tasks and Scheduling Synchronization and Concurrent Containers Summary 2 1

More information

CMPSC 311- Introduction to Systems Programming Module: Concurrency

CMPSC 311- Introduction to Systems Programming Module: Concurrency CMPSC 311- Introduction to Systems Programming Module: Concurrency Professor Patrick McDaniel Fall 2016 Sequential Programming Processing a network connection as it arrives and fulfilling the exchange

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Start of Lecture on January 22, 2014

Start of Lecture on January 22, 2014 Start of Lecture on January 22, 2014 Chapter 3: 4: Processes Threads 1 Chapter 3: 4: Processes Threads 2 Reminders Likely you ve gotten some useful info for your assignment and are hopefully working on

More information

Chapter 3: Operating-System Structures

Chapter 3: Operating-System Structures Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System Design and Implementation System Generation 3.1

More information

the gamedesigninitiative at cornell university Lecture 7 C++ Overview

the gamedesigninitiative at cornell university Lecture 7 C++ Overview Lecture 7 Lecture 7 So You Think You Know C++ Most of you are experienced Java programmers Both in 2110 and several upper-level courses If you saw C++, was likely in a systems course Java was based on

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Roadmap. Tevfik Ko!ar. CSC Operating Systems Spring Lecture - III Processes. Louisiana State University. Virtual Machines Processes

Roadmap. Tevfik Ko!ar. CSC Operating Systems Spring Lecture - III Processes. Louisiana State University. Virtual Machines Processes CSC 4103 - Operating Systems Spring 2008 Lecture - III Processes Tevfik Ko!ar Louisiana State University January 22 nd, 2008 1 Roadmap Virtual Machines Processes Basic Concepts Context Switching Process

More information

LECTURE 11 TREE TRAVERSALS

LECTURE 11 TREE TRAVERSALS DATA STRUCTURES AND ALGORITHMS LECTURE 11 TREE TRAVERSALS IMRAN IHSAN ASSISTANT PROFESSOR AIR UNIVERSITY, ISLAMABAD BACKGROUND All the objects stored in an array or linked list can be accessed sequentially

More information

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct.

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct. MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct. In linked list the elements are necessarily to be contiguous In linked list the elements may locate at far positions

More information

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5.

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5. Lecture Topics Today: Threads (Stallings, chapter 4.1-4.3, 4.6) Next: Concurrency (Stallings, chapter 5.1-5.4, 5.7) 1 Announcements Make tutorial Self-Study Exercise #4 Project #2 (due 9/20) Project #3

More information

Computer Systems II. First Two Major Computer System Evolution Steps

Computer Systems II. First Two Major Computer System Evolution Steps Computer Systems II Introduction to Processes 1 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent processes) 2 1 At First (1945 1955) In the beginning,

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Processes and Threads. Processes and Threads. Processes (2) Processes (1)

Processes and Threads. Processes and Threads. Processes (2) Processes (1) Processes and Threads (Topic 2-1) 2 홍성수 Processes and Threads Question: What is a process and why is it useful? Why? With many things happening at once in a system, need some way of separating them all

More information

What is a Process? Processes and Process Management Details for running a program

What is a Process? Processes and Process Management Details for running a program 1 What is a Process? Program to Process OS Structure, Processes & Process Management Don Porter Portions courtesy Emmett Witchel! A process is a program during execution. Ø Program = static file (image)

More information

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

Operating Systems (2INC0) 2017/18

Operating Systems (2INC0) 2017/18 Operating Systems (2INC0) 2017/18 Memory Management (09) Dr. Courtesy of Dr. I. Radovanovic, Dr. R. Mak (figures from Bic & Shaw) System Architecture and Networking Group Agenda Reminder: OS & resources

More information

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016 Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what

More information

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Chapter 4: Threads. Operating System Concepts 9 th Edit9on Chapter 4: Threads Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads 1. Overview 2. Multicore Programming 3. Multithreading Models 4. Thread Libraries 5. Implicit

More information

Chapter 5: Processes & Process Concept. Objectives. Process Concept Process Scheduling Operations on Processes. Communication in Client-Server Systems

Chapter 5: Processes & Process Concept. Objectives. Process Concept Process Scheduling Operations on Processes. Communication in Client-Server Systems Chapter 5: Processes Chapter 5: Processes & Threads Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems, Silberschatz, Galvin and

More information

CNIT 127: Exploit Development. Ch 3: Shellcode. Updated

CNIT 127: Exploit Development. Ch 3: Shellcode. Updated CNIT 127: Exploit Development Ch 3: Shellcode Updated 1-30-17 Topics Protection rings Syscalls Shellcode nasm Assembler ld GNU Linker objdump to see contents of object files strace System Call Tracer Removing

More information

Absolute C++ Walter Savitch

Absolute C++ Walter Savitch Absolute C++ sixth edition Walter Savitch Global edition This page intentionally left blank Absolute C++, Global Edition Cover Title Page Copyright Page Preface Acknowledgments Brief Contents Contents

More information

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc. Parallel Processing with the SMP Framework in VTK Berk Geveci Kitware, Inc. November 3, 2013 Introduction The main objective of the SMP (symmetric multiprocessing) framework is to provide an infrastructure

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems Processes CS 475, Spring 2018 Concurrent & Distributed Systems Review: Abstractions 2 Review: Concurrency & Parallelism 4 different things: T1 T2 T3 T4 Concurrency: (1 processor) Time T1 T2 T3 T4 T1 T1

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Binary Trees and Binary Search Trees

Binary Trees and Binary Search Trees Binary Trees and Binary Search Trees Learning Goals After this unit, you should be able to... Determine if a given tree is an instance of a particular type (e.g. binary, and later heap, etc.) Describe

More information

Compositional C++ Page 1 of 17

Compositional C++ Page 1 of 17 Compositional C++ Page 1 of 17 Compositional C++ is a small set of extensions to C++ for parallel programming. OVERVIEW OF C++ With a few exceptions, C++ is a pure extension of ANSI C. Its features: Strong

More information

Multiprocessors 2007/2008

Multiprocessors 2007/2008 Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several

More information

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Chapter 3: Operating-System Structures

Chapter 3: Operating-System Structures 1 Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System Design and Implementation System Generation 3.1

More information

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL) CGT 581I - Parallel Graphics and Simulation Knights Landing Tasks and Threads Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Tasks and Threads Use OpenMP Threading Building

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week 03 Lecture 12 Create, Execute, and Exit from a Process

More information