Programming with Shared Memory PART I. HPC Fall 2010 Prof. Robert van Engelen

Similar documents
Programming with Shared Memory. Nguyễn Quang Hùng

COSC Operating Systems Design, Fall Lecture Note: Unnamed Pipe and Shared Memory. Unnamed Pipes

Process Synchronization(2)

ECE 650 Systems Programming & Engineering. Spring 2018

CSE 4/521 Introduction to Operating Systems

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Processes Prof. James L. Frankel Harvard University. Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved.

Parallel Programming with Threads

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Part II Processes and Threads Process Basics

Interprocess Communication. Originally multiple approaches Today more standard some differences between distributions still exist

Introduction to PThreads and Basic Synchronization

Shared Memory Parallel Programming with Pthreads An overview

Process Synchronization(2)

POSIX Threads. Paolo Burgio

Multithreading Programming II

Synchronization and Semaphores. Copyright : University of Illinois CS 241 Staff 1

LSN 13 Linux Concurrency Mechanisms

POSIX / System Programming

Shared Memory Memory mapped files

Lecture Outline. CS 5523 Operating Systems: Concurrency and Synchronization. a = 0; b = 0; // Initial state Thread 1. Objectives.

Lecture 9: Thread Synchronizations. Spring 2016 Jason Tang

Lecture 18. Log into Linux. Copy two subdirectories in /home/hwang/cs375/lecture18/ $ cp r /home/hwang/cs375/lecture18/*.

CS 345 Operating Systems. Tutorial 2: Treasure Room Simulation Threads, Shared Memory, Synchronization

Memory management. Single process. Multiple processes. How to: All memory assigned to the process Addresses defined at compile time

Process Synchronization(2)

Process Synchronization

real time operating systems course

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Concurrency: Mutual Exclusion and

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Shared Memory. By Oren Kalinsky

Outline. CS4254 Computer Network Architecture and Programming. Introduction 2/4. Introduction 1/4. Dr. Ayman A. Abdel-Hamid.

POSIX Threads. HUJI Spring 2011

CSci 4061 Introduction to Operating Systems. Synchronization Basics: Locks

CS370 Operating Systems Midterm Review

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

Concurrency, Thread. Dongkun Shin, SKKU

Processes and Threads

Processes. OS Structure. OS Structure. Modes of Execution. Typical Functions of an OS Kernel. Non-Kernel OS. COMP755 Advanced Operating Systems

Lesson 6: Process Synchronization

Gabrielle Evaristo CSE 460. Lab Shared Memory

CMSC421: Principles of Operating Systems

CS 261 Fall Mike Lam, Professor. Threads

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Operating Systems. VI. Threads. Eurecom. Processes and Threads Multithreading Models

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS. Information and Computer Science Department. ICS 431 Operating Systems. Lab # 9.

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

Questions from last time

ECE 574 Cluster Computing Lecture 8

Processes and Threads

Week 3. Locks & Semaphores

Synchronization. Dr. Yingwu Zhu

Data Races and Deadlocks! (or The Dangers of Threading) CS449 Fall 2017

CMSC421: Principles of Operating Systems

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

Synchronization and Semaphores. Copyright : University of Illinois CS 241 Staff 1

Chapter 6 Process Synchronization

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Programming with Shared Memory

Introduction to OS Synchronization MOS 2.3

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References

Thread and Synchronization

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CS345 Opera,ng Systems. Φροντιστήριο Άσκησης 2

Shared Memory Programming. Parallel Programming Overview

Inter-process communication (IPC)

Chapter 4: Multi-Threaded Programming

THREADS. Jo, Heeseung

CS533 Concepts of Operating Systems. Jonathan Walpole

CSE 380: Homework 2: Synchronization

PThreads in a Nutshell

CS510 Operating System Foundations. Jonathan Walpole

CS420: Operating Systems

Process. Program Vs. process. During execution, the process may be in one of the following states

Locks and semaphores. Johan Montelius KTH

Parallel Programming with Processes, Threads and Message Passing. Parallel Programming Models. Processes. Outline. Concurrency vs.

Semaphore. Originally called P() and V() wait (S) { while S <= 0 ; // no-op S--; } signal (S) { S++; }

Synchroniza+on II COMS W4118

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

High Performance Computing Course Notes Shared Memory Parallel Programming

Paralleland Distributed Programming. Concurrency

CSE 4/521 Introduction to Operating Systems

CHAPTER 2: PROCESS MANAGEMENT

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas

Lecture 6. Process Synchronization

CS 3723 Operating Systems: Final Review

Dealing with Issues for Interprocess Communication

VEOS high level design. Revision 2.1 NEC

Module 6: Process Synchronization

Chapter 4 Concurrent Programming

Pthreads. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 5: Process Synchronization. Operating System Concepts Essentials 2 nd Edition

Chapter 6: Process Synchronization. Module 6: Process Synchronization

Lecture 2 Process Management

CS516 Programming Languages and Compilers II

Concurrent Server Design Multiple- vs. Single-Thread

Process a program in execution; process execution must progress in sequential fashion. Operating Systems

Transcription:

Programming with Shared Memory PART I HPC Fall 2010 Prof. Robert van Engelen

Overview Shared memory machines Programming strategies for shared memory machines Allocating shared data for IPC Processes and threads MT-safety issues Coordinated access to shared data Locks Semaphores Condition variables Barriers Further reading HPC Fall 2010 2

Shared Memory Machines Shared memory UMA machine with a single bus Single address space Shared memory Single bus (UMA) Interconnect with memory banks (NUMA) Cross-bar switch Distributed shared memory (DSM) Logically shared, physically distributed Shared memory NUMA machine with memory banks DSM Shared memory multiprocessor with cross-bar switch HPC Fall 2010 3

Programming Strategies for Shared Memory Machines Use a specialized programming language for parallel computing For example: HPF, UPC Use compiler directives to supplement a sequential program with parallel directives For example: OpenMP Use libraries For example: ScaLapack (though ScaLapack is primarily designed for distributed memory) Use heavyweight processes and a shared memory API Use threads Use a parallelizing compiler to transform (part of) a sequential program into a parallel program HPC Fall 2010 4

Heavyweight Processes Process tree Fork-join of processes The UNIX system call fork() creates a new process fork() returns 0 to the child process fork() returns process ID (pid) of child to parent System call exit(n) joins child process with parent and passes exit value n to it Parent executes wait(&n) to wait until one of its children joins, where n is set to the exit value System and user processes form a tree HPC Fall 2010 5

Fork-Join Process 1 pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues SPMD program HPC Fall 2010 6

Fork-Join Process 1 Process 2 pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues SPMD program pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues Copy of program, data, and file descriptors (operations by the processes on open files will be independent) HPC Fall 2010 7

Fork-Join Process 1 Process 2 pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues SPMD program pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues Copy of program and data HPC Fall 2010 8

Fork-Join Process 1 Process 2 pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues SPMD program pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues Copy of program and data HPC Fall 2010 9

Fork-Join Process 1 Process 2 pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues SPMD program pid = fork(); if (pid == 0) { // code for child exit(0); else { // parent code continues wait(&n); // join // parent code continues Terminated HPC Fall 2010 10

Creating Shared Data for IPC shmget() returns the shared memory identifier for a given key (key is for naming and locking) shmat() attaches the segment identified by a shared memory identifier and returns the address of the memory segment shmctl() deletes the segment with IPC_RMID argument mmap() returns the address of a mapped object described by the file id returned by open() munmap() deletes the mapping for a given address Interprocess communication (IPC) via shared data Processes do not automatically share data Use files to share data Slow, but portable Unix system V shmget() Allocates shared pages between two or more processes BSD Unix mmap() Uses file-memory mapping to create shared data in memory Based on the principle that files are shared between processes HPC Fall 2010 11

shmget vs mmap #include <sys/types.h> #include <sys/ipc.h> #include <sys/shm.h> size_t len; // size of data we want void *buf; // to point to shared data int shmid; key_t key = 9876; // or IPC_PRIVATE shmid = shmget(key, len, IPC_CREAT 0666); if (shmid == -1) // error buf = shmat(shmid, NULL, 0); if (buf == (void*)-1) // error fork(); // parent and child use buf wait(&n); shmctl(shmid, IPC_RMID, NULL); #include <sys/types.h> #include <sys/mman.h> size_t len; // size of data we want void *buf; // to point to shared data buf = mmap(null, len, PROT_READ PROT_WRITE, MAP_SHARED MAP_ANON, -1, // fd=-1 is unnamed 0); if (buf == MAP_FAILED) // error fork(); // parent and child use buf wait(&n); munmap(buf, len); Tip: use ipcs command to display IPC shared memory status of a system HPC Fall 2010 12

Threads Threads of control operate in the same memory space, sharing code and data Data is implicitly shared Consider data on a thread s stack private Many OS-specific thread APIs Windows threads, Linux threads, Java threads, POSIX-compliant Pthreads: pthread_create() start a new thread Thread creation and join pthread_join() wait for child thread to join pthread_exit() stop thread HPC Fall 2010 13

Detached Threads Detached threads do not join Use pthread_detach(thread_id) Detached threads are more efficient Make sure that all detached threads terminate before program terminates HPC Fall 2010 14

Process vs Threads Process Process with two threads What happens when we fork a process that executes multiple threads? Does fork duplicate only the calling thread or all threads? HPC Fall 2010 15

Thread Pools Thread pooling (or process pooling) is an efficient mechanism One master thread dispatches jobs to worker threads Worker threads in pool never terminate and keep accepting new jobs when old job done Jobs are communicated to workers via shared data and/or signals Best for irregular job loads HPC Fall 2010 16

MT-Safety time_t clk = clock(); char *txt = ctime(&clk); printf( Current time: %s\n, txt); Use of a not-mt-safe routine time_t clk = clock(); char txt[32]; ctime_r(&clk, txt); printf( Current time: %s\n, txt); Use of the reentrant version of ctime static int counter = 0; int count_events() { return counter++; Is this routine MT-safe? What can go wrong? Routines must be multithreadedsafe (MT-safe) when invoked by more than one thread Non-MT-safe routines must be placed in a critical section, e.g. using a mutex lock (see later) Many C libraries are not MT-safe Use libroutine_r() versions that are reentrant When building your own MT-safe library, use #define _REENTRANT Always make your routines MTsafe for reuse in a threaded application Use locks when necessary (see next slides) HPC Fall 2010 17

Coordinated Access to Shared Data Reading and writing shared data by more than one thread or process requires coordination with locking Cannot update shared variables simultaneously by more than one thread static int counter = 0; int count_events() { return counter++; static int counter = 0; int count_events() { pthread_mutex_lock(&lock); counter++; pthread_mutex_unlock(&lock); return counter-1; reg1 = M[counter] = 3 reg2 = reg1 + 1 = 4 M[counter] = reg2 = 4 return reg1 = 3 Thread 1 Thread 2 reg1 = M[counter] = 3 reg2 = reg1 + 1 = 4 M[counter] = reg2 = 4 return reg1 = 3 acquire lock reg1 = M[counter] = 3 reg2 = reg1 + 1 = 4 M[counter] = reg2 = 4 release lock Thread 1 acquire lock wait reg1 = M[counter] = 4 reg2 = reg1 + 1 = 5 M[counter] = reg2 = 5 release lock Thread 2 HPC Fall 2010 18

Spinlocks Spin locks use busy waiting until a condition is met Naïve implementations are almost always incorrect // initially lock = 0 while (lock == 1) ; // do nothing lock = 1; critical section lock = 0; Acquire lock Release lock Two or more threads want to enter the critical section, what can go wrong? HPC Fall 2010 19

Spinlocks Spin locks use busy waiting until a condition is met Naïve implementations are almost always incorrect Thread 1 while (lock == 1) ; // do nothing lock = 1; critical section lock = 0; This ordering works Thread 2 while (lock == 1) lock = 1; critical section lock = 0; HPC Fall 2010 20

Spinlocks Spin locks use busy waiting until a condition is met Naïve implementations are almost always incorrect Thread 1 while (lock == 1) ; // do nothing lock = 1; critical section lock = 0; This statement interleaving leads to failure Not set in time before read Thread 2 while (lock == 1) lock = 1; critical section lock = 0; Both threads end up executing the critical section! HPC Fall 2010 21

Spinlocks Spin locks use busy waiting until a condition is met Naïve implementations are almost always incorrect Thread 1 while (lock == 1) ; // do nothing lock = 1; critical section lock = 0; Compiler optimizes the code! Useless assignment removed Assignment can be moved by compiler Thread 2 while (lock == 1) lock = 1; critical section lock = 0; Atomic operations such as atomic testand-set instructions must be used (these instructions are not reordered or removed by compiler) HPC Fall 2010 22

Spinlocks Advantage of spinlocks is that the kernel is not involved Better performance when acquisition waiting time is short Dangerous to use in a uniprocessor system, because of priority inversion No guarantee of fairness and a thread may wait indefinitely in the worst case, leading to starvation void spinlock_lock(spinlock *s) { while (TestAndSet(s)) while (*s == 1) ; void spinlock_unlock(spinlock *s) { *s = 0; Correct and efficient spinlock operations using atomic TestAndSet assuming hardware supports cache coherence protocol Note: TestAndSet(int *n) sets n to 1 and returns old value of n HPC Fall 2010 23

Semaphores A semaphore is an integer-valued counter The counter is incremented and decremented by two operations signal (or post) and wait, respectively Traditionally called V and P (Dutch verhogen and probeer te verlagen ) When the counter < 0 the wait operation blocks and waits until the counter > 0 sem_post(sem_t *s) { (*s)++; Note: actual implementations of POSIX semaphores use atomic operations and a queue of waiting processes to ensure fairness sem_wait(sem_t *s) { while (*s <= 0) ; // do nothing (*s)--; HPC Fall 2010 24

Semaphores A two-valued (= binary) semaphore provides a mechanism to implement mutual exclusion (mutex) POSIX semaphores are named and have permissions, allowing use across a set processes Unique name Permissions Initial value #include semaphore.h sem_t *mutex = sem_open( lock371, O_CREAT, 0600, 1); sem_wait(mutex); // sem_trywait() to poll state critical section sem_post(mutex); Tip: use ipcs command to display sem_close(mutex); IPC semaphore status of a system HPC Fall 2010 25

Pthread Mutex Locks pthread_mutex_t mylock; pthread_mutex_init(&mylock, NULL); pthread_mutex_lock(&mylock); critical section pthread_mutex_unlock(&mylock); pthread_mutex_destroy(&mylock); POSIX mutex locks for thread synchronization Threads share user space, processes do not Pthreads is available for Unix/ Linux and Windows ports pthread_mutex_init() initialize lock pthread_mutex_lock() lock pthread_mutex_unlock() unlock pthread_mutex_trylock() check if lock can be acquired HPC Fall 2010 26

Using Mutex Locks Locks are used to synchronize shared data access from any part of a program, not just the same routine executed by multiple threads Multiple locks should be used, each for a set of shared data items that is disjoint from another set of shared data items (no single lock for everything) pthread_mutex_lock(&array_a_lck); A[i] = A[i] + 1 pthread_mutex_unlock(&array_a_lck); pthread_mutex_lock(&queue_lck); add element to shared queue pthread_mutex_unlock(&queue_lck); pthread_mutex_lock(&array_a_lck); A[i] = A[i] + B[i] pthread_mutex_unlock(&array_a_lck); Lock operations on array A pthread_mutex_lock(&queue_lck); remove element from shared queue pthread_mutex_unlock(&queue_lck); Lock operations on a queue What if threads may or may not update some of the same elements of an array, should we use a lock for every array element? HPC Fall 2010 27

Condition Variables Condition variables are associated with mutex locks Provide signal and wait operations within critical sections Process 1 Process 2 lock(mutex) if (cannot continue) wait(mutex, event) unlock(mutex) lock(mutex) signal(mutex, event) unlock(mutex) Can t use semaphore wait and signal here: what can go wrong? HPC Fall 2010 28

Condition Variables signal releases one waiting thread (if any) Process 1 Process 2 lock(mutex) if (cannot continue) wait(mutex, event) unlock(mutex) lock(mutex) signal(mutex, event) unlock(mutex) wait blocks until a signal is received When blocked, it releases the mutex lock, and reacquires the lock when wait is over HPC Fall 2010 29

Producer-Consumer Example Producer adds items to a shared container, when not full Consumer picks an item from a shared container, when not empty A consumer while (true) { lock(mutex) if (container is empty) wait(mutex, notempty) get item from container signal(mutex, notfull) unlock(mutex) A producer while (true) { lock(mutex) if (container is full) wait(mutex, notfull) add item to container signal(mutex, notempty) unlock(mutex) Condition variables associated with mutex HPC Fall 2010 30

Semaphores versus Condition Semaphores: Variables Semaphores must have matching signal-wait pairs, that is, the semaphore counter must stay balanced One too many waits: one waiting thread is indefinitely blocked One too many signals: two threads may enter critical section that is guarded by semaphore locks Condition variables: A signal can be executed at any time When there is no wait, signal does nothing If there are multiple threads waiting, signal will release one Both provide: Fairness: waiting threads will be released with equal probability Absence of starvation: no thread will wait indefinitely HPC Fall 2010 31

Pthreads Condition Variables Pthreads supports condition variables A condition variable is always used in combination with a lock, based on the principle of monitors Declarations Initialization pthread_mutex_t mutex; pthread_cond_t notempty, notfull; pthread_mutex_init(&mutex, NULL); pthread_cond_init(&notempty, NULL); pthread_cond_init(&notfull, NULL); A consumer A producer while (1) while (1) { pthread_mutex_lock(&mutex); { pthread_mutex_lock(&mutex); if (container is empty) if (container is full) pthread_cond_wait(&mutex, &notempty); pthread_cond_wait(&mutex, &notfull); get item from container add item to container pthread_cond_signal(&mutex, &notfull); pthread_cond_signal(&mutex, &notempty); pthread_mutex_unlock(&mutex); pthread_mutex_unlock(&mutex); HPC Fall 2010 32

Monitor with Condition Only P1 executes a routine, P0 waits on a signal, and P2, P3 are in the entry queue to execute next when P1 is done (or moved to the wait queue) Variables A monitor is a concept A monitor combines a set of shared variables and a set of routines that operate on the variables Only one process may be active in a monitor at a time All routines a synchronized by implicit locks (like an entry queue) Shared variables are safely modified under mutex Condition variables are used for signal and wait within the monitor routines Like a wait queue HPC Fall 2010 33

Barriers A barrier synchronization statement in a program blocks processes until all processes have arrived at the barrier Frequently used in data parallel programming (implicit or explicit) and an essential part of BSP Each process produces part of shared data X barrier Processes use shared data X HPC Fall 2010 34

Two-Phase Barrier with Semaphores for P Processes sem_t *mutex = sem_open( mutex-492, O_CREAT, 0600, 1); sem_t *turnstile1 = sem_open( ts1-492, O_CREAT, 0600, 0); sem_t *turnstile2 = sem_open( ts2-492, O_CREAT, 0600, 1); int count = 0; sem_wait(mutex); if (++count == P) { sem_wait(turnstile2); sem_signal(turnstile1); Rendezvous sem_signal(mutex); sem_wait(turnstile1); sem_signal(turnstile1); sem_wait(mutex); if (--count == 0) { sem_wait(turnstile1); sem_signal(turnstile2); Critical point sem_signal(mutex); sem_wait(turnstile2); sem_signal(turnstile2); Barrier sequence HPC Fall 2010 35

Pthread Barriers pthread_barrier_t barrier; pthread_barrier_init( barrier, NULL, // attributes count); // number of threads pthread_barrier_wait(barrier); Barrier using POSIX pthreads (advanced realtime threads) Specify number of threads involved in barrier syncs in initialization pthread_barrier_init() initialize barrier with thread count pthread_barrier_wait() barrier synchronization HPC Fall 2010 36

Further Reading [PP2] pages 230-247 [HPC] pages 191-218 Optional: [HPC] pages 219-240 The Little Book of Semaphores by Allen Downey http://www.greenteapress.com/semaphores/downey05semaphores.pdf HPC Fall 2010 37