CS 5220: Shared memory programming. David Bindel

Similar documents
Multithreading in C with OpenMP

Lecture 4: OpenMP Open Multi-Processing

OpenMP Fundamentals Fork-join model and data environment

EPL372 Lab Exercise 5: Introduction to OpenMP

Parallel programming using OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Shared Memory Programming Model

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

DPHPC: Introduction to OpenMP Recitation session

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

GCC Developers Summit Ottawa, Canada, June 2006

Overview: The OpenMP Programming Model

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Shared Memory Programming Models I

DPHPC: Introduction to OpenMP Recitation session

Introduction to OpenMP.

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Shared memory parallel computing

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Introduction to OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Session 4: Parallel Programming with OpenMP

ECE 574 Cluster Computing Lecture 10

Alfio Lazzaro: Introduction to OpenMP

A brief introduction to OpenMP

Introduction to OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Parallel Numerical Algorithms

Distributed Systems + Middleware Concurrent Programming with OpenMP

CS691/SC791: Parallel & Distributed Computing

Allows program to be incrementally parallelized

Parallel Programming: OpenMP

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Memory Consistency Models

Introduction to OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP threading: parallel regions. Paolo Burgio

Parallel Programming in C with MPI and OpenMP

Scientific Computing

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming Shared Memory Systems with OpenMP Part I. Book

Computer Architecture

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Parallel Programming in C with MPI and OpenMP

Programming with Shared Memory

Data Environment: Default storage attributes

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Computing. Prof. Marco Bertini

Shared Memory Programming with OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Lecture 13. Shared memory: Architecture and programming

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Parallel Computing. Lecture 13: OpenMP - I

CS 261 Fall Mike Lam, Professor. Threads

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel processing with OpenMP. #pragma omp

Introduction to OpenMP

Parallel Programming with OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

GLOSSARY. OpenMP. OpenMP brings the power of multiprocessing to your C, C++, and. Fortran programs. BY WOLFGANG DAUTERMANN

15-418, Spring 2008 OpenMP: A Short Introduction

Introduction to OpenMP

Parallel Programming

Shared Memory Programming with OpenMP

Synchronisation in Java - Java Monitor

Concurrency, Thread. Dongkun Shin, SKKU

Parallel Programming in C with MPI and OpenMP

CSL 860: Modern Parallel

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

COMP Parallel Computing. SMM (2) OpenMP Programming Model

OpenMP: Open Multi-Processing

Multi-core Architecture and Programming

Lecture 2 A Hand-on Introduction to OpenMP

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

High Performance Computing: Tools and Applications

Data Handling in OpenMP

Transcription:

CS 5220: Shared memory programming David Bindel 2017-09-26 1

Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy Example: Data in ghost cells Example: Replicated book-keeping data (pidx in our code) Big pain point: Thinking about many partly-overlapping representations Maintaining consistent picture across processes Wouldn t it be nice to have just one representation? 2

Shared memory vs message passing Implicit communication via memory vs explicit messages Still need separate global vs local picture? No: One thread-safe data structure may be easier Yes: More sharing can hurt performance Synchronization costs cycles even with no contention Contention for locks reduces parallelism Cache coherency can slow even non-contending access Easy approach: add multi-threading to serial code Better performance: design like a message-passing code Let s dig a little deeper into the hardware side of this... 3

Memory model Single processor: return last write What about DMA and memory-mapped I/O? Simplest generalization: sequential consistency as if Each process runs in program order Instructions from different processes are interleaved Interleaved instructions ran on one processor 4

Sequential consistency A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Lamport, 1979 5

Example: Spin lock Initially, flag = 0 and sum = 0 Processor 1: sum += p1; flag = 1; Processor 2: while (!flag); sum += p2; 6

Example: Spin lock Initially, flag = 0 and sum = 0 Processor 1: sum += p1; flag = 1; Processor 2: while (!flag); sum += p2; Without sequential consistency support, what if 1. Processor 2 caches flag? 2. Compiler optimizes away loop? 3. Compiler reorders assignments on P1? Starts to look restrictive! 7

Sequential consistency: the good, the bad, the ugly Program behavior is intuitive : Nobody sees garbage values Time always moves forward One issue is cache coherence: Coherence: different copies, same value Requires (nontrivial) hardware support Also an issue for optimizing compiler! There are cheaper relaxed consistency models. 8

Snoopy bus protocol Basic idea: Broadcast operations on memory bus Cache controllers snoop on all bus transactions Problems: Memory writes induce serial order Act to enforce coherence (invalidate, update, etc) Bus bandwidth limits scaling Contending writes are slow There are other protocol options (e.g. directory-based). But usually give up on full sequential consistency. 9

Weakening sequential consistency Try to reduce to the true cost of sharing volatile tells compiler when to worry about sharing Memory fences tell when to force consistency Synchronization primitives (lock/unlock) include fences 10

Sharing True sharing: Frequent writes cause a bottleneck. Idea: make independent copies (if possible). Example problem: malloc/free data structure. False sharing: Distinct variables on same cache block Idea: make processor memory contiguous (if possible) Example problem: array of ints, one per processor 11

Take-home message Sequentially consistent shared memory is a useful idea... Natural analogue to serial case Architects work hard to support it... but implementation is costly! Makes life hard for optimizing compilers Coherence traffic slows things down Helps to limit sharing Have to think about these things to get good performance. 12

Back to software... 13

Reminder: Shared memory programming model Program consists of threads of control. Can be created dynamically Each has private variables (e.g. local) Each has shared variables (e.g. heap) Communication through shared variables Coordinate by synchronizing on variables Examples: pthreads, OpenMP, Cilk, Java threads 14

Wait, what s a thread? Processes have separate state. Threads share some: Instruction pointer (per thread) Register file (per thread) Call stack (per thread) Heap memory (shared) 15

Mechanisms for thread birth/death Statically allocate threads at start Fork/join (pthreads) Fork detached threads (pthreads) Cobegin/coend (OpenMP?) Like fork/join, but lexically scoped Futures v = future(somefun(x)) Attempts to use v wait on evaluation 16

Mechanisms for synchronization Locks/mutexes (enforce mutual exclusion) Condition variables (notification) Monitors (like locks with lexical scoping) Barriers Atomic operations 17

Getting more concrete... 18

OpenMP: Open spec for MultiProcessing Standard API for multi-threaded code Only a spec multiple implementations Lightweight syntax C or Fortran (with appropriate compiler support) High level: Preprocessor/compiler directives (80%) Library calls (19%) Environment variables (1%) Basic syntax: #omp construct [clause...] Usually affects structured block (one way in/out) OK to have exit() in such a block 19

A logistical note A practical aside... Intel has OpenMP support by default icc -c -qopenmp foo.c icc -o -qopenmp mycode.x foo.o GCC has OpenMP support by default (several years) gcc -c -fopenmp foo.c gcc -o -fopenmp mycode.x foo.o LLVM has OpenMP support by default (more recent) clang -c -fopenmp foo.c clang -o -fopenmp mycode.x foo.o Apple LLVM does not support OpenMP And gcc on OS X now aliases clang Can use Hombrew for alternate compiler 20

Parallel hello world 1 #include <stdio.h> 2 #include <omp.h> 3 4 int main() 5 { 6 #pragma omp parallel 7 printf("hello world from %d\n", 8 omp_get_thread_num()); 9 10 return 0; 11 } 21

Shared and private Annotations distinguish between different types of sharing: shared(x) (default): One x shared everywhere private(x): Thread gets own x (indep. of master) firstprivate(x): Each thread gets its own x, initialized by x from before parallel region lastprivate(x): After the parallel region, private x set to the value last left by one of the threads reduction(op:x): Does reduction on all thread x on exit of parallel region 22

Parallel regions i=0 s=[?,?,?,?] i=1 i=2 i=3 s=[0,1,2,3] Parallel region Basic model: fork-join Each thread runs same code block Annotations distinguish shared (s) and private (i) data Relaxed consistency for shared data 23

Parallel regions i=0 i=1 s=[?,?,?,?] i=2 s=[0,1,2,3] i=3 1 double s[max_threads]; 2 int i; 3 #pragma omp parallel shared(s) private(i) 4 { 5 i = omp_get_thread_num(); 6 s[i] = i; 7 } 8 // Implicit barrier here 24

Parallel regions i=0 i=1 s=[?,?,?,?] i=2 s=[0,1,2,3] i=3 1 double s[max_threads]; // default shared 2 #pragma omp parallel 3 { 4 int i = omp_get_thread_num(); // local, so private 5 s[i] = i; 6 } 7 // Implicit barrier here 25

Parallel regions Several ways to control num threads Default: System chooses (= number processors?) Environment: export OMP_NUM_THREADS=4 Function call: omp_set_num_threads(4) Clause: #pragma omp parallel num_threads(4) Can also nest parallel regions. 26

Parallel regions What to do with parallel regions alone? Maybe Monte Carlo: 1 double result = 0; 2 #pragma omp parallel reduction(+:result) 3 result = run_mc(trials) / omp_get_num_threads(); 4 printf("final result: %f\n", result); Anything more interesting needs synchronization. 27

OpenMP synchronization High-level synchronization: critical: Critical sections atomic: Atomic update barrier: Barrier ordered: Ordered access (later) Low-level synchronization: flush Locks (simple and nested We will stay high-level. 28

Critical sections Thread 0 Thread 1 Automatically lock/unlock at ends of critical section Automatically memory flushes for consistency Locks are still there if you really need them... 29

Critical sections Thread 0 Thread 1 1 #pragma omp parallel 2 { 3... 4 #pragma omp critical my_data_cs 5 { 6... modify data structure here... 7 } 8 } 30

Atomic updates 1 #pragma omp parallel 2 { 3... 4 double my_piece = foo(); 5 #pragma omp atomic 6 x += my_piece; 7 } Only simple ops: increment/decrement or x += expr and co 31

Barriers Thread 0 Thread 1 1 #pragma omp parallel 2 for (i = 0; i < nsteps; ++i) { 3 do_stuff 4 #pragma omp barrier 5 } 32

Is this all? Next time: Parallel loop constructs Task-based parallelism (OpenMP 3.0+) Other parallel work divisions But for now, let s do some examples! 33