Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Size: px

Start display at page:

Download "Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen"

Teresa Tyler
5 years ago
Views:

1 Programmig with Shared Memory PART II HPC Sprig 2017 Prof. Robert va Egele

2 Overview Sequetial cosistecy Parallel programmig costructs Depedece aalysis OpeMP Autoparallelizatio Further readig HPC Sprig

3 Sequetial Cosistecy Sequetial cosistecy: the result of a parallel program is always the same as the sequetial program, irrespective of the statemet iterleavig that is a result of parallel executio Parallel program (a, b, x, y, z are shared) time a=5; x=1; y=x+3; z=x+y; a=5; x=1; y=x+3; z=x+y; Ay order of permitted statemet iterleavigs x=1; y=x+3; a=5; z=x+y; Output a, b, x, y, z HPC Sprig

4 Data Flow: Implicitly Parallel x=1; a=5; y=x+3; z=x+y; output a, b, x, y, z Flow depedeces determie the parallel executio schedule: each operatio waits util operads are produced HPC Sprig

5 Explicit Parallel Programmig Costructs Declarig shared data, whe private is implicit shared it x; shared it *p; A shared variable Private poiter to a shared value Declarig private data, whe private is explicit private it x; private it *p; Would this make ay sese? HPC Sprig

6 Explicit Parallel Programmig Costructs The par costruct par { S1; S2; S; Statemets i the body are executed cocurretly Oe thread threads S1 S2 S Oe thread HPC Sprig

7 Explicit Parallel Programmig Costructs The forall costruct (also called parfor) forall (i=0; i<; i++) { S1; S2; Sm; Oe thread threads Statemets i the body are executed i serial order by threads i=0..-1 i parallel i=0: S1; S2; Sm; i=1: S1; S2; Sm; i=-1:s1; S2; Sm; Oe thread HPC Sprig

8 Explicit Parallel: May Choices, Which is Safe? par { a=5; x=1; y=x+3; z=x+y; x=1; par { a=5; y=x+3; z=x+y; x=1; par { a=5; y=x+3; z=x+y; Thik about data flow: each operatio requires completio of operads first Data depedeces preserved sequetial cosistecy guarateed HPC Sprig

9 Berstei s Coditios I i is the set of memory locatios read by process P i O j is the set of memory locatios altered by process P j Processes P 1 ad P 2 ca be executed cocurretly if all of the followig coditios are met I 1 Ç O 2 = Æ I 2 Ç O 1 = Æ O 1 Ç O 2 = Æ HPC Sprig

10 Berstei s Coditios Verified by Depedece Aalysis Depedece aalysis performed by a compiler determies that Berstei s coditios are ot violated whe optimizig ad/or parallelizig a program idepedet P 1 : A = x + y; P 2 : B = x + z; I 1 Ç O 2 = Æ I 2 Ç O 1 = Æ O 1 Ç O 2 = Æ RAW P 1 : A = x + y; P 2 : B = x + A; I 1 Ç O 2 = Æ I 2 Ç O 1 = {A O 1 Ç O 2 = Æ WAR P 1 : A = x + B; P 2 : B = x + z; I 1 Ç O 2 = {B I 2 Ç O 1 = Æ O 1 Ç O 2 = Æ WAW P 1 : A = x + y; P 2 : A = x + z; I 1 Ç O 2 = Æ I 2 Ç O 1 = Æ O 1 Ç O 2 = {A HPC Sprig

11 Berstei s Coditios Verified by Depedece Aalysis Depedece aalysis performed by a compiler determies that Berstei s coditios are ot violated whe optimizig ad/or parallelizig a program idepedet P 1 : A = x + y; P 2 : B = x + z; I 1 Ç O 2 = Æ I 2 Ç O 1 = Æ O 1 Ç O 2 = Æ par { P 1 : A = x + y; P 2 : B = x + z; Recall: istructio schedulig for istructio-level parallelism (ILP) HPC Sprig

12 Berstei s Coditios i Loops A parallel loop is valid whe ay orderig of its parallel body yields the same result forall (I=4;I<7;I++) S1: A[I] = A[I-3]+B[I]; S1(4): S1(5): S1(6): A[4] = A[1]+B[4]; A[5] = A[2]+B[5]; A[6] = A[3]+B[6]; S1(4): S1(6): S1(5): A[4] = A[1]+B[4]; A[6] = A[3]+B[6]; A[5] = A[2]+B[5]; S1(5): S1(4): S1(6): A[5] = A[2]+B[5]; A[4] = A[1]+B[4]; A[6] = A[3]+B[6]; S1(6): S1(5): S1(4): A[6] = A[3]+B[6]; A[5] = A[2]+B[5]; A[4] = A[1]+B[4]; S1(5): S1(6): S1(4): A[5] = A[2]+B[5]; A[6] = A[3]+B[6]; A[4] = A[1]+B[4]; S1(6): S1(4): S1(5): A[6] = A[3]+B[6]; A[4] = A[1]+B[4]; A[5] = A[2]+B[5]; HPC Sprig

13 OpeMP OpeMP is a portable implemetatio of commo parallel costructs for shared memory machies OpeMP i C #pragma omp directive_ame statemet_block OpeMP i Fortra!$OMP directive_ame statemet_block!$omp ed directive ame HPC Sprig

14 HPC Sprig

15 OpeMP Parallel The parallel costruct (i OpeMP is ot the same as par) #pragma omp parallel { S1; S2; Sm; A team of threads all execute the body statemets ad joi whe doe threads Oe thread (master thread) Parallel regio S1; S2; Sm; S1; S2; Sm; S1; S2; Sm; Oe thread (master thread) HPC Sprig

16 OpeMP Parallel The parallel costruct #pragma omp parallel default(oe) shared(vars) { S1; S2; Sm; This specifies that variables should ot be assumed to be shared by default threads Oe thread (master thread) Parallel regio S1; S2; Sm; S1; S2; Sm; S1; S2; Sm; Oe thread (master thread) HPC Sprig

17 OpeMP Parallel The parallel costruct #pragma omp parallel private(, i) { = omp_get_um_threads(); i = omp_get_thread_um(); Use private to declare private data What happes if ad/or i are ot declared private? Do Berstei s coditios still hold? omp_get_um_threads() returs the umber of threads that are curretly beig used omp_get_thread_um() returs the thread id (0 to -1) HPC Sprig

18 OpeMP Parallel with Reductio The parallel costruct with reductio clause operatio: +, *, -, &, ^,, &&, #pragma omp parallel reductio(+:var) { var = expr; = var; Performs a global reductio operatio over privatized variable(s) ad assigs fial value to master s private variable(s) or to the shared variable(s) whe shared Oe thread var = expr; var = expr; var = expr; S var HPC Sprig

19 OpeMP Parallel The parallel costruct #pragma omp parallel um_threads() { S1; S2; Sm; Number of threads we wat Alteratively, use omp_set_um_threads() or set eviromet variable OMP_NUM_THREADS Oe thread threads S1; S2; Sm; S1; S2; Sm; S1; S2; Sm; Oe thread HPC Sprig

20 OpeMP Parallel Sectios The sectios costruct is for work-sharig, where a curret team of threads is used to execute differet statemets cocurretly, similar to par #pragma omp parallel #pragma omp sectios { #pragma omp sectio S1; #pragma omp sectio S2; #pragma omp sectio Sm; Statemets i the sectios are executed cocurretly threads executig m sectios S1 S2 Sm barrier HPC Sprig

21 OpeMP Parallel Sectios The sectios costruct is for work-sharig, where a curret team of threads is used to execute differet statemets cocurretly, similar to par #pragma omp parallel #pragma omp sectios owait { #pragma omp sectio S1; #pragma omp sectio S2; #pragma omp sectio Sm; Use owait to remove the implicit barrier threads executig m sectios S1 S2 Sm HPC Sprig

22 OpeMP Parallel Sectios The sectios costruct is for work-sharig, where a curret team of threads is used to execute differet statemets cocurretly, similar to par #pragma omp parallel sectios { #pragma omp sectio S1; #pragma omp sectio S2; #pragma omp sectio Sm; Oe thread Use parallel sectios to combie parallel with sectios threads executig m sectios S1 S2 Sm HPC Sprig

23 OpeMP For/Do The for costruct (do i Fortra) is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel #pragma omp for for (i=0; i<k; i++) { S1; S2; Sm; Loop iteratios are executed cocurretly by threads Use owait to remove the implicit barrier threads executig k iteratios i=0: S1; S2; Sm; i=1: S1; S2; Sm; i=k-1:s1; S2; Sm; barrier HPC Sprig

24 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel #pragma omp for schedule(dyamic) for (i=0; i<k; i++) { S1; S2; Sm; Whe k>, threads execute radomly chose loop iteratios util all iteratios are completed threads executig k iteratios i=0: S1; S2; Sm; i=1: S1; S2; Sm; i=k-1:s1; S2; Sm; barrier HPC Sprig

25 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel #pragma omp for schedule(static) for (i=0; i<4; i++) { S1; S2; Sm; Whe k>, threads are assiged to ék/ù chuks of the iteratio space 2 threads executig 4 iteratios i=0; S1; S2; Sm; i=1; S1; S2; Sm; i=2; S1; S2; Sm; i=3; S1; S2; Sm; barrier HPC Sprig

26 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel #pragma omp for schedule(static, 2) for (i=0; i<8; i++) { S1; S2; Sm; 2 threads executig 8 iteratios usig chuk size 2 i a roud-robi fashio Chuk size i=0; S1; S2; Sm; i=1; S1; S2; Sm; i=2; S1; S2; Sm; i=3; S1; S2; Sm; i=4; S1; S2; Sm; i=5; S1; S2; Sm; i=6; S1; S2; Sm; i=7; S1; S2; Sm; barrier HPC Sprig

27 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel #pragma omp for schedule(guided, 4) for (i=0; i<k; i++) { S1; S2; Sm; Chuk size Expoetially decreasig chuk size, i this example: 4, 2, 1 HPC Sprig

28 OpeMP For/Do Schedulig Compariso 0 Loop iteratio idex Loop iteratio idex 27 HPC Sprig

29 OpeMP For/Do Schedulig with Load Imbalaces Cost per iteratio 0 Loop iteratio idex 27 HPC Sprig

30 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel #pragma omp for schedule(ru_time) for (i=0; i<k; i++) { S1; S2; Sm; Cotrolled by eviromet variable OMP_SCHEDULE: setev OMP_SCHEDULE dyamic setev OMP_SCHEDULE static setev OMP_SCHEDULE static,2 HPC Sprig

31 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly operatio: +, *, -, &, ^,, &&, #pragma omp parallel #pragma omp for reductio(+:s) for (i=0; i<k; i++) { s += a[i]; Performs a global reductio operatio over privatized variables ad assigs fial value to master s private variable(s) or to the shared variable(s) whe shared i=0: s += a[0]; i=1: s += a[1]; i=k-1:s += a[k-1]; S s HPC Sprig

32 OpeMP For/Do The for costruct is for work-sharig, where a curret team of threads is used to execute a loop cocurretly #pragma omp parallel for for (i=0; i<k; i++) { S1; S2; Sm; Oe thread Use parallel for to combie parallel with for threads executig k iteratios i=0: S1; S2; Sm; i=1: S1; S2; Sm; i=k-1:s1; S2; Sm; HPC Sprig

33 OpeMP Firstprivate ad Lastprivate The parallel costruct with firstprivate ad/or lastprivate clause x = ; #pragma omp parallel firstprivate(x) lastprivate(y) { x = x + ; #pragma omp for for (i=0; i<k; i++) { y = i; = y; Use firstprivate to declare private variables that are iitialized with the mai thread s value of the variables Likewise, use lastprivate to declare private variables whose values are copied back out to mai thread s variables by the thread that executes the last iteratio of a parallel for loop, or the thread that executes the last parallel sectio HPC Sprig

34 OpeMP Sigle The sigle costruct selects oe thread of the curret team of threads to execute the body #pragma omp parallel #pragma omp sigle { S1; S2; Sm; Oe thread executes the body barrier S1; S2; Sm; HPC Sprig

35 OpeMP Master The master costruct selects the master thread of the curret team of threads to execute the body #pragma omp parallel #pragma omp master { S1; S2; Sm; The master thread executes the body, o barrier is iserted S1; S2; Sm; HPC Sprig

36 OpeMP Critical The critical costruct defies a critical sectio #pragma omp parallel #pragma omp critical ame { S1; S2; Sm; acquire S1; S2; Sm; wait Mutual exclusio is eforced o the body usig a amed lock release S1; S2; Sm; acquire release HPC Sprig

37 OpeMP Critical The critical costruct defies a critical sectio #pragma omp parallel #pragma omp critical qlock { equeue(job); #pragma omp critical qlock { dequeue(job); acquire release Oe thread is here Aother thread is here equeue(job); wait dequeue(job) acquire release HPC Sprig

38 OpeMP Critical The critical costruct defies a critical sectio #pragma omp parallel #pragma omp critical { S1; S2; Sm; acquire S1; S2; Sm; wait Mutual exclusio is eforced o the body usig a aoymous lock release S1; S2; Sm; acquire release HPC Sprig

39 OpeMP Barrier The barrier costruct sychroizes the curret team of threads #pragma omp parallel #pragma omp barrier barrier HPC Sprig

40 OpeMP Atomic The atomic costruct executes a expressio atomically (expressios are restricted to simple updates) #pragma omp parallel #pragma omp atomic expressio; idivisible expressio; expressio; idivisible HPC Sprig

41 OpeMP Atomic The atomic costruct executes a expressio atomically (expressios are restricted to simple updates) #pragma omp parallel #pragma omp atomic = +1; #pragma omp atomic = -1; Oe thread is here Aother thread is here idivisible = +1; = -1 idivisible HPC Sprig

42 OpeMP Flush The flush costruct flushes shared variables from local storage (registers, cache) to shared memory OpeMP adopts a relaxed cosistecy model of memory #pragma omp parallel #pragma omp flush(variables) flush b = 3, but there is o guaratee that a will be 3 = 3; a = b = HPC Sprig

43 OpeMP Relaxed Cosistecy Memory Model Relaxed cosistecy meas that memory updates made by oe CPU may ot be immediately visible to aother CPU Data ca be i registers Data ca be i cache (cache coherece protocol is slow or o-existet) Therefore, the updated value of a shared variable that was set by a thread may ot be available to aother A OpeMP flush is automatically performed at Etry ad exit of parallel ad critical Exit of for Exit of sectios Exit of sigle Barriers HPC Sprig

44 OpeMP Thread Schedulig Cotrolled by eviromet variable OMP_DYNAMIC Whe set to FALSE Same umber of threads used for every parallel regio Whe set to TRUE The umber of threads is adjusted for each parallel regio omp_get_um_threads() returs actual umber of threads omp_get_max_threads() returs OMP_NUM_THREADS Determie optimal umber of threads Determie optimal umber of threads Parallel regio Parallel regio HPC Sprig

45 OpeMP Threadprivate The threadprivate costruct declares variables i a global scope private to a thread across multiple parallel regios Must use whe variables should stay private, eve outside of the curret scope, e.g. across fuctio calls it couter; Global couter #pragma omp threadprivate(couter) #pragma omp parallel { couter = 0; #pragma omp parallel { couter++; Each thread has a local copy of couter Each thread has a local copy of couter HPC Sprig

46 OpeMP Locks Mutex locks, with additioal estable versios of locks omp_lock_t the lock type omp_lock_t lck; omp_iit_lock(&lck); omp_set_lock(&lck); critical sectio omp_uset_lock(&lck); omp_destroy_lock(&lck); omp_iit_lock() iitializatio omp_set_lock() blocks util lock is acquired omp_uset_lock() releases the lock omp_destroy_lock() deallocates the lock HPC Sprig

47 Compiler Optios for OpeMP GOMP project for GCC 4.2 (C ad Fortra) Use #iclude <omp.h> Note: the _OPENMP defie is set whe compilig with OpeMP Itel compiler: icc -opemp ifort -opemp Su compiler: succ -xopemp f95 -xopemp HPC Sprig

48 Autoparallelizatio Compiler applies depedece aalysis ad parallelizes a loop (or etire loop est) automatically whe possible Typically task-parallelizes outer loops (more parallel work), possibly after loop iterchage, fusio, etc. Similar to addig #pragma parallel for to loop(s), with appropriate private ad shared clauses Itel compiler: icc -parallel ifort -parallel Su compiler: succ -xautopar f95 -xautopar HPC Sprig

49 Further Readig [PP2] pages Optioal: OpeMP tutorial at Lawrece Livermore HPC Sprig

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II HPC Fall 2012 Prof. Robert van Engelen Overview Sequential consistency Parallel programming constructs Dependence analysis OpenMP Autoparallelization Further reading