Introduction Hybrid Programming. Sebastian von Alfthan 13/8/2008 CSC the Finnish IT Center for Science PRACE summer school 2008

Size: px

Start display at page:

Download "Introduction Hybrid Programming. Sebastian von Alfthan 13/8/2008 CSC the Finnish IT Center for Science PRACE summer school 2008"

Jonas Webb
5 years ago
Views:

1 Introduction Hybrid Programming Sebastian von Alfthan 13/8/2008 CSC the Finnish IT Center for Science PRACE summer school 2008

2 Contents Introduction OpenMP in detail OpenMP programming models & performance Hybrid programming

The need for improved parallelism Top 500 trend show that in 10-12 years #1 has a peak of 1 EF #500 has a peak of 1 PF General purpose processors are not getting (very much)

3 The need for improved parallelism Top 500 trend show that in years #1 has a peak of 1 EF #500 has a peak of 1 PF General purpose processors are not getting (very much) faster MPP:s Massively parallel procesor (MPP) Very large number of nodes connected with a fast interconnect Symmetric multiprocessor nodes (SMP) Hybrid architectures Cell, GPGPU...

Low level Parallelization between nodes Distributed memory MPI ( obsolete ) PVM Here: MPI +

4 Hybrid programming - Mixed mode Parallel programming model combining: Parallelization over one SMP node Shared memory parallelization OpenMP de facto standard Posix threads (Pthreads) Low level Parallelization between nodes Distributed memory MPI ( obsolete ) PVM Here: MPI + OpenMP Is it faster? C3 C4 Sometimes In most cases not... C1 C2 Memory OpenMP L3 MPI Seastar2

5 OpenMP: a brief introduction An API that can be used for multithreaded shared memory parallelization Fortran 77/9X and C/C++ are supported Current version implemented in compilers is 2.5 (this talk) OpenMP 3.0 specs have been released Enables one to parallelize one part of the program at a time Easy to do quick and dirty prototyping Efficient and well scaling code still requires effort

6 OpenMP OpenMP API has three components: 1) Compiler directives Expresses shared memory parallelization Preceded by sentinel, can compile serial version 2) Runtime library routines Small number of library functions Example: get number of threads, get rank of thread.. Can be discarded in serial version via conditional compiling 3) Environment variables Bind threads to cores Specify number of threads

7 A simple OpenMP program: F95 PROGRAM demo1 USE omp_lib INTEGER::omp_rank ( private(omp_rank!$omp parallel () omp_rank=omp_get_thread_num WRITE(*,*) "thread is ",omp_rank!$omp end parallel END PROGRAM demo1 >export OMP_NUM_THREADS=2 >ftn -mp demo1.f95 >aprun -n 1./a.out thread is 0 thread is 1

8 A simple OpenMP program: C #include <stdio.h> #include "omp.h int main(int argc,char *argv[]){ int omp_rank; #pragma omp parallel private(omp_rank) { } omp_rank=omp_get_thread_num(); printf("thread is %d\n",omp_rank); } >export OMP_NUM_THREADS=2 >cc -mp demo1.c >aprun -n 1./a.out thread is 0 thread is 1

9 OpenMP in detail

10 Directives Sentinels precede each OpenMP directive C/C++ #pragma omp Fortran free form!$omp Fortran fixed form!$omp c$omp *$omp Space in sixth column begins directive No space depicts continuation line

11 Directives: parallel Starts a parallel region Prior to it only one thread, master Creates a team of threads: master+slave threads At end of block is a barrier and all shared data is synchronized Clauses if(logical expression) private(list) shared(list) default(private/shared/none) firstprivate(list) reduction(operator:list) copyin(list) num_threads(integer)!$omp parallel!$omp end parallel

12 Directives: parallel clauses private(list) Comma separated list with private variables Private variables are on private stack of each thread Undefined initial value Undefined value after parallel region Firstprivate(list) Private variable with initial value that is the same as the original objects Lastprivate(list) Private variable The Thread that performs the last parallel iteration step or section copies its value to the original object

13 Directives: parallel clauses shared(list) Comma separated list with private variables All threads can write to, and read from a shared variable Race condition if other threads access a variable while one writes to it Variables are shared by default (with some exceptions) default(private/shared/none) Sets default for variables to be shared, private or not defined In C/C++ private is not allowed None can be useful for debugging as each variable has to be defined manually

14 Directives: parallel clauses reduction(operator:list) Performs reduction on the (scalar) variables in list Private reduction variable is created for each threads partial result Private reduction variable initialized to operators identity value (see table) After parallel region the reduction operation is applied to private variables and result is aggregated to the shared variable Operator Fortran Initial value Operator C/C++ Initial value * 1 * 1.AND..TRUE. && 1.OR..FALSE. 0 MAX MIN Smallest value Largest value

15 Work-sharing directives: DO/for Directive instructing compiler to share the work of a DO loop Fortran: $OMP DO C/C++: #pragma for Directive is inside parallel region prior to DO-loop Can also be combined with parallel: $OMP PARALLEL DO Loop variable is private by default in Fortran Not in C/C++, there it has to be explicitly defined to be private sumvar=0!$omp parallel do reduction(+:sumvar) do i=1,10 sumvar=sumvar+1 end do!$omp end parallel do WRITE(*,*) "sum is ",sumvar sum is 10

16 Work-sharing directives: DO/for clauses Clauses schedule (type [,chunk]) ordered nowait private(list) firstprivate(list) lastprivate(list) shared(list) reduction(operator:list) schedule(type [,chunksize]) Defines how the iterations are divided over the threads Static Dynamic Guided ordered Iterations performed in same order as in serial program nowait No barrier or synchronization at end of loop In Fortran it is seclared in the!$omp end directive

17 Work-sharing directives: DO/for schedules schedule(static,[chunksize]) Iterations are divided into chunk-sized parts Chunks are statically assigned to threads Default chunk-size is iterations/threads Low overhead Load balance can be problematic The static schedule is used by default if the schedule not defined (implementation dependent)

18 Work-sharing directives: DO/for schedules schedule(dynamic,[chunksize]) Iterations are divided into chunk-sized parts After a thread completes a chunk, it is dynamically assigned a new one Default value of chunksize is 1 Higher overhead Better load balance for unbalanced iterations schedule(guided,[chunksize]) Like dynamic but the size of the chunks decreases exponentially Size of first chunk is implementation-dependent Size of smallest chunk is chunksize Default value of chunksize is 1

19 Work-sharing directives: sections Defines sections that are executed by different threads!$omp sections!$omp section write(*,*) "thread",omp_get_thread_num(),"section A"!$omp section write(*,*) "thread",omp_get_thread_num(),"section B"!$omp end sections thread 0 section A thread 1 section B

20 Directives: Master $OMP MASTER Specifies that the region should only be executed by the master thread!$omp master!code for master!$omp end master #pragma omp master { //code for master }

21 Directives: Single $OMP SINGLE Specifies that the region should only be executed by one arbitrary thread Implicit barrier at end directive (unless nowait is defined) Clauses private(list) firstprivate(list) copyprivate(list) nowait(list)!$omp single!code for any thread!$omp end single #pragma omp single { //code for any thread }

22 Directives: Critical!$OMP CRITICAL [name] A section that should only be executed by one thread at a time Optional name specifies different critical section All unnamed critical sections are treated as the same section!$omp critical!code that is not thread-safe!$omp end critical #pragma omp critical { //code that is not thread safe }

23 Directives: Atomic!$OMP ATOMIC Specifies that a memory location is to be updated atomically, only one thread at a time Applies to only one line Only certain kinds of expressions allowed C/C++ a= a+= a-= a*= a/= a++ ++a a-- --a!$omp atomic var=... #pragma omp atomic var=...;

24 Directives: Barrier! $OMP BARRIER Synchronizes all threads at this point When a thread reaches a barrier it only continues after all threads have reached it!$omp BARRIER #pragma omp barrier

25 Directives: Flush! $OMP FLUSH [list] Synchronizes the memory of all threads Makes sure each thread has a consistent view of memory at this point Also required on cache-coherent systems, changes to variables could still reside in registers Can also only flush variables in list, if not defined all variables are flushed Implicit flush for several directives: BARRIER PARALLEL Both entry and exit CRITICAL Both entry and exit DO - On exit SECTIONS On exit SINGLE On exit ORDERED Both entry and exit

26 OpenMP: Run time library routines OMP_SET_NUM_THREADS OMP_GET_NUM_THREADS OMP_GET_MAX_THREADS OMP_GET_THREAD_NUM OMP_GET_NUM_PROCS OMP_IN_PARALLEL OMP_SET_DYNAMIC OMP_GET_DYNAMIC OMP_SET_NESTED OMP_GET_NESTED OMP_INIT_LOCK OMP_DESTROY_LOCK OMP_SET_LOCK OMP_UNSET_LOCK OMP_TEST_LOCK OMP_GET_WTIME OMP_GET_WTICK

27 OpenMP: Important environment variables OMP_NUM_THREADS Maximum number of threads OMP_NESTED TRUE or FALSE Enables or disables nested parallelism Not always supported Compiler specific flags for binding threads to cores PGI setenv MP_BIND yes Pathscale setenv PSC_OMP_AFFINITY TRUE setenv PSC_OMP_AFFINITY_GLOBAL TRUE GNU setenv GOMP_CPU_AFFINITY "0-3"

28 OpenMP: Compilation flags PGI -mp=nonuma Pathscale -mp GNU -fopenmp

29 OpenMP programming models & performance

30 OpenMP: Programming models Fine-grained: loop level, several local parallel regions PARALLEL DO Can be introduced in piecewise fashion Often simple to implement Performance benefits are limited Coarse-grained: parallel region extends over larger segments (or whole program) PARALLEL OMP_GET_NUM_THREADS OMP_GET_THREAD_NUM Divide work based on thread number Similar to MPI programming Often demands larger changes to program and algorithm Larger potential benefits

31 Case study: Matrix-multiplication Naive serial matrixmultiplication Slow algorithm - do not use in real code (BLAS) n=m=p=1000 Execution time 1.92 t1=omp_get_wtime() DO j=1,m DO i=1,n DO k=1,p c(i,j)=a(i,k)*b(k,j) END DO END DO END DO t2=omp_get_wtime() WRITE(*,*) "Execution time"&,t2-t1

32 Case study: Matrix-multiplication Fine-grained parallelization Static scheduling j,i,k private All other variables shared j-loop parallelized 4-threads (n=m=p=1000) Execution time 0.479s Perfect speedup in this simple case t1=omp_get_wtime()!$omp PARALLEL DO DO j=1,m DO i=1,n DO k=1,p c(i,j)=a(i,k)*b(k,j) END DO END DO END DO!$END OMP PARALLEL DO t2=omp_get_wtime() WRITE(*,*) "Execution time"&,t2-t1

33 Case study: Matrix-multiplication Fine-grained parallelization Static scheduling j,i,k private All other variables shared k-loop parallelized 4-threads (n=m=p=1000) Execution time 1.8s Thread-overhead Synchronization t1=omp_get_wtime() DO j=1,m DO i=1,n!$omp PARALLEL DO DO k=1,p c(i,j)=a(i,k)*b(k,j) END DO!$END OMP PARALLEL DO END DO END DO t2=omp_get_wtime() WRITE(*,*) "Execution time"&,t2-t1

34 Case study: Matrix-multiplication Coarse-grained parallelization Simple domaindecomposition 4-threads (n=m=p=1000) Execution time 0.49s A few percent worse than fine-grained version Best-case scenario for fine-grained parallelization t1=omp_get_wtime()!$omp PARALLEL!$OMP& PRIVATE(omp_rank,i,j,k)!$OMP& PRIVATE(imin,jmin,imax,jmax) omp_rank=omp_get_thread_num() imin=... jmin=... imax=... jmax=... DO j=jmin,jmax DO i=imin,imax DO k=1,p c(i,j)=a(i,k)*b(k,j) END DO END DO END DO!$OMP END PARALLEL t2=omp_get_wtime() WRITE(*,*) "Execution time",t2-t1

35 OpenMP performance: False sharing Memory is read and written in whole cache lines 64 bytes on AMD opteron False sharing When a thread modifies a part of a cache line the whole cache line is marked as invalid When another threads attempt to read or modify another part of the cache it is forced to fetch a newer copy How to avoid Does not occur if all threads only read from variable If data is used by threads in small slices the risk for false sharing is larger than if larger chunks are used

36 OpenMP performance: ccnuma issues cache coherent Non Uniform Memory Access (ccnuma) Shared memory model Local memory closer to processor is faster Caches are kept coherent Examples: Some MPP nodes such as Cray XT5 nodes Large shared memory computers such as SGI Altix Uniform memory access (UMA) Only feasible for small systems Examples: Cray XT4 nodes BlueGene/P nodes

37 OpenMP performance: ccnuma issues In a NUMA node some of the memory is more expensive to access Can lead to severe performance problems OpenMP has no support for NUMA Does not specify where the data is stored Does not give tools to check where it is stored How to avoid Often the system uses a first touch principle: The thread that first accesses the data will host it in its memory Initialization loops that make sure the threads data are local Can also use the low level system call madvise on some systems

38 OpenMP performance: Overheads Amdahls-law If only parts of program are parallelized (fine-grained) then Amdahls-law limits performance In hybrid programs the number of threads is low less of a problem Thread management has a large overhead Avoid creating/destroying threads, use larger parallel regions Synchronization Avoid explicit and implicit barriers If you can use NOWAIT clauses then do it Avoid (if possible) BARRIER/CRITICAL/ORDERED/FLUSH Use named CRITICAL regions

39 OpenMP performance: DO/for directive PARALLEL DO can be more efficient than a DO directive inside a PARALLEL region implementation dependent If the iterations are well balanced use STATIC If there are load-balancing issues then use GUIDED or DYNAMIC Small loops should not be parallelized In case of nested loops inner loops should not bee parallelized See COLLAPSE in OMP 3.0 Use NOWAIT if possible 2 threads 4 threads PARALLEL 0.5 µs 1.0 µs STATIC(1) 0.9 µs 1.3 µs STATIC(64) 0.4 µs 0.7 µs DYNAMIC(1) 34 µs 315 µs DYNAMIC(6 4) QC Cray XT4 1.2 µs 2.7 µs GUIDED(1) 15 µs 214 µs GUIDED(64) 3.3 µs 6.2 µs

40 Hybrid programming

41 Hybrid programming Parallel programming model combining: OpenMP parallelization over one node MPI parallelization between nodes Hybrid model closer to hardware model of SMP cluster, is it therefore always faster? No there is a large body of work suggesting its often slower There are a number of possible benefits and problems Analyze program and target platform to decide if the benefits might yield improvements

42 Hybrid parallel programming models 1. No overlapping communication and computation 1. MPI is called only outside parallel regions and by the master thread 2. MPI is called by several threads 2. Communication and computation overlap: while some of the thread communicate, the rest are executing an application 1. MPI is called only by the master thread 2. Communication is carried out with several threads 3. Each thread handles its own communication demands

43 MPI support for threading MPI standard defines four levels of support 0. MPI_THREAD_SINGLE Only one thread allowed 1. MPI_THREAD_FUNNELED Only master thread allowed to make MPI call 2. MPI_THREAD_SERIALIZED All threads allowed to make MPI calls, but not concurrently 3. MPI_THREAD_MULTIPLE No restrictions Some implementations support an additional model 0.5. MPI calls are allowed only outside parallel regions Returns MPI_THREAD_SINGLE

44 MPI support on Cray XT4/XT5 MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided); printf("supports level %d of %d %d %d %d\n", provided, MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, MPI_THREAD_MULTIPLE); Cray XT4 (xt-mpt ) > Supports level 1 of

45 MPI support on Cray XT4/XT5 MPI-library supports MPI_THREAD_FUNNELED Overlapping communication/computation still possible Non-blocking communication can be started in MASTER block Completes while parallel region computes Able to saturate the interconnect with only one thread communicating Might not be true on all architectures, possible problem for the funneled model

$First Hybrid program int main(int argc, char *argv[]){ int rank,omp_rank,mpisupport; MPI_Init_thread(&argc,&argv, MPI_THREAD_FUNNELED,&mpisupport);$

46 First Hybrid program int main(int argc, char *argv[]){ int rank,omp_rank,mpisupport; MPI_Init_thread(&argc,&argv, MPI_THREAD_FUNNELED,&mpisupport); MPI_Comm_rank(MPI_COMM_WORLD,&rank); #pragma omp parallel private(omp_rank) { omp_rank=omp_get_thread_num(); printf("%d %d \n",rank,omp_rank); } MPI_Finalize(); }

47 Communication Communication inside node replaced by direct memory reads/writes Improved throughput and latency Decreased overhead from MPI-library Aggregated messages In many (data-parallel) algorithms the messages are larger as the number MPI-processes are decreased Increased throughput on inter-node communication In some algorithms the number of messages are reduced E.g. All-to-all Restrictions on calling MPI routines Depends on level of support Only allowed outside parallel region - all other cores are idle MPI_THREAD_FUNNELED - Other threads can calculate MPI_THREAD_MULTIPLE - Best but often not available

48 Case study: All-to-all on QC XT4 Collective operations often performance bottlenecks Especially all-to-all operations Point-to-point implementation can be faster Hybrid implementation For all-to-all operations (maximum) number of transfers decreases by a factor of #threads 2 Size of message increases by a factor of #threads Allow overlapping communication and communication

49 Case study: All-to-all on QC XT4 Collective operations often performance bottlenecks Especially all-to-all operations Point-to-point implementation can be faster Hybrid implementation For all-to-all operations (maximum) number of transfers decreases by a factor of #threads 2 Size of message increases by a factor of #threads Allow overlapping communication and communication

50 Case study: All-to-all on QC XT4 Collective operations often performance bottlenecks Especially all-to-all operations Point-to-point implementation can be faster Hybrid implementation For all-to-all operations (maximum) number of transfers decreases by a factor of #threads 2 Size of message increases by a factor of #threads Allow overlapping communication and communication

51 Case study: All-to-all 40 Kbytes of data per node 400 Kbytes of data per node

52 Algorithmic issues The benefits of the hybrid approach are algorithm dependent, some examples: Limited parallelism in MPI-parallelization Additional levels of parallelism can be easier to implement with hybrid approach E.g. Grid-based algorithm only parallelized in one dimension E.g. Master-slave algorithms Embarrassingly parallel algorithms Can be used to speed up single tasks Can be used to increase system size Domain decomposition (see next case study)

53 Case study: Domain decomposition Number of atoms per cell is proportional to the number of threads Number of ghost particles is proportional to #threads -1/3 We can reduce communication by hybridizing the algorithm With four threads per process the number of ghost particles decreases by about 40%

54 Case study: Domain decomposition Number of atoms per cell is proportional to the number of threads Number of ghost particles is proportional to #threads -1/3 We can reduce communication by hybridizing the algorithm With four threads per process the number of ghost particles decreases by about 40%

55 Case study: Domain decomposition Fine-grained hybridization of MD code Parallel region entered each time the potential is evaluated Loop over atoms parallelized with static for Temporary array for forces Shared Separate space for each thread Avoids the need for synchronization when Newton s third law is used Results added to real force array at end of parallel region #pragma omp parallel {... zero(ptforce[thread][..][..])... #pragma omp for schedule(static,10) for (ii = 0; ii < atoms; ii++)... ptforce[thread][ii][..]+=... ptforce[thread][jj][..]+=... }... for(t=0;t<threads;t++) force[..][..]+=ptforce[t][..][..]...

56 Case study: Domain decomposition

57 Load balance Good load balance is harder and harder to achieve as the number of MPI-processes increases Hybrid approach decreases number of processes One can dynamically change the number of threads per process Can improve load balance Hardware of SMP - clusters restricts the usefulness E.g. Node with two QC processors (XT5) - two MPIprocesses which can have 1-7 threads

coarse-grained hybrid model Utilizes the following benefits: + Better

58 Case study: Master-slave algorithms Matrix multiplication Demonstration of a master-slave algorithm Scaling is improved by going to a coarse-grained hybrid model Utilizes the following benefits: + Better load-balancing due to fewer MPI-processes + Message aggregation and reduced communication

59 Overlapping communication/computation If level of support is at least MPI_THREAD_FUNNELED there are more options for overlapping Isend/Irecv are available as normal While master thread communicate other threads can calculate Can be difficult to utilize properly - load balancing tricky With enough threads master thread could be a dedicated communication thread

60 Memory issues The hybrid programming method can be used to decrease memory requirements Some algorithms have replicated data - can yield significant savings In domain decomposition algorithms there are fewer boundary data-points Improved cache usage Many processors have shared caches L2 in Intel Core2 L3 in AMD QC Shared data can reside in this cache - decreased cache pressure

61 Parallel I/O I/O is expensive and it is difficult to make it optimal Some approaches for parallel I/O MPI-2 I/O Single writer reduction Subset of writers/readers N writers/readers to N files Single writer Subset of writers N writers Hybrid MPI

62 Parallel I/O: a simple hybrid approach Every MPI process opens a file Good I/O BW No communication needed Large filesystem stress, slow open/closes Inconvenient as many files are created Hybridization: only one core per processor writes a shared array Achievable BW similar Decreases number of files by a factor of #threads Easy to implement Allows overlapping of communication/computation

63 Summary Hybrid approach is difficult, but sometimes useful Performance of hybrid approach is a tradeoff between greater overhead and decreased communication costs Direct benefits achieved without additional effort All-to-all collective operations 2-5 times faster Gives parallel IO with reduced file-system stress in the N- writers case Message aggregation We expect the potential benefits to be even greater on XT5

64 Questions!

OpenMP+F90 p OpenMP+F90

OpenMP+F90 p OpenMP+F90 OpenMP+F90 hmli@ustc.edu.cn - http://hpcjl.ustc.edu.cn OpenMP+F90 p. 1 OpenMP+F90 p. 2 OpenMP ccnuma Cache-Coherent Non-Uniform Memory Access SMP Symmetric MultiProcessing MPI MPP Massively Parallel Processing