Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Size: px

Start display at page:

Download "Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015"

Theodore Taylor
6 years ago
Views:

1 Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015 Slide 1

2 Background Back in 2014 : Adaptation of IFS physics cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme Emphasis was on GPU-migration by use of OpenACC directives CLOUDSC consumes about 10% of IFS Forecast time Some 3500 lines of Fortran2003 before OpenACC directives This presentation concentrates comparing performances on Haswell OpenMP version of CLOUDSC NVIDIA GPU (K40) OpenACC version of CLOUDSC Slide 2

3 Some earlier results Baseline results down from 40s to 0,24s on K40 GPU PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014) Also Cray CCE 8.4 OpenACC-compiler was tried OpenACC directives inserted automatically By use of acc_insert Perl script followed by manual cleanup Source code lines expanded from 3500 to 5000 in CLOUDSC! The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz) Data transfer added serious overheads Slide 3 Strange DATA PRESENT testing & memory pinning slowdowns

The problem setup for this case study Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV) About 80,000 columns fit into one K40 GPU Grid point columns are independent of each other

4 The problem setup for this case study Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV) About 80,000 columns fit into one K40 GPU Grid point columns are independent of each other So no horizontal dependencies here, but level dependency prevents parallelization along vertical dim Arrays are organized in blocks of grid point columns Instead of using ARRAY(NGPTOT, NLEV) we use ARRAY(NPROMA, NLEV, NBLKS) NPROMA is a (runtime) fixed blocking factor Slide 4 Arrays are OpenMP thread safe over NBLKS

5 Hardware, compiler & NPROMA s used Haswell-node : 2.5GHz 2 x NVIDIA K40c GPUs on each Haswell-node via PCIe Each GPU equipped with 12GB memory with CUDA 7.0 PGI Compiler 15.7 with OpenMP & OpenACC O4 fast mp=numa,allcores,bind Mfprelaxed tp haswell Mvect=simd:256 [ -acc ] Environment variables PGI_ACC_NOSHARED=1 PGI_ACC_BUFFERSIZE=4M Typical good NPROMA value for Haswell ~ Slide 5 Per GPUs NPROMA up to 80,000 for max performance

6 Haswell : Driving CLOUDSC with OpenMP REAL(kind=8) :: array(nproma, NLEV, NGPBLKS)!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND)!$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,ibl), &! ~ 65 arrays like this ) END DO!$OMP END DO!$OMP END PARALLEL Typical values for NPROMA in OpenMP implementation: Slide

OpenMP scaling (Haswell, in GFlops/s) 18 16 14 12 10 8

7 OpenMP scaling (Haswell, in GFlops/s) NPROMA 10 NPROMA Slide 7 OMP

8 Development of OpenACC/GPU-version The driver-code with OpenMP-loop kept roughly unchanged GPU to HOST data mapping (ACC DATA) added Note that OpenACC can (in most cases) co-exist with OpenMP Allows an elegant multi-gpu implementation CLOUDSC was pre-processed with acc_insert Perl-script Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC In addition some minimal manual source code clean-up CLOUDSC performance on GPU needs very large NPROMA Slide 8 Lack of multilevel parallelism (only across NPROMA, not NLEV)

9 Driving OpenACC CLOUDSC with OpenMP!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(numgpus) tid = omp_get_thread_num()! OpenMP thread number idgpu = mod(tid, NumGPUs)! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type())!$omp DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA!$acc data copyout(array(:,:,ibl),...) &! ~22 : GPU to Host!$acc& copyin(array(:,:,ibl))! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,ibl)...)! Runs on GPU#<idgpu>!$acc end data END DO!$OMP END DO!$OMP END PARALLEL Typical values for NPROMA in OpenACC implementation: > 10,000 Slide 9

10 Sample OpenACC coding of CLOUDSC!$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB!$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) ENDDO ENDDO!$ACC END KERNELS ASYNC(IBL) = ZQX(JL,JK,NCLDQV) + ztmp Slide 10 ASYNC removes CUDA-thread syncs

OpenACC scaling (K40c, in GFlops/s) 12 10 8 6 1 GPU 2

11 OpenACC scaling (K40c, in GFlops/s) GPU 2 GPUs 4 2 NPROMA 0 Slide

Timing (ms) breakdown : single GPU 12000 10000 8000 6000 4000 Other overhead

12 Timing (ms) breakdown : single GPU Other overhead Communication Computation Haswell 2000 NPROMA 0 Slide

13 Saturating GPUs with more work More threads here!$omp PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(numgpus * 4) tid = omp_get_thread_num()! OpenMP thread number idgpu = mod(tid, NumGPUs)! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type())!$omp DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA!$acc data copyout(array(:,:,ibl),...) &! ~22 : GPU to Host!$acc& copyin(array(:,:,ibl))! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,ibl)...)! Runs on GPU#<idgpu>!$acc end data END DO!$OMP END DO!$OMP END PARALLEL Slide 13

14 Saturating GPUs with more work Consider few performance degradation facts at present Parallelism only in NPROMA dimension in CLOUDSC Updating 60-odd arrays back and forth every time step OpenACC overhead related to data transfers & ACC DATA Can we do better? YES! We can enable concurrently executed kernels through OpenMP! Time-sharing GPU(s) across multiple OpenMP-threads About 4 simultaneous OpenMP host threads can saturate a single GPU in our CLOUDSC case Extra care must be taken to avoid running out of memory on GPU Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000 Slide 14

Multiple copies of CLOUDSC per GPU (GFlops/s) 16

15 Multiple copies of CLOUDSC per GPU (GFlops/s) GPU 2 GPUs Slide 15 Copies 1 2 4

16 nvvp profiler shows time-sharing impact GPU is fed with work by one OpenMP thread only GPU is 4- way timeshared Slide 16

17 Timing (ms) : 4-way time-shared vs. no T/S GPU is 4- way timeshared Slide 17 Other overhead Communication Computation Haswell NPROMA GPU is not time-shared

18 24-core Haswell 2.5GHz vs. K40c GPU(s) (GFlops/s) T/S = GPUs timeshared Haswell 2 GPUs (T/S) 2 GPUs 1 GPU (T/S) 1 GPU 4 2 Slide 18 0

19 Conclusions CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF s tiny GPU cluster in 3Q/2015 Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7) With CUDA 7.0 and concurrent kernels it seems timesharing (oversubscribing) GPUs with more work pays off Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA) Slide 19

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,