Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, PDF Free Download

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015 Slide 1

Background Back in 2014 : Adaptation of IFS physics cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme Emphasis was on GPU-migration by use of OpenACC directives CLOUDSC consumes about 10% of IFS Forecast time Some 3500 lines of Fortran2003 before OpenACC directives This presentation concentrates comparing performances on Haswell OpenMP version of CLOUDSC NVIDIA GPU (K40) OpenACC version of CLOUDSC Slide 2

Some earlier results Baseline results down from 40s to 0,24s on K40 GPU PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014) Also Cray CCE 8.4 OpenACC-compiler was tried OpenACC directives inserted automatically By use of acc_insert Perl script followed by manual cleanup Source code lines expanded from 3500 to 5000 in CLOUDSC! The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz) Data transfer added serious overheads Slide 3 Strange DATA PRESENT testing & memory pinning slowdowns

The problem setup for this case study Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV) About 80,000 columns fit into one K40 GPU Grid point columns are independent of each other So no horizontal dependencies here, but...... level dependency prevents parallelization along vertical dim Arrays are organized in blocks of grid point columns Instead of using ARRAY(NGPTOT, NLEV)...... we use ARRAY(NPROMA, NLEV, NBLKS) NPROMA is a (runtime) fixed blocking factor Slide 4 Arrays are OpenMP thread safe over NBLKS

Hardware, compiler & NPROMA s used Haswell-node : 24-cores @ 2.5GHz 2 x NVIDIA K40c GPUs on each Haswell-node via PCIe Each GPU equipped with 12GB memory with CUDA 7.0 PGI Compiler 15.7 with OpenMP & OpenACC O4 fast mp=numa,allcores,bind Mfprelaxed tp haswell Mvect=simd:256 [ -acc ] Environment variables PGI_ACC_NOSHARED=1 PGI_ACC_BUFFERSIZE=4M Typical good NPROMA value for Haswell ~ 10 100 Slide 5 Per GPUs NPROMA up to 80,000 for max performance

Haswell : Driving CLOUDSC with OpenMP REAL(kind=8) :: array(nproma, NLEV, NGPBLKS)!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND)!$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,ibl), &! ~ 65 arrays like this ) END DO!$OMP END DO!$OMP END PARALLEL Typical values for NPROMA in OpenMP implementation: Slide 6 10 100

OpenMP scaling (Haswell, in GFlops/s) 18 16 14 12 10 8 NPROMA 10 NPROMA 100 6 4 2 0 Slide 7 OMP 1 2 4 6 12 24

Development of OpenACC/GPU-version The driver-code with OpenMP-loop kept roughly unchanged GPU to HOST data mapping (ACC DATA) added Note that OpenACC can (in most cases) co-exist with OpenMP Allows an elegant multi-gpu implementation CLOUDSC was pre-processed with acc_insert Perl-script Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC In addition some minimal manual source code clean-up CLOUDSC performance on GPU needs very large NPROMA Slide 8 Lack of multilevel parallelism (only across NPROMA, not NLEV)

Driving OpenACC CLOUDSC with OpenMP!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(numgpus) tid = omp_get_thread_num()! OpenMP thread number idgpu = mod(tid, NumGPUs)! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type())!$omp DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA!$acc data copyout(array(:,:,ibl),...) &! ~22 : GPU to Host!$acc& copyin(array(:,:,ibl))! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,ibl)...)! Runs on GPU#<idgpu>!$acc end data END DO!$OMP END DO!$OMP END PARALLEL Typical values for NPROMA in OpenACC implementation: > 10,000 Slide 9

Sample OpenACC coding of CLOUDSC!$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB!$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) ENDDO ENDDO!$ACC END KERNELS ASYNC(IBL) = ZQX(JL,JK,NCLDQV) + ztmp Slide 10 ASYNC removes CUDA-thread syncs

OpenACC scaling (K40c, in GFlops/s) 12 10 8 6 1 GPU 2 GPUs 4 2 NPROMA 0 Slide 11 100 1000 10000 20000 40000 80000

Timing (ms) breakdown : single GPU 12000 10000 8000 6000 4000 Other overhead Communication Computation Haswell 2000 NPROMA 0 Slide 12 10 1000 20000 80000

Saturating GPUs with more work More threads here!$omp PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(numgpus * 4) tid = omp_get_thread_num()! OpenMP thread number idgpu = mod(tid, NumGPUs)! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type())!$omp DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA!$acc data copyout(array(:,:,ibl),...) &! ~22 : GPU to Host!$acc& copyin(array(:,:,ibl))! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,ibl)...)! Runs on GPU#<idgpu>!$acc end data END DO!$OMP END DO!$OMP END PARALLEL Slide 13

Saturating GPUs with more work Consider few performance degradation facts at present Parallelism only in NPROMA dimension in CLOUDSC Updating 60-odd arrays back and forth every time step OpenACC overhead related to data transfers & ACC DATA Can we do better? YES! We can enable concurrently executed kernels through OpenMP! Time-sharing GPU(s) across multiple OpenMP-threads About 4 simultaneous OpenMP host threads can saturate a single GPU in our CLOUDSC case Extra care must be taken to avoid running out of memory on GPU Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000 Slide 14

Multiple copies of CLOUDSC per GPU (GFlops/s) 16 14 12 10 8 6 1 GPU 2 GPUs 4 2 0 Slide 15 Copies 1 2 4

nvvp profiler shows time-sharing impact GPU is fed with work by one OpenMP thread only GPU is 4- way timeshared Slide 16

Timing (ms) : 4-way time-shared vs. no T/S 4500 4000 3500 3000 2500 2000 1500 1000 500 0 GPU is 4- way timeshared 10 20000 80000 Slide 17 Other overhead Communication Computation Haswell NPROMA GPU is not time-shared

24-core Haswell 2.5GHz vs. K40c GPU(s) (GFlops/s) 18 16 T/S = GPUs timeshared 14 12 10 8 6 Haswell 2 GPUs (T/S) 2 GPUs 1 GPU (T/S) 1 GPU 4 2 Slide 18 0

Conclusions CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF s tiny GPU cluster in 3Q/2015 Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7) With CUDA 7.0 and concurrent kernels it seems timesharing (oversubscribing) GPUs with more work pays off Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA) Slide 19