Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Similar documents
Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Deutscher Wetterdienst

S Comparing OpenACC 2.5 and OpenMP 4.5

A case study of performance portability with OpenMP 4.5

An OpenACC GPU adaptation of the IFS cloud microphysics scheme

CLAW FORTRAN Compiler source-to-source translation for performance portability

Directive-based Programming for Highly-scalable Nodes

Physical parametrizations and OpenACC directives in COSMO

IFS migrates from IBM to Cray CPU, Comms and I/O

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

arxiv: v1 [hep-lat] 12 Nov 2013

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC Pipelining Jeff Larkin November 18, 2016

OpenACC Course. Office Hour #2 Q&A

Porting COSMO to Hybrid Architectures

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Xeon Phi Coprocessors

Running the FIM and NIM Weather Models on GPUs

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

LDetector: A low overhead data race detector for GPU programs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Benefits and Costs of Coarse-Grained Multithreading for HARMONIE

John Levesque Nov 16, 2001

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

An Introduction to OpenACC

Performance Tuning and OpenMP

IFS Vectorisation Improvements Using Cray Supercomputer. John Hague. Cray Consultant, Sept 2014

Investigating and Vectorizing IFS on a Cray Supercomputer

OpenMP 4.0/4.5. Mark Bull, EPCC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Accelerator programming with OpenACC

Advanced OpenACC. Steve Abbott November 17, 2017

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

OpenACC 2.6 Proposed Features

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

Deutscher Wetterdienst

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Fundamentals Jeff Larkin November 14, 2016

Performance Tuning and OpenMP

Mathematical computations with GPUs

Parallel Programming. Libraries and implementations

COMP Parallel Computing. Programming Accelerators using Directives

Introduction to OpenMP

Accelerating Financial Applications on the GPU

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

OpenMP 4.0 (and now 5.0)

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

DawnCC : a Source-to-Source Automatic Parallelizer of C and C++ Programs

PROFILER OPENACC TUTORIAL. Version 2018

An Introduction to OpenAcc

PGPROF OpenACC Tutorial

Parallel Computing. November 20, W.Homberg

High Performance Ocean Modeling using CUDA

Introduction to OpenMP

Accelerating Harmonie with GPUs (or MICs)

Automatic Testing of OpenACC Applications

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL. Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Programming paradigms for GPU devices

COMP528: Multi-core and Multi-Processor Computing

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenACC (Open Accelerators - Introduced in 2012)

OpenMP 4.0. Mark Bull, EPCC

Introduction to parallel Computing

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Can Accelerators Really Accelerate Harmonie?

The Titan Tools Experience

Intel Xeon Phi Coprocessor

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Using GPUs for ICON: An MPI and OpenACC Implementation

Blue Waters Programming Environment

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.

Task-based Execution of Nested OpenMP Loops

Evaluating OpenMP s Effectiveness in the Many-Core Era

Carlo Cavazzoni, HPC department, CINECA

OpenACC and the Cray Compilation Environment James Beyer PhD

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Transcription:

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015 Slide 1

Background Back in 2014 : Adaptation of IFS physics cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme Emphasis was on GPU-migration by use of OpenACC directives CLOUDSC consumes about 10% of IFS Forecast time Some 3500 lines of Fortran2003 before OpenACC directives This presentation concentrates comparing performances on Haswell OpenMP version of CLOUDSC NVIDIA GPU (K40) OpenACC version of CLOUDSC Slide 2

Some earlier results Baseline results down from 40s to 0,24s on K40 GPU PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014) Also Cray CCE 8.4 OpenACC-compiler was tried OpenACC directives inserted automatically By use of acc_insert Perl script followed by manual cleanup Source code lines expanded from 3500 to 5000 in CLOUDSC! The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz) Data transfer added serious overheads Slide 3 Strange DATA PRESENT testing & memory pinning slowdowns

The problem setup for this case study Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV) About 80,000 columns fit into one K40 GPU Grid point columns are independent of each other So no horizontal dependencies here, but...... level dependency prevents parallelization along vertical dim Arrays are organized in blocks of grid point columns Instead of using ARRAY(NGPTOT, NLEV)...... we use ARRAY(NPROMA, NLEV, NBLKS) NPROMA is a (runtime) fixed blocking factor Slide 4 Arrays are OpenMP thread safe over NBLKS

Hardware, compiler & NPROMA s used Haswell-node : 24-cores @ 2.5GHz 2 x NVIDIA K40c GPUs on each Haswell-node via PCIe Each GPU equipped with 12GB memory with CUDA 7.0 PGI Compiler 15.7 with OpenMP & OpenACC O4 fast mp=numa,allcores,bind Mfprelaxed tp haswell Mvect=simd:256 [ -acc ] Environment variables PGI_ACC_NOSHARED=1 PGI_ACC_BUFFERSIZE=4M Typical good NPROMA value for Haswell ~ 10 100 Slide 5 Per GPUs NPROMA up to 80,000 for max performance

Haswell : Driving CLOUDSC with OpenMP REAL(kind=8) :: array(nproma, NLEV, NGPBLKS)!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND)!$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,ibl), &! ~ 65 arrays like this ) END DO!$OMP END DO!$OMP END PARALLEL Typical values for NPROMA in OpenMP implementation: Slide 6 10 100

OpenMP scaling (Haswell, in GFlops/s) 18 16 14 12 10 8 NPROMA 10 NPROMA 100 6 4 2 0 Slide 7 OMP 1 2 4 6 12 24

Development of OpenACC/GPU-version The driver-code with OpenMP-loop kept roughly unchanged GPU to HOST data mapping (ACC DATA) added Note that OpenACC can (in most cases) co-exist with OpenMP Allows an elegant multi-gpu implementation CLOUDSC was pre-processed with acc_insert Perl-script Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC In addition some minimal manual source code clean-up CLOUDSC performance on GPU needs very large NPROMA Slide 8 Lack of multilevel parallelism (only across NPROMA, not NLEV)

Driving OpenACC CLOUDSC with OpenMP!$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(numgpus) tid = omp_get_thread_num()! OpenMP thread number idgpu = mod(tid, NumGPUs)! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type())!$omp DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA!$acc data copyout(array(:,:,ibl),...) &! ~22 : GPU to Host!$acc& copyin(array(:,:,ibl))! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,ibl)...)! Runs on GPU#<idgpu>!$acc end data END DO!$OMP END DO!$OMP END PARALLEL Typical values for NPROMA in OpenACC implementation: > 10,000 Slide 9

Sample OpenACC coding of CLOUDSC!$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB!$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) ENDDO ENDDO!$ACC END KERNELS ASYNC(IBL) = ZQX(JL,JK,NCLDQV) + ztmp Slide 10 ASYNC removes CUDA-thread syncs

OpenACC scaling (K40c, in GFlops/s) 12 10 8 6 1 GPU 2 GPUs 4 2 NPROMA 0 Slide 11 100 1000 10000 20000 40000 80000

Timing (ms) breakdown : single GPU 12000 10000 8000 6000 4000 Other overhead Communication Computation Haswell 2000 NPROMA 0 Slide 12 10 1000 20000 80000

Saturating GPUs with more work More threads here!$omp PARALLEL PRIVATE(JKGLO,IBL,ICEND) &!$OMP& PRIVATE(tid, idgpu) num_threads(numgpus * 4) tid = omp_get_thread_num()! OpenMP thread number idgpu = mod(tid, NumGPUs)! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type())!$omp DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1)! Block length <= NPROMA!$acc data copyout(array(:,:,ibl),...) &! ~22 : GPU to Host!$acc& copyin(array(:,:,ibl))! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,ibl)...)! Runs on GPU#<idgpu>!$acc end data END DO!$OMP END DO!$OMP END PARALLEL Slide 13

Saturating GPUs with more work Consider few performance degradation facts at present Parallelism only in NPROMA dimension in CLOUDSC Updating 60-odd arrays back and forth every time step OpenACC overhead related to data transfers & ACC DATA Can we do better? YES! We can enable concurrently executed kernels through OpenMP! Time-sharing GPU(s) across multiple OpenMP-threads About 4 simultaneous OpenMP host threads can saturate a single GPU in our CLOUDSC case Extra care must be taken to avoid running out of memory on GPU Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000 Slide 14

Multiple copies of CLOUDSC per GPU (GFlops/s) 16 14 12 10 8 6 1 GPU 2 GPUs 4 2 0 Slide 15 Copies 1 2 4

nvvp profiler shows time-sharing impact GPU is fed with work by one OpenMP thread only GPU is 4- way timeshared Slide 16

Timing (ms) : 4-way time-shared vs. no T/S 4500 4000 3500 3000 2500 2000 1500 1000 500 0 GPU is 4- way timeshared 10 20000 80000 Slide 17 Other overhead Communication Computation Haswell NPROMA GPU is not time-shared

24-core Haswell 2.5GHz vs. K40c GPU(s) (GFlops/s) 18 16 T/S = GPUs timeshared 14 12 10 8 6 Haswell 2 GPUs (T/S) 2 GPUs 1 GPU (T/S) 1 GPU 4 2 Slide 18 0

Conclusions CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF s tiny GPU cluster in 3Q/2015 Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7) With CUDA 7.0 and concurrent kernels it seems timesharing (oversubscribing) GPUs with more work pays off Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA) Slide 19