Porting CASTEP to GPGPUs. Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh

Similar documents
STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Goals of parallel computing

Introduction to the CASTEP code. Stewart Clark University of Durham

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Quantum ESPRESSO on GPU accelerated systems

Parallel Programming. Libraries and Implementations

Phonons and Spectroscopy Tutorial

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

OpenACC programming for GPGPUs: Rotor wake simulation

arxiv: v1 [hep-lat] 12 Nov 2013

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Efficient use of hybrid computing clusters for nanosciences

Adrian Tate XK6 / openacc workshop Manno, Mar

Parallel Programming Libraries and implementations

What is!?! Fortran95: math&sci. libxc etsf-io.

An Innovative Massively Parallelized Molecular Dynamic Software

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

OpenACC Course. Office Hour #2 Q&A

Parallel Programming. Libraries and implementations

GPU Programming Paradigms

CUDA Fortran COMPILERS &TOOLS. Porting Guide

Porting COSMO to Hybrid Architectures

OPENMP FOR ACCELERATORS

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION

Performance Evaluation of Quantum ESPRESSO on SX-ACE. REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017

Introduction to Modern Fortran

Overview of OpenMX: 3. Ongoing and planned works. 2. Implementation. 1. Introduction. 4. Summary. Implementation and practical aspects

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Hybrid multicore supercomputing and GPU acceleration with next-generation Cray systems

An Introduc+on to OpenACC Part II

The VASP Scripter AddOn

SETTING UP A CP2K CALCULATION. Iain Bethune

Porting Guide. CUDA Fortran COMPILERS &TOOLS

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

HECToR. UK National Supercomputing Service. Andy Turner & Chris Johnson

OpenMP: Open Multiprocessing

Programming with MPI

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Getting Started with OpenMX Ver. 3.5

GPUs and Emerging Architectures

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Utilisation of the GPU architecture for HPC

Subroutines and Functions

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

CUDA. Matthew Joyner, Jeremy Williams

RUNNING CP2K CALCULATIONS. Iain Bethune

An Introduction to the LFRic Project

Time-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

Addressing Heterogeneity in Manycore Applications

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

OpenACC 2.6 Proposed Features

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Optimising the Mantevo benchmark suite for multi- and many-core architectures

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

practical class: Step by step introduction to Yambo Excitons in LiF: comparison of theories & approximations H-chains: reasons for the ALDA failure

VASP Accelerated with GPUs

AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT. Bobby R. Bruce

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Introduction to OpenACC

Building NVLink for Developers

Turbostream: A CFD solver for manycore

Pratical course Theoretical Chemistry CO on Palladium Dr. Carine MICHEL

Review More Arrays Modules Final Review

MOLECULAR INTEGRATION SIMULATION TOOLKIT - INTERFACING NOVEL INTEGRATORS WITH MOLECULAR DYNAMICS CODES

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Technology for a better society. hetcomp.com

CPMD Performance Benchmark and Profiling. February 2014

Message-Passing and MPI Programming

Welcome. Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at BST

OpenMP: Open Multiprocessing

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CP2K: INTRODUCTION AND ORIENTATION

CSC573: TSHA Introduction to Accelerators

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Can Accelerators Really Accelerate Harmonie?

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

G. Colin de Verdière CEA

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Step by step introduction to Yambo

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Accelerating Harmonie with GPUs (or MICs)

Parallel Hybrid Computing Stéphane Bihan, CAPS

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

Transcription:

Porting CASTEP to GPGPUs Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh

CASTEP Density Functional Theory Plane-wave basis set with pseudo potentials Heavy use of FFTs FORTRAN (modern) and MPI for parallelisation plane-waves, k-points and bands data decompositions Significant use on UK HPC systems

CASTEP Scaling Runtime (seconds) 10000 1000 100 10 CASTEP Scaling 1 10 100 1000 Number of Processes CASTEP Ideal

Capabilities Hamiltonians DFT XC-functionalsLDA, PW91, PBE, RPBE, PBEsol, WC Hybrid functionalspbe0, B3LYP, sx-lda, the HSE family of functionals(including user-defined parameterisation) LDA+U and GGA+U Semi-empirical dispersion corrections (DFT+D) Structural methods Full variable-cell geometry optimisation using BFGS, LBFGS and TPSD Geometry optimisation using internal co-ordinates Geometry optimisation using damped molecular dynamics Transition-state search using LST/QST method Molecular Dynamics Molecular Dynamics including fixed and variable-cell MD NVE, NVT, NPH and NPT ensembles Path-integral MD for quantum nuclear motion Vibrational Spectroscopy Phonon dispersion and DOS over full Brillouin-Zone using DFPT methods Phonon dispersion and DOS over full Brillouin-Zone using supercell methods IR and ramanintensities Dielectric Properties Born effective charges and dielectric permittivity Frequency-dependent dielectric permittivity in IR range WannierFunctions Electrostatic correction for polar slab models Solid-state NMR spectroscopy Chemical Shifts Electric Field Gradient tensors J-coupling Optical and other Spectroscopies EELS/ELNES and XANES Spectra Optical matrix elements and spectra Electronic properties Band-structure calculations Mullikenpopulation analysis Hirshfeldpopulation analysis Electron Localisation Functions (ELF) Pseudopotentials Supports Vanderbilt ultrasoftand norm-conserving pseudopotentials Built in "On The Fly" pseudopotential generator Self-consistent Pseudpotentials (non self-consistent) PAW for properties calculations Electronic Solvers Block Davidson solver with density mixing Ensemble DFT for metals

Motivation Demonstrator Investigate whether it makes sense What data transfers are necessary CASTEP 7.0 No divergence from mainstream No intrusion into physics Single GPU Large simulations on desktop Multiple GPU Utilise large GPU d systems Enable future UK HPC systems to be GPU d

CASTEP: initial accelerator investigation Replace blascalls with cula (cuda-blaslibrary http://www.culatools.com/) Replace fftcalls with cufft NLCX and Geometry Optimisation Small simulation, to fit on one CPU, no MPI calls. 4 Ti atoms, 2 O atoms, total of 32 electrons. No device calls runtime = 14.6s Culablascalls runtime = 31.1s Culablasand cufftcalls runtime = 418s. Majority of the increased runtime was due to data transfer.

GPUificationof CASTEP Aim: remove data transfer problems by placing most of the large data structures on the GPU. Use OpenACC kernels, PGI CUDA fortran, cula blas and cufft. The process: All or nothing approach, moving large data structures onto the GPU and all affected routines/functions (approximately 50 subroutines) Focus on the serial version first. After initial compilation expect to spend some time optimising, particularly data transfers Move onto mpiversion.

OpenACCDirectives With directives inserted, the compiler will attempt to compile the key kernels for execution on the GPU, and will manage the necessary data transfer automatically. Directive format: C: #pragma acc. Fortran:!$acc. These are ignored by nonaccelerator compilers

OpenACC PROGRAM main INTEGER :: a(n)!$acc data copy(a)!$acc parallel loop DO i = 1,N a(i) = i ENDDO!$acc end parallel loop CALL double_array(a)!$acc end data END PROGRAM main SUBROUTINE double_array(b) INTEGER :: b(n)!$acc kernels loop present(b) DO i = 1,N b(i) = 2*b(i) ENDDO!$acc end kernels loop END SUBROUTINE double_array

GPUificationof CASTEP Data structures on device Wavefunctions: complex(kind=dp) :: Wavefunction%coeffs(:,:,:,:) complex(kind=dp) :: Wavefunction%beta_phi(:,:,:,:) real(kind=dp) :: Wavefunction%beta_phi_at_gamma(:,:,:,:) logical :: Wavefunction%have_beta_phi(:,:) complex(kind=dp) :: Wavefunctionslice%coeffs(:,:) complex(kind=dp) :: Wavefunctionslice%realspace_coeffs(:,:) real(kind=dp) :: Wavefunctionslice%realspace_coeffs_at_gamma(:,:) logical :: Wavefunctionslice%have_realspace(:) complex(kind=dp) :: Wavefunctionslice%beta_phi(:,:) real(kind=dp) :: Wavefunctionslice%beta_phi_at_gamma(:,:) Bands complex(kind=dp) :: coeffs(:) complex(kind=dp) :: beta_phi(:) real(kind=dp) :: beta_phi_at_gamma(:)

Example use of kernels subroutine wave_copy_wv_wv_ks!$acc kernels present_or_copy(wvfn_dst, wvfn_src)!map reduced representation of coefficients on k-point do nb=1,nbands_to_copy recip_grid = cmplx_0 call basis_recip_reduced_to_grid(wvfn_src%coeffs(:,nb,nk_s,ns_s),nk_src,recip_grid,'s TND') call basis_recip_grid_to_reduced(recip_grid,'stnd',wvfn_dst%coeffs(:,nb,nk_d,ns_d),nk _dst) end do! copy rotation data do nb=1,nbands_to_copy do nb2=1,nbands_to_copy wvfn_dst%rotation(nb,wvfn_dst%node_band_index (nb2,id_in_bnd_group),nk_dst,ns_dst) = & & wvfn_src%rotation(nb,wvfn_src%node_band_index(nb2,id_in_bnd_group),nk_src,ns_src ) end do end do!$acc end kernels end subroutine wave_copy_wv_wv_ks

GPUificationof CASTEP Module procedures used throughout the code Multiple calls for all the core kernels Module procedures support different data structures for same call Interface chooses different routines CASTEP uses language options that are not supported on devices, such as the use of optional types when passing data to subroutines followed by if present statements. Resolved by creating copies of subroutines with and without optional arguments. Specifying arrays with dimension(*) when passing to subroutines Resolved by specifying correct dimension structure, sometimes requiring multiple copies of subroutines

subroutine basis_real_to_recip_gamma(grid,grid_type,num_grids,gamma) real(kind=dp), dimension(*), intent(inout) :: grid character(len=*), intent(in) :: grid_type complex(kind=dp), dimension(*), intent(out) :: gamma

Example modification interface basis_real_to_recip_gamma module procedure basis_real_to_recip_gamma_1d module procedure basis_real_to_recip_gamma_2d_grid module procedure basis_real_to_recip_gamma_2d_gamma module procedure basis_real_to_recip_gamma_2d_grid_2d_gamma module procedure basis_real_to_recip_gamma_3d_gamma module procedure basis_real_to_recip_gamma_3d_grid_3d_gamma end interface subroutine basis_real_to_recip_gamma_2d_grid_2d_gamma(grid,grid_type,num_grids,gamma) implicit none integer, intent(in) :: num_grids real(kind=dp), dimension(:,:), intent(inout) :: grid character(len=*), intent(in) :: grid_type complex(kind=dp), dimension(:,:), intent(out) :: gamma real(kind=dp), dimension(:), allocatable :: temp_grid complex(kind=dp), dimension(:), allocatable :: temp_gamma allocate(temp_grid(size(grid))) allocate(temp_gamma(size(gamma))) temp_grid = reshape(grid,shape(temp_grid)) temp_gamma = reshape(gamma,shape(temp_gamma)) call basis_real_to_recip_gamma_inner(temp_grid,grid_type,num_grids,temp_gamma) grid = reshape(temp_grid,shape(grid)) gamma = reshape(temp_gamma,shape(gamma)) deallocate(temp_grid,temp_gamma) end subroutine basis_real_to_recip_gamma_2d_grid_2d_gamma

GPUificationof CASTEP Data that is involved in I/O needs to be taken off the device (copies of data need to be made): Original code (from ion.cuf): read(wvfn%page_unit,rec=record,iostat=status) ((wvfn%coeffs(np,nb,1,1),np=1,wvfn%waves_at_kp(nk)),nb=1,wvfn %nbands_max) New code: read(wvfn%page_unit,rec=record,iostat=status) ((coeffs_tmp,np=1,wvfn%waves_at_kp(nk)),nb=1,wvfn%nbands_max) wvfn%coeffs(np,nb,1,1) = coeffs_tmp Sometimes the limitations of what is on and off the device results in multiple!$acc kernel regions very close together, and not the entire subroutines, which is not necessarily very efficient. Will require a lot of fine tuning to improve performance. Currently still working on successfully compiling the serial code.

GPUificationof CASTEP Still an ongoing project Very closed to having the first version of the software ported to device Expect this to be optimisedto improve performance and minimise data transfer Using the PGI compiler (in order to use OpenACC) has resulted in multiple compiler issues tmp files not being correctly understood no clear error message Compiler failing on large files Complex number functions in Fortran not currently compatible with OpenACC. Deep data copy not handled Next step: OpenACC+MPI implementation.

Reduced port of CASTEP Porting the full program to difficult Unsupported features and compiler immaturity Low-level changes affected too much code Change approach to port contained functionality Particular feature (nlxc calculation) Work from bottom up rather than top down Port lowest level kernels, then move data regions successively higher Rather than porting the data structures then altering all associated code to work with those structures

CASTEP Performance Time (s) Speedup 1 process 6442.45 2 processes 4368.15 1.47 4 processes 2183.26 2.95 8 processes 2147.57 2.99 16 processes 1489.48 4.32 32 processes 936.59 6.88 64 processes 741.37 8.69 1 GPU 1894.67 3.4

Summary Porting CASTEP to GPGPUs using OpenACCand CUDA libraries Full program defeated us Still porting a large amount of the core kernels but not having to update the whole program OpenACChas moved on and so have the compilers Much better now, but still not trivial OpenACCis not OpenMP Similar in the sense it is easy to get something to work but harder to get full performance Hides much worse data operations Bad OpenMPwill scale a bit Bad (not well structured) OpenACC will go much slower than serial How do you cope with modifications in the source code to enable GPU usage?