Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage

Size: px

Start display at page:

Download "Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage"

Jeremy Hamilton
5 years ago
Views:

1 Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage INFN - Università di Parma November 6, 2012

2 Table of contents 1 Introduction 2 3 4

3 Let me introduce myself I am a 2 nd year PhD Student in Physics at Parma University Master Thesis in Florence with L. Del Zanna: oscillations of rotating Neutron Stars (ECHO code) PhD in Parma with R. De Pietri: dynamical instabilities of rotating NS with Magnetic Fields in GR (Cactus/Whisky) PRACE Winter School 2012 here in CINECA I would like to have a fast working code to perform 2D simulations of NS on a cluster with GPUs The starting point is the 2- and 3-D code ECHO The ECHO code would need to be optimized, fully parallelized and ported on GPUs

4 General Relativistic Magnetohydrodynamics (GRMHD) GRMHD is the study of magnetized fluid flows in general relativistic spacetimes It is required for the study of extremely compact astrophysical objects, like neutron stars and black holes The spacetime is a solution of Einsteins field equations of general relativity, and can be either solved for dynamically or assumed given by the initial data (Cowling approximation) Solution techniques are similar to traditional Computational Fluid Dynamics, with some important differences

5 : Overview Eulerian Conservative High Order code: the aim is to combine shock-capturing properties and accuracy for small scale wave propagation and turbulence, in a 3+1 approach (L. Del Zanna, O. Zanotti, N. Bucciantini, P. Londrillo, 2007, A&A 473, 11) GR upgrade of: Londrillo & Del Zanna 2000; Del Zanna et al. 2002, 2003 Modular structure, F90 language, MPI parallelization Any metric allowed (1-, 2- or 3-D), even time-dependent Finite-difference scheme, Runge-Kutta time-stepping UCT strategy for the magnetic field (staggered grid) Central-type Riemann solvers (LLF, HLL, HLLC) Upgrades: resistivity, radiation HD, evolving spacetimes

6 : numerical scheme The two sets of conservation laws are discretized in space according the Upwind Constrained Transport strategies (UCT: Londrillo & Del Zanna ApJ 530, 508, 2000; JCP 195, 17, 2004) Staggered grid for magnetic and electric field components Finite differences: point values at cell centers (u), at cell faces (b and f), at edges (e).

7 : Upwind Constrained Transport The divergence-free constraint is satisfied algebraically at 2 nd spatial order Single upwind state for B i along direction i Four-state numerical fluxes for the magnetic field at edges

8 : the evolution scheme (1/3) The ECHO evolution scheme needs to: Compute the primitive fluid variables from the conservative ones (see next) and interpolate the B components at cell center 8 primitive variables at cell center: P = [ ρ, v, p, B] T For every direction, fill the boundary ghost zones with the values of the primitives and reconstruct left (L) and right /R) upwind states at interfaces (B i along i is unchanged) P L,R i+1/2, j = RL,R ({P ij }) Compute metric properties at interfaces interpolating the metrics defined on the grid (only at t = 0 in Cowling approx.)

9 : the evolution scheme (2/3) Compute the upwind fluxes for the fluid part from the primitive variables using an approximate Riemann solver (save the fast-magnetosonic and transverse transport speeds) For every direction, compute the spatial derivatives of the fluxes at cell center and maximum wave speed Compute geometrical source terms (right-hand side of the conservation laws) Reconstruct B e v from cell faces to cell edge Compute numerical fluxes for the magnetic field (upwind electric fields) and their derivatives

10 : the evolution scheme (3/3) Compute the maximum time step obeying the CFL (Courant-Friedrichs-Lewy) condition 0 < c < 1: t = c max i (a i M /h i) (am i are the maximum speeds over the whole domain, for each direction i) Update all the variables using fluxes and electric fields derivatives with a 2 nd order Runga-Kutta scheme 1 u (1) = u n + t R(u n ) 2 u n+1 = 1 2 un [ ] u (1) + t R(u (1) )

11 ECHO: The time evolution routine

12 Optimizations The main optimizations on the original code are the following: implicit none was added to all subroutines so that all the variables are now explicitly declared all the variables read and written by a subroutine are now passed as arguments and declared with their own intent many do cycles were conveniently merged or re-arranged some routines have been splitted in different routines for different cases many useless temporary array variables were eliminated all the physicical parameters, the evolution parameters and the grid and boundaries settings are now contained in one file instead of being scattered among many source files

CPUs vs GPUs CPUs design multi-core sophisticated control logic unit large cache memories to reduce access latencies GPUs design many-cores (several hundreds)

13 CPUs vs GPUs CPUs design multi-core sophisticated control logic unit large cache memories to reduce access latencies GPUs design many-cores (several hundreds) minimized control logic in order to manage leightweight threads and maximize execution throughput large number of threads to overcome long-latency memory accesses

14 OpenACC: What is it? is a programming API standard to program accelerators is portable across operating systems and various types of host CPUs and accelerators. allows programmers to provide directives to the compiler identifying which areas of code to accelerate, without requiring programmers to modify or adapt the code itself is aimed at incremental development of accelerator code

15 OpenACC: About the standard announced at the SC11 conference (Seattle, November 2011) offers portability between compilers drawn up by: NVIDIA, Cray, PGI, CAPS works for Fortran, C, C++ standard available at current version: 1.0a (April 2012) work is now targeting additional features for v1.1 compiler support: all complete in 2012 (PGI full support from version 12.6, late August)

16 OpenACC: Execution model Host-directed execution with attached GPU accelerator Main program executes on host (i.e. CPU) Compute-intensive regions offloaded to the accelerator device under control of the host. device (i.e. GPU) executes parallel regions typically contain kernels (i.e. work-sharing loops), or kernels regions, containing loops executed as kernels. Host must orchestrate the execution by: allocating memory on the accelerator device, initiating data transfer, sending the code to the accelerator, passing arguments to the parallel region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory.

17 OpenACC: Memory model Memory spaces on the host and device distinct Different locations, different address space Data movement performed by host using runtime library calls that explicitly move data GPUs have a weak memory model No synchronisation between different execution units (SMs) (unless explicit memory barrier) Can write OpenACC kernels with race conditions Giving inconsistent execution results Compiler will catch most errors, but not all OpenACC data movement between the memories implicit managed by the compiler, based on directives from the programmer. Device memory caches are managed by the compiler with hints from the programmer in the form of directives.

18 OpenACC: Benefits Modify original source code with directives Non-executable statements (comments, pragmas) Can be ignored by non-accelerating compiler Don t need separate source-base, much more portable Can preserve subprogram structure Familiar programming model if used traditional OpenMP A small performance gap acceptable (target is 10-15%, currently seeing better than that for many cases) An open standard is the most attractive for developers Portability, multiple compilers for debugging

19 OpenACC: Levels of parallelism The model target architecture is a collection of processing elements or PEs, where each PE is multithreaded, and each thread on the PE can execute vector instructions. The OpenACC execution model has three levels of parallelism: the gang dimension would map across the PEs (CUDA blocks) the worker dimension across the multithreading dimension within a PE (warps) the vector dimension to the vector instructions (threads within a warp) There is no support for any synchronization between gangs, since current accelerators typically do not support synchronization across PEs.

20 OpenACC: Categories of OpenACC APIs Accelerator Parallel Region / Kernels Directives Loop Directives Data Declaration Directives Data Regions Directives Wait / update directives Runtime Library Routines Environment variables

21 OpenACC: Accelerator directives Fortran: sentinel:!$acc * paired with!$acc end * extra lines: & +!$acc& C/C++: sentinel: #pragma acc * structured block {... } avoids need for end directives extra lines: \ (at line end)!$acc directive-name [clause [,clause]]

22 OpenACC: Parallel Directive 1/2 Starts parallel execution on accelerator Specified by:!$acc parallel [clause [,clause]] When encountered: Gangs of workers threads are created to execute on accelerator One worker in each gang begins executing the code following the structured block Number of gangs/workers remains constant in parallel region

23 OpenACC: Parallel Directive 2/2 The clauses for the!$acc parallel directive are: if(condition) async [(scalar-integer-expression)] num gangs, num workers, vector length (scalar-integer-expr.) reduction (operator:list) copy, copyin, copyout (list) create (list) private (list) present, present or... (list) If async is not present, there is an implicit barrier at the end of accelerator parallel region. present or copy default for aggregate types (arrays) private or copy default for scalar variables

24 OpenACC: Kernels Directive Defines a region of a program that is to be compiled into a sequence of kernels for execution on the accelerator Each loop nest will be a different kernel Kernels launched in order in device Specified by:!$acc kernels [clause [,clause]] Kernels directive may not contain nested parallel or kernel directive Configuration of gangs and worker thread may be different for each kernel If async is present, kernels or parallel region will execute asynchronous on accelerator present or copy default for aggregate types (arrays) private or copy default for scalar variables

25 OpenACC: Loop Directive Used to describe what type of parallelism to use to execute the loop in the accelerator Can be used to declare loop-private variables, arrays and reduction operations Specified by:!$acc loop [clause [,clause]] + a do loop The clauses for the!$acc loop directive are: collapse (n) gang, worker, vector [( scalar-integer-expression )] seq independent private (list) reduction ( operator : list) Combined directives are specified by:!$acc parallel loop,!$acc kernels loop

26 OpenACC: Data Directive The data construct defines scalars, arrays and subarrays to be allocated in the accelerator memory for the duration of the region Can be used to control if data should be copied-in or out from the host Specified by:!$acc data [clause [,clause]] The clauses for the!$acc data directive are: if( condition) copy, copyin, copyout (list) create (list) present, present or... (list) deviceptr (list)

27 OpenACC: Declare Directive Used in the variable declaration section of program to specify that a variable should be allocated, copy-in/out in an implicit data region of a function, subroutine or program If specified within a Fortran Module, the implicit data region is valid for the whole program Specified by:!$acc declare [clause [,clause]] Not fully implemented in PGI compiler release 12.8 yet? see later...

28 OpenACC: Update and Wait Directives OpenACC Update Directive:!$acc update [clause [,clause]] Used within a data region to update / synchronize the values of the arrays on both the host or accelerator The clauses for the!$acc update directive are: host,device (list) if (condition) async [( scalar-integer-expression)] OpenACC Wait Directive:!$acc wait [(scalar-integer-expression)] It causes the program to wait for completion of an asynchronous activity such as an accelerator parallel, kernel region or update directive It will test and evaluate the integer expression for completion If no argument is specified, the host process will wait until all asynchronous activities have completed

29 OpenACC: Directive Status The PGI compiler has been providing full OpenACC support since release 12.6 (August 2012) Some of the features supposed to be implemented in 12.6 are:!$acc parallel reduction and!$acc loop reduction!$acc parallel private() and!$acc loop private()!$acc parallel present() and!$acc parallel present or...()!$acc parallel deviceptr()!$acc declare device resident() and!$acc declare deviceptr() Actually, from the pgroup user forum: We re still missing a few OpenACC features and device resident is one of them. We expect it to be in by the release

30 OpenACC: Runtime Routines acc get num devices() acc set device type() acc get device type() acc set device num() acc get device num() acc async wait() acc async wait all() acc init() acc shutdown() acc on device() acc malloc() acc free()

31 OpenACC: Environment Variables ACC DEVICE TYPE and ACC DEVICE NUM ACC NOTIFY shows the list of launched kernels with detailed information about the number of the device who executes the kernel the name of the function the kernels is launched from, the file that contains it and the line in the file the grid and block dimensions Example: export ACC NOTIFY=1 file=.../testgpu.f90 function=testgpu line=18 device=0 grid=1x200 block=128 queue=0 PGI ACC TIME is equivalent to the flag -ta=nvidia,time Example: export PGI ACC TIME=1 18: region entered 1000 times time(us): total=536,077 init=177 region=535,900 kernels=11,627 data=286,750 w/o init: total=535,900 max=716 min=519 avg=535

32 OpenACC: Mandatory requirements Privatize arrays (scalars are private by default) Error: Parallelization would require privatization of array a(:) All loops must be rectangular Restructure linearized arrays with computed indices Error: Non-stride-1 accesses for array b Privatize live-out scalars Accelerator restriction: induction variable live-out from loop No function calls in directives regions manually or automatically inline subroutines) Error: Accelerator region ignored Accelerator restriction: function/procedure calls are not supported Avoid print or write operations

33 OpenACC: Tips All parallel regions should contain a loop directive Always start with parallel loop (not just parallel) Always use : when shaping with an entire dimension (i.e. A(:,1:2) ) Watch for runtime device errors, for example: Call to cumemcpydtoh returned error: Launch failed Call to cumemcpy2d returned error: Invalid value First get your code working without data regions, then add data regions: be aware of data movement leave data on GPU across procedure boundaries First get your code working without async, than add async Use directive clauses to optimize performance

34 OpenACC: Compilation flags The minimum set of flags needed to use OpenACC directives is: -acc (enables OpenACC) -ta=nvidia (target) -ta=nvidia:cc20 (GPU cabability, cc20 stands for 2.0) -ta=nvidia,time (enables Accelerator Kernel Timing data) -ta=nvidia,host (generates two versions of routines, one that runs on the host and one on the GPU) -Minline -Mipa=fast,inline,reshape (enables IPA, automatic inlining and array reshaping) -O2 at least (if less, -Mipa forces -O2 anyway) -Minfo=inline,accel (enables compiler feedback) Example: pgfortran -fast -O3 -acc -ta=host,nvidia:cc20,time -Minline -Mipa=fast,inline,reshape -Minfo=accel

35 OpenACC: Simple examples See live examples on PLX...

36 OpenACC: Performances Example: Jacobi relaxation Calculation: 4096 x 4096 mesh directory with source code on PLX: /plx/userexternal/lfranci0/stage/openacc/ wiki page with all details and results

37 ECHO-GPU: Necessary rearrangements Many changes and rearrangements were necessary to allow all subroutines in the evolution routine to be run as GPU kernels: it was mandatory to use at least -O2 optimization flag, interprocedural analysis and automatic inlining and reshaping some small subroutines, or subroutines used only once, were manually inlined to better manage the variables many do cycles were re-arranged for synchronization purpose all the print statements inside parallel regions were removed a common statement was removed many exit statements were removed temporary arrays were substituted with scalars where possible many little do cycles were manually unrolled many temporary arrays were privatized

38 ECHO-GPU: partial OpenACC implementation

ECHO-GPU: full OpenACC implementation only the primitive and conservative variables (together with the metric terms and the grid variables) are copied into the device at the beginning of the

39 ECHO-GPU: full OpenACC implementation only the primitive and conservative variables (together with the metric terms and the grid variables) are copied into the device at the beginning of the evolution and copied back to the host at the end of the simulation all the other variables are created directly on the GPU with!$acc create or!$acc device resident, and then declared present in all the subroutines called inside evolve

40 ECHO-GPU: Performances Test run: 2D mesh, 120x50 points, tmax=0.005 ms Execution time CPU-version: sec GPU-version: sec (already achieved) GPU-version: 40 sec (theoretical but plausible value with a small further effort) Speedup achieved: 1.5x theoretical: 2.3x Real runs have much longer evolution times and finer grids and in this case the performances are supposed to be better

41 ECHO-GPU: Short-term improvements Further improvements can be quite easily achieved by: extending the data region oustide the main do while cycle creating all the temporary variables directly on the device memory and copying only the primitive and conservative variables array together with the metric terms and grids avoiding the copy of some variables defined in inlined subfunctions using the new OpenACC features implemented in the PGI compiler release tuning the parallelization with the right choice of the numbers of threads and blocks

42 ECHO-GPU: Long-term improvements Further medium and long-term possible improvements include: implementing an OpenMP parallelization, taking advantage of the work already done to use OpenACC directives using MPI to manage multiple GPUs moving from cilindrycal coordinates to cartesian coordinates implementing a Phyton user interface implementing a parallel HDF5 I/0

43 Thank you for your attention Comment and suggestions are welcome and enjoy your acceleration! mail:

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...