CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

Size: px

Start display at page:

Download "CAF versus MPI Applicability of Coarray Fortran to a Flow Solver"

Dylan Holland
5 years ago
Views:

1 CAF versus MPI Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller m.hasert@grs-sim.de Applied Supercomputing in Engineering

2 Motivation We develop several CFD codes at our institute Fortran 95/2003 Parallelization MPI (90%), other (10%) How do we have to design our codes to perform well on future architectures without too much porting effort (or even a re-design)? In this talk I focus on Implementations with MPI against several CAF implementations Performance Code complexity 2

3 Coarrays in Fortran (CAF) Language-inherent parallel extension to Fortran Partitioned Global Address Space (PGAS) Direct access to remote memory locations possible Compiler handles network access GET or PUT access to remote locations remote_copy = value1 [ 2 ]! Get access from arr(2) on proc 2 value2[ 3 ] = localvalue! Put access to Array2 on proc3 sync all! Barrier. Synchronize all processes Here only GET investigated 3

4 Algorithmic background The implementation of the Lattice Boltzmann Method

5 The Lattice Boltzmann Code LBC Computation domain Solves weakly compressible fluid problems 19 DOFs/cell Compute kernel Stream-collide algorithm Nearest neighbor information required 19 DOFs per cell (f i, i=1,..,19) (Degrees Of Freedom) Cartesian, uniform grid Parallel execution: divide into sub-cubes, use ghost cells (simple approach) do k=1,nz do itime = 1, maxtime do j=1,ny call communicate( fin, fout ) do i=1,nx call compute( fin, fout )! Streaming from offsets cx, cy, cz enddo ftmp(:) = fin( :, i-cx(:), j-cy(:), k-cz(:))...! collide fout(:,i,j,k) = ftmp(:) - (1-omega)*(ftmp(:) -feq(:)) enddo; enddo; enddo Compute kernel: Element loop main loop 5 5

6 Data Alignment in Memory Cartesian grid structure with DOFs/cell Fits naturally to a four-dimensional array fin( l, i, j, k ), fout( l, i, j, k ) i, j, k: Spatial positions of the fluid cell l=1.. 19: DOFs of each cell Vary position l of the DOFs Access to all 19 DOFs in the array yield: DOF-first stride=1 DOF-last stride=n cells Considerations for parallel execution Collection of data at the border planes Smallest chunks in x-direction DOF-first: DOF/cell = 19 DOF-last: 1 entry z y x 6

7 Possible parallelizations From Explicit to Implicit Implementations

8 Starting with two different parallelization schemes Explicit Communication Separate communication routine Ghost cells sender-receiver approach do itime = 1, maxtime call communicate( fin, fout ) call compute( fin, fout ) enddo main loop Implicit Communication remote data accessed within compute routine No ghost cells do itime = 1, maxtime call compute( fin, fout ) enddo main loop 1 2 Ghost cells Fluid cells Same values Fluid cell performing coarray access Fluid cell with local access Current fluid cell Neighbor of current cell accessed as coarray 5 different implementations between these two schemes 8

9 1: Traditional Parallelization Approach Domain Decomposition Introduction of ghost cells No computation takes place on these cells Provide data for fluid cells 1 2 Ghost cells Fluid cells Same values do k=1,nz do j=1,ny do i=1,nx! Streaming from offsets cx, cy, cz ftmp(:) = fin( :, i-cx(:), j-cy(:), k-cz(:))...! collide fout(:) = ftmp(:) - (1-omega)*(ftmp(:) -feq(:)) enddo; enddo; enddo do itime = 1, maxtime call communicate( fin, fout ) call compute( fin, fout ) enddo main loop Compute kernel: Element loop MPI irecv-isend, waitall, communication buffers Use a simple, non-optimized MPI communication to be comparable 9

10 2,3: Mimic MPI Parallelization with Coarrays Computation kernel exactly like MPI Communication with coarrays GET 1 2 Ghost cells Fluid cells Same values 2: Explicit buffered CAF Replace communication itself by Coarrays Use communication buffers 3 Explicit direct CAF Omit buffer usage Direct access to remote memory locations Stride becomes important do itime = 1, maxtime call communicate( fin, fout ) call compute( fin, fout ) enddo main loop 10

11 Parallelization following the Coarray Concept Communication inside kernel s element loop Remote Neighbor / location identification Access remote data in streaming step All fluid cells are accessed as Coarrays Naive CAF do k=1,nz do j=1,ny do i=1,nx! streaming step (get values from neighbors) do l=1,nnod! loop over densities in each cell xpos = mod( crd(1)*bnd(1) + i - cx( l, 1 )- 1, bnd(1) )+1 xp( 1 ) = ( crd(1)*bnd(1) + i - cx( l, 1 )- 1)/bnd(1)+1...! analoguous for the other directions if(xp(1).lt. 1) then...! correct physical boundaries nbp = image_index( caf_cart_comm, xp(1:3))! get image num ftmp( l)=fin( l,xpos,ypos,zpos )[nbp]! coarray get enddo...! collision enddo; enddo; enddo Fluid cell performing coarray access Fluid cell with local access Current fluid cell Neighbor of current cell accessed as coarray do itime = 1, maxtime call compute enddo main loop Compute kernel: Naive Coarray Fortran implementation 11

12 Improved Coarray Implementation Segmented approach Loop over fluid cells without coarray access Border nodes include coarray access Preserve efficiency on pure local cells Segmented CAF 1 2 Remote data necessary Local data only Current fluid cell Neighbors of current fluid cell Call compute_caf(fout,fin,1,nx,1,ny,1,1) do k=2,nz-1 call compute_caf(fout,fin,1,nx,1,1,k,k) do j=2,ny-1 call compute_caf(fout,fin,1,1,j,j,k,k) call compute_purelylocal(fout,fin,j,k) call compute_caf(fout,fin,nx,nx,j,j,k,k) end do call compute_caf(fout,fin,1,nx,ny,ny,k,k) end do call compute_caf(fout,fin,1,nx,1,ny,nz,nz) do itime = 1, maxtime call compute enddo serial kernel kernel with CAF access on every cell main loop Compute kernel: Segmented Coarray Fortran implementation 12

13 Tested Implementations We now compare the performance of these implementations Name Paradigm Separate communication Buffer usage Halos 1 MPI MPI x x x 2 expl buf CAF x x x 3 expl dir CAF x x 4 impl segm CAF 5 impl naive CAF 13

14 Performance Results Serial Performance and Scalability

Used Hardware Criteria Hardware / software support for PGAS Gemini has been optimized for high throughput of small messages Potentially many small data packages to transfer Comparison between Seastar

15 Used Hardware Criteria Hardware / software support for PGAS Gemini has been optimized for high throughput of small messages Potentially many small data packages to transfer Comparison between Seastar with Gemini XT5m XE6 CPU Barcelona 23 C2 Magny Cours 6128 Cores 4 8 GHz 2,4 2 L2 Cache 512 KB 512 KB L3 Cache 2 MB 12 MB Socket per node 2 2 Memory 16 GB 32 GB ASIC Seastar Gemini Compiler Version MPI Version Cray MPT Cray MPT

16 Serial Performance and impact of memory layout DOF-first Domain size-dependent performance, esp. when running in cache Smallest chunk is 19 datums choose DOF-last weaker impact of cache, but cache thrashing, better for large problems Smallest chunk only 1 datum, severe impact on Seastar 16

17 Strong Scaling Results Total Fluid cells: = 8 Mio. cell Three-dimensional domain decomposition same # procs in each direction 2 3,3 3, Method Paradigm Separate communication Buffer usage Halos 1 MPI MPI x x x 2 expl buf CAF x x x 3 expl dir CAF x x 4 impl segm CAF 5 impl naive CAF procs p Drastic improvements from Seastar to Gemini 9 3 cells/proc 17 3 cells/proc 17

18 Weak Scaling Results MPI outperforms CAF for large problem sizes 9 3 = 729 cells/process Latency region Method Paradigm Separate communication Buffer usage Halos 1 MPI MPI x x x 2 expl buf CAF x x x 3 expl dir CAF x x 4 impl segm CAF 5 impl naive CAF procs p One-dimensional domain-decomposition for p<8 Improvement from Seastar to Gemini 18

19 Programming complexity MPI CAF Compute kernel Same as serial Identification of neighbor proc and remote address inside -> slow and obscuring Potential for compiler optimizations Data structure Parallel infrastructure Derived Type buffers: flexible and fast Regular Coarrays: fast but unflexible Derived Type Coarray: flexible but slow on XT Neighbor and position identification with Cartesian communicator Data validity Given after communication Has to be ensured by sync statements Restrictions Sync is slow Same sized arrays 19

20 Conclusion and Outlook Code complexity First Parallelization is fairly easy to implement Achieving higher performance requires complex code Kernel code gets obscured Performance MPI yields best performance for all tested cases Usage of coarrays might get beneficial for very large number of cores Result At the current stage, more drawbacks than advantages for complex scientific codes Coarrays can be used on Gemini without severe implications Although some tasks are hidden from the user, other tasks arise No detailed performance tuning possible for the user 20

21 Thank you! 21

CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver

CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller 2 German Research School for Simulation Sciences GmbH, 52062 Aachen, Germany 2 RWTH Aachen