CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver

CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller 2 German Research School for Simulation Sciences GmbH, 52062 Aachen, Germany 2 RWTH Aachen University, 52062 Aachen, Germany Abstract. We investigate how to use coarrays in Fortran (CAF) for parallelizing a flow solver and the capabilities of current compilers with coarray support. Usability and performance of CAF in mesh-based applications is examined and compared to traditional MPI strategies. We analyze the influence of the memory layout, the usage of communication buffers against direct access to the data used in the computation and different methods of the communication itself. Our objective is to provide insights on how common communication patterns have to be formulated when using coarrays. Keywords: PGAS, Coarray Fortran, MPI, Performance Comparison Introduction Attempts to exploit parallelism in computing devices automatically have always been made, and it was successfully done by compilers in a restricted form such as vector operations. For more general parallelization concepts with multiple instructions on multiple data (MIMD), the automation was less successful and programmers had to support the compiler by directives such as in OpenMP. The Message Passing Interface (MPI) offers a rich set of functionality for MIMD applications on distributed systems and high-level parallelization is supplied by APIs from libraries. The parallelization however, has to be elaborated in detail by the programmer. Recently, an increasing effort in language-inherent parallelism is made to leave parallel implementation details to the compiler. A concept developed in detail is the partitioned global address space (PGAS), which was brought into the Fortran 2008 standard with the notion of coarrays. Parallel features are turned into intrinsic language properties and allow a high-level formulation of parallelism in the language itself [?]. Coarrays minimally extend Fortran to allow the creation of parallel programs with minor modifications to the sequential code. The optimistic goal is to obtain a language which inherits parallelism and allows the compiler to concurrently consider serial aspects and communication for code optimization. In CAF, shared data objects are indicated by an additional index in square brackets, for which the remote location in terms of the process number of the shared variable is defined. Contrary to MPI, there are no collective operations

2 defined in the current standard. These have been shown to constitute a large part of nowadays total communication on supercomputers [?]. thus posing a severe limitation to a pure coarray parallelization, which is a major point of criticism by Mellor-Crummey et al. [?]. Only few publications actually give advice on how to use coarrays in order to obtain a fast and scalable parallel code. Ashby & Reid [?] ported an MPI flow solver to coarrays and Barrett [?] uses a finite difference scheme to assess the performance of several coarray implementations. The goal of this paper is to compare several parallelization implementations with coarrays and MPI. We compare these approaches in terms of performance, flexibility and ease of programming. Speedup studies on two machines with a different network interface are performed. The work concentrates on the implementation provided by Cray. 2 Numerical Method We use a lattice Boltzmann method (LBM) fluid solver as a testbed. This explicit numerical scheme uses direct neighbor stencils and a homogeneous, Cartesian grid. The numerical algorithm consists of streaming and collision performed at each time step (Listing.). The cell-local collision mimics particle interactions and streaming represents the free motion of particles, consisting of pure memory copy operations from nearest neighbor cells. Neighbors are directly accessed on a cubic grid, which is subdivided into rectangular sub-blocks for parallel execution. do k=,nz; do j=,ny; do i=,nx ftmp(:)=fin(:,i-cx(:),j-cy(:),k-cz(:))! advection from offsets cx, cy, cz...! Double buffering: Read values from fin, work on ftmp and write to fout fout(:) = ftmp(:) - (-omega)*(ftmp(:) -feq(:))! collide enddo; enddo; enddo Listing.. Serial stream collide routine 2. Alignment in Memory The Cartesian grid structure naturally maps to a four-dimensional array, where the indices {i, j, k} represent the fluid cells spatial coordinates. Each cell holds n nod =9 density values. It is represented by the index l and its position in the array can be chosen. Communication then involves strided data access, where the strides depend on the direction and memory layout. There are two main arrays fin and fout which by turns hold the state of the current time step. These arrays are of the size {n nod, n x, n y, n z }, depending on the position of l. In Fortran the first index is aligned successively in memory yielding a stride one access. With the density-first lijk, the smallest data chunk for communication has at least n nod consecutive memory entries. The smallest memory chunks of n nod entries for communication occur in the x-direction. Communication in y- direction involves chunks of n nod n x and in z-direction n nod n x n y. When the density is saved last ijkl, the x-direction again involves the smallest data chunks, but only of a single memory entry with strides of n stride = n x n y n z.

3 2.2 Traditional Parallelization Approach A time-dependent mesh-based flow solver is usually parallelized following a SPMD (single program multiple data) approach. All p processes execute the same program but work on different data. In LBM, each fluid cell induces the same computing effort per time step. For an ideal load-balancing, the work is equally distributed among the processes and a simple domain decomposition can be performed, splitting the regular Cartesian mesh into p equal sub-domains. An update of a given cell requires the information from the neighboring cells and itself from the previous iteration. At the border between sub-domains it is then necessary to exchange data in each iteration. A common approach is to use halo cells, which are not updated by the scheme itself but provide valid data from the remote processes. This allows the actual update procedure to act just like in the serial algorithm. With MPI, data is exchanged before the update of the local cells at each time step, usually with a non-blocking point-to-point exchange. 2.3 Strategy Following the Coarray Concept Using coarrays, data can be directly accessed in the neighbor s memory without using halos. Coarray data objects must have the same size and shape on each process. The cells on remote images can then be accessed from within the algorithm like local ones, but with the additional process address (nbp) appended. Synchronization is required between time steps to ensure data consistency across all processes, but no further communication statements are necessary. We present different approaches to coarray implementations. In the Naive Coarray Approach every streaming operation, i.e. copy from neighbor to the local cell, is done by a coarray access, even for values which are found locally. This requires either the calculation of the address for the requested value, or a lookup table for this information. Both approaches result in additional run time costs and the calculation of the neighbor process number and the position there obscure the kernel code. do k=,nz; do j=,ny; do i=,nx! streaming step (get values from neighbors) do l=,nnod! loop over densities in each cell xpos = mod(crd()*bnd()+i-cx(l,)-,bnd())+ xp()= (crd()*bnd()+i-cx(l,)-)/bnd()+...! analoguous for the other directions if(xp().lt. ) then...! correct physical boundaries nbp=image_index( caf_cart_comm,xp(:3) )! get image num ftmp( l)=fin( l,xpos,ypos,zpos )[nbp]! coarray get enddo...! collision enddo; enddo; enddo Listing.2. Naive streaming step with coarrays The copy itself is easy to implement, but the logic for getting the remote data address requires quite some effort (Listing.2). Significant overhead is generated by the coarray access itself and repetitive address calculations. If the position of each neighbor is determined in advance and saved, memory demand and, more

4 important, memory access increases. As the LBM is a memory bandwidth-bound algorithm, this puts even more pressure on the memory interface. With the Segmented Coarray Approach, the inner nodes of each partition are treated as in the serial version. Coarray access is only used where necessary, i.e. on the interface cells. Separate subroutines are defined for the coarray and non-coarray streaming access, which are called according to the current position in the fluid domain (Listing.3). With this approach, there are two kernels, one with additional coarray access as above and additionally the loop for determining the kernel. This raises the required lines of code again. call stream_collide_caf(fout,fin,,nx,,ny,,) do k=2,nz- call stream_collide_caf(fout,fin,,nx,,,k,k) do j=2,ny- call stream_collide_caf(fout,fin,,,j,j,k,k) call stream_collide_sub(fout,fin,j,k) call stream_collide_caf(fout,fin,nx,nx,j,j,k,k) end do call stream_collide_caf(fout,fin,,nx,ny,ny,k,k) end do call stream_collide_caf(fout,fin,,nx,,ny,nz,nz) Listing.3. Segmented stream collide with coarrays 3 Tested Communication Schemes 3. Data Structures Message passing based parallelization requires data structures for collecting, sending and receiving the data. MPI types or regular arrays can be used in MPI, whereas with CAF, regular arrays or derived types with arrays can be employed. With Regular Global Arrays and the same-size restriction for coarrays, separate data objects for each neighbor buffer are required. This applies both for send and receive, which increases implementation complexity. The usage of Derived Types provides the programmer with flexibility, as the arrays inside the coarray derived types do not have to be of the same size. Before each communication or alternatively at every array (de)allocation, information about the array size of every globally accessible data object has to be made visible to all processes.! Regular global arrays as coarrays real,dimension(:,:,:,:)[:],allocatable :: caf_snd,..! Derived types type caf_dt! Coarray derived type with regular array real,dimension(:,:,:,:),allocatable :: send end type caf_dt type(caf_dt) :: buffer[*]!< Coarrray definition type reg_dt! Regular derived type with coarray inside real,dimension(:,:,:,:)[:],allocatable :: snd,.. snd9 end type reg_dt type(reg_dt) :: buffer!< Regular array definition Listing.4. Derived type and regular global coarrays

5 CPU Rev Cores GHz L2(KB) L3(MB) Memory ASIC CCE MPI XT5m Shanghai 23C2 4 2,4 52 6 6GB Seastar2 7.2.4 MPT 5.0.0 XE6 MagnyCours 628 8 2,0 52 2 32GB Gemini 7.3.0 MPT 5.. Table. Machine setup 3.2 Buffered and Direct Communication The usage of explicit buffers, i.e. using halos, requires separate send/receive buffers with potentially different sizes. The communication is done in an explicit, dedicated routine, and before its start, the required data is collected and placed into the halo buffer, from where it is put back into the main array after the communication. One-sided communication models allow direct remote memory access to locations, which are not available to the local process, but where access is routed through the network. This allows the construction of either explicit message passing of buffers or access to remote data within the kernel itself, here referred to as implicit access. 4 Experimental Results We performed investigations on Cray XT and XE systems (see Table ), as they are among the few supporting PGAS both by hardware and compiler. These architectures mainly differ in the application-specific integrated circuits (ASIC) [?], which connect the processors to the system network and offloads communication functions from the processor. The Cray XT5m nodes use SeaStar ASICs, which contain among other a direct memory access engine to move data on the local memory, a router connecting with the system network and a remote access memory engine [?]. The XE6 nodes are equipped with the Gemini ASIC [?], which supports a global address space and is optimized for efficient one-sided point-to-point communication with a high throughput for small messages. We used the Cray Fortran compiler from the Cray Compiling Environment (CCE). We first evaluate the single core performance for various domain sizes and memory layouts, from which we choose a suited method for parallel scaling. Coarray and MPI are then compared on both machines. We perform a three-dimensional domain decomposition, for p > 8 with an equal amount of subdivisions in all three directions. For p 8, the domain is split in z-direction only. 4. Influence of the Memory Layout in Serial and Parallel The serial performance is measured with the physics-independent quantity million lattice updates per second (MLUPs), as a function of the total fluid cells n tot. Performance studies of the LBM by Donath [?] have revealed a significant influence of memory layout and domain size. With density-first lijk, the performance decreases with increasing domain size. In Fig. (left), the cache hierarchy is clearly visible in terms of different MLUPs levels, especially for lijk. For density-later iljk,ijlk,ijkl, the performance is relatively constant with cachethrashing occurring for ijkl.

6 MLUPs 6 4 2 0 8 6 4 2 0 ijkl ijlk iljk lijk L L2 L3 Memory 2 3 4 5 6 lg 0 (n tot ) total run time [s] 000 00 0 XT5m ijkl 0. XT5m lijk XE6 ijkl 0.0 XE6 lijk ideal 0.00 0 00 000 p Fig.. Impact of memory layout in serial (XT5m) and parallel (direct CAF) The memory layout not only plays a large role for serial execution, but also when data for communication is collected. Depending on data alignment in memory, the compiler has to collect data with varying strides resulting in a large time discrepancy of communication in different directions. In figure on the right, all layouts scale linearly for a one-dimensional domain decomposition for p 8 and a domain size of n tot = 200 3. Invoking more directions leads to a strong degradation of performance on the XT5m for the ijkl layout, probably due to the heavily fragmented memory, that needs to be communicated. The smallest chunks in lijk remains 9, as all densities are needed for communication and stored consecutively in memory. On the XE6 with the new programming environment, this seems to be resolved. 4.2 Derived Type and Regular Coarray Buffers total run time [s] 00 0 XE6 reg coarray XE6 der type 0. XT5m reg coarray XT5m der type ideal 0.0 0 00 p 000 On the XT5m there is an obvious difference between the two implementations. The regular array implementation scales in a nearly linear fashion. The derived type version even increases linearly in the run time when using more processes inside a single node. When using the network, it scales nearly linear for p 6. This issue also seems to be resolved on the new XE6 architecture. Within a sin- Fig. 2. Derived type coarrays gle node, there are virtually no differences between the two variants. However, the derived type variant seems to scale a little worse beyond a single node. 4.3 MPI compared to Coarray Communications Here we compare various coarray schemes to regular non-blocking, buffered MPI communication. We start with a typical explicit MPI scheme and work our way to an implicit communication scheme with coarrays. The MPI and MPI-style CAF schemes handle the communication in a separate routine. The implicit coarray schemes perform the access to remote data during the streaming step. We use the lijk memory layout, where all values of one cell are saved contiguously, which results in a minimal data pack of 9 values. Strong and weak scaling experiments are performed for the following communication schemes:. Explicit MPI: buffered isend-irecv

7 2. Explicit CAF buffered: same as MPI but with coarray GET to fill buffers 3. Explicit CAF direct access: no buffers but direct remote data access 4. Implicit CAF segmented loops: coarray access on border nodes only 5. Implicit CAF naive: coarray access on all nodes Strong scaling A fluid domain with a total of 200 3 grid cells is employed for all process counts. In Fig. 3 the total execution time of the main loop is shown as a function of the number of processes. Increasing the number of processes decreases domain sizes per process, by which the execution time decreases. Ideal total run time [s] 00 0 0. XT5m 00 0 0. XE6 impl naive impl segm expl dir expl buf MPI ideal 0.0 0.0 0 00 000 0 00 000 0000 p Fig. 3. Strong scaling comparison of MPI and CAF with lijk layout and n tot =200 3 scaling is plotted for comparison. The MPI implementation is fastest and scales close to ideal on both machines, but scaling scaling stalls for p > 2000. The coarray implementations show, that there was a huge progress made from the XT5m to the XE6. Implicit coarray access schemes shows a much slower serial runtime. The naive coarray implementation is slower than explicit ones even by a factor of 30, due to coarray accesses to local data, although it scales perfectly on the XE6. On the XE6, for p>0000, all schemes tend to converge against the naive implementation, which is expected, as this approach pretends all neighbors to be remote, essentially resulting in cell partitions. However this results in a very low parallel efficiency with respect to the fastest serial implementation. Due to an unexplainable loss of scaling in the MPI implementation beyond 2000 processes, coarrays mimicking the MPI communication pattern get even slightly faster in this range. Weak scaling The increasingly high parallelism due to power and clock frequency limitation [?] combined with limited memory resources lead to extremely distributed systems. Small computational domains of n = 9 3 fluid cells, which fit completely in cache are used to anticipate this trend in our analysis (Fig. 4). With such domain sizes, latency effects prevail. The MPI parallelization scales similar on both machines, and yields the best run time among the tested schemes. Both explicit buffers in coarray scale nearly perfect for p > 64. Whereas implicit coarray addressing within the algorithm clearly looses in all respects. 5 Conclusion We presented different approaches on how to parallelize a fluid solver using coarrays in Fortran. We compared the ease of programming and performance

8 total run time [s] 0. 0.0 XT5m 0. 0.0 XE6 MPI expl buf expl dir impl seg impl naive 0.00 0 00 000 0.00 0 00 000 0000 p Fig. 4. Weak scaling in three directions with n=9 3 cells per process (latency area) to traditional parallelization schemes with MPI. The code complexity for simple and slow CAF implementations is low, but quickly increases with the constraint of a high performance. We showed that the achievable performance of applications using coarrays depends on the definition of data structures. The analysis indicates, that it might get beneficial to use coarrays for very large numbers of processes, however on the systems available today, MPI communication provides highest performance and needs to be mimicked in coarray implementations. Acknowledgments We thank the DEISA Consortium (www.deisa.eu), funded through the EU FP7 project RI-22299, for support within the DEISA Extreme Computing Initiative. We also want to express our grateful thanks to M. Schliephake from the PDC KTH Stockholm for providing us the scaling runs on the XE6 system Lindgren and U. Küster from HLRS Stuttgart for insightful discussions. References. J. V. Ashby and J. K. Reid. Migrating a Scientific Application from MPI to Coarrays. Technical report, Comp. Sc. Eng. Dept. STFC Rutherford Appleton Laboratory, 2008. 2. R. Barrett. Co-Array Fortran Experiences with Finite Differencing Methods. Technical Report, 2006. 3. R. Brightwell, K. Pedretti, and K. D. Underwood. Initial Performance Evaluation of the Cray SeaStar Interconnect. Proc. 3th Symp. on High Performance Interconnects, 2005. 4. CRAY. Cray XT System Overview, Jun 2009. 5. CRAY. Using the GNI and DMAPP APIs. User Manual, 200. 6. S. Donath. On optimized Implementations of the Lattice Boltzmann Method on contemporary High Performance Architectures. Master s thesis, Friedrich- Alexander-University, Erlangen-Nürnberg, 2004. 7. P. Kogge. Exascale computing study: Technology challenges in achieving exascale systems. Technical report, Information Processing Techniques Office and Air Force Research Lab, 2008. 8. J. Mellor-Crummey, L. Adhianto, and W. Scherer. A Critique of Co-array Features in Fortran 2008. Working Draft J3/07-007r3, 2008. 9. R. Rabenseifner. Optimization of Collective Reduction Operations. International Conference on Computational Science, 2004. 0. J. Reid. Coarrays in the next Fortran Standard, Mar 2009.