CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver

Similar documents
CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

simulation framework for piecewise regular grids

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Migrating A Scientific Application from MPI to Coarrays. John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK

Lecture V: Introduction to parallel programming with Fortran coarrays

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Evaluating New Communication Models in the Nek5000 Code for Exascale

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Massively Parallel Phase Field Simulations using HPC Framework walberla

Algorithm Engineering with PRAM Algorithms

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Introduction to parallel Computing

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

CS 475: Parallel Programming Introduction

Ateles performance assessment report

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Software and Performance Engineering for numerical codes on GPU clusters

Parallel Poisson Solver in Fortran

1 Serial Implementation

Evaluation of the Coarray Fortran Programming Model on the Example of a Lattice Boltzmann Code

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallel and High Performance Computing CSE 745

Parallel Programming in Fortran with Coarrays

Numerical Algorithms on Multi-GPU Architectures

An Introduction to OpenACC

Storage Hierarchy Management for Scientific Computing

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

Exploring XcalableMP. Shun Liang. August 24, 2012

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Parallel Programming with Coarray Fortran

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

Fortran 2008: what s in it for high-performance computing

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

walberla: Developing a Massively Parallel HPC Framework

Large scale Imaging on Current Many- Core Platforms

Friday, May 25, User Experiments with PGAS Languages, or

Principles of Parallel Algorithm Design: Concurrency and Mapping

6.1 Multiprocessor Computing Environment

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

PowerVR Hardware. Architecture Overview for Developers

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

High Performance Computing. Introduction to Parallel Computing

Operating Systems. Memory Management. Lecture 9 Michael O Boyle

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

LAPI on HPS Evaluating Federation

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Early Experiences Writing Performance Portable OpenMP 4 Codes

Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

HPC Algorithms and Applications

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Introduction to Parallel Performance Engineering

Implementation of an integrated efficient parallel multiblock Flow solver

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Unit 9 : Fundamentals of Parallel Processing

Optimization of MPI Applications Rolf Rabenseifner

Hybrid Programming with MPI and SMPSs

Peta-Scale Simulations with the HPC Software Framework walberla:

Comparing One-Sided Communication with MPI, UPC and SHMEM

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Parallel Mesh Partitioning in Alya

Memory allocation and sample API calls. Preliminary Gemini performance measurements

NIA CFD Futures Conference Hampton, VA; August 2012

International Supercomputing Conference 2009

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Center Extreme Scale CS Research

Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Performance Engineering - Case study: Jacobi stencil

Virtual EM Inc. Ann Arbor, Michigan, USA

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Performance Comparison between Two Programming Models of XcalableMP

MPI Casestudy: Parallel Image Processing

Praktikum 2014 Parallele Programmierung Universität Hamburg Dept. Informatics / Scientific Computing. October 23, FluidSim.

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Basics of Performance Engineering

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Chapter 3 Parallel Software

Transcription:

CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller 2 German Research School for Simulation Sciences GmbH, 52062 Aachen, Germany 2 RWTH Aachen University, 52062 Aachen, Germany Abstract. We investigate how to use coarrays in Fortran (CAF) for parallelizing a flow solver and the capabilities of current compilers with coarray support. Usability and performance of CAF in mesh-based applications is examined and compared to traditional MPI strategies. We analyze the influence of the memory layout, the usage of communication buffers against direct access to the data used in the computation and different methods of the communication itself. Our objective is to provide insights on how common communication patterns have to be formulated when using coarrays. Keywords: PGAS, Coarray Fortran, MPI, Performance Comparison Introduction Attempts to exploit parallelism in computing devices automatically have always been made, and it was successfully done by compilers in a restricted form such as vector operations. For more general parallelization concepts with multiple instructions on multiple data (MIMD), the automation was less successful and programmers had to support the compiler by directives such as in OpenMP. The Message Passing Interface (MPI) offers a rich set of functionality for MIMD applications on distributed systems and high-level parallelization is supplied by APIs from libraries. The parallelization however, has to be elaborated in detail by the programmer. Recently, an increasing effort in language-inherent parallelism is made to leave parallel implementation details to the compiler. A concept developed in detail is the partitioned global address space (PGAS), which was brought into the Fortran 2008 standard with the notion of coarrays. Parallel features are turned into intrinsic language properties and allow a high-level formulation of parallelism in the language itself [?]. Coarrays minimally extend Fortran to allow the creation of parallel programs with minor modifications to the sequential code. The optimistic goal is to obtain a language which inherits parallelism and allows the compiler to concurrently consider serial aspects and communication for code optimization. In CAF, shared data objects are indicated by an additional index in square brackets, for which the remote location in terms of the process number of the shared variable is defined. Contrary to MPI, there are no collective operations

2 defined in the current standard. These have been shown to constitute a large part of nowadays total communication on supercomputers [?]. thus posing a severe limitation to a pure coarray parallelization, which is a major point of criticism by Mellor-Crummey et al. [?]. Only few publications actually give advice on how to use coarrays in order to obtain a fast and scalable parallel code. Ashby & Reid [?] ported an MPI flow solver to coarrays and Barrett [?] uses a finite difference scheme to assess the performance of several coarray implementations. The goal of this paper is to compare several parallelization implementations with coarrays and MPI. We compare these approaches in terms of performance, flexibility and ease of programming. Speedup studies on two machines with a different network interface are performed. The work concentrates on the implementation provided by Cray. 2 Numerical Method We use a lattice Boltzmann method (LBM) fluid solver as a testbed. This explicit numerical scheme uses direct neighbor stencils and a homogeneous, Cartesian grid. The numerical algorithm consists of streaming and collision performed at each time step (Listing.). The cell-local collision mimics particle interactions and streaming represents the free motion of particles, consisting of pure memory copy operations from nearest neighbor cells. Neighbors are directly accessed on a cubic grid, which is subdivided into rectangular sub-blocks for parallel execution. do k=,nz; do j=,ny; do i=,nx ftmp(:)=fin(:,i-cx(:),j-cy(:),k-cz(:))! advection from offsets cx, cy, cz...! Double buffering: Read values from fin, work on ftmp and write to fout fout(:) = ftmp(:) - (-omega)*(ftmp(:) -feq(:))! collide enddo; enddo; enddo Listing.. Serial stream collide routine 2. Alignment in Memory The Cartesian grid structure naturally maps to a four-dimensional array, where the indices {i, j, k} represent the fluid cells spatial coordinates. Each cell holds n nod =9 density values. It is represented by the index l and its position in the array can be chosen. Communication then involves strided data access, where the strides depend on the direction and memory layout. There are two main arrays fin and fout which by turns hold the state of the current time step. These arrays are of the size {n nod, n x, n y, n z }, depending on the position of l. In Fortran the first index is aligned successively in memory yielding a stride one access. With the density-first lijk, the smallest data chunk for communication has at least n nod consecutive memory entries. The smallest memory chunks of n nod entries for communication occur in the x-direction. Communication in y- direction involves chunks of n nod n x and in z-direction n nod n x n y. When the density is saved last ijkl, the x-direction again involves the smallest data chunks, but only of a single memory entry with strides of n stride = n x n y n z.

3 2.2 Traditional Parallelization Approach A time-dependent mesh-based flow solver is usually parallelized following a SPMD (single program multiple data) approach. All p processes execute the same program but work on different data. In LBM, each fluid cell induces the same computing effort per time step. For an ideal load-balancing, the work is equally distributed among the processes and a simple domain decomposition can be performed, splitting the regular Cartesian mesh into p equal sub-domains. An update of a given cell requires the information from the neighboring cells and itself from the previous iteration. At the border between sub-domains it is then necessary to exchange data in each iteration. A common approach is to use halo cells, which are not updated by the scheme itself but provide valid data from the remote processes. This allows the actual update procedure to act just like in the serial algorithm. With MPI, data is exchanged before the update of the local cells at each time step, usually with a non-blocking point-to-point exchange. 2.3 Strategy Following the Coarray Concept Using coarrays, data can be directly accessed in the neighbor s memory without using halos. Coarray data objects must have the same size and shape on each process. The cells on remote images can then be accessed from within the algorithm like local ones, but with the additional process address (nbp) appended. Synchronization is required between time steps to ensure data consistency across all processes, but no further communication statements are necessary. We present different approaches to coarray implementations. In the Naive Coarray Approach every streaming operation, i.e. copy from neighbor to the local cell, is done by a coarray access, even for values which are found locally. This requires either the calculation of the address for the requested value, or a lookup table for this information. Both approaches result in additional run time costs and the calculation of the neighbor process number and the position there obscure the kernel code. do k=,nz; do j=,ny; do i=,nx! streaming step (get values from neighbors) do l=,nnod! loop over densities in each cell xpos = mod(crd()*bnd()+i-cx(l,)-,bnd())+ xp()= (crd()*bnd()+i-cx(l,)-)/bnd()+...! analoguous for the other directions if(xp().lt. ) then...! correct physical boundaries nbp=image_index( caf_cart_comm,xp(:3) )! get image num ftmp( l)=fin( l,xpos,ypos,zpos )[nbp]! coarray get enddo...! collision enddo; enddo; enddo Listing.2. Naive streaming step with coarrays The copy itself is easy to implement, but the logic for getting the remote data address requires quite some effort (Listing.2). Significant overhead is generated by the coarray access itself and repetitive address calculations. If the position of each neighbor is determined in advance and saved, memory demand and, more

4 important, memory access increases. As the LBM is a memory bandwidth-bound algorithm, this puts even more pressure on the memory interface. With the Segmented Coarray Approach, the inner nodes of each partition are treated as in the serial version. Coarray access is only used where necessary, i.e. on the interface cells. Separate subroutines are defined for the coarray and non-coarray streaming access, which are called according to the current position in the fluid domain (Listing.3). With this approach, there are two kernels, one with additional coarray access as above and additionally the loop for determining the kernel. This raises the required lines of code again. call stream_collide_caf(fout,fin,,nx,,ny,,) do k=2,nz- call stream_collide_caf(fout,fin,,nx,,,k,k) do j=2,ny- call stream_collide_caf(fout,fin,,,j,j,k,k) call stream_collide_sub(fout,fin,j,k) call stream_collide_caf(fout,fin,nx,nx,j,j,k,k) end do call stream_collide_caf(fout,fin,,nx,ny,ny,k,k) end do call stream_collide_caf(fout,fin,,nx,,ny,nz,nz) Listing.3. Segmented stream collide with coarrays 3 Tested Communication Schemes 3. Data Structures Message passing based parallelization requires data structures for collecting, sending and receiving the data. MPI types or regular arrays can be used in MPI, whereas with CAF, regular arrays or derived types with arrays can be employed. With Regular Global Arrays and the same-size restriction for coarrays, separate data objects for each neighbor buffer are required. This applies both for send and receive, which increases implementation complexity. The usage of Derived Types provides the programmer with flexibility, as the arrays inside the coarray derived types do not have to be of the same size. Before each communication or alternatively at every array (de)allocation, information about the array size of every globally accessible data object has to be made visible to all processes.! Regular global arrays as coarrays real,dimension(:,:,:,:)[:],allocatable :: caf_snd,..! Derived types type caf_dt! Coarray derived type with regular array real,dimension(:,:,:,:),allocatable :: send end type caf_dt type(caf_dt) :: buffer[*]!< Coarrray definition type reg_dt! Regular derived type with coarray inside real,dimension(:,:,:,:)[:],allocatable :: snd,.. snd9 end type reg_dt type(reg_dt) :: buffer!< Regular array definition Listing.4. Derived type and regular global coarrays

5 CPU Rev Cores GHz L2(KB) L3(MB) Memory ASIC CCE MPI XT5m Shanghai 23C2 4 2,4 52 6 6GB Seastar2 7.2.4 MPT 5.0.0 XE6 MagnyCours 628 8 2,0 52 2 32GB Gemini 7.3.0 MPT 5.. Table. Machine setup 3.2 Buffered and Direct Communication The usage of explicit buffers, i.e. using halos, requires separate send/receive buffers with potentially different sizes. The communication is done in an explicit, dedicated routine, and before its start, the required data is collected and placed into the halo buffer, from where it is put back into the main array after the communication. One-sided communication models allow direct remote memory access to locations, which are not available to the local process, but where access is routed through the network. This allows the construction of either explicit message passing of buffers or access to remote data within the kernel itself, here referred to as implicit access. 4 Experimental Results We performed investigations on Cray XT and XE systems (see Table ), as they are among the few supporting PGAS both by hardware and compiler. These architectures mainly differ in the application-specific integrated circuits (ASIC) [?], which connect the processors to the system network and offloads communication functions from the processor. The Cray XT5m nodes use SeaStar ASICs, which contain among other a direct memory access engine to move data on the local memory, a router connecting with the system network and a remote access memory engine [?]. The XE6 nodes are equipped with the Gemini ASIC [?], which supports a global address space and is optimized for efficient one-sided point-to-point communication with a high throughput for small messages. We used the Cray Fortran compiler from the Cray Compiling Environment (CCE). We first evaluate the single core performance for various domain sizes and memory layouts, from which we choose a suited method for parallel scaling. Coarray and MPI are then compared on both machines. We perform a three-dimensional domain decomposition, for p > 8 with an equal amount of subdivisions in all three directions. For p 8, the domain is split in z-direction only. 4. Influence of the Memory Layout in Serial and Parallel The serial performance is measured with the physics-independent quantity million lattice updates per second (MLUPs), as a function of the total fluid cells n tot. Performance studies of the LBM by Donath [?] have revealed a significant influence of memory layout and domain size. With density-first lijk, the performance decreases with increasing domain size. In Fig. (left), the cache hierarchy is clearly visible in terms of different MLUPs levels, especially for lijk. For density-later iljk,ijlk,ijkl, the performance is relatively constant with cachethrashing occurring for ijkl.

6 MLUPs 6 4 2 0 8 6 4 2 0 ijkl ijlk iljk lijk L L2 L3 Memory 2 3 4 5 6 lg 0 (n tot ) total run time [s] 000 00 0 XT5m ijkl 0. XT5m lijk XE6 ijkl 0.0 XE6 lijk ideal 0.00 0 00 000 p Fig.. Impact of memory layout in serial (XT5m) and parallel (direct CAF) The memory layout not only plays a large role for serial execution, but also when data for communication is collected. Depending on data alignment in memory, the compiler has to collect data with varying strides resulting in a large time discrepancy of communication in different directions. In figure on the right, all layouts scale linearly for a one-dimensional domain decomposition for p 8 and a domain size of n tot = 200 3. Invoking more directions leads to a strong degradation of performance on the XT5m for the ijkl layout, probably due to the heavily fragmented memory, that needs to be communicated. The smallest chunks in lijk remains 9, as all densities are needed for communication and stored consecutively in memory. On the XE6 with the new programming environment, this seems to be resolved. 4.2 Derived Type and Regular Coarray Buffers total run time [s] 00 0 XE6 reg coarray XE6 der type 0. XT5m reg coarray XT5m der type ideal 0.0 0 00 p 000 On the XT5m there is an obvious difference between the two implementations. The regular array implementation scales in a nearly linear fashion. The derived type version even increases linearly in the run time when using more processes inside a single node. When using the network, it scales nearly linear for p 6. This issue also seems to be resolved on the new XE6 architecture. Within a sin- Fig. 2. Derived type coarrays gle node, there are virtually no differences between the two variants. However, the derived type variant seems to scale a little worse beyond a single node. 4.3 MPI compared to Coarray Communications Here we compare various coarray schemes to regular non-blocking, buffered MPI communication. We start with a typical explicit MPI scheme and work our way to an implicit communication scheme with coarrays. The MPI and MPI-style CAF schemes handle the communication in a separate routine. The implicit coarray schemes perform the access to remote data during the streaming step. We use the lijk memory layout, where all values of one cell are saved contiguously, which results in a minimal data pack of 9 values. Strong and weak scaling experiments are performed for the following communication schemes:. Explicit MPI: buffered isend-irecv

7 2. Explicit CAF buffered: same as MPI but with coarray GET to fill buffers 3. Explicit CAF direct access: no buffers but direct remote data access 4. Implicit CAF segmented loops: coarray access on border nodes only 5. Implicit CAF naive: coarray access on all nodes Strong scaling A fluid domain with a total of 200 3 grid cells is employed for all process counts. In Fig. 3 the total execution time of the main loop is shown as a function of the number of processes. Increasing the number of processes decreases domain sizes per process, by which the execution time decreases. Ideal total run time [s] 00 0 0. XT5m 00 0 0. XE6 impl naive impl segm expl dir expl buf MPI ideal 0.0 0.0 0 00 000 0 00 000 0000 p Fig. 3. Strong scaling comparison of MPI and CAF with lijk layout and n tot =200 3 scaling is plotted for comparison. The MPI implementation is fastest and scales close to ideal on both machines, but scaling scaling stalls for p > 2000. The coarray implementations show, that there was a huge progress made from the XT5m to the XE6. Implicit coarray access schemes shows a much slower serial runtime. The naive coarray implementation is slower than explicit ones even by a factor of 30, due to coarray accesses to local data, although it scales perfectly on the XE6. On the XE6, for p>0000, all schemes tend to converge against the naive implementation, which is expected, as this approach pretends all neighbors to be remote, essentially resulting in cell partitions. However this results in a very low parallel efficiency with respect to the fastest serial implementation. Due to an unexplainable loss of scaling in the MPI implementation beyond 2000 processes, coarrays mimicking the MPI communication pattern get even slightly faster in this range. Weak scaling The increasingly high parallelism due to power and clock frequency limitation [?] combined with limited memory resources lead to extremely distributed systems. Small computational domains of n = 9 3 fluid cells, which fit completely in cache are used to anticipate this trend in our analysis (Fig. 4). With such domain sizes, latency effects prevail. The MPI parallelization scales similar on both machines, and yields the best run time among the tested schemes. Both explicit buffers in coarray scale nearly perfect for p > 64. Whereas implicit coarray addressing within the algorithm clearly looses in all respects. 5 Conclusion We presented different approaches on how to parallelize a fluid solver using coarrays in Fortran. We compared the ease of programming and performance

8 total run time [s] 0. 0.0 XT5m 0. 0.0 XE6 MPI expl buf expl dir impl seg impl naive 0.00 0 00 000 0.00 0 00 000 0000 p Fig. 4. Weak scaling in three directions with n=9 3 cells per process (latency area) to traditional parallelization schemes with MPI. The code complexity for simple and slow CAF implementations is low, but quickly increases with the constraint of a high performance. We showed that the achievable performance of applications using coarrays depends on the definition of data structures. The analysis indicates, that it might get beneficial to use coarrays for very large numbers of processes, however on the systems available today, MPI communication provides highest performance and needs to be mimicked in coarray implementations. Acknowledgments We thank the DEISA Consortium (www.deisa.eu), funded through the EU FP7 project RI-22299, for support within the DEISA Extreme Computing Initiative. We also want to express our grateful thanks to M. Schliephake from the PDC KTH Stockholm for providing us the scaling runs on the XE6 system Lindgren and U. Küster from HLRS Stuttgart for insightful discussions. References. J. V. Ashby and J. K. Reid. Migrating a Scientific Application from MPI to Coarrays. Technical report, Comp. Sc. Eng. Dept. STFC Rutherford Appleton Laboratory, 2008. 2. R. Barrett. Co-Array Fortran Experiences with Finite Differencing Methods. Technical Report, 2006. 3. R. Brightwell, K. Pedretti, and K. D. Underwood. Initial Performance Evaluation of the Cray SeaStar Interconnect. Proc. 3th Symp. on High Performance Interconnects, 2005. 4. CRAY. Cray XT System Overview, Jun 2009. 5. CRAY. Using the GNI and DMAPP APIs. User Manual, 200. 6. S. Donath. On optimized Implementations of the Lattice Boltzmann Method on contemporary High Performance Architectures. Master s thesis, Friedrich- Alexander-University, Erlangen-Nürnberg, 2004. 7. P. Kogge. Exascale computing study: Technology challenges in achieving exascale systems. Technical report, Information Processing Techniques Office and Air Force Research Lab, 2008. 8. J. Mellor-Crummey, L. Adhianto, and W. Scherer. A Critique of Co-array Features in Fortran 2008. Working Draft J3/07-007r3, 2008. 9. R. Rabenseifner. Optimization of Collective Reduction Operations. International Conference on Computational Science, 2004. 0. J. Reid. Coarrays in the next Fortran Standard, Mar 2009.