CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver

Size: px
Start display at page:

Download "CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver"

Transcription

1 CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller 2 German Research School for Simulation Sciences GmbH, Aachen, Germany 2 RWTH Aachen University, Aachen, Germany Abstract. We investigate how to use coarrays in Fortran (CAF) for parallelizing a flow solver and the capabilities of current compilers with coarray support. Usability and performance of CAF in mesh-based applications is examined and compared to traditional MPI strategies. We analyze the influence of the memory layout, the usage of communication buffers against direct access to the data used in the computation and different methods of the communication itself. Our objective is to provide insights on how common communication patterns have to be formulated when using coarrays. Keywords: PGAS, Coarray Fortran, MPI, Performance Comparison Introduction Attempts to exploit parallelism in computing devices automatically have always been made, and it was successfully done by compilers in a restricted form such as vector operations. For more general parallelization concepts with multiple instructions on multiple data (MIMD), the automation was less successful and programmers had to support the compiler by directives such as in OpenMP. The Message Passing Interface (MPI) offers a rich set of functionality for MIMD applications on distributed systems and high-level parallelization is supplied by APIs from libraries. The parallelization however, has to be elaborated in detail by the programmer. Recently, an increasing effort in language-inherent parallelism is made to leave parallel implementation details to the compiler. A concept developed in detail is the partitioned global address space (PGAS), which was brought into the Fortran 2008 standard with the notion of coarrays. Parallel features are turned into intrinsic language properties and allow a high-level formulation of parallelism in the language itself [?]. Coarrays minimally extend Fortran to allow the creation of parallel programs with minor modifications to the sequential code. The optimistic goal is to obtain a language which inherits parallelism and allows the compiler to concurrently consider serial aspects and communication for code optimization. In CAF, shared data objects are indicated by an additional index in square brackets, for which the remote location in terms of the process number of the shared variable is defined. Contrary to MPI, there are no collective operations

2 2 defined in the current standard. These have been shown to constitute a large part of nowadays total communication on supercomputers [?]. thus posing a severe limitation to a pure coarray parallelization, which is a major point of criticism by Mellor-Crummey et al. [?]. Only few publications actually give advice on how to use coarrays in order to obtain a fast and scalable parallel code. Ashby & Reid [?] ported an MPI flow solver to coarrays and Barrett [?] uses a finite difference scheme to assess the performance of several coarray implementations. The goal of this paper is to compare several parallelization implementations with coarrays and MPI. We compare these approaches in terms of performance, flexibility and ease of programming. Speedup studies on two machines with a different network interface are performed. The work concentrates on the implementation provided by Cray. 2 Numerical Method We use a lattice Boltzmann method (LBM) fluid solver as a testbed. This explicit numerical scheme uses direct neighbor stencils and a homogeneous, Cartesian grid. The numerical algorithm consists of streaming and collision performed at each time step (Listing.). The cell-local collision mimics particle interactions and streaming represents the free motion of particles, consisting of pure memory copy operations from nearest neighbor cells. Neighbors are directly accessed on a cubic grid, which is subdivided into rectangular sub-blocks for parallel execution. do k=,nz; do j=,ny; do i=,nx ftmp(:)=fin(:,i-cx(:),j-cy(:),k-cz(:))! advection from offsets cx, cy, cz...! Double buffering: Read values from fin, work on ftmp and write to fout fout(:) = ftmp(:) - (-omega)*(ftmp(:) -feq(:))! collide enddo; enddo; enddo Listing.. Serial stream collide routine 2. Alignment in Memory The Cartesian grid structure naturally maps to a four-dimensional array, where the indices {i, j, k} represent the fluid cells spatial coordinates. Each cell holds n nod =9 density values. It is represented by the index l and its position in the array can be chosen. Communication then involves strided data access, where the strides depend on the direction and memory layout. There are two main arrays fin and fout which by turns hold the state of the current time step. These arrays are of the size {n nod, n x, n y, n z }, depending on the position of l. In Fortran the first index is aligned successively in memory yielding a stride one access. With the density-first lijk, the smallest data chunk for communication has at least n nod consecutive memory entries. The smallest memory chunks of n nod entries for communication occur in the x-direction. Communication in y- direction involves chunks of n nod n x and in z-direction n nod n x n y. When the density is saved last ijkl, the x-direction again involves the smallest data chunks, but only of a single memory entry with strides of n stride = n x n y n z.

3 3 2.2 Traditional Parallelization Approach A time-dependent mesh-based flow solver is usually parallelized following a SPMD (single program multiple data) approach. All p processes execute the same program but work on different data. In LBM, each fluid cell induces the same computing effort per time step. For an ideal load-balancing, the work is equally distributed among the processes and a simple domain decomposition can be performed, splitting the regular Cartesian mesh into p equal sub-domains. An update of a given cell requires the information from the neighboring cells and itself from the previous iteration. At the border between sub-domains it is then necessary to exchange data in each iteration. A common approach is to use halo cells, which are not updated by the scheme itself but provide valid data from the remote processes. This allows the actual update procedure to act just like in the serial algorithm. With MPI, data is exchanged before the update of the local cells at each time step, usually with a non-blocking point-to-point exchange. 2.3 Strategy Following the Coarray Concept Using coarrays, data can be directly accessed in the neighbor s memory without using halos. Coarray data objects must have the same size and shape on each process. The cells on remote images can then be accessed from within the algorithm like local ones, but with the additional process address (nbp) appended. Synchronization is required between time steps to ensure data consistency across all processes, but no further communication statements are necessary. We present different approaches to coarray implementations. In the Naive Coarray Approach every streaming operation, i.e. copy from neighbor to the local cell, is done by a coarray access, even for values which are found locally. This requires either the calculation of the address for the requested value, or a lookup table for this information. Both approaches result in additional run time costs and the calculation of the neighbor process number and the position there obscure the kernel code. do k=,nz; do j=,ny; do i=,nx! streaming step (get values from neighbors) do l=,nnod! loop over densities in each cell xpos = mod(crd()*bnd()+i-cx(l,)-,bnd())+ xp()= (crd()*bnd()+i-cx(l,)-)/bnd()+...! analoguous for the other directions if(xp().lt. ) then...! correct physical boundaries nbp=image_index( caf_cart_comm,xp(:3) )! get image num ftmp( l)=fin( l,xpos,ypos,zpos )[nbp]! coarray get enddo...! collision enddo; enddo; enddo Listing.2. Naive streaming step with coarrays The copy itself is easy to implement, but the logic for getting the remote data address requires quite some effort (Listing.2). Significant overhead is generated by the coarray access itself and repetitive address calculations. If the position of each neighbor is determined in advance and saved, memory demand and, more

4 4 important, memory access increases. As the LBM is a memory bandwidth-bound algorithm, this puts even more pressure on the memory interface. With the Segmented Coarray Approach, the inner nodes of each partition are treated as in the serial version. Coarray access is only used where necessary, i.e. on the interface cells. Separate subroutines are defined for the coarray and non-coarray streaming access, which are called according to the current position in the fluid domain (Listing.3). With this approach, there are two kernels, one with additional coarray access as above and additionally the loop for determining the kernel. This raises the required lines of code again. call stream_collide_caf(fout,fin,,nx,,ny,,) do k=2,nz- call stream_collide_caf(fout,fin,,nx,,,k,k) do j=2,ny- call stream_collide_caf(fout,fin,,,j,j,k,k) call stream_collide_sub(fout,fin,j,k) call stream_collide_caf(fout,fin,nx,nx,j,j,k,k) end do call stream_collide_caf(fout,fin,,nx,ny,ny,k,k) end do call stream_collide_caf(fout,fin,,nx,,ny,nz,nz) Listing.3. Segmented stream collide with coarrays 3 Tested Communication Schemes 3. Data Structures Message passing based parallelization requires data structures for collecting, sending and receiving the data. MPI types or regular arrays can be used in MPI, whereas with CAF, regular arrays or derived types with arrays can be employed. With Regular Global Arrays and the same-size restriction for coarrays, separate data objects for each neighbor buffer are required. This applies both for send and receive, which increases implementation complexity. The usage of Derived Types provides the programmer with flexibility, as the arrays inside the coarray derived types do not have to be of the same size. Before each communication or alternatively at every array (de)allocation, information about the array size of every globally accessible data object has to be made visible to all processes.! Regular global arrays as coarrays real,dimension(:,:,:,:)[:],allocatable :: caf_snd,..! Derived types type caf_dt! Coarray derived type with regular array real,dimension(:,:,:,:),allocatable :: send end type caf_dt type(caf_dt) :: buffer[*]!< Coarrray definition type reg_dt! Regular derived type with coarray inside real,dimension(:,:,:,:)[:],allocatable :: snd,.. snd9 end type reg_dt type(reg_dt) :: buffer!< Regular array definition Listing.4. Derived type and regular global coarrays

5 5 CPU Rev Cores GHz L2(KB) L3(MB) Memory ASIC CCE MPI XT5m Shanghai 23C2 4 2, GB Seastar MPT XE6 MagnyCours , GB Gemini MPT 5.. Table. Machine setup 3.2 Buffered and Direct Communication The usage of explicit buffers, i.e. using halos, requires separate send/receive buffers with potentially different sizes. The communication is done in an explicit, dedicated routine, and before its start, the required data is collected and placed into the halo buffer, from where it is put back into the main array after the communication. One-sided communication models allow direct remote memory access to locations, which are not available to the local process, but where access is routed through the network. This allows the construction of either explicit message passing of buffers or access to remote data within the kernel itself, here referred to as implicit access. 4 Experimental Results We performed investigations on Cray XT and XE systems (see Table ), as they are among the few supporting PGAS both by hardware and compiler. These architectures mainly differ in the application-specific integrated circuits (ASIC) [?], which connect the processors to the system network and offloads communication functions from the processor. The Cray XT5m nodes use SeaStar ASICs, which contain among other a direct memory access engine to move data on the local memory, a router connecting with the system network and a remote access memory engine [?]. The XE6 nodes are equipped with the Gemini ASIC [?], which supports a global address space and is optimized for efficient one-sided point-to-point communication with a high throughput for small messages. We used the Cray Fortran compiler from the Cray Compiling Environment (CCE). We first evaluate the single core performance for various domain sizes and memory layouts, from which we choose a suited method for parallel scaling. Coarray and MPI are then compared on both machines. We perform a three-dimensional domain decomposition, for p > 8 with an equal amount of subdivisions in all three directions. For p 8, the domain is split in z-direction only. 4. Influence of the Memory Layout in Serial and Parallel The serial performance is measured with the physics-independent quantity million lattice updates per second (MLUPs), as a function of the total fluid cells n tot. Performance studies of the LBM by Donath [?] have revealed a significant influence of memory layout and domain size. With density-first lijk, the performance decreases with increasing domain size. In Fig. (left), the cache hierarchy is clearly visible in terms of different MLUPs levels, especially for lijk. For density-later iljk,ijlk,ijkl, the performance is relatively constant with cachethrashing occurring for ijkl.

6 6 MLUPs ijkl ijlk iljk lijk L L2 L3 Memory lg 0 (n tot ) total run time [s] XT5m ijkl 0. XT5m lijk XE6 ijkl 0.0 XE6 lijk ideal p Fig.. Impact of memory layout in serial (XT5m) and parallel (direct CAF) The memory layout not only plays a large role for serial execution, but also when data for communication is collected. Depending on data alignment in memory, the compiler has to collect data with varying strides resulting in a large time discrepancy of communication in different directions. In figure on the right, all layouts scale linearly for a one-dimensional domain decomposition for p 8 and a domain size of n tot = Invoking more directions leads to a strong degradation of performance on the XT5m for the ijkl layout, probably due to the heavily fragmented memory, that needs to be communicated. The smallest chunks in lijk remains 9, as all densities are needed for communication and stored consecutively in memory. On the XE6 with the new programming environment, this seems to be resolved. 4.2 Derived Type and Regular Coarray Buffers total run time [s] 00 0 XE6 reg coarray XE6 der type 0. XT5m reg coarray XT5m der type ideal p 000 On the XT5m there is an obvious difference between the two implementations. The regular array implementation scales in a nearly linear fashion. The derived type version even increases linearly in the run time when using more processes inside a single node. When using the network, it scales nearly linear for p 6. This issue also seems to be resolved on the new XE6 architecture. Within a sin- Fig. 2. Derived type coarrays gle node, there are virtually no differences between the two variants. However, the derived type variant seems to scale a little worse beyond a single node. 4.3 MPI compared to Coarray Communications Here we compare various coarray schemes to regular non-blocking, buffered MPI communication. We start with a typical explicit MPI scheme and work our way to an implicit communication scheme with coarrays. The MPI and MPI-style CAF schemes handle the communication in a separate routine. The implicit coarray schemes perform the access to remote data during the streaming step. We use the lijk memory layout, where all values of one cell are saved contiguously, which results in a minimal data pack of 9 values. Strong and weak scaling experiments are performed for the following communication schemes:. Explicit MPI: buffered isend-irecv

7 7 2. Explicit CAF buffered: same as MPI but with coarray GET to fill buffers 3. Explicit CAF direct access: no buffers but direct remote data access 4. Implicit CAF segmented loops: coarray access on border nodes only 5. Implicit CAF naive: coarray access on all nodes Strong scaling A fluid domain with a total of grid cells is employed for all process counts. In Fig. 3 the total execution time of the main loop is shown as a function of the number of processes. Increasing the number of processes decreases domain sizes per process, by which the execution time decreases. Ideal total run time [s] XT5m XE6 impl naive impl segm expl dir expl buf MPI ideal p Fig. 3. Strong scaling comparison of MPI and CAF with lijk layout and n tot =200 3 scaling is plotted for comparison. The MPI implementation is fastest and scales close to ideal on both machines, but scaling scaling stalls for p > The coarray implementations show, that there was a huge progress made from the XT5m to the XE6. Implicit coarray access schemes shows a much slower serial runtime. The naive coarray implementation is slower than explicit ones even by a factor of 30, due to coarray accesses to local data, although it scales perfectly on the XE6. On the XE6, for p>0000, all schemes tend to converge against the naive implementation, which is expected, as this approach pretends all neighbors to be remote, essentially resulting in cell partitions. However this results in a very low parallel efficiency with respect to the fastest serial implementation. Due to an unexplainable loss of scaling in the MPI implementation beyond 2000 processes, coarrays mimicking the MPI communication pattern get even slightly faster in this range. Weak scaling The increasingly high parallelism due to power and clock frequency limitation [?] combined with limited memory resources lead to extremely distributed systems. Small computational domains of n = 9 3 fluid cells, which fit completely in cache are used to anticipate this trend in our analysis (Fig. 4). With such domain sizes, latency effects prevail. The MPI parallelization scales similar on both machines, and yields the best run time among the tested schemes. Both explicit buffers in coarray scale nearly perfect for p > 64. Whereas implicit coarray addressing within the algorithm clearly looses in all respects. 5 Conclusion We presented different approaches on how to parallelize a fluid solver using coarrays in Fortran. We compared the ease of programming and performance

8 8 total run time [s] XT5m XE6 MPI expl buf expl dir impl seg impl naive p Fig. 4. Weak scaling in three directions with n=9 3 cells per process (latency area) to traditional parallelization schemes with MPI. The code complexity for simple and slow CAF implementations is low, but quickly increases with the constraint of a high performance. We showed that the achievable performance of applications using coarrays depends on the definition of data structures. The analysis indicates, that it might get beneficial to use coarrays for very large numbers of processes, however on the systems available today, MPI communication provides highest performance and needs to be mimicked in coarray implementations. Acknowledgments We thank the DEISA Consortium ( funded through the EU FP7 project RI-22299, for support within the DEISA Extreme Computing Initiative. We also want to express our grateful thanks to M. Schliephake from the PDC KTH Stockholm for providing us the scaling runs on the XE6 system Lindgren and U. Küster from HLRS Stuttgart for insightful discussions. References. J. V. Ashby and J. K. Reid. Migrating a Scientific Application from MPI to Coarrays. Technical report, Comp. Sc. Eng. Dept. STFC Rutherford Appleton Laboratory, R. Barrett. Co-Array Fortran Experiences with Finite Differencing Methods. Technical Report, R. Brightwell, K. Pedretti, and K. D. Underwood. Initial Performance Evaluation of the Cray SeaStar Interconnect. Proc. 3th Symp. on High Performance Interconnects, CRAY. Cray XT System Overview, Jun CRAY. Using the GNI and DMAPP APIs. User Manual, S. Donath. On optimized Implementations of the Lattice Boltzmann Method on contemporary High Performance Architectures. Master s thesis, Friedrich- Alexander-University, Erlangen-Nürnberg, P. Kogge. Exascale computing study: Technology challenges in achieving exascale systems. Technical report, Information Processing Techniques Office and Air Force Research Lab, J. Mellor-Crummey, L. Adhianto, and W. Scherer. A Critique of Co-array Features in Fortran Working Draft J3/07-007r3, R. Rabenseifner. Optimization of Collective Reduction Operations. International Conference on Computational Science, J. Reid. Coarrays in the next Fortran Standard, Mar 2009.

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver CAF versus MPI Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller m.hasert@grs-sim.de Applied Supercomputing in Engineering Motivation We develop several CFD

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm

More information

Migrating A Scientific Application from MPI to Coarrays. John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK

Migrating A Scientific Application from MPI to Coarrays. John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK Migrating A Scientific Application from MPI to Coarrays John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK Why and Why Not? +MPI programming is arcane +New emerging paradigms

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

Evaluating New Communication Models in the Nek5000 Code for Exascale

Evaluating New Communication Models in the Nek5000 Code for Exascale Evaluating New Communication Models in the Nek5000 Code for Exascale Ilya Ivanov (KTH), Rui Machado (Fraunhofer), Mirko Rahn (Fraunhofer), Dana Akhmetova (KTH), Erwin Laure (KTH), Jing Gong (KTH), Philipp

More information

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)

More information

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm March 17 March 21, 2014 Florian Schornbaum, Martin Bauer, Simon Bogner Chair for System Simulation Friedrich-Alexander-Universität

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer ENERGY-224 Reservoir Simulation Project Report Ala Alzayer Autumn Quarter December 3, 2014 Contents 1 Objective 2 2 Governing Equations 2 3 Methodolgy 3 3.1 BlockMesh.........................................

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory This talk will explain the objectives of coarrays, give a quick summary of their history, describe the

More information

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Ateles performance assessment report

Ateles performance assessment report Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,

More information

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

1 Serial Implementation

1 Serial Implementation Grey Ballard, Razvan Carbunescu, Andrew Gearhart, Mehrzad Tartibi CS267: Homework 2 1 Serial Implementation For n particles, the original code requires O(n 2 ) time because at each time step, the apply

More information

Evaluation of the Coarray Fortran Programming Model on the Example of a Lattice Boltzmann Code

Evaluation of the Coarray Fortran Programming Model on the Example of a Lattice Boltzmann Code Evaluation of the Coarray Fortran Programming Model on the Example of a Lattice Boltzmann Code Klaus Sembritzki Friedrich-Alexander University Erlangen-Nuremberg Erlangen Regional Computing Center (RRZE)

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Parallel Programming in Fortran with Coarrays

Parallel Programming in Fortran with Coarrays Parallel Programming in Fortran with Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran 2008 is now in FDIS ballot: only typos permitted at this stage.

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

CloverLeaf: Preparing Hydrodynamics Codes for Exascale CloverLeaf: Preparing Hydrodynamics Codes for Exascale Andrew Mallinson Andy.Mallinson@awe.co.uk www.awe.co.uk British Crown Owned Copyright [2013]/AWE Agenda AWE & Uni. of Warwick introduction Problem

More information

Exploring XcalableMP. Shun Liang. August 24, 2012

Exploring XcalableMP. Shun Liang. August 24, 2012 Exploring XcalableMP Shun Liang August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract This project has implemented synthetic and application

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS 14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology

More information

Fortran 2008: what s in it for high-performance computing

Fortran 2008: what s in it for high-performance computing Fortran 2008: what s in it for high-performance computing John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran 2008 has been completed and is about to be published.

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

walberla: Developing a Massively Parallel HPC Framework

walberla: Developing a Massively Parallel HPC Framework walberla: Developing a Massively Parallel HPC Framework SIAM CS&E 2013, Boston February 26, 2013 Florian Schornbaum*, Christian Godenschwager*, Martin Bauer*, Matthias Markl, Ulrich Rüde* *Chair for System

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Friday, May 25, User Experiments with PGAS Languages, or

Friday, May 25, User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or User Experiments with PGAS Languages, or It s the Performance, Stupid! User Experiments with PGAS Languages, or It s the Performance, Stupid! Will Sawyer, Sergei

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method June 21, 2011 Introduction Free Surface LBM Liquid-Gas-Solid Flows Parallel Computing Examples and More References Fig. Simulation

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Operating Systems. Memory Management. Lecture 9 Michael O Boyle

Operating Systems. Memory Management. Lecture 9 Michael O Boyle Operating Systems Memory Management Lecture 9 Michael O Boyle 1 Memory Management Background Logical/Virtual Address Space vs Physical Address Space Swapping Contiguous Memory Allocation Segmentation Goals

More information

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Jeff Larkin, Cray Inc. and Mark Fahey, Oak Ridge National Laboratory ABSTRACT: This paper will present an overview of I/O methods on Cray XT3/XT4

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding

Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding Procedia Computer Science Volume 51, 2015, Pages 1494 1503 ICCS 2015 International Conference On Computational Science Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological

More information

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER Shaheen R. Tonse* Lawrence Berkeley National Lab., Berkeley, CA, USA 1. INTRODUCTION The goal of this

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Implementation of an integrated efficient parallel multiblock Flow solver

Implementation of an integrated efficient parallel multiblock Flow solver Implementation of an integrated efficient parallel multiblock Flow solver Thomas Bönisch, Panagiotis Adamidis and Roland Rühle adamidis@hlrs.de Outline Introduction to URANUS Why using Multiblock meshes

More information

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP** CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Hybrid Programming with MPI and SMPSs

Hybrid Programming with MPI and SMPSs Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail

More information

Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,

More information

Comparing One-Sided Communication with MPI, UPC and SHMEM

Comparing One-Sided Communication with MPI, UPC and SHMEM Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Parallel Mesh Partitioning in Alya

Parallel Mesh Partitioning in Alya Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es

More information

Memory allocation and sample API calls. Preliminary Gemini performance measurements

Memory allocation and sample API calls. Preliminary Gemini performance measurements DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini

More information

NIA CFD Futures Conference Hampton, VA; August 2012

NIA CFD Futures Conference Hampton, VA; August 2012 Petascale Computing and Similarity Scaling in Turbulence P. K. Yeung Schools of AE, CSE, ME Georgia Tech pk.yeung@ae.gatech.edu NIA CFD Futures Conference Hampton, VA; August 2012 10 2 10 1 10 4 10 5 Supported

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH

Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH Exploiting the Potential of European HPC Stakeholders in Extreme-Scale Demonstrators Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH Motivation & Introduction Computer system

More information

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago Petascale Multiscale Simulations of Biomolecular Systems John Grime Voth Group Argonne National Laboratory / University of Chicago About me Background: experimental guy in grad school (LSCM, drug delivery)

More information

Performance Engineering - Case study: Jacobi stencil

Performance Engineering - Case study: Jacobi stencil Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.

More information

Virtual EM Inc. Ann Arbor, Michigan, USA

Virtual EM Inc. Ann Arbor, Michigan, USA Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

Performance Comparison between Two Programming Models of XcalableMP

Performance Comparison between Two Programming Models of XcalableMP Performance Comparison between Two Programming Models of XcalableMP H. Sakagami Fund. Phys. Sim. Div., National Institute for Fusion Science XcalableMP specification Working Group (XMP-WG) Dilemma in Parallel

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Praktikum 2014 Parallele Programmierung Universität Hamburg Dept. Informatics / Scientific Computing. October 23, FluidSim.

Praktikum 2014 Parallele Programmierung Universität Hamburg Dept. Informatics / Scientific Computing. October 23, FluidSim. Praktikum 2014 Parallele Programmierung Universität Hamburg Dept. Informatics / Scientific Computing October 23, 2014 Paul Bienkowski Author 2bienkow@informatik.uni-hamburg.de Dr. Julian Kunkel Supervisor

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information