PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

Size: px
Start display at page:

Download "PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK"

Transcription

1 PRACE Workshop: Application Case Study: Code_Saturne Andrew Sunderland, Charles Moulinec, Zhi Shang Science and Technology Facilities Council, Daresbury Laboratory, UK Yvan Fournier, Electricite it de France, Paris, France Kevin Roy, Cray Centre of Excellence, UK Juan Uribe, University of Manchester, UK

2 Summary Background STFC Daresbury Laboratory Evolution of Code_Saturne Petascaling and Optimization i Datasets Initial Performance Analysis Optimization Multigrid Solver Petascaling Partitioning MPI/IO Load imbalance Hybrid Model

3 STFC Daresbury Laboratory HPC service provider to the UK academic community for > 25 yrs Jointly run UK national HPC services HPCx (IBM Pwr5), HECToR(XT4) Also STFC machine 1Rack IBM BG/P Research, development & support centre for leading edge academic engineering and physical science simulation codes: e.g. DL_POLY, GAMESS-UK, MPP-CRYSTAL, PFARM

4 Towards The Petascale Increase in TOP 500 performance now driven by increasing core count, not processor speed Memory subsystems may continue to improve Terascaling issues many hundreds or few 1000 cores Parallel scalability of Diags, FFTs, preconditioned sparse solvers Petascaling Issues s thousands of cores Diag free, FFT free methods? Different approach to sparse solvers? Efficient I/O, load-balancing, sensitivity to partitioning, MPI vs Hybrid

5 Code_Saturne main capabilities Chosen as one of the core application benchmarks for PRACE WP6 General Purpose Computational Fluid Dynamics code: to be run on PWR6, BG/P, Cray XT5, NEC SX-9 prototypes Physical modelling Single-phase laminar and turbulent flows: k-ε, SST, v2f, RSM, LES, RANS Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible & Incompressible flow Conjugate heat transfer (Syrthes & 1D) Specific engineering i modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 5

6 Code_Saturne main capabilities Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March Parallelism L.E.S 1.2 State of the art tin turbulence (source code and manuals) (support) Basic capabilities : Open source 1.3 Massively parallel l ALE Code coupling Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 6

7 Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 7

8 Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) No major porting issues for BG/P, PWR6 or XT Series in PRACE GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Aster,...) 8

9 Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method lines of code, 49% FORTRAN, 41% C,10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 9

10 Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), AIX, Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies ( to cells) 10

11 Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining i periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 11

12 Code_Saturne subsystems Code_Saturne BFT library Meshes Pre-processor mesh import mesh joining i periodicity domain partitioning FVM Prace WP6 library Parallel Kernel ghost cells creation CFD Solver serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 12

13 Code_Saturne Features of note to HPC Segregated solver Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi- CGstab) used for other variables Matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 13

14 Base parallel operations Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 14

15 Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 15

16 Code_Saturne Initial Performance (i) 10M Cell Dataset 50 (arbitrary) Performa ance per Timestep Louhi (Cray XT) Huygens (IBM PWR6) Jugene (IBM BG/P) Number of Cores

17 Code_Saturne Initial Performance (ii) 10M Cell Dataset ) Louhi (Cray XT) 50 Huygens (IBM PWR6) Jugene IBM BG/P Performance per Timestep (arbitrary 0 Number of Cores

18 Code_Saturne Initial Performance (ii) 100M Cell Dataset Mixer Grid 100M Cells - Solver 90 Relative Performance e (arbitrary) Ideal CRAY XT4 IBM BG/P Number of Cores

19 Optimization - Multigrid V1.4 replaces standard Conjugate Gradient solver with Multigrid solver Two potential performance gains: Solver may converge in fewer iterations Solver requires fewer operations in a coarse grid iteration Otherwise improves robustness of code

20 Multigrid Performance Cray XT4, 10M Cell Dataset Performance Comparison: Conjugate Gradient vs Multigrid Relative Performanc ce (arbitrary y) Multigrid Conjugate gradient Number of Cores

21 Multigrid Future Optimizations (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability Planned solution: move grids to nearest rank multiple l of 4 or 8 when mean local grid size is too small 21

22 Multigrid Future Optimizations(2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local l grid size is too small Map onto underlying multicore architecture Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper 22

23 CrayPat Multigrid Serial Performance Profile at 128 cores shows 65% of the runtime in scalar numerical routines: 100.0% 0% Total % USER % % _mat_vec_p_l_native 10.9% % _conjugate_gradient_mp 7.9% % _alpha_a_x_p_beta_y_native 7.5% % cblas_daxpy 4.8% % _polynomial_preconditionning 4.6% % 7% gradrc_ 2.5% % _jacobi All of these are targets for optimization Most have been optimised for IBM & Bull platforms: #if defined( xlc ) #pragma disjoint(*x, *y, *da, *xa1, *xa2) #endif #if defined(ia64_optim) Shows there is a way forward for increased performance targeting AMD Quad-Core

24 CrayPat - Multigrid Parallel Performance Timing results show that it is not scaling: Profile at 128 cores shows 35% in non-compute operations Profile for 512 cores shows only 11% in serial operations. Either load-balancing problem or written with inefficient comms operations for an XT. Early profiles suggest collectives are dominating: 100.0% 0% Total % MPI % % MPI_Allreduce 9.3% % MPI_Waitall 4.7% % MPI_Barrier 4.6% % MPI_Recv 3.8% % MPI_Isend 1.0% % MPI_Irecv Significant time spent in Waitall, also imbalanced (3 rd and 4 th columns).

25 Cray Pat Analysis Message Exchange The predominant message exchanging routine is cs_halo_sync_var. It is called 180,000 times in 10 iterations. Structure is implemented with isend & irecv. Has global barrier between isends and irecvs to ensure irecv is posted before the send. For better performance we will re-order the Isends. Would also be better to not issue isends & irecvs if length is zero. (*) Would be better if there is some work to do between irecvs and isends as this will allow the communication to happen asynchronously. In many cases the calls to cs_halo_sync_var can be combined to send one message rather than 4 (in some cases).

26 Cray Pat Analysis Global Comms The synchronization time for the collectives is more significant than the routines themselves. We should be able to reduce collectives. Barrier within halo_sync_var. Consecutive global collectives can be collated. Should allow us to save time spent in the collective, and also give opportunity for overlapping.

27 Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: PARMETIS, PT-SCOTCH 27

28 Parallel Partitioner Performance 80 Time Take en (secs) PT-SCOTCH PARMETIS Cores

29 I/O Overheads 40 (arbitrary y) Perfo ormance Cray XT4 per Iteration Cray XT4 Total Time Ideal Processors

30 I/O Overheads 40 (arbitrary y) Cray XT4 per Iteration Cray XT4 Total Time Ideal I/O Overheads Perfo ormance Processors

31 Parallel I/O (i) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output 31

32 Parallel I/O (ii) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 32

33 Parallel I/O (iii) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 33

34 Load Imbalance (1/3) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/ min/max cells at 1024 cores 5.8% imbalance 11344/12781 min/max cells at 8192 cores 8.9% imbalance 34

35 Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Load imbalance might be reduced d using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 35

36 Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 36

37 Hybrid MPI / OpenMP (1/2) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 37

38 Hybrid MPI / OpenMP (2/2) Hybrid MPI / OpenMP EDF plans to test on Blue Gene Also pure OpenMP parallelism for ease of packaging / installation on Linux distributions No N dependencies d on multiple l MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling p g with SYRTHES 4 will still require MPI 38

39 Code_Saturne - Summary Several projects exist (in addition to PRACE) to improve the performance of the code: Pre-processing, Mesh Generation, Mesh Partitioning Improvements to CFD Solver Code Optimization, particularly on the Cray XT4 / XT5 (& Jaguar) Parallel I/O, Hybrid MPI / OpenMP Parallel Performance of Existing Code is Very Good Particularly for large problem sizes. We hope to benchmark the 100M Cell Mixing Grid for PRACE a.s.a.p. Introduction of Multigrid has reduced scalability but improved performance Load Balancing difficult to perfect on large processor counts

EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap

EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap EDF's Code_Saturne and parallel IO Toolchain evolution and roadmap Code_Saturne main capabilities Physical modelling Single-phase laminar and turbulent flows: k-, k- SST, v2f, RSM, LES Radiative heat transfer

More information

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap High Performance Calculation with Code_Saturne at EDF Toolchain evoution and roadmap Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are

More information

HPC and CFD at EDF with Code_Saturne

HPC and CFD at EDF with Code_Saturne HPC and CFD at EDF with Code_Saturne Yvan Fournier, Jérôme Bonelle EDF R&D Fluid Dynamics, Power Generation and Environment Department Open Source CFD International Conference Barcelona 2009 Summary 1.

More information

Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)

Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration) IBM eserver pseries Sciomp15- May 2008 - Barcelona Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration) Pascal Vezolle - IBM Deep Computing, vezolle@fr.ibm.com

More information

Code Saturne on POWER8 clusters: First Investigations

Code Saturne on POWER8 clusters: First Investigations Code Saturne on POWER8 clusters: First Investigations C. MOULINEC, V. SZEREMI, D.R. EMERSON (STFC Daresbury Lab., UK) Y. FOURNIER (EDF R&D, FR) P. VEZOLLE, L. ENAULT (IBM Montpellier, FR) B. ANLAUF, M.

More information

Quality Assurance Benchmarks. EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD

Quality Assurance Benchmarks. EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD Quality Assurance Benchmarks EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD Outlines General presentation of Code_Saturne and NEPTUNE_CFD Code_Saturne has been developed since 1998, based on older

More information

Parallel Mesh Multiplication for Code_Saturne

Parallel Mesh Multiplication for Code_Saturne Parallel Mesh Multiplication for Code_Saturne Pavla Kabelikova, Ales Ronovsky, Vit Vondrak a Dept. of Applied Mathematics, VSB-Technical University of Ostrava, Tr. 17. listopadu 15, 708 00 Ostrava, Czech

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles Moulinec and Vendel Szeremi (STFC, Daresbury Laboratory Outline Parallel IO problem

More information

Optimizing TELEMAC-2D for Large-scale Flood Simulations

Optimizing TELEMAC-2D for Large-scale Flood Simulations Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Optimizing TELEMAC-2D for Large-scale Flood Simulations Charles Moulinec a,, Yoann Audouin a, Andrew Sunderland a a STFC

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations. Yvan Fournier, EDF R&D October 3, 2012

Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations. Yvan Fournier, EDF R&D October 3, 2012 Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations Yvan Fournier, EDF R&D October 3, 2012 SUMMARY 1. EDF SIMULATION CODES Code_Saturne NEPTUNE_CFD 1. PARALLELISM AND HPC

More information

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh Engineering Volume 2013, Article ID 850148, 6 pages http://dx.doi.org/10.1155/2013/850148 Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh Zhi Shang Science and Technology

More information

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing

More information

High Performance Computing : Code_Saturne in the PRACE project

High Performance Computing : Code_Saturne in the PRACE project High Performance Computing : Code_Saturne in the PRACE project Andy SUNDERLAND Charles MOULINEC STFC Daresbury Laboratory, UK Code_Saturne User Meeting Chatou 1st-2nd Dec 28 STFC Daresbury Laboratory HPC

More information

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS U. Bieder, C. Calvin, G. Fauchet CEA Saclay, CEA/DEN/DANS/DM2S P. Ledac CS-SI HPCC 2014 - First International Workshop

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Multi-Physics Multi-Code Coupling On Supercomputers

Multi-Physics Multi-Code Coupling On Supercomputers Multi-Physics Multi-Code Coupling On Supercomputers J.C. Cajas 1, G. Houzeaux 1, M. Zavala 1, M. Vázquez 1, B. Uekermann 2, B. Gatzhammer 2, M. Mehl 2, Y. Fournier 3, C. Moulinec 4 1) er, Edificio NEXUS

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

Recent results with elsa on multi-cores

Recent results with elsa on multi-cores Michel Gazaix (ONERA) Steeve Champagneux (AIRBUS) October 15th, 2009 Outline Short introduction to elsa elsa benchmark on HPC platforms Detailed performance evaluation IBM Power5, AMD Opteron, INTEL Nehalem

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0

Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0 Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0 09-10/09/2015 1 Introduction Flows through tubes bundles are encountered in many heat exchanger applications. One example

More information

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011 ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include

More information

ANSYS HPC Technology Leadership

ANSYS HPC Technology Leadership ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables

More information

Particleworks: Particle-based CAE Software fully ported to GPU

Particleworks: Particle-based CAE Software fully ported to GPU Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation

More information

Explicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung

Explicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung Explicit and Implicit Coupling Strategies for s Outline FreSCo+ Grid Coupling Interpolation Schemes Implementation Mass Conservation Examples Lid-driven Cavity Flow Cylinder in a Channel Oscillating Cylinder

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC 2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system

High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system IBM eserver pseries Sciomp Meeting- July 16-20 2007 High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system Jean-Yves Berthou EDF R&D Pascal Vezolle / Olivier Hess - IBM Deep

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application

More information

Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion

Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion Kaidi Wan 1, Jun Xia 2, Neelofer Banglawala 3, Zhihua Wang 1, Kefa Cen 1 1. Zhejiang University, Hangzhou,

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland

More information

Commodity Cluster Computing

Commodity Cluster Computing Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program C.T. Vaughan, D.C. Dinge, P.T. Lin, S.D. Hammond, J. Cook, C. R. Trott, A.M. Agelastos, D.M. Pase, R.E. Benner,

More information

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Outline of the Talk Introduction to the TELEMAC System and to TELEMAC-2D Code Developments Data Reordering Strategy Results Conclusions

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute

More information

High Performance Computing for PDE Towards Petascale Computing

High Performance Computing for PDE Towards Petascale Computing High Performance Computing for PDE Towards Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik, Univ. Dortmund

More information

Second OpenFOAM Workshop: Welcome and Introduction

Second OpenFOAM Workshop: Welcome and Introduction Second OpenFOAM Workshop: Welcome and Introduction Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom and FSB, University of Zagreb, Croatia 7-9th June 2007 Second OpenFOAM Workshop:Welcome and

More information

ALE and AMR Mesh Refinement Techniques for Multi-material Hydrodynamics Problems

ALE and AMR Mesh Refinement Techniques for Multi-material Hydrodynamics Problems ALE and AMR Mesh Refinement Techniques for Multi-material Hydrodynamics Problems A. J. Barlow, AWE. ICFD Workshop on Mesh Refinement Techniques 7th December 2005 Acknowledgements Thanks to Chris Powell,

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

NIA CFD Futures Conference Hampton, VA; August 2012

NIA CFD Futures Conference Hampton, VA; August 2012 Petascale Computing and Similarity Scaling in Turbulence P. K. Yeung Schools of AE, CSE, ME Georgia Tech pk.yeung@ae.gatech.edu NIA CFD Futures Conference Hampton, VA; August 2012 10 2 10 1 10 4 10 5 Supported

More information

Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

Red Storm / Cray XT4: A Superior Architecture for Scalability

Red Storm / Cray XT4: A Superior Architecture for Scalability Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia

More information

The Cray Rainier System: Integrated Scalar/Vector Computing

The Cray Rainier System: Integrated Scalar/Vector Computing THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier

More information

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison

More information

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, Multi-objective Optimisation and Resilience

More information

The IBM Blue Gene/Q: Application performance, scalability and optimisation

The IBM Blue Gene/Q: Application performance, scalability and optimisation The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

High Performance Computing Course Notes HPC Fundamentals

High Performance Computing Course Notes HPC Fundamentals High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics Zbigniew P. Piotrowski *,** EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics *Geophysical Turbulence Program, National Center for Atmospheric Research,

More information

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

PETSc Satish Balay, Kris Buschelman, Bill Gropp, Dinesh Kaushik, Lois McInnes, Barry Smith

PETSc   Satish Balay, Kris Buschelman, Bill Gropp, Dinesh Kaushik, Lois McInnes, Barry Smith PETSc http://www.mcs.anl.gov/petsc Satish Balay, Kris Buschelman, Bill Gropp, Dinesh Kaushik, Lois McInnes, Barry Smith PDE Application Codes PETSc PDE Application Codes! ODE Integrators! Nonlinear Solvers,!

More information

Petascale Adaptive Computational Fluid Dyanamics

Petascale Adaptive Computational Fluid Dyanamics Petascale Adaptive Computational Fluid Dyanamics K.E. Jansen, M. Rasquin Aerospace Engineering Sciences University of Colorado at Boulder O. Sahni, A. Ovcharenko, M.S. Shephard, M. Zhou, J. Fu, N. Liu,

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows Benzi John* and M. Damodaran** Division of Thermal and Fluids Engineering, School of Mechanical and Aerospace

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Hybrid programming with MPI and OpenMP On the way to exascale

Hybrid programming with MPI and OpenMP On the way to exascale Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how

More information

Exploring unstructured Poisson solvers for FDS

Exploring unstructured Poisson solvers for FDS Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests

More information

Compute Node Linux: Overview, Progress to Date & Roadmap

Compute Node Linux: Overview, Progress to Date & Roadmap Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP Porting and Optimisation of UM on ARCHER Karthee Sivalingam, NCAS-CMS HPC Workshop ECMWF JWCRP Acknowledgements! NCAS-CMS Bryan Lawrence Jeffrey Cole Rosalyn Hatcher Andrew Heaps David Hassell Grenville

More information

Parallel Mesh Partitioning in Alya

Parallel Mesh Partitioning in Alya Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es

More information

ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS

ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS BY TAHER SAIF A thesis submitted to the Graduate School New Brunswick Rutgers, The State University

More information

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square

More information

Enabling Code_Saturne for Multi-Petaflop/Exascale with MPI 3.0 one sided Communication

Enabling Code_Saturne for Multi-Petaflop/Exascale with MPI 3.0 one sided Communication Enabling Code_Saturne for Multi-Petaflop/Exascale with MPI 3.0 one sided Communication Chandan Basu a,1, Soon-Heum Ko a, Charles Moulinec b, Yvan Fournier c a National Supercomputer Centrer, Sweden b STFC,

More information

Handling Parallelisation in OpenFOAM

Handling Parallelisation in OpenFOAM Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation

More information

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago Petascale Multiscale Simulations of Biomolecular Systems John Grime Voth Group Argonne National Laboratory / University of Chicago About me Background: experimental guy in grad school (LSCM, drug delivery)

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Meshing With SALOME for Code_Saturne. Mixed meshing and tests November

Meshing With SALOME for Code_Saturne. Mixed meshing and tests November Meshing With SALOME for Code_Saturne Mixed meshing and tests November 23 2010 Fuel assembly mock-up We seek to mesh the Nestor mock-up Several types of grids : simplified spacer grids + realistic 5x5 mixing

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCALASCA parallel performance analyses of SPEC MPI2007 applications Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP Chapter 18 Combining MPI and OpenMP Outline Advantages of using both MPI and OpenMP Case Study: Conjugate gradient method Case Study: Jacobi method C+MPI vs. C+MPI+OpenMP Interconnection Network P P P

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information