EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap

Size: px
Start display at page:

Download "EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap"

Transcription

1 EDF's Code_Saturne and parallel IO Toolchain evolution and roadmap

2 Code_Saturne main capabilities Physical modelling Single-phase laminar and turbulent flows: k-, k- SST, v2f, RSM, LES Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible flow ALE method for deformable meshes Conjugate heat transfer (Syrthes & 1D) Specific engineering modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 2

3 Code_Saturne main capabilities Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March Parallelism L.E.S 1.2 State of the art in turbulence Open source 1.3 Massively parallel ALE Code coupling (source code and manuals) (support) Basic capabilities : Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 3

4 Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 4

5 Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster,...) 5

6 Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method lines of code, 49% FORTRAN, 41% C, 10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 6

7 Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies ( to cells) 7

8 Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 8

9 Code_Saturne mesh examples Example of mesh with stretched cells and hanging nodes PWR lower plenum Example of composite mesh 3D polyedral cells 9

10 Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are explicit Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-cgstab) used for other variables More important, matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 10

11 Base parallel operations (1/6) Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Extended neighborhoods for gradients also require ghost cells sharing vertices Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 11

12 Base parallel operations (2/6) Use of global numbering We associate a global number to each mesh entity A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer (usually 32-bit), but an unsigned long integer (64-bit) will be necessary Face-cell connectivity for hexahedral cells : size 4.n_faces, and n_faces about 3.n_cells, size around 12.n_cells, so numbers requiring 64 bit around 350 million cells. Currently equal to the initial (pre-partitioning) number Allows for partition-independent single-image files Essential for restart files, also used for postprocessor output Also used for legacy coupling where matches can be saved 12

13 Base parallel operations (3/6) Use of global numbering Redistribution on n blocks n blocks n cores Minimum block size may be set to avoid many small blocks (for some communication or usage schemes), or to force 1 block (for I/O with nonparallel libraries) 13

14 Base parallel operations (4/6) Conversely, simply using global numbers allows reconstructing neighbor partition entity equivalents mapping Used for parallel ghost cell construction from initially partitioned mesh with no ghost data Arbitrary distribution, inefficient for halo exchange, but allow for simpler data structure related algorithms with deterministic performance bounds Owning processor determined simply by global number, messages are aggregated 14

15 Base parallel operations (5/6) Use of global numbering Using a block-distribution for I/O avoids need for indexed MPI datatypes We layer our raw I/O subsystem so that we may use either predistribution or direct MPI/IO with indexed datatypes in the future (and compare performance), but I/O using libraries with no indexed datatype equivalent may require using block distribution. Switching between partition-based to block-based distribution is currently done using MPI_Alltoall (for counts) and MPI_Alltoallv (for data) On high-performance machines with well-optimized MPI libraries, this does not seem to lead to scaling issues 15

16 Base parallel operations (6/6) Such all-to-all communication is only used for specific operations, such as parallel sort and processor boundary equivalent entity identification and I/O Algorithms such as crystal router exist in which message count (and thus latency) is of complexity O(log(p)) We trust MPI implementors to use the best routing algorithms and heuristics in most cases; they are the specialists We avoid explicit O(p) communication loops in new code, and have plans to remove a few such loops still residing in the post-processor file output Often less than 1 such call per time step on average, versus hundreds or thousands of calls to MPI_Isend / MPI_Irecv and MPI_Allreduce 16

17 Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells 200 iterations 200 iterations 50 iterations Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 17

18 Current performance (1/6) 2 LES test cases (most I/O factored out) 1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150 at FATHER 1 M hexahedra LES test case HYPPI 10 M hexahedra LES test case Elapsed time Opteron + infiniband Opteron + Myrinet NovaScale Blue Gene/L Elapsed tile Opteron + Myrinet Opteron + infiniband NovaScale Blue Gene/L n cores n cores 18

19 Current performance (2/6) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/ min/max cells/core at 1024 cores 11344/12781 min/max cells cores at 8192 cores FA Grid RANS test case Elapsed time per iteration NovaScale Blue Gene/L (CO) Blue Gene/L (VN) n cores 19

20 Current performance (3/6) Speedups on FATHER test case Speed-up Nombre de processeurs Station HP Cluster CHATOU Tantale Platine BGL 20

21 Current performance (4/6) Speedups on HYPI test case Speed-up Nombre de processeurs Cluster Tantale Platine BlueGene 21

22 Current performance (5/6) Serialized IO elapsed time for post-processor output, FATHER test case Start / restart up to 4 times faster than postprocessor output Communication scheme differences Temps pour un post-traitement type avec FATHER (1 M) 9,0 8,0 7,0 6,0 Temps écoulé (s) 5,0 4,0 3,0 2,0 1,0 0, Nombre de processeurs Station HP Cluster Tantale Platine BlueGene 22

23 Current performance (6/6) Serialized IO elapsed time for post-processor output, HYPI test case Temps pour un post-traitement type sur HYPI (10M) Temps (s) Nb procs. BlueGene Platine Tantale Cluster Chatou 23

24 Toolchain evolution (1) Code_Saturne V1.3 added many HPC-oriented improvements compared to prior versions: Post-processor output handled by FVM / Kernel Ghost cell construction handled by FVM / Kernel Up to 40% gain in preprocessor memory peak compared to V1.2 Parallelized and scales, though complex, as we may have 2 ghost cell sets and multiple periodicities Well adapted for models up to 100 million cells (with 64 Gb preprocessing machine), or less All fundamental limitations are pre-processing related Meshes Pre-Processor: serial run Kernel + FVM: distributed run Postprocessing output 24

25 Toolchain evolution (2) Separate basic Pre-Processing from Partitioning in version 1.4 Allows running both stages on different machines Allows using the same pre-processing run for different partitionings Joining of non-conforming meshes usually requires only a few minutes under 10 million cells, but can be close to a day for for 50 million+ cell meshes built of many sub-parts Partitioning requires slightly less memory than pre-processing Meshes Pre-Processor: serial run Partitioner: serial run Kernel + FVM: distributed run Postprocessing output 25

26 Toolchain evolution (3) Parallelize mesh joining for V2.0 Joining of conforming/non-conforming meshes, periodicity Parallel edge intersection implemented, not merged yet Sub-face reconstruction in progress Current (serial) algorithm will probably remain available as a backup option for V2.0 (and be removed no sooner than 6 months after full parallel algorithm) Allow restart in case of memory fragmentation due to complex mesh joining step Meshes Kernel + FVM: distributed init Partitioner: serial run Kernel + FVM: distributed run Postprocessing output 26

27 Toolchain evolution (4) Parallelize domain partitioning for V2.0 Replace serial METIS or SCOTCH with parallel versions SCOTCH uses free CeCILL-C licence, whereas METIS is more restrictive, but SCOTCH does not yet seem as stable Possibly due to METIS-wrapper, though feedback from Code_Aster seems to indicate the same ParMetis partitioning reportedly of lower quality than serial METIS Is it that bad? PT-SCOTCH does not suffer from this, but functionality is not complete yet 27

28 Large case example (1/2) 125 million hexahedra canal at Daresbury 100 million tetrahedra calculation at EDF Simplified PWR fuel assembly mixing grid Run on to Blue Gene/L cores Tested also on BULL NovaScale cores biggest issue: mesh generation 28

29 Large case example (2/2) 29

30 Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: use SANDIA's ZOLTAN library May call METIS, SCOTCH, it's own high-quality hypergraph partitioning, recursive bisection or space-filling curves (Zcurve or Hilbert), LGPL license 30

31 Parallel I/O (1/) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Use of indexed datatypes may be tested in the future, but will not be possible everywhere Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output Layered approach as we allow for multiple formats 31

32 Standard I/O (1/2) By default, all stdout type messages go to a log file Uses Fortran write Even in C, we print to a buffer, which is written as a character string by Fortran, to ensure messages appear in the correct order Easy to define and possibly overload a bft_printf function in C than to modify Fortran IO in a portable manner, so C adapts to Fortran... Opened as listing on rank 0, or unit 6 if we want to send it to screen We may use listing_rank_num on other ranks, but /dev/null is the default 32

33 Standard I/O (2/2) In case of failure (either error detected by the code, such as consistency or convergence errors), or a signal is caught: Open a special error_rank_num file, and print details and possibly backtrace to this file) In the case of a signal handler, only do this on rank 0 for signals such as SIGTERM,... do this on all ranks that fail for SIGFPE or SIGSEGV Also copy message to regular output if possible If va_copy is available 33

34 Parallel I/O (1/5) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 34

35 Parallel I/O (2/5) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 35

36 Parallel I/O (3/5) Uses fvm_file intermediary layer Only provides a subset of MPI IO possibilities, privileging collectives Handles offsets and other metadata Provides fvm_file_write_global, fvm_file_write_block, and corresponding read functions Will try fvm_file_read/write_list in the future using indexed datatypes to have all data shuffling inside MPI IO instead of outside Allows for using (synchronous) explicit offsets and individual file pointer implementations Shared pointers tried, but removed due to issues on some filesystems, and limited interest 36

37 Parallel I/O (4/5) Recently added for mesh input and checkpoint/restart files Specific format used for these files, based on successive header + data, with possible padding for alignment Initial header indicates minimum header size, header alignment, and data alignment Easy to analyze and index for restart purposes by reading headers and seeking to next header 37

38 Parallel I/O (5/5) Recent development, not much feedback yet Most useful for postprocessor output, but this will require more work We need to distribute indexed output data, distributed to blocks preferably by data blocks rather than index blocks for good balancing We will replace use of fvm_gather_... type functions by fvm_part_to_block_... type functions in file output and output helper code Multiple layers due to handling of several file formats 38

39 Possible future changes Base global (canonical) numbering not on initial number, but on space-filling curve A domain partitioning method in its own right Independent of processor count Ensures block to partition mapping is «sparse» (i.e. only a few blocks are references by a partition, and vice-versa) Block to partition mapping is 1-to-1 if no additional partitioning method is used 39

40 Load imbalance (1/3) In this example, using 8 partitions (with METIS), we have the following local minima and maxima: Cells: 416 / 440 (6% imbalance) Cells + ghost cells: 469/519 (11% imbalance) Interior faces: 852/946 (11% imbalance) Most loops are on cells, but some are on cells + ghosts, and MatVec is in cells + faces 40

41 Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Perfect balancing is impossible to reach, as different loops show different imbalance levels, an synchronizations may be required between these loops GCP uses MatVec and dot products Load imbalance might be reduced using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 41

42 Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 42

43 Multigrid (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability, and may be checked (if suspected) using the solver summary info at the end of the log file Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small 43

44 Multigrid (2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper Important, as grids may be rebuilt each time step 44

45 Parallelization of mesh joining (1/2) Though mesh joining may not be a long-term solution (as it tends to be excessively used in ways detrimental to mesh quality), it is a functionality we can not do without as long as we do not have a proven alternative. Periodicity is also constructed as an extension of mesh joining. In addition, joining of meshes built in several pieces to circumvent meshing tool memory limitations is very useful. 45

46 Parallelization of mesh joining (2/2) Parallelizing this algorithm requires the same main steps as the serial algorithm: Detect intersections (within a given tolerance) between edges of overlapping faces (done) Uses parallel octree for face bounding boxes, built in a bottomup fashion (no balance condition required here) Preprocessor version used a lexicographical binary search, whose best case was O(n.log(n)), and worst case was O(n 2 ). Subdivide edges according to inserted intersection vertices Merge coincident or nearly-coincident vertices/intersections Requires recursive merge accept/refuse (done) Re-build sub-faces 46

47 Parallel point location Parallel point location is used by the cooling tower and overset grid functionalities, as well as coupling with SYRTHES 4, neither of which is fully standard yet Algorithm is parallel, but communication pattern could lead to some serialization A simple improvement would be to order communication by communicating first among ranks in same half of rank set, then among ranks in other half, and ordering these communications recursively in the same manner Load imbalance may be important, but serialization would be limited A rendez-vous approach using an octree or recursive bisection intermediate mesh may be more efficient, but is setup more complex and expensive 47

48 Hybrid MPI / OpenMP (1/3) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 48

49 Hybrid MPI / OpenMP (3/3) Hybrid MPI / OpenMP will be tested IBM will do testing on Blue Gene/P We also hope to provide OpenMP parallelism for ease of packaging / installation on Linux distributions No dependencies on multiple MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling with SYRTHES 4 will still require MPI 49

50 HPC roadmap: application examples 50

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap High Performance Calculation with Code_Saturne at EDF Toolchain evoution and roadmap Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are

More information

PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK PRACE Workshop: Application Case Study: Code_Saturne Andrew Sunderland, Charles Moulinec, Zhi Shang Science and Technology Facilities Council, Daresbury Laboratory, UK Yvan Fournier, Electricite it de

More information

HPC and CFD at EDF with Code_Saturne

HPC and CFD at EDF with Code_Saturne HPC and CFD at EDF with Code_Saturne Yvan Fournier, Jérôme Bonelle EDF R&D Fluid Dynamics, Power Generation and Environment Department Open Source CFD International Conference Barcelona 2009 Summary 1.

More information

Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)

Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration) IBM eserver pseries Sciomp15- May 2008 - Barcelona Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration) Pascal Vezolle - IBM Deep Computing, vezolle@fr.ibm.com

More information

Quality Assurance Benchmarks. EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD

Quality Assurance Benchmarks. EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD Quality Assurance Benchmarks EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD Outlines General presentation of Code_Saturne and NEPTUNE_CFD Code_Saturne has been developed since 1998, based on older

More information

Code Saturne on POWER8 clusters: First Investigations

Code Saturne on POWER8 clusters: First Investigations Code Saturne on POWER8 clusters: First Investigations C. MOULINEC, V. SZEREMI, D.R. EMERSON (STFC Daresbury Lab., UK) Y. FOURNIER (EDF R&D, FR) P. VEZOLLE, L. ENAULT (IBM Montpellier, FR) B. ANLAUF, M.

More information

Parallel Mesh Multiplication for Code_Saturne

Parallel Mesh Multiplication for Code_Saturne Parallel Mesh Multiplication for Code_Saturne Pavla Kabelikova, Ales Ronovsky, Vit Vondrak a Dept. of Applied Mathematics, VSB-Technical University of Ostrava, Tr. 17. listopadu 15, 708 00 Ostrava, Czech

More information

Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations. Yvan Fournier, EDF R&D October 3, 2012

Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations. Yvan Fournier, EDF R&D October 3, 2012 Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations Yvan Fournier, EDF R&D October 3, 2012 SUMMARY 1. EDF SIMULATION CODES Code_Saturne NEPTUNE_CFD 1. PARALLELISM AND HPC

More information

Meshing With SALOME for Code_Saturne. Mixed meshing and tests November

Meshing With SALOME for Code_Saturne. Mixed meshing and tests November Meshing With SALOME for Code_Saturne Mixed meshing and tests November 23 2010 Fuel assembly mock-up We seek to mesh the Nestor mock-up Several types of grids : simplified spacer grids + realistic 5x5 mixing

More information

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing

More information

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles Moulinec and Vendel Szeremi (STFC, Daresbury Laboratory Outline Parallel IO problem

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system

High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system IBM eserver pseries Sciomp Meeting- July 16-20 2007 High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system Jean-Yves Berthou EDF R&D Pascal Vezolle / Olivier Hess - IBM Deep

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Optimizing TELEMAC-2D for Large-scale Flood Simulations

Optimizing TELEMAC-2D for Large-scale Flood Simulations Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Optimizing TELEMAC-2D for Large-scale Flood Simulations Charles Moulinec a,, Yoann Audouin a, Andrew Sunderland a a STFC

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver

Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver Juan J. Alonso Department of Aeronautics & Astronautics Stanford University CME342 Lecture 14 May 26, 2014 Outline Non-linear

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011 ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Recent results with elsa on multi-cores

Recent results with elsa on multi-cores Michel Gazaix (ONERA) Steeve Champagneux (AIRBUS) October 15th, 2009 Outline Short introduction to elsa elsa benchmark on HPC platforms Detailed performance evaluation IBM Power5, AMD Opteron, INTEL Nehalem

More information

ANSYS HPC Technology Leadership

ANSYS HPC Technology Leadership ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables

More information

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Esteban Meneses Patrick Pisciuneri Center for Simulation and Modeling (SaM) University of Pittsburgh University of

More information

Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0

Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0 Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0 09-10/09/2015 1 Introduction Flows through tubes bundles are encountered in many heat exchanger applications. One example

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh Engineering Volume 2013, Article ID 850148, 6 pages http://dx.doi.org/10.1155/2013/850148 Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh Zhi Shang Science and Technology

More information

Native mesh ordering with Scotch 4.0

Native mesh ordering with Scotch 4.0 Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application

More information

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation

More information

New Challenges In Dynamic Load Balancing

New Challenges In Dynamic Load Balancing New Challenges In Dynamic Load Balancing Karen D. Devine, et al. Presentation by Nam Ma & J. Anthony Toghia What is load balancing? Assignment of work to processors Goal: maximize parallel performance

More information

Real Application Performance and Beyond

Real Application Performance and Beyond Real Application Performance and Beyond Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 http://www.mellanox.com Scientists, engineers and analysts

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Handling Parallelisation in OpenFOAM

Handling Parallelisation in OpenFOAM Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Red Storm / Cray XT4: A Superior Architecture for Scalability

Red Storm / Cray XT4: A Superior Architecture for Scalability Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion

Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion Kaidi Wan 1, Jun Xia 2, Neelofer Banglawala 3, Zhihua Wang 1, Kefa Cen 1 1. Zhejiang University, Hangzhou,

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS U. Bieder, C. Calvin, G. Fauchet CEA Saclay, CEA/DEN/DANS/DM2S P. Ledac CS-SI HPCC 2014 - First International Workshop

More information

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen Graph Partitioning for High-Performance Scientific Simulations Advanced Topics Spring 2008 Prof. Robert van Engelen Overview Challenges for irregular meshes Modeling mesh-based computations as graphs Static

More information

contact:

contact: Fluid Dynamics, Power Generation and Environment Department Single Phase Thermal-Hydraulics Group 6, quai Watier F-78401 Chatou Cedex Tel: 33 1 30 87 75 40 Fax: 33 1 30 87 79 16 JUNE 2017 version 5.0.0

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program C.T. Vaughan, D.C. Dinge, P.T. Lin, S.D. Hammond, J. Cook, C. R. Trott, A.M. Agelastos, D.M. Pase, R.E. Benner,

More information

Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids

Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Matthieu Lefebvre 1, Jean-Marie Le Gouez 2 1 PhD at Onera, now post-doc at Princeton, department of Geosciences,

More information

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania

More information

Particleworks: Particle-based CAE Software fully ported to GPU

Particleworks: Particle-based CAE Software fully ported to GPU Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation

More information

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering State of the art distributed parallel computational techniques in industrial finite element analysis Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering Ajaccio, France

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Fast Dynamic Load Balancing for Extreme Scale Systems

Fast Dynamic Load Balancing for Extreme Scale Systems Fast Dynamic Load Balancing for Extreme Scale Systems Cameron W. Smith, Gerrett Diamond, M.S. Shephard Computation Research Center (SCOREC) Rensselaer Polytechnic Institute Outline: n Some comments on

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Eleventh International Conference on CFD in the Minerals and Process Industries CSIRO, Melbourne, Australia 7-9 December 2015 GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Stan POSEY 1*, Simon SEE 2, and

More information

Generic finite element capabilities for forest-of-octrees AMR

Generic finite element capabilities for forest-of-octrees AMR Generic finite element capabilities for forest-of-octrees AMR Carsten Burstedde joint work with Omar Ghattas, Tobin Isaac Institut für Numerische Simulation (INS) Rheinische Friedrich-Wilhelms-Universität

More information

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics Zbigniew P. Piotrowski *,** EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics *Geophysical Turbulence Program, National Center for Atmospheric Research,

More information

Commodity Cluster Computing

Commodity Cluster Computing Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications

More information

TAU mesh deformation. Thomas Gerhold

TAU mesh deformation. Thomas Gerhold TAU mesh deformation Thomas Gerhold The parallel mesh deformation of the DLR TAU-Code Introduction Mesh deformation method & Parallelization Results & Applications Conclusion & Outlook Introduction CFD

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Code Performance Analysis

Code Performance Analysis Code Performance Analysis Massimiliano Fatica ASCI TST Review May 8 2003 Performance Theoretical peak performance of the ASCI machines are in the Teraflops range, but sustained performance with real applications

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

ANSYS FLUENT. Lecture 3. Basic Overview of Using the FLUENT User Interface L3-1. Customer Training Material

ANSYS FLUENT. Lecture 3. Basic Overview of Using the FLUENT User Interface L3-1. Customer Training Material Lecture 3 Basic Overview of Using the FLUENT User Interface Introduction to ANSYS FLUENT L3-1 Parallel Processing FLUENT can readily be run across many processors in parallel. This will greatly speed up

More information

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm March 17 March 21, 2014 Florian Schornbaum, Martin Bauer, Simon Bogner Chair for System Simulation Friedrich-Alexander-Universität

More information

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009 Industrial finite element analysis: Evolution and current challenges Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009 Dr. Chief Numerical Analyst Office of Architecture and

More information

Fault tolerant issues in large scale applications

Fault tolerant issues in large scale applications Fault tolerant issues in large scale applications Romain Teyssier George Lake, Ben Moore, Joachim Stadel and the other members of the project «Cosmology at the petascale» SPEEDUP 2010 1 Outline Computational

More information

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Performance Benefits of NVIDIA GPUs for LS-DYNA

Performance Benefits of NVIDIA GPUs for LS-DYNA Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Comparison of XT3 and XT4 Scalability

Comparison of XT3 and XT4 Scalability Comparison of XT3 and XT4 Scalability Patrick H. Worley Oak Ridge National Laboratory CUG 2007 May 7-10, 2007 Red Lion Hotel Seattle, WA Acknowledgements Research sponsored by the Climate Change Research

More information

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, Multi-objective Optimisation and Resilience

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information