EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap
|
|
- Doreen Cook
- 6 years ago
- Views:
Transcription
1 EDF's Code_Saturne and parallel IO Toolchain evolution and roadmap
2 Code_Saturne main capabilities Physical modelling Single-phase laminar and turbulent flows: k-, k- SST, v2f, RSM, LES Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible flow ALE method for deformable meshes Conjugate heat transfer (Syrthes & 1D) Specific engineering modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 2
3 Code_Saturne main capabilities Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March Parallelism L.E.S 1.2 State of the art in turbulence Open source 1.3 Massively parallel ALE Code coupling (source code and manuals) (support) Basic capabilities : Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 3
4 Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 4
5 Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster,...) 5
6 Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method lines of code, 49% FORTRAN, 41% C, 10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 6
7 Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies ( to cells) 7
8 Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 8
9 Code_Saturne mesh examples Example of mesh with stretched cells and hanging nodes PWR lower plenum Example of composite mesh 3D polyedral cells 9
10 Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are explicit Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-cgstab) used for other variables More important, matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 10
11 Base parallel operations (1/6) Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Extended neighborhoods for gradients also require ghost cells sharing vertices Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 11
12 Base parallel operations (2/6) Use of global numbering We associate a global number to each mesh entity A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer (usually 32-bit), but an unsigned long integer (64-bit) will be necessary Face-cell connectivity for hexahedral cells : size 4.n_faces, and n_faces about 3.n_cells, size around 12.n_cells, so numbers requiring 64 bit around 350 million cells. Currently equal to the initial (pre-partitioning) number Allows for partition-independent single-image files Essential for restart files, also used for postprocessor output Also used for legacy coupling where matches can be saved 12
13 Base parallel operations (3/6) Use of global numbering Redistribution on n blocks n blocks n cores Minimum block size may be set to avoid many small blocks (for some communication or usage schemes), or to force 1 block (for I/O with nonparallel libraries) 13
14 Base parallel operations (4/6) Conversely, simply using global numbers allows reconstructing neighbor partition entity equivalents mapping Used for parallel ghost cell construction from initially partitioned mesh with no ghost data Arbitrary distribution, inefficient for halo exchange, but allow for simpler data structure related algorithms with deterministic performance bounds Owning processor determined simply by global number, messages are aggregated 14
15 Base parallel operations (5/6) Use of global numbering Using a block-distribution for I/O avoids need for indexed MPI datatypes We layer our raw I/O subsystem so that we may use either predistribution or direct MPI/IO with indexed datatypes in the future (and compare performance), but I/O using libraries with no indexed datatype equivalent may require using block distribution. Switching between partition-based to block-based distribution is currently done using MPI_Alltoall (for counts) and MPI_Alltoallv (for data) On high-performance machines with well-optimized MPI libraries, this does not seem to lead to scaling issues 15
16 Base parallel operations (6/6) Such all-to-all communication is only used for specific operations, such as parallel sort and processor boundary equivalent entity identification and I/O Algorithms such as crystal router exist in which message count (and thus latency) is of complexity O(log(p)) We trust MPI implementors to use the best routing algorithms and heuristics in most cases; they are the specialists We avoid explicit O(p) communication loops in new code, and have plans to remove a few such loops still residing in the post-processor file output Often less than 1 such call per time step on average, versus hundreds or thousands of calls to MPI_Isend / MPI_Irecv and MPI_Allreduce 16
17 Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells 200 iterations 200 iterations 50 iterations Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 17
18 Current performance (1/6) 2 LES test cases (most I/O factored out) 1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150 at FATHER 1 M hexahedra LES test case HYPPI 10 M hexahedra LES test case Elapsed time Opteron + infiniband Opteron + Myrinet NovaScale Blue Gene/L Elapsed tile Opteron + Myrinet Opteron + infiniband NovaScale Blue Gene/L n cores n cores 18
19 Current performance (2/6) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/ min/max cells/core at 1024 cores 11344/12781 min/max cells cores at 8192 cores FA Grid RANS test case Elapsed time per iteration NovaScale Blue Gene/L (CO) Blue Gene/L (VN) n cores 19
20 Current performance (3/6) Speedups on FATHER test case Speed-up Nombre de processeurs Station HP Cluster CHATOU Tantale Platine BGL 20
21 Current performance (4/6) Speedups on HYPI test case Speed-up Nombre de processeurs Cluster Tantale Platine BlueGene 21
22 Current performance (5/6) Serialized IO elapsed time for post-processor output, FATHER test case Start / restart up to 4 times faster than postprocessor output Communication scheme differences Temps pour un post-traitement type avec FATHER (1 M) 9,0 8,0 7,0 6,0 Temps écoulé (s) 5,0 4,0 3,0 2,0 1,0 0, Nombre de processeurs Station HP Cluster Tantale Platine BlueGene 22
23 Current performance (6/6) Serialized IO elapsed time for post-processor output, HYPI test case Temps pour un post-traitement type sur HYPI (10M) Temps (s) Nb procs. BlueGene Platine Tantale Cluster Chatou 23
24 Toolchain evolution (1) Code_Saturne V1.3 added many HPC-oriented improvements compared to prior versions: Post-processor output handled by FVM / Kernel Ghost cell construction handled by FVM / Kernel Up to 40% gain in preprocessor memory peak compared to V1.2 Parallelized and scales, though complex, as we may have 2 ghost cell sets and multiple periodicities Well adapted for models up to 100 million cells (with 64 Gb preprocessing machine), or less All fundamental limitations are pre-processing related Meshes Pre-Processor: serial run Kernel + FVM: distributed run Postprocessing output 24
25 Toolchain evolution (2) Separate basic Pre-Processing from Partitioning in version 1.4 Allows running both stages on different machines Allows using the same pre-processing run for different partitionings Joining of non-conforming meshes usually requires only a few minutes under 10 million cells, but can be close to a day for for 50 million+ cell meshes built of many sub-parts Partitioning requires slightly less memory than pre-processing Meshes Pre-Processor: serial run Partitioner: serial run Kernel + FVM: distributed run Postprocessing output 25
26 Toolchain evolution (3) Parallelize mesh joining for V2.0 Joining of conforming/non-conforming meshes, periodicity Parallel edge intersection implemented, not merged yet Sub-face reconstruction in progress Current (serial) algorithm will probably remain available as a backup option for V2.0 (and be removed no sooner than 6 months after full parallel algorithm) Allow restart in case of memory fragmentation due to complex mesh joining step Meshes Kernel + FVM: distributed init Partitioner: serial run Kernel + FVM: distributed run Postprocessing output 26
27 Toolchain evolution (4) Parallelize domain partitioning for V2.0 Replace serial METIS or SCOTCH with parallel versions SCOTCH uses free CeCILL-C licence, whereas METIS is more restrictive, but SCOTCH does not yet seem as stable Possibly due to METIS-wrapper, though feedback from Code_Aster seems to indicate the same ParMetis partitioning reportedly of lower quality than serial METIS Is it that bad? PT-SCOTCH does not suffer from this, but functionality is not complete yet 27
28 Large case example (1/2) 125 million hexahedra canal at Daresbury 100 million tetrahedra calculation at EDF Simplified PWR fuel assembly mixing grid Run on to Blue Gene/L cores Tested also on BULL NovaScale cores biggest issue: mesh generation 28
29 Large case example (2/2) 29
30 Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: use SANDIA's ZOLTAN library May call METIS, SCOTCH, it's own high-quality hypergraph partitioning, recursive bisection or space-filling curves (Zcurve or Hilbert), LGPL license 30
31 Parallel I/O (1/) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Use of indexed datatypes may be tested in the future, but will not be possible everywhere Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output Layered approach as we allow for multiple formats 31
32 Standard I/O (1/2) By default, all stdout type messages go to a log file Uses Fortran write Even in C, we print to a buffer, which is written as a character string by Fortran, to ensure messages appear in the correct order Easy to define and possibly overload a bft_printf function in C than to modify Fortran IO in a portable manner, so C adapts to Fortran... Opened as listing on rank 0, or unit 6 if we want to send it to screen We may use listing_rank_num on other ranks, but /dev/null is the default 32
33 Standard I/O (2/2) In case of failure (either error detected by the code, such as consistency or convergence errors), or a signal is caught: Open a special error_rank_num file, and print details and possibly backtrace to this file) In the case of a signal handler, only do this on rank 0 for signals such as SIGTERM,... do this on all ranks that fail for SIGFPE or SIGSEGV Also copy message to regular output if possible If va_copy is available 33
34 Parallel I/O (1/5) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 34
35 Parallel I/O (2/5) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 35
36 Parallel I/O (3/5) Uses fvm_file intermediary layer Only provides a subset of MPI IO possibilities, privileging collectives Handles offsets and other metadata Provides fvm_file_write_global, fvm_file_write_block, and corresponding read functions Will try fvm_file_read/write_list in the future using indexed datatypes to have all data shuffling inside MPI IO instead of outside Allows for using (synchronous) explicit offsets and individual file pointer implementations Shared pointers tried, but removed due to issues on some filesystems, and limited interest 36
37 Parallel I/O (4/5) Recently added for mesh input and checkpoint/restart files Specific format used for these files, based on successive header + data, with possible padding for alignment Initial header indicates minimum header size, header alignment, and data alignment Easy to analyze and index for restart purposes by reading headers and seeking to next header 37
38 Parallel I/O (5/5) Recent development, not much feedback yet Most useful for postprocessor output, but this will require more work We need to distribute indexed output data, distributed to blocks preferably by data blocks rather than index blocks for good balancing We will replace use of fvm_gather_... type functions by fvm_part_to_block_... type functions in file output and output helper code Multiple layers due to handling of several file formats 38
39 Possible future changes Base global (canonical) numbering not on initial number, but on space-filling curve A domain partitioning method in its own right Independent of processor count Ensures block to partition mapping is «sparse» (i.e. only a few blocks are references by a partition, and vice-versa) Block to partition mapping is 1-to-1 if no additional partitioning method is used 39
40 Load imbalance (1/3) In this example, using 8 partitions (with METIS), we have the following local minima and maxima: Cells: 416 / 440 (6% imbalance) Cells + ghost cells: 469/519 (11% imbalance) Interior faces: 852/946 (11% imbalance) Most loops are on cells, but some are on cells + ghosts, and MatVec is in cells + faces 40
41 Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Perfect balancing is impossible to reach, as different loops show different imbalance levels, an synchronizations may be required between these loops GCP uses MatVec and dot products Load imbalance might be reduced using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 41
42 Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 42
43 Multigrid (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability, and may be checked (if suspected) using the solver summary info at the end of the log file Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small 43
44 Multigrid (2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper Important, as grids may be rebuilt each time step 44
45 Parallelization of mesh joining (1/2) Though mesh joining may not be a long-term solution (as it tends to be excessively used in ways detrimental to mesh quality), it is a functionality we can not do without as long as we do not have a proven alternative. Periodicity is also constructed as an extension of mesh joining. In addition, joining of meshes built in several pieces to circumvent meshing tool memory limitations is very useful. 45
46 Parallelization of mesh joining (2/2) Parallelizing this algorithm requires the same main steps as the serial algorithm: Detect intersections (within a given tolerance) between edges of overlapping faces (done) Uses parallel octree for face bounding boxes, built in a bottomup fashion (no balance condition required here) Preprocessor version used a lexicographical binary search, whose best case was O(n.log(n)), and worst case was O(n 2 ). Subdivide edges according to inserted intersection vertices Merge coincident or nearly-coincident vertices/intersections Requires recursive merge accept/refuse (done) Re-build sub-faces 46
47 Parallel point location Parallel point location is used by the cooling tower and overset grid functionalities, as well as coupling with SYRTHES 4, neither of which is fully standard yet Algorithm is parallel, but communication pattern could lead to some serialization A simple improvement would be to order communication by communicating first among ranks in same half of rank set, then among ranks in other half, and ordering these communications recursively in the same manner Load imbalance may be important, but serialization would be limited A rendez-vous approach using an octree or recursive bisection intermediate mesh may be more efficient, but is setup more complex and expensive 47
48 Hybrid MPI / OpenMP (1/3) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 48
49 Hybrid MPI / OpenMP (3/3) Hybrid MPI / OpenMP will be tested IBM will do testing on Blue Gene/P We also hope to provide OpenMP parallelism for ease of packaging / installation on Linux distributions No dependencies on multiple MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling with SYRTHES 4 will still require MPI 49
50 HPC roadmap: application examples 50
High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap
High Performance Calculation with Code_Saturne at EDF Toolchain evoution and roadmap Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are
More informationPRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK
PRACE Workshop: Application Case Study: Code_Saturne Andrew Sunderland, Charles Moulinec, Zhi Shang Science and Technology Facilities Council, Daresbury Laboratory, UK Yvan Fournier, Electricite it de
More informationHPC and CFD at EDF with Code_Saturne
HPC and CFD at EDF with Code_Saturne Yvan Fournier, Jérôme Bonelle EDF R&D Fluid Dynamics, Power Generation and Environment Department Open Source CFD International Conference Barcelona 2009 Summary 1.
More informationMixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)
IBM eserver pseries Sciomp15- May 2008 - Barcelona Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration) Pascal Vezolle - IBM Deep Computing, vezolle@fr.ibm.com
More informationQuality Assurance Benchmarks. EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD
Quality Assurance Benchmarks EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD Outlines General presentation of Code_Saturne and NEPTUNE_CFD Code_Saturne has been developed since 1998, based on older
More informationCode Saturne on POWER8 clusters: First Investigations
Code Saturne on POWER8 clusters: First Investigations C. MOULINEC, V. SZEREMI, D.R. EMERSON (STFC Daresbury Lab., UK) Y. FOURNIER (EDF R&D, FR) P. VEZOLLE, L. ENAULT (IBM Montpellier, FR) B. ANLAUF, M.
More informationParallel Mesh Multiplication for Code_Saturne
Parallel Mesh Multiplication for Code_Saturne Pavla Kabelikova, Ales Ronovsky, Vit Vondrak a Dept. of Applied Mathematics, VSB-Technical University of Ostrava, Tr. 17. listopadu 15, 708 00 Ostrava, Czech
More informationEvolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations. Yvan Fournier, EDF R&D October 3, 2012
Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations Yvan Fournier, EDF R&D October 3, 2012 SUMMARY 1. EDF SIMULATION CODES Code_Saturne NEPTUNE_CFD 1. PARALLELISM AND HPC
More informationMeshing With SALOME for Code_Saturne. Mixed meshing and tests November
Meshing With SALOME for Code_Saturne Mixed meshing and tests November 23 2010 Fuel assembly mock-up We seek to mesh the Nestor mock-up Several types of grids : simplified spacer grids + realistic 5x5 mixing
More informationIntroducing Overdecomposition to Existing Applications: PlasComCM and AMPI
Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing
More informationPERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS
PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles Moulinec and Vendel Szeremi (STFC, Daresbury Laboratory Outline Parallel IO problem
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationHigh Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system
IBM eserver pseries Sciomp Meeting- July 16-20 2007 High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system Jean-Yves Berthou EDF R&D Pascal Vezolle / Olivier Hess - IBM Deep
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationOptimizing TELEMAC-2D for Large-scale Flood Simulations
Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Optimizing TELEMAC-2D for Large-scale Flood Simulations Charles Moulinec a,, Yoann Audouin a, Andrew Sunderland a a STFC
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationMultigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver
Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver Juan J. Alonso Department of Aeronautics & Astronautics Stanford University CME342 Lecture 14 May 26, 2014 Outline Non-linear
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011
ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include
More informationIntroduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI
Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure
More informationDistributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs
Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationRecent results with elsa on multi-cores
Michel Gazaix (ONERA) Steeve Champagneux (AIRBUS) October 15th, 2009 Outline Short introduction to elsa elsa benchmark on HPC platforms Detailed performance evaluation IBM Power5, AMD Opteron, INTEL Nehalem
More informationANSYS HPC Technology Leadership
ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables
More informationLoad Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application
Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Esteban Meneses Patrick Pisciuneri Center for Simulation and Modeling (SaM) University of Pittsburgh University of
More informationLaminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0
Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0 09-10/09/2015 1 Introduction Flows through tubes bundles are encountered in many heat exchanger applications. One example
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationResearch Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh
Engineering Volume 2013, Article ID 850148, 6 pages http://dx.doi.org/10.1155/2013/850148 Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh Zhi Shang Science and Technology
More informationNative mesh ordering with Scotch 4.0
Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationFirst Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application
More informationChallenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation
More informationNew Challenges In Dynamic Load Balancing
New Challenges In Dynamic Load Balancing Karen D. Devine, et al. Presentation by Nam Ma & J. Anthony Toghia What is load balancing? Assignment of work to processors Goal: maximize parallel performance
More informationReal Application Performance and Beyond
Real Application Performance and Beyond Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 http://www.mellanox.com Scientists, engineers and analysts
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationHandling Parallelisation in OpenFOAM
Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationRed Storm / Cray XT4: A Superior Architecture for Scalability
Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia
More informationSELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND
Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationsmooth coefficients H. Köstler, U. Rüde
A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin
More informationOptimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion
Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion Kaidi Wan 1, Jun Xia 2, Neelofer Banglawala 3, Zhihua Wang 1, Kefa Cen 1 1. Zhejiang University, Hangzhou,
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short
More informationHIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS
HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS U. Bieder, C. Calvin, G. Fauchet CEA Saclay, CEA/DEN/DANS/DM2S P. Ledac CS-SI HPCC 2014 - First International Workshop
More informationGraph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen
Graph Partitioning for High-Performance Scientific Simulations Advanced Topics Spring 2008 Prof. Robert van Engelen Overview Challenges for irregular meshes Modeling mesh-based computations as graphs Static
More informationcontact:
Fluid Dynamics, Power Generation and Environment Department Single Phase Thermal-Hydraulics Group 6, quai Watier F-78401 Chatou Cedex Tel: 33 1 30 87 75 40 Fax: 33 1 30 87 79 16 JUNE 2017 version 5.0.0
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More informationEarly Experiences with Trinity - The First Advanced Technology Platform for the ASC Program
Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program C.T. Vaughan, D.C. Dinge, P.T. Lin, S.D. Hammond, J. Cook, C. R. Trott, A.M. Agelastos, D.M. Pase, R.E. Benner,
More informationGeneric Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids
Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Matthieu Lefebvre 1, Jean-Marie Le Gouez 2 1 PhD at Onera, now post-doc at Princeton, department of Geosciences,
More informationPuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania
More informationParticleworks: Particle-based CAE Software fully ported to GPU
Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation
More informationSecond Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering
State of the art distributed parallel computational techniques in industrial finite element analysis Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering Ajaccio, France
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationPARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS
Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationFast Dynamic Load Balancing for Extreme Scale Systems
Fast Dynamic Load Balancing for Extreme Scale Systems Cameron W. Smith, Gerrett Diamond, M.S. Shephard Computation Research Center (SCOREC) Rensselaer Polytechnic Institute Outline: n Some comments on
More informationWhy Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends
Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer
More informationLarge-scale Gas Turbine Simulations on GPU clusters
Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based
More informationGPU PROGRESS AND DIRECTIONS IN APPLIED CFD
Eleventh International Conference on CFD in the Minerals and Process Industries CSIRO, Melbourne, Australia 7-9 December 2015 GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Stan POSEY 1*, Simon SEE 2, and
More informationGeneric finite element capabilities for forest-of-octrees AMR
Generic finite element capabilities for forest-of-octrees AMR Carsten Burstedde joint work with Omar Ghattas, Tobin Isaac Institut für Numerische Simulation (INS) Rheinische Friedrich-Wilhelms-Universität
More informationEULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics
Zbigniew P. Piotrowski *,** EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics *Geophysical Turbulence Program, National Center for Atmospheric Research,
More informationCommodity Cluster Computing
Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications
More informationTAU mesh deformation. Thomas Gerhold
TAU mesh deformation Thomas Gerhold The parallel mesh deformation of the DLR TAU-Code Introduction Mesh deformation method & Parallelization Results & Applications Conclusion & Outlook Introduction CFD
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationFigure 6.1: Truss topology optimization diagram.
6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.
More informationCode Performance Analysis
Code Performance Analysis Massimiliano Fatica ASCI TST Review May 8 2003 Performance Theoretical peak performance of the ASCI machines are in the Teraflops range, but sustained performance with real applications
More informationAutomatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential
More informationANSYS FLUENT. Lecture 3. Basic Overview of Using the FLUENT User Interface L3-1. Customer Training Material
Lecture 3 Basic Overview of Using the FLUENT User Interface Introduction to ANSYS FLUENT L3-1 Parallel Processing FLUENT can readily be run across many processors in parallel. This will greatly speed up
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationComputational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm
Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm March 17 March 21, 2014 Florian Schornbaum, Martin Bauer, Simon Bogner Chair for System Simulation Friedrich-Alexander-Universität
More informationIndustrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009
Industrial finite element analysis: Evolution and current challenges Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009 Dr. Chief Numerical Analyst Office of Architecture and
More informationFault tolerant issues in large scale applications
Fault tolerant issues in large scale applications Romain Teyssier George Lake, Ben Moore, Joachim Stadel and the other members of the project «Cosmology at the petascale» SPEEDUP 2010 1 Outline Computational
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationSome aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)
Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationPerformance Benefits of NVIDIA GPUs for LS-DYNA
Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationReconstruction of Trees from Laser Scan Data and further Simulation Topics
Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair
More informationImplicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC
Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationComparison of XT3 and XT4 Scalability
Comparison of XT3 and XT4 Scalability Patrick H. Worley Oak Ridge National Laboratory CUG 2007 May 7-10, 2007 Red Lion Hotel Seattle, WA Acknowledgements Research sponsored by the Climate Change Research
More informationAllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill
This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, Multi-objective Optimisation and Resilience
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More information