Large-scale Gas Turbine Simulations on GPU clusters

Similar documents
Turbostream: A CFD solver for manycore

CFD Solvers on Many-core Processors

Accelerating CFD with Graphics Hardware

Interactive Aerodynamic Design in a Virtual Environment

Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware

AN ACCELERATED 3D NAVIER-STOKES SOLVER FOR FLOWS IN TURBOMACHINES

Simplification and automated modification of existing Isight-processes

Software and Performance Engineering for numerical codes on GPU clusters

GE Usage & Trends

Large scale Imaging on Current Many- Core Platforms

Two-Phase flows on massively parallel multi-gpu clusters

Large-scale integrated LES-RANS simulations of a gas turbine engine

NUMERICAL INVESTIGATION OF THE FLOW BEHAVIOR INTO THE INLET GUIDE VANE SYSTEM (IGV)

Direct Numerical Simulation of a Low Pressure Turbine Cascade. Christoph Müller

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

First Experiences with Intel Cluster OpenMP

Numerical Algorithms on Multi-GPU Architectures

Studies of the Continuous and Discrete Adjoint Approaches to Viscous Automatic Aerodynamic Shape Optimization

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Acceleration of scientific computing using graphics hardware

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Advanced Turbomachinery Methods. Brad Hutchinson ANSYS, Inc. Industry Marketing

NUMERICAL SIMULATION OF FLOW FIELD IN AN ANNULAR TURBINE STATOR WITH FILM COOLING

OpenACC programming for GPGPUs: Rotor wake simulation

Using GPUs for unstructured grid CFD

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

1.2 Numerical Solutions of Flow Problems

cuibm A GPU Accelerated Immersed Boundary Method

Automated Finite Element Computations in the FEniCS Framework using GPUs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Turbo Tools and General Grid Interface

Accurate and Efficient Turbomachinery Simulation. Chad Custer, PhD Turbomachinery Technical Specialist

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Integrated RANS/LES computations of turbulent flow through a turbofan jet engine

Unstructured Grid Numbering Schemes for GPU Coalescing Requirements

International Supercomputing Conference 2009

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011

Advanced Computation in the design and development of aircraft engines. Serge Eury SNECMA

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

The Calculation of Three Dimensional Viscous Flow Through Multistage Turbomachines

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Driven Cavity Example

Shape optimisation using breakthrough technologies

OP2 FOR MANY-CORE ARCHITECTURES

Auto-tuning Multigrid with PetaBricks

Efficient Imaging Algorithms on Many-Core Platforms

Overview and Recent Developments of Dynamic Mesh Capabilities

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Summary of the main PROBAND project results

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Recent results with elsa on multi-cores

computational Fluid Dynamics - Prof. V. Esfahanian

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Implementation of an integrated efficient parallel multiblock Flow solver

NUMERICAL VISCOSITY. Convergent Science White Paper. COPYRIGHT 2017 CONVERGENT SCIENCE. All rights reserved.

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

OP2: an active library for unstructured grid applications on GPUs

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009

Using Multiple Rotating Reference Frames

A Framework for Coupling Reynolds-Averaged With Large-Eddy Simulations for Gas Turbine Applications

Liszt, a language for PDE solvers

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Recent developments for the multigrid scheme of the DLR TAU-Code

Optimization of an Axial Pump using CFturbo, PumpLinx & optislang

arxiv: v1 [hep-lat] 12 Nov 2013

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Advances in Turbomachinery Simulation Fred Mendonça and material prepared by Chad Custer, Turbomachinery Technology Specialist

D036 Accelerating Reservoir Simulation with GPUs

Mesh Morphing and the Adjoint Solver in ANSYS R14.0. Simon Pereira Laz Foley

Axisymmetric Viscous Flow Modeling for Meridional Flow Calculation in Aerodynamic Design of Half-Ducted Blade Rows

Coupled Simulation of Flow and Body Motion Using Overset Grids. Eberhard Schreck & Milovan Perić

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Flow and Heat Transfer in a Mixing Elbow

Benchmarking CPU Performance. Benchmarking CPU Performance

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Numerical Methods for PDEs. SSC Workgroup Meetings Juan J. Alonso October 8, SSC Working Group Meetings, JJA 1

Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011

David Ham Abstracting the hardware

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Enhancement of a large injection system for steam turbines

Rapid Aerodynamic Performance Prediction on a Cluster of Graphics Processing Units

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

Large Data Set Computations with CFX: To 100 Million Nodes & Beyond!

Turbocharger Design & Analysis Solutions. Bill Holmes Brad Hutchinson Detroit, October 2012

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

CFD Post-Processing of Rampressor Rotor Compressor

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Application of Wray-Agarwal Turbulence Model for Accurate Numerical Simulation of Flow Past a Three-Dimensional Wing-body

Development of a CFD Capability for Full Helicopter Engineering Analysis

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Transcription:

Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge

A large-scale simulation

Overview PART I: Turbomachinery PART II: Stencil-based PDE solvers on multi-core PART III: A new solver for turbomachinery flows PART IV: Example simulations

PART I: Turbomachinery

Turbomachinery Thousands of blades, arranged in rows Each row has a bespoke blade profile designed with CFD Left: From Rolls-Royce Trent 1000 brochure Right: Courtesy of Dave Sizer

Research problem - surge prediction

Research problem - surge prediction

Deverson low-speed compressor rig

Deverson simulation Typical routine simulation Structured grid, steady state 3 million grid nodes 8 hours on four CPU cores 25 minutes on Fermi

PART II: Stencil-based PDE solvers

Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware cost

Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware cost And maintain performance portability across current and future multi-core processors

Structured grids Our work focuses on multi-block structured grids Structured grid solvers: a series of stencil operations Stencil operations: discrete approximations of the equations

Solving PDEs on multi-core processors 14

Solving PDEs on multi-core processors Navier-Stokes 15

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping 16

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme 17

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... 18

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Multigrid, Mixing plane, Sliding plane... 19

Solving PDEs on multi-core processors Stencil operations Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Multigrid, Mixing plane, Sliding plane... 20

Solving PDEs on multi-core processors Stencil operations Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Non-stencil operations Multigrid, Mixing plane, Sliding plane... 21

Solving PDEs on multi-core processors Stencil operations Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... 90% of run-time Non-stencil operations Multigrid, Mixing plane, Sliding plane... 10% of run-time 22

Stencil operations on multi-core processors Single implementation? Multiple implementations? Alternative: High level language for stencil operations Source-to-source compilation

Stencil example Structured grid indexing i, j+1, k i, j, k+1 i-1, j, k i, j, k i+1, j, k i, j, k-1 i, j-1, k

Stencil example in Fortran

Stencil example Stencil definition:

Source-to-source compilation The stencil definition is transformed at compile-time into code that can run on the chosen processor The transformation is performed by filling in a pre-defined template using the stencil definition Stencil definition CPU template GPU template CPU source (.c) GPU source (.cu)

Source-to-source compilation The stencil definition is transformed at compile-time into code that can run on the chosen processor The transformation is performed by filling in a pre-defined template using the stencil definition Stencil definition CPU template GPU template X template CPU source (.c) GPU source (.cu) X source

Stencil example Other stencils are in principle the same Set fluxes Sum fluxes Smoothing 29

Implementation details There are many optimisation strategies for stencil operations (see paper from Supercomputing 2008 by Datta et al.) CPUs: Parallelise with pthreads SSE vectorisation GPUs: Cyclic queues

Software framework Framework: Templates + Run-time library Library provides: Uniform API for both CPUs and GPUs MPI Reductions Parallel sparse matrix vector multiplication

Testbed CPU 1: Intel Core i7 920 (2.66 GHZ) CPU 2: AMD Phenom II X4 940 (3.0 GHz) GPU: NVIDIA GTX 280

Stencil benchmark

Stencil benchmark

PART III: A new solver for turbomachinery flows

Turbostream We have implemented a new solver that can run on both CPUs and GPUs The starting point was an existing solver called TBLOCK The new solver is called Turbostream

TBLOCK Developed by John Denton Blocks with arbitrary patch interfaces Simple and fast algorithm 15,000 lines of Fortran 77 Main solver routines are only 5000 lines

Denton Codes TBLOCK is the latest of the Denton Codes Dates back to the late 1970s Used for some part of the design process at most turbomachinery manufacturers

Turbostream 3000 lines of stencil definitions (~15 different stencil kernels) Code generated from stencil definitions is 15,000 lines Additional 5000 lines of C for boundary conditions, file I/O etc. Source code is very similar to TBLOCK every subroutine has an equivalent stencil definition

Implementation notes Stencil kernels are easy : Fortran 77 to Python by hand What about the last 10%?

The last 10% The last 10% becomes important when the runtime of the other 90% has been reduced by a factor of 10 Multigrid Sliding planes Mixing planes Inlet flow/outlet flow

Multigrid Algorithm for improving convergence

Multigrid Algorithm for improving convergence Coarsening Interpolation

Multigrid Algorithm for improving convergence Coarsening Interpolation Implemented as a parallel sparse matrix vector multiplication (pspmv)

Sliding planes 1D interpolation between grids in relative motion

Sliding planes 1D interpolation between grids in relative motion

Sliding planes 1D interpolation between grids in relative motion Also done with a pspmv

Mixing planes Interface which mixes the flow between a stationary and rotating blade row Difficult case: Complicated coding (needs many small reductions) Can take up a significant fraction of runtime for particular cases Hand-code implementation for each processor

Inlet and outflow Straightforward case Easy coding (local operation) Consumes small fraction of runtime 1. Get data from GPU 2. Perform computation on CPU 3. Write data back to GPU memory

Turbostream performance TBLOCK runs in parallel on all four CPU cores using MPI Initial Fermi results are 2x faster larger shared memory

Multi-processor performance University of Cambridge became a NVIDIA CCoE in December 2008 NVIDIA donated 32 S1070s which were added to existing Darwin supercomputer Nehalem-based CPU nodes from Dell QDR Infiniband

Multi-processor performance Benchmark case is an unsteady simulation of a turbine stage

Multi-processor implementation 1 Stencil kernels 2 Boundary conditions 3 Exchange halo nodes (MPI)

Multi-processor performance

Multi-processor performance

PART IV: Example simulations

Example simulations Three-stage turbine (small scale) Cooling hole simulation (medium scale) Three-stage compressor (large scale)

Three-stage turbine

Single stage geometry - clean

Single stage geometry - real

Entropy function through machine 4 million grid nodes 10 hours on a single CPU 8 minutes on four GPUs

Total pressure loss, Stator 3 exit Experiment Turbostream

Cooling hole simulation Gas at ~1800 K from combustor Drill holes in the blade for cooling Currently model tens of holes at a time Image from Texas A&M website: http://www1.mengr.tamu.edu/thtl/projects.html

Cooling hole simulation LES calculation of a single hole Used for better understanding and models 15 million grid nodes Time-accurate 2 days on 16 GPUs

Compressor simulation Three-stage compressor test rig at Siemens, UK 160 million grid nodes 5 revolutions needed to obtain a periodic solution (22500 time steps) On 32 NVIDIA GT200 GPUs, each revolution takes 24 hours

Entropy function contours at mid-span

Conclusions The switch to multi-core processors enables a step change in performance, but existing codes have to be rewritten The differences between processors make it difficult to hand-code a solver that will run on all of them We suggest a high level abstraction coupled with source-to-source compilation

Conclusions A new solver called Turbostream, which is based on Denton s TBLOCK, has been implemented Turbostream is ~10 times faster than TBLOCK when running on an NVIDIA GPU as compared to a quad-core Intel or AMD CPUs Single blade-row calculations almost interactive on a desktop (10 30 seconds) Multi-stage calculations in a few minutes on a small cluster ($10,000) Full annulus unsteady calculations complete overnight on a modest cluster ($100,000)