60x Computational Fluid Dynamics and Visualisation
|
|
- Delphia Claribel Mathews
- 5 years ago
- Views:
Transcription
1 60x Computational Fluid Dynamics and Visualisation Jamil Appa BAE Systems Advanced Technology Centre 1
2 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 2
3 Some of our products 3
4 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 4
5 Aerodynamic Design Challenges Current fluid simulation tools and technologies are very capable for a limited range of design tasks Inability to fully explore the design space within useable timescales and to required/bounded accuracy Properties such as turbulence and flow separation are difficult to simulate accurately, but have a significant impact on the performance of the product. 5
6 6
7 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 7
8 Why GPUs? 8
9 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 9
10 CUDA enabled CFD 3D Explicit Finite Volume 2 nd order time and space Time accurate Two Equation Turbulence model Arbitrary Polyhedra Multiple GPU implementation Uses MPI to enable use of GPU cluster 10
11 Mesh and Physics Complexity 11
12 Validation 12
13 Rotating Laminar Flow Cylinder 13
14 GPU Speed-up
15 Nehalem Calc (with Tau) Volumes (Cells) (M) 33.5 Volumes (Points) (M) 18.5 Calc unit (edges) (M) 68.3 Iterations 5902 Computing unit (Cores) 128 Time (s) Time (s) per iteration Time (s) per iteration per calc unit (x10-6) Computing unit Time (s) per iteration per calc unit (x10-6)
16 Veloxi CFD calc Volumes (Cells) (M) 6.7 Calc unit (faces) (M) 21.9 Iterations 10 Computing unit (Cards) 1 Time (s) 94 Time (s) per iteration 9.4 Time (s) per iteration per calc unit (x10-6) Computing unit Time (s) per iteration per calc unit (x10-6)
17 Comparison Figures Nehalem Core equivalent per GPU card No. cards needed for 128 core equivalent
18 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs? Example Kernel Compute and Visualisation on GPUs Summary 18
19 Example Kernel global void UpdateKernel(VarTypes::var_t *celldata_d, VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) { const int T = threadidx.x; const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to maximise coalesed data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Call function using data in registers Update::Execute((Precision::Stype*)&cellData, // <---- This function is the same for host or GPU execution (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver parameters stored in constant memory (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; } } global void UpdateKernel(VarTypes::var_t *celldata_d, { const int T = threadidx.x; VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to // maximise coalesced data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; 19
20 Example Kernel global void UpdateKernel(VarTypes::var_t *celldata_d, VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) { const int T = threadidx.x; const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to maximise coalesed data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Call function using data in registers Update::Execute((Precision::Stype*)&cellData, // <---- This function is the same for host or GPU execution (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver parameters stored in constant memory (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; } } 20
21 Example Kernel global void UpdateKernel(VarTypes::var_t *celldata_d, VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) { const int T = threadidx.x; const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to maximise coalesed data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Call function using data in registers Update::Execute((Precision::Stype*)&cellData, // <---- This function is the same for host or GPU execution (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver parameters stored in constant memory (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; } } } This function is the same for host or GPU execution Update::Execute((Precision::Stype*)&cellData, (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; // parameters // stored in // constant memory 21
22 Profiler Output 22
23 Lessons Learnt Maximise coalesced memory transfers Use appropriate data structures Pay close attention to.cubin and.ptx outputs Manually move data associated with lmem by the compiler into shared Do NOT rely on the compiler to do this (at the moment) Use the profiler... Maintaining optimum code for different generations of products is time consuming 23
24 Wish List.. Toolkit to simplify the use of shared memory as a register spill over Improved debugging especially for complex kernels Improved error diagnosis Many bugs can only be found by trial and error Resolution of Infiniband/CUDA pinned memory conflict Require zero copy DMA for both stacks Next Gen GPUs as soon as possible... 24
25 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 25
26 Visualisation Techniques Vector plots Messy, difficult to interpret Streamlines Results depend on start points Without prior knowledge of flow field, can miss features
27 Line Integral Convolution A local streamline is calculated for each pixel A white noise image is smeared along these streamlines
28 28
29 29
30 Results Size (pixels) CPU Time (s) GPU Time (s) Speedup Frame Time (s) 100x x x x x
31 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Compute and Visualisation on GPUs Summary 31
32 Summary GPU based high fidelity CFD is possible today NVIDIA PSC sized system equivalent to 100 Opteron cores or 60 Nehalem cores Large TCO savings (software, hardware and power) possible. Combined compute and visualisation will enable realtime simulation and interpretation of results 32
Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures
Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund
More informationUsing GPUs for unstructured grid CFD
Using GPUs for unstructured grid CFD Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Schlumberger Abingdon Technology Centre, February 17th, 2011
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationSpeed Up Your Codes Using GPU
Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC
Nonlinear Computational Aeroelasticity Lab ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC N I C H O L S O N K. KO U K PA I Z A N P H D. C A N D I D AT E GPU Technology Conference
More informationAdjoint Solver Workshop
Adjoint Solver Workshop Why is an Adjoint Solver useful? Design and manufacture for better performance: e.g. airfoil, combustor, rotor blade, ducts, body shape, etc. by optimising a certain characteristic
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGerman Aerospace Center, Institute of Aerodynamics and Flow Technology, Numerical Methods
Automatische Transitionsvorhersage im DLR TAU Code Status der Entwicklung und Validierung Automatic Transition Prediction in the DLR TAU Code - Current Status of Development and Validation Andreas Krumbein
More informationScientific Computations Using Graphics Processors
Scientific Computations Using Graphics Processors Blair Perot Ali Khajeh-Saeed Tim McGuiness History Kevin Bowers, X Division Los Alamos Lab (2003) Lots of Memory Uses Memory Banks Cheap (commodity) Relativistic
More informationThe Fermi GPU and HPC Application Breakthroughs
The Fermi GPU and HPC Application Breakthroughs Peng Wang, PhD HPC Developer Technology Group Stan Posey HPC Industry Development NVIDIA, Santa Clara, CA, USA NVIDIA Corporation 2009 Overview GPU Computing:
More informationEvolving a CUDA Kernel from an nvidia Template
Evolving a CUDA Kernel from an nvidia Template W. B. Langdon CREST lab, Department of Computer Science 16a.7.2010 Introduction Using genetic programming to create C source code How? Why? Proof of concept:
More informationEvolving a CUDA Kernel from an nvidia Template
Evolving a CUDA Kernel from an nvidia Template W. B. Langdon CREST lab, Department of Computer Science 11.5.2011 Introduction Using genetic programming to create C source code How? Why? Proof of concept:
More informationCase Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units
- Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Summer School 2009: Many-Core Processors for Science and Engineering Applications,
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationSENSEI / SENSEI-Lite / SENEI-LDC Updates
SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC
More informationHigh-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration
High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration Jianping Meng, Xiao-Jun Gu, David R. Emerson, Gihan Mudalige, István Reguly and Mike B Giles Scientific Computing
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationFast Bilateral Filter GPU implementation
Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design, University of Erlangen-Nuremberg July 21, 2016 Overview
More informationGPU DEVELOPMENT & FUTURE PLAN OF MIDAS NFX
GPU DEVELOPMENT & FUTURE PLAN OF MIDAS NFX September 22 2015 Noh-hoon Lee lnh0702@midasit.com SOFTWARE ENGINEER / CFD DEVELOPMENT TEAM MIDASIT CONTENTS 1. Introduction to MIDASIT 2. Computing Procedure
More informationALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems
www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline
More informationMODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU
MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH PATRIC ZHAO, JIRI KRAUS, SKY WU patricz@nvidia.com AGENDA Background Collect data and Visualizations Critical Path Performance analysis and prediction
More informationChapter 6 Visualization Techniques for Vector Fields
Chapter 6 Visualization Techniques for Vector Fields 6.1 Introduction 6.2 Vector Glyphs 6.3 Particle Advection 6.4 Streamlines 6.5 Line Integral Convolution 6.6 Vector Topology 6.7 References 2006 Burkhard
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationHigh Performance and GPU Computing in MATLAB
High Performance and GPU Computing in MATLAB Jan Houška houska@humusoft.cz http://www.humusoft.cz 1 About HUMUSOFT Company: Humusoft s.r.o. Founded: 1990 Number of employees: 18 Location: Praha 8, Pobřežní
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationAIR LOAD CALCULATION FOR ISTANBUL TECHNICAL UNIVERSITY (ITU), LIGHT COMMERCIAL HELICOPTER (LCH) DESIGN ABSTRACT
AIR LOAD CALCULATION FOR ISTANBUL TECHNICAL UNIVERSITY (ITU), LIGHT COMMERCIAL HELICOPTER (LCH) DESIGN Adeel Khalid *, Daniel P. Schrage + School of Aerospace Engineering, Georgia Institute of Technology
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationPerformance Benefits of NVIDIA GPUs for LS-DYNA
Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA
More informationTeam 194: Aerodynamic Study of Airflow around an Airfoil in the EGI Cloud
Team 194: Aerodynamic Study of Airflow around an Airfoil in the EGI Cloud CFD Support s OpenFOAM and UberCloud Containers enable efficient, effective, and easy access and use of MEET THE TEAM End-User/CFD
More informationOzenCloud Case Studies
OzenCloud Case Studies Case Studies, April 20, 2015 ANSYS in the Cloud Case Studies: Aerodynamics & fluttering study on an aircraft wing using fluid structure interaction 1 Powered by UberCloud http://www.theubercloud.com
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationGTC 2017 S7672. OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver
David Gutzwiller, NUMECA USA (david.gutzwiller@numeca.com) Dr. Ravi Srinivasan, Dresser-Rand Alain Demeulenaere, NUMECA USA 5/9/2017 GTC 2017 S7672 OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationLS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.
12 th International LS-DYNA Users Conference FSI/ALE(1) LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA Part 1 Facundo Del
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationExplicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung
Explicit and Implicit Coupling Strategies for s Outline FreSCo+ Grid Coupling Interpolation Schemes Implementation Mass Conservation Examples Lid-driven Cavity Flow Cylinder in a Channel Oscillating Cylinder
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationGPGPUGPGPU: Multi-GPU Programming
GPGPUGPGPU: Multi-GPU Programming Fall 2012 HW4 global void cuda_transpose(const float *ad, const int n, float *atd) { } int i = threadidx.y + blockidx.y*blockdim.y; int j = threadidx.x + blockidx.x*blockdim.x;
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationGPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations
GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations Fred Lionetti @ CSE Andrew McCulloch @ Bioeng Scott Baden @ CSE University of California, San Diego What is heart modeling? Bioengineer
More informationPressure Drop Evaluation in a Pilot Plant Hydrocyclone
Pressure Drop Evaluation in a Pilot Plant Hydrocyclone Fabio Kasper, M.Sc. Emilio Paladino, D.Sc. Marcus Reis, M.Sc. ESSS Carlos A. Capela Moraes, D.Sc. Dárley C. Melo, M.Sc. Petrobras Research Center
More informationLecture 2: different memory and variable types
Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern
More informationTowards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA
Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationLiszt, a language for PDE solvers
Liszt, a language for PDE solvers Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, Eric Darve,
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationParallel solution of Turek & Hron s FSI benchmark problem with spatial adaptivity for the fluid and solid meshes
Chapter 1 Parallel solution of Turek & Hron s FSI benchmark problem with spatial adaptivity for the fluid and solid meshes This document provides an overview of how to change the serial driver code for
More informationRecent results with elsa on multi-cores
Michel Gazaix (ONERA) Steeve Champagneux (AIRBUS) October 15th, 2009 Outline Short introduction to elsa elsa benchmark on HPC platforms Detailed performance evaluation IBM Power5, AMD Opteron, INTEL Nehalem
More informationSolving the heat equation with CUDA
Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationShared Memory and Synchronizations
and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can
More informationcuibm A GPU Accelerated Immersed Boundary Method
cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationImplementation of Adaptive Coarsening Algorithm on GPU using CUDA
Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow
More informationReal Application Performance and Beyond
Real Application Performance and Beyond Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 http://www.mellanox.com Scientists, engineers and analysts
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationTri-Hybrid Computational Fluid Dynamics on DOE s Cray XK7, Titan.
Tri-Hybrid Computational Fluid Dynamics on DOE s Cray XK7, Titan. Aaron Vose, Brian Mitchell, and John Levesque. Cray User Group, May 2014. GE Global Research: mitchellb@ge.com Cray Inc.: avose@cray.com,
More informationSystem Level Cooling, Fatigue, and Durability. Co-Simulation. Stuart A. Walker, Ph.D.
System Level Cooling, Fatigue, and Durability Analysis via Multiphysics Co-Simulation Stuart A. Walker, Ph.D. swalker@altair.com Outline Motivation Presentation of process Presentation of tools Presentation
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationGeneral Plasma Physics
Present and Future Computational Requirements General Plasma Physics Center for Integrated Computation and Analysis of Reconnection and Turbulence () Kai Germaschewski, Homa Karimabadi Amitava Bhattacharjee,
More informationGenerating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2
Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Graham R. Markall, Florian Rathgeber, David A. Ham, Paul H. J. Kelly, Carlo Bertolli, Adam Betts
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationMulti-GPU simulations in OpenFOAM with SpeedIT technology.
Multi-GPU simulations in OpenFOAM with SpeedIT technology. Attempt I: SpeedIT GPU-based library of iterative solvers for Sparse Linear Algebra and CFD. Current version: 2.2. Version 1.0 in 2008. CMRS format
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationCUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin
CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationRyan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX
Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX Outline Introduction Knights Ferry Technical Specifications CFD Governing Equations Numerical Algorithm Solver
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationAdding CUDA Support to Cling: JIT Compile to GPUs
Published under CC BY-SA 4.0 DOI: 10.5281/zenodo.1412256 Adding CUDA Support to Cling: JIT Compile to GPUs S. Ehrig 1,2, A. Naumann 3, and A. Huebl 1,2 1 Helmholtz-Zentrum Dresden - Rossendorf 2 Technische
More information