High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters
|
|
- Denis Rice
- 5 years ago
- Views:
Transcription
1 SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014 SIAM PP 2014, February 20,
2 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,
3 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,
4 Motivation Main Goal Development of a highly scalable code for heterogeneous clusters to investigate new ideas. Minor Goal Extend the software step by step with additional features (never loosing the main goal out of sight). SIAM PP 2014, February 20,
5 Motivation Main Goal Development of a highly scalable code for heterogeneous clusters to investigate new ideas. Minor Goal Extend the software step by step with additional features (never loosing the main goal out of sight). Properties of our Software application: algorithm: heterogenous system: development platform: Computation Fluid Dynamics Lattice Boltzmann Method CPUs + GPUs OpenCL Yet another Lattice Boltzmann Code SIAM PP 2014, February 20,
6 Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results SIAM PP 2014, February 20,
7 Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results Features to be implemented: Support for (complex) geometries Investigation of additional turbulence models Exploiting CPU power... SIAM PP 2014, February 20,
8 Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results Features to be implemented: Support for (complex) geometries Investigation of additional turbulence models Expoiting CPU power... SIAM PP 2014, February 20,
9 The Lattice Boltzmann Method Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,
10 The Lattice Boltzmann Method 1/3 Boltzmann Equation f t + v f = (f f eq ) }{{} Collision operator f ( x, v, t): f eq : Probability density function: Probability for finding particles at x and time t with velocity v Equilibrium distribution SIAM PP 2014, February 20,
11 The Lattice Boltzmann Method 1/3 Boltzmann Equation f t + v f = (f f eq ) }{{} Collision operator f ( x, v, t): f eq : Probability density function: Probability for finding particles at x and time t with velocity v Equilibrium distribution Lattice Boltzmann Method (LBM) LBM simulates fluid flow by... tracking virtual particles (more specifically, their distribution) colliding and streaming them (collision & streaming phase) only along a limited number of directions (DdQq discretization schemes) SIAM PP 2014, February 20,
12 The Lattice Boltzmann Method 2/3 Discretization Schemes d: number of dimensions q: number of density distributions/lattice vectors Collision & Streaming During collision phase, the collision operator is approximated (BGK model): f i ( x, t + δt) = f i ( x, t) + 1 i f i ) τ f eq i ( u) = ω i ρ (1 + 3( e i u) + 92 ( e i u) 2 32 ) u2 with ω i = 1 for i 0,..., 3, 16, 17, 1 for i 4,..., 15, and 1 for i = Movement of the virtual particles is simulated during streaming phase: (f eq f i ( x + e i, t + δt) = f i ( x, t + δt) SIAM PP 2014, February 20,
13 The Lattice Boltzmann Method 3/3 Densities & Velocities Only density distributions along the lattice vectores are stored. Densities & velocities are NOT explicitly stored per cell, but can be calculated: ρ = 18 i=0 f i ρ v = LBM for High Performance Computing 18 i=0 (f i e i ) Data access only for discretized density distributions Communication only with neighboring cells Calculations only use simple instructions SIAM PP 2014, February 20,
14 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,
15 Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written SIAM PP 2014, February 20,
16 Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written Propagation Updated density distributions are copied to the corresponding neighboring cells SIAM PP 2014, February 20,
17 Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written Propagation Updated density distributions are copied to the corresponding neighboring cells To avoid race conditions, two dedicated arrays for the density distributions are required! They are used in a ping pong like manner. SIAM PP 2014, February 20,
18 Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell SIAM PP 2014, February 20,
19 Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) β Step (even iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell Performed during even iterations Two streamings and one collision Memory belongs to neighbors SIAM PP 2014, February 20,
20 Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) β Step (even iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell Performed during even iterations Two streamings and one collision Memory belongs to neighbors Data is always read and written to the same memory location. SIAM PP 2014, February 20,
21 OpenCL Features CommandQueue * Event Command Queue Contains a list of commands Commands can be executed in-order or out-of-order Event Can be used to determine the state of a command Mechanism to implement synchronization between commands Complex dependency graphs can be realized by dependencies of events SIAM PP 2014, February 20,
22 OpenCL Features CommandQueue * Event Command Queue Contains a list of commands Commands can be executed in-order or out-of-order Event Can be used to determine the state of a command Mechanism to implement synchronization between commands Complex dependency graphs can be realized by dependencies of events Only the concurrent execution of several command queues enables the parallel execution of several commands! SIAM PP 2014, February 20,
23 Single Boundary Kernel - Single Command Queue (SBK-SCQ) 1/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n CPU MPI_Isend() MPI_Irecv() communication (GPU CPU) one single command queue X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU one signle kernel for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,
24 Single Boundary Kernel - Single Command Queue (SBK-SCQ) 2/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n CPU MPI_Isend() MPI_Irecv() communication (GPU CPU) one single command queue X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU one signle kernel for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,
25 Multi Boundary Kernels - Single Command Queue (MBK-SCQ) 1/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n MPI_Isend() CPU communication (GPU CPU) one single command queue MPI_Irecv() X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU multiple kernels for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,
26 Multi Boundary Kernels - Single Command Queue (MBK-SCQ) 2/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n MPI_Isend() CPU communication (GPU CPU) one single command queue MPI_Irecv() X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU multiple kernels for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,
27 Multi Boundary Kernels - Multi Command Queues (MBK-SCQ) communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n X 0 X n Y 0 Y n Z 0... CPU MPI_Isend() MPI_Isend() multiple command queues for halo cells MPI_Irecv() communication (GPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n X 0 X n Y 0 Y n Z 0 Z N multiple kernels for halo cells multiple kernels for halo cells GPU compute halo cells compute inner cells compute halo cells computations done by GPU communication computations done by CPU command queue idle times SIAM PP 2014, February 20,
28 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,
29 Test Configuration Tödi Cluster in Switzerland CPUs: AMD Opteron Interlagos 6272 number of cores per CPU: 16 total number of CPUs: 272 total number of CPU cores: 4352 GPUs: NVIDIA Tesla K20x number of CUDA cores per GPU: 2688 total number of GPUs: 272 (256 used) total number of CUDA cores: SIAM PP 2014, February 20,
30 Runtime (Weak Scaling) Runtime (Weak Scaling) Optimal Runtime Basic SBC-SCQ MBC-SCQ MBC-MCQ 3,91 3,8 normalized Runtime (linear scale) 3,3 2,8 2,3 1,8 1,3 0,8 1,00 1,00 1,00 1,00 1,10 1,13 1,17 1,18 1,02 1,19 1,20 1,02 1,18 1,23 1,20 1,11 2,23 1,96 1,48 1,44 1,37 1,71 1,70 1,55 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 2,98 2,08 2,15 2,06 2,61 2,48 2, Number of GPUs (logarithmic scale) SIAM PP 2014, February 20,
31 Lattice Updates (Weak Scaling) Lattice Updates (Weak Scaling) Optimal Scaling Basic SBC-SCQ MBC-SCQ MBC-MCQ 205,25 GLUPS (logarithmic) ,60 1,77 1,74 1,57 1,60 3,21 3,02 2,94 3,05 2,90 6,41 5,98 5,78 6,09 5,67 12,83 11,58 11,53 11,19 10,82 25,66 19,26 19,36 18,14 13,07 51,31 33,19 32,68 31,94 23,00 102,62 54,75 51,75 48,19 34,35 87,45 89,44 82, Number of GPUs (logarithmic) 53,98 SIAM PP 2014, February 20,
32 Parallel Efficiency (Weak Scaling) Parallel Efficiency (Weak Scaling) Optimal Efficiency Efficiency Basic Efficiency SBC-SCQ Efficiency MBC-SCQ Efficiency MBC-MCQ Parallel Efficiency (linear scale) 1,1 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,98 0,98 0,85 0,85 0,91 0,84 0,83 0,88 0,82 0,83 0,90 0,84 0,68 0,70 0,73 0,58 0,59 0,64 0,51 0,45 0,48 0,46 0,49 0,38 0,40 0,41 0,34 0, Number of GPUs (logarithmic scale) SIAM PP 2014, February 20,
33 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,
34 Conclusion & Outlook Conclusion Biggest gain in performance is achieved by using SBK-SCQ instead of a basic scheme More sophisticated schemes (MBK-MCQ) achieve results similar to less sophisticated schemes (SBK-SCQ) Looks like NVIDIA OpenCL does not support overlapping of command queues Degree of parallel efficiency decreases linearly with number of GPUs! SIAM PP 2014, February 20,
35 Conclusion & Outlook Conclusion Biggest gain in performance is achieved by using SBK-SCQ instead of a basic scheme More sophisticated schemes (MBK-MCQ) achieve results similar to less sophisticated schemes (SBK-SCQ) Looks like NVIDIA OpenCL does not support overlapping of command queues Degree of parallel efficiency decreases linearly with number of GPUs! Outlook Improve poor parallel efficiency ( 40%), by running multiple command queues in parallel by considering the network topology Exploit CPUs not just for communication but also for computation SIAM PP 2014, February 20,
36 Technische Universita t Mu nchen Final slide SIAM PP 2014, February 20,
Computational Science and Engineering (Int. Master s Program)
Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis MPI Parallelization of GPU-based Lattice Boltzmann Simulations Author: Arash Bakhtiari 1 st
More informationInternational Supercomputing Conference 2009
International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationA Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters
computation Article A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters Christoph Riesinger 1, ID, Arash Bakhtiari 1 ID, Martin Schreiber ID,
More informationA Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters
computation Article A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters Christoph Riesinger 1, ID, Arash Bakhtiari 1 ID, Martin Schreiber ID,
More informationPerformance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures
Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund
More informationLattice Boltzmann with CUDA
Lattice Boltzmann with CUDA Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Overview of LBM An usage of LBM Algorithm Implementation in CUDA and Optimization
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationLATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS
LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate
More informationSailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python
Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Micha l Januszewski Institute of Physics University of Silesia in Katowice, Poland Google GTC 2012 M. Januszewski (IoP, US) Sailfish:
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More information(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more
Parallel Free-Surface Extension of the Lattice-Boltzmann Method A Lattice-Boltzmann Approach for Simulation of Two-Phase Flows Stefan Donath (LSS Erlangen, stefan.donath@informatik.uni-erlangen.de) Simon
More informationDynamic Mode Decomposition analysis of flow fields from CFD Simulations
Dynamic Mode Decomposition analysis of flow fields from CFD Simulations Technische Universität München Thomas Indinger Lukas Haag, Daiki Matsumoto, Christoph Niedermeier in collaboration with Agenda Motivation
More informationThe walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms
The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms Harald Köstler, Uli Rüde (LSS Erlangen, ruede@cs.fau.de) Lehrstuhl für Simulation Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de
More informationSimulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method
Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method June 21, 2011 Introduction Free Surface LBM Liquid-Gas-Solid Flows Parallel Computing Examples and More References Fig. Simulation
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationA Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method
GTC (GPU Technology Conference) 2013, San Jose, 2013, March 20 A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method Takayuki Aoki Global Scientific Information
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationPerformance Analysis of the Lattice Boltzmann Method on x86-64 Architectures
Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationA portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures
Procedia Computer Science Volume 29, 2014, Pages 40 49 ICCS 2014. 14th International Conference on Computational Science A portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures
More informationBlocking Vs. Non-Blocking Halo Exchange for Lattice Boltzmann. Anthony Bourached arxiv: v1 [cs.dc] 18 Sep 2017
Blocking Vs. Non-Blocking Halo Exchange for Lattice Boltzmann Anthony Bourached arxiv:1709.06175v1 [cs.dc] 18 Sep 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation:
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationEfficient Algorithmic Techniques for Heterogeneous Computing
Efficient Algorithmic Techniques for Heterogeneous Computing Thesis submitted in partial fulfillment of the requirements for the degree of MS in Computer Science Engineering by Shashank Sharma 200802034
More informationVirtual EM Inc. Ann Arbor, Michigan, USA
Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationGPU-based Distributed Behavior Models with CUDA
GPU-based Distributed Behavior Models with CUDA Courtesy: YouTube, ISIS Lab, Universita degli Studi di Salerno Bradly Alicea Introduction Flocking: Reynolds boids algorithm. * models simple local behaviors
More informationLATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS
14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology
More informationGAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD
GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES Jason Yang, Takahiro Harada AMD HYBRID CPU-GPU
More informationGraph traversal and BFS
Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationReal-time Thermal Flow Predictions for Data Centers
Real-time Thermal Flow Predictions for Data Centers Using the Lattice Boltzmann Method on Graphics Processing Units for Predicting Thermal Flow in Data Centers Johannes Sjölund Computer Science and Engineering,
More informationParallel Summation of Inter-Particle Forces in SPH
Parallel Summation of Inter-Particle Forces in SPH Fifth International Workshop on Meshfree Methods for Partial Differential Equations 17.-19. August 2009 Bonn Overview Smoothed particle hydrodynamics
More informationNVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film
NVIDIA Interacting with Particle Simulation in Maya using CUDA & Maximus Wil Braithwaite NVIDIA Applied Engineering Digital Film Some particle milestones FX Rendering Physics 1982 - First CG particle FX
More informationUnstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications
Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications Per-Olof Persson (persson@mit.edu) Department of Mathematics Massachusetts Institute of Technology http://www.mit.edu/
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationVolumetric processing and visualization on heterogeneous architecture
Volumetric processing and visualization on heterogeneous architecture Wei Li Siemens Corporation Princeton, NJ In collaboration with Wei Hong & Gianluca Paladini Volumetric computation Areas Medical imaging,
More informationCAF versus MPI Applicability of Coarray Fortran to a Flow Solver
CAF versus MPI Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller m.hasert@grs-sim.de Applied Supercomputing in Engineering Motivation We develop several CFD
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationThreads Implementation. Jo, Heeseung
Threads Implementation Jo, Heeseung Today's Topics How to implement threads? User-level threads Kernel-level threads Threading models 2 Kernel/User-level Threads Who is responsible for creating/managing
More informationPerformance analysis of single-phase, multiphase, and multicomponent lattice-boltzmann fluid flow simulations on GPU clusters
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2010) Published online in Wiley Online Library (wileyonlinelibrary.com)..1645 Performance analysis of single-phase,
More informationSPEED-UP GEARBOX SIMULATIONS BY INTEGRATING SCORG. Dr. Christine Klier, Sahand Saheb-Jahromi, Ludwig Berger*
SPEED-UP GEARBOX SIMULATIONS BY INTEGRATING SCORG Dr. Christine Klier, Sahand Saheb-Jahromi, Ludwig Berger* CFD SCHUCK ENGINEERING Engineering Services in computational fluid Dynamics (CFD) 25 employees
More informationUberFlow: A GPU-Based Particle Engine
UberFlow: A GPU-Based Particle Engine Peter Kipfer Mark Segal Rüdiger Westermann Technische Universität München ATI Research Technische Universität München Motivation Want to create, modify and render
More informationRadiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationParallelization Strategy
COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationFrom Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp
From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with andreas.schaefer@cs.fau.de Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José,
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationWebinar #3 Lattice Boltzmann method for CompBioMed (incl. Palabos)
Webinar series A Centre of Excellence in Computational Biomedicine Webinar #3 Lattice Boltzmann method for CompBioMed (incl. Palabos) 19 March 2018 The webinar will start at 12pm CET / 11am GMT Dr Jonas
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationTechnical report: How to implement your DdQq dynamics with only q variables per node (instead of 2q)
Technical report: How to implement your DdQq dynamics with only q variables per node (instead of 2q) Jonas Latt Tufts University Medford, USA jonas.latt@gmail.com June 2007 1 Overview Naive implementations
More informationSimulation of moving Particles in 3D with the Lattice Boltzmann Method
Simulation of moving Particles in 3D with the Lattice Boltzmann Method, Nils Thürey, Christian Feichtinger, Hans-Joachim Schmid Chair for System Simulation University Erlangen/Nuremberg Chair for Particle
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationarxiv: v1 [cs.ms] 8 Aug 2018
ACCELERATING WAVE-PROPAGATION ALGORITHMS WITH ADAPTIVE MESH REFINEMENT USING THE GRAPHICS PROCESSING UNIT (GPU) XINSHENG QIN, RANDALL LEVEQUE, AND MICHAEL MOTLEY arxiv:1808.02638v1 [cs.ms] 8 Aug 2018 Abstract.
More informationPorting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms
Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms http://www.scidacreview.org/0902/images/esg13.jpg Matthew Norman Jeffrey Larkin Richard Archibald Valentine Anantharaj
More informationHeterogeneous-Race-Free Memory Models
Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationPerformance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla
Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,
More informationComparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems
Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems Subject: Comp6470 - Special Topics in Computing Student: Tony Oakden (U4750194) Supervisor: Dr Eric McCreath
More informationDomain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández
Domain Decomposition for Colloid Clusters Pedro Fernando Gómez Fernández MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2004 Authorship declaration I, Pedro Fernando
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationGPU Implementation of Implicit Runge-Kutta Methods
GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com
More informationA D3Q19 Lattice Boltzmann Solver on a GPU Using Constant-Time Circular Array Shifting
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 383 A D3Q19 Lattice Boltzmann Solver on a GPU Using Constant-Time Circular Array Shifting Mohammadreza Khani and Tai-Hsien Wu Department of Electrical
More informationLattice Boltzmann Method for Simulating Turbulent Flows
Lattice Boltzmann Method for Simulating Turbulent Flows by Yusuke Koda A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationHigh Performance Computing
High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich
More informationApplication of GPU technology to OpenFOAM simulations
Application of GPU technology to OpenFOAM simulations Jakub Poła, Andrzej Kosior, Łukasz Miroslaw jakub.pola@vratis.com, www.vratis.com Wroclaw, Poland Agenda Motivation Partial acceleration SpeedIT OpenFOAM
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationLARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS
ECCOMAS Congress 2016 VII European Congress on Computational Methods in Applied Sciences and Engineering M. Papadrakakis, V. Papadopoulos, G. Stefanou, V. Plevris (eds.) Crete Island, Greece, 5 10 June
More informationarxiv: v1 [physics.ins-det] 11 Jul 2015
GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University
More informationDesigning a Domain-specific Language to Simulate Particles. dan bailey
Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver
More informationProgress in automatic GPU compilation and why you want to run MPI on your GPU
TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 #pragma ivdep 2 !$ACC DATA
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationShape of Things to Come: Next-Gen Physics Deep Dive
Shape of Things to Come: Next-Gen Physics Deep Dive Jean Pierre Bordes NVIDIA Corporation Free PhysX on CUDA PhysX by NVIDIA since March 2008 PhysX on CUDA available: August 2008 GPU PhysX in Games Physical
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationOpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationCS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1 Markus Hadwiger, KAUST Reading Assignment #2 (until Feb. 17) Read (required): GLSL book, chapter 4 (The OpenGL Programmable
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationIntroduction à OpenCL et applications
Philippe Helluy 1, Anaïs Crestetto 1 1 Université de Strasbourg - IRMA Lyon, journées groupe calcul, 10 novembre 2010 Sommaire OpenCL 1 OpenCL 2 3 4 OpenCL GPU architecture A modern Graphics Processing
More informationAsynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen
Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen Motivation Clustering bottleneck in Flow Cytometry research 3,000 data sets 25,000
More information