High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Size: px
Start display at page:

Download "High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters"

Transcription

1 SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014 SIAM PP 2014, February 20,

2 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,

3 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,

4 Motivation Main Goal Development of a highly scalable code for heterogeneous clusters to investigate new ideas. Minor Goal Extend the software step by step with additional features (never loosing the main goal out of sight). SIAM PP 2014, February 20,

5 Motivation Main Goal Development of a highly scalable code for heterogeneous clusters to investigate new ideas. Minor Goal Extend the software step by step with additional features (never loosing the main goal out of sight). Properties of our Software application: algorithm: heterogenous system: development platform: Computation Fluid Dynamics Lattice Boltzmann Method CPUs + GPUs OpenCL Yet another Lattice Boltzmann Code SIAM PP 2014, February 20,

6 Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results SIAM PP 2014, February 20,

7 Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results Features to be implemented: Support for (complex) geometries Investigation of additional turbulence models Exploiting CPU power... SIAM PP 2014, February 20,

8 Features Features implemented so far: Implementation of LBM Tailored software design for heterogeneous clusters Overlapping communication with computation Support for free surfaces Considering turbulent flows Validation of the simulation results Features to be implemented: Support for (complex) geometries Investigation of additional turbulence models Expoiting CPU power... SIAM PP 2014, February 20,

9 The Lattice Boltzmann Method Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,

10 The Lattice Boltzmann Method 1/3 Boltzmann Equation f t + v f = (f f eq ) }{{} Collision operator f ( x, v, t): f eq : Probability density function: Probability for finding particles at x and time t with velocity v Equilibrium distribution SIAM PP 2014, February 20,

11 The Lattice Boltzmann Method 1/3 Boltzmann Equation f t + v f = (f f eq ) }{{} Collision operator f ( x, v, t): f eq : Probability density function: Probability for finding particles at x and time t with velocity v Equilibrium distribution Lattice Boltzmann Method (LBM) LBM simulates fluid flow by... tracking virtual particles (more specifically, their distribution) colliding and streaming them (collision & streaming phase) only along a limited number of directions (DdQq discretization schemes) SIAM PP 2014, February 20,

12 The Lattice Boltzmann Method 2/3 Discretization Schemes d: number of dimensions q: number of density distributions/lattice vectors Collision & Streaming During collision phase, the collision operator is approximated (BGK model): f i ( x, t + δt) = f i ( x, t) + 1 i f i ) τ f eq i ( u) = ω i ρ (1 + 3( e i u) + 92 ( e i u) 2 32 ) u2 with ω i = 1 for i 0,..., 3, 16, 17, 1 for i 4,..., 15, and 1 for i = Movement of the virtual particles is simulated during streaming phase: (f eq f i ( x + e i, t + δt) = f i ( x, t + δt) SIAM PP 2014, February 20,

13 The Lattice Boltzmann Method 3/3 Densities & Velocities Only density distributions along the lattice vectores are stored. Densities & velocities are NOT explicitly stored per cell, but can be calculated: ρ = 18 i=0 f i ρ v = LBM for High Performance Computing 18 i=0 (f i e i ) Data access only for discretized density distributions Communication only with neighboring cells Calculations only use simple instructions SIAM PP 2014, February 20,

14 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,

15 Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written SIAM PP 2014, February 20,

16 Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written Propagation Updated density distributions are copied to the corresponding neighboring cells SIAM PP 2014, February 20,

17 Single GPU optimization: The A-B Pattern Collision Density distributions of own cell are read Collision operator is executed Updated density distributions are written Propagation Updated density distributions are copied to the corresponding neighboring cells To avoid race conditions, two dedicated arrays for the density distributions are required! They are used in a ping pong like manner. SIAM PP 2014, February 20,

18 Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell SIAM PP 2014, February 20,

19 Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) β Step (even iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell Performed during even iterations Two streamings and one collision Memory belongs to neighbors SIAM PP 2014, February 20,

20 Single GPU optimization: The A-A Pattern Only one memory layout is used for all iterations. This saves memory by a factor of two twice as many cells per GPU α Step (odd iterations) β Step (even iterations) Performed during odd iterations Only one collision is performed Memory belongs to own cell Performed during even iterations Two streamings and one collision Memory belongs to neighbors Data is always read and written to the same memory location. SIAM PP 2014, February 20,

21 OpenCL Features CommandQueue * Event Command Queue Contains a list of commands Commands can be executed in-order or out-of-order Event Can be used to determine the state of a command Mechanism to implement synchronization between commands Complex dependency graphs can be realized by dependencies of events SIAM PP 2014, February 20,

22 OpenCL Features CommandQueue * Event Command Queue Contains a list of commands Commands can be executed in-order or out-of-order Event Can be used to determine the state of a command Mechanism to implement synchronization between commands Complex dependency graphs can be realized by dependencies of events Only the concurrent execution of several command queues enables the parallel execution of several commands! SIAM PP 2014, February 20,

23 Single Boundary Kernel - Single Command Queue (SBK-SCQ) 1/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n CPU MPI_Isend() MPI_Irecv() communication (GPU CPU) one single command queue X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU one signle kernel for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,

24 Single Boundary Kernel - Single Command Queue (SBK-SCQ) 2/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n CPU MPI_Isend() MPI_Irecv() communication (GPU CPU) one single command queue X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU one signle kernel for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,

25 Multi Boundary Kernels - Single Command Queue (MBK-SCQ) 1/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n MPI_Isend() CPU communication (GPU CPU) one single command queue MPI_Irecv() X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU multiple kernels for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,

26 Multi Boundary Kernels - Single Command Queue (MBK-SCQ) 2/2 communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n MPI_Isend() CPU communication (GPU CPU) one single command queue MPI_Irecv() X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n GPU multiple kernels for halo cells compute halo cells compute inner cells computations done by GPU computations done by CPU communication command queue idle times SIAM PP 2014, February 20,

27 Multi Boundary Kernels - Multi Command Queues (MBK-SCQ) communication (CPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n X 0 X n Y 0 Y n Z 0... CPU MPI_Isend() MPI_Isend() multiple command queues for halo cells MPI_Irecv() communication (GPU CPU) X 0 X n Y 0 Y n Z 0 Z n X' 0 X' n Y' 0 Y' n Z' 0 Z' n X 0 X n Y 0 Y n Z 0 Z N multiple kernels for halo cells multiple kernels for halo cells GPU compute halo cells compute inner cells compute halo cells computations done by GPU communication computations done by CPU command queue idle times SIAM PP 2014, February 20,

28 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,

29 Test Configuration Tödi Cluster in Switzerland CPUs: AMD Opteron Interlagos 6272 number of cores per CPU: 16 total number of CPUs: 272 total number of CPU cores: 4352 GPUs: NVIDIA Tesla K20x number of CUDA cores per GPU: 2688 total number of GPUs: 272 (256 used) total number of CUDA cores: SIAM PP 2014, February 20,

30 Runtime (Weak Scaling) Runtime (Weak Scaling) Optimal Runtime Basic SBC-SCQ MBC-SCQ MBC-MCQ 3,91 3,8 normalized Runtime (linear scale) 3,3 2,8 2,3 1,8 1,3 0,8 1,00 1,00 1,00 1,00 1,10 1,13 1,17 1,18 1,02 1,19 1,20 1,02 1,18 1,23 1,20 1,11 2,23 1,96 1,48 1,44 1,37 1,71 1,70 1,55 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 2,98 2,08 2,15 2,06 2,61 2,48 2, Number of GPUs (logarithmic scale) SIAM PP 2014, February 20,

31 Lattice Updates (Weak Scaling) Lattice Updates (Weak Scaling) Optimal Scaling Basic SBC-SCQ MBC-SCQ MBC-MCQ 205,25 GLUPS (logarithmic) ,60 1,77 1,74 1,57 1,60 3,21 3,02 2,94 3,05 2,90 6,41 5,98 5,78 6,09 5,67 12,83 11,58 11,53 11,19 10,82 25,66 19,26 19,36 18,14 13,07 51,31 33,19 32,68 31,94 23,00 102,62 54,75 51,75 48,19 34,35 87,45 89,44 82, Number of GPUs (logarithmic) 53,98 SIAM PP 2014, February 20,

32 Parallel Efficiency (Weak Scaling) Parallel Efficiency (Weak Scaling) Optimal Efficiency Efficiency Basic Efficiency SBC-SCQ Efficiency MBC-SCQ Efficiency MBC-MCQ Parallel Efficiency (linear scale) 1,1 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,98 0,98 0,85 0,85 0,91 0,84 0,83 0,88 0,82 0,83 0,90 0,84 0,68 0,70 0,73 0,58 0,59 0,64 0,51 0,45 0,48 0,46 0,49 0,38 0,40 0,41 0,34 0, Number of GPUs (logarithmic scale) SIAM PP 2014, February 20,

33 Topics Motivation The Lattice Boltzmann Method Optimization Single GPU optimization: A-B vs. A-A Pattern Multi GPU optimization: Communication Schemes Results Runtime Lattice Updates Parallel Efficiency Conclusion & Outlook SIAM PP 2014, February 20,

34 Conclusion & Outlook Conclusion Biggest gain in performance is achieved by using SBK-SCQ instead of a basic scheme More sophisticated schemes (MBK-MCQ) achieve results similar to less sophisticated schemes (SBK-SCQ) Looks like NVIDIA OpenCL does not support overlapping of command queues Degree of parallel efficiency decreases linearly with number of GPUs! SIAM PP 2014, February 20,

35 Conclusion & Outlook Conclusion Biggest gain in performance is achieved by using SBK-SCQ instead of a basic scheme More sophisticated schemes (MBK-MCQ) achieve results similar to less sophisticated schemes (SBK-SCQ) Looks like NVIDIA OpenCL does not support overlapping of command queues Degree of parallel efficiency decreases linearly with number of GPUs! Outlook Improve poor parallel efficiency ( 40%), by running multiple command queues in parallel by considering the network topology Exploit CPUs not just for communication but also for computation SIAM PP 2014, February 20,

36 Technische Universita t Mu nchen Final slide SIAM PP 2014, February 20,

Computational Science and Engineering (Int. Master s Program)

Computational Science and Engineering (Int. Master s Program) Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis MPI Parallelization of GPU-based Lattice Boltzmann Simulations Author: Arash Bakhtiari 1 st

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters computation Article A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters Christoph Riesinger 1, ID, Arash Bakhtiari 1 ID, Martin Schreiber ID,

More information

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters computation Article A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters Christoph Riesinger 1, ID, Arash Bakhtiari 1 ID, Martin Schreiber ID,

More information

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund

More information

Lattice Boltzmann with CUDA

Lattice Boltzmann with CUDA Lattice Boltzmann with CUDA Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Overview of LBM An usage of LBM Algorithm Implementation in CUDA and Optimization

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate

More information

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Micha l Januszewski Institute of Physics University of Silesia in Katowice, Poland Google GTC 2012 M. Januszewski (IoP, US) Sailfish:

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more Parallel Free-Surface Extension of the Lattice-Boltzmann Method A Lattice-Boltzmann Approach for Simulation of Two-Phase Flows Stefan Donath (LSS Erlangen, stefan.donath@informatik.uni-erlangen.de) Simon

More information

Dynamic Mode Decomposition analysis of flow fields from CFD Simulations

Dynamic Mode Decomposition analysis of flow fields from CFD Simulations Dynamic Mode Decomposition analysis of flow fields from CFD Simulations Technische Universität München Thomas Indinger Lukas Haag, Daiki Matsumoto, Christoph Niedermeier in collaboration with Agenda Motivation

More information

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms Harald Köstler, Uli Rüde (LSS Erlangen, ruede@cs.fau.de) Lehrstuhl für Simulation Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de

More information

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method June 21, 2011 Introduction Free Surface LBM Liquid-Gas-Solid Flows Parallel Computing Examples and More References Fig. Simulation

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method GTC (GPU Technology Conference) 2013, San Jose, 2013, March 20 A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method Takayuki Aoki Global Scientific Information

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

A portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures

A portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures Procedia Computer Science Volume 29, 2014, Pages 40 49 ICCS 2014. 14th International Conference on Computational Science A portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures

More information

Blocking Vs. Non-Blocking Halo Exchange for Lattice Boltzmann. Anthony Bourached arxiv: v1 [cs.dc] 18 Sep 2017

Blocking Vs. Non-Blocking Halo Exchange for Lattice Boltzmann. Anthony Bourached arxiv: v1 [cs.dc] 18 Sep 2017 Blocking Vs. Non-Blocking Halo Exchange for Lattice Boltzmann Anthony Bourached arxiv:1709.06175v1 [cs.dc] 18 Sep 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation:

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Efficient Algorithmic Techniques for Heterogeneous Computing

Efficient Algorithmic Techniques for Heterogeneous Computing Efficient Algorithmic Techniques for Heterogeneous Computing Thesis submitted in partial fulfillment of the requirements for the degree of MS in Computer Science Engineering by Shashank Sharma 200802034

More information

Virtual EM Inc. Ann Arbor, Michigan, USA

Virtual EM Inc. Ann Arbor, Michigan, USA Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

GPU-based Distributed Behavior Models with CUDA

GPU-based Distributed Behavior Models with CUDA GPU-based Distributed Behavior Models with CUDA Courtesy: YouTube, ISIS Lab, Universita degli Studi di Salerno Bradly Alicea Introduction Flocking: Reynolds boids algorithm. * models simple local behaviors

More information

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS 14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology

More information

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES Jason Yang, Takahiro Harada AMD HYBRID CPU-GPU

More information

Graph traversal and BFS

Graph traversal and BFS Graph traversal and BFS Fundamental building block Graph traversal is part of many important tasks Connected components Tree/Cycle detection Articulation vertex finding Real-world applications Peer-to-peer

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Real-time Thermal Flow Predictions for Data Centers

Real-time Thermal Flow Predictions for Data Centers Real-time Thermal Flow Predictions for Data Centers Using the Lattice Boltzmann Method on Graphics Processing Units for Predicting Thermal Flow in Data Centers Johannes Sjölund Computer Science and Engineering,

More information

Parallel Summation of Inter-Particle Forces in SPH

Parallel Summation of Inter-Particle Forces in SPH Parallel Summation of Inter-Particle Forces in SPH Fifth International Workshop on Meshfree Methods for Partial Differential Equations 17.-19. August 2009 Bonn Overview Smoothed particle hydrodynamics

More information

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film NVIDIA Interacting with Particle Simulation in Maya using CUDA & Maximus Wil Braithwaite NVIDIA Applied Engineering Digital Film Some particle milestones FX Rendering Physics 1982 - First CG particle FX

More information

Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications

Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications Unstructured Mesh Generation for Implicit Moving Geometries and Level Set Applications Per-Olof Persson (persson@mit.edu) Department of Mathematics Massachusetts Institute of Technology http://www.mit.edu/

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Volumetric processing and visualization on heterogeneous architecture

Volumetric processing and visualization on heterogeneous architecture Volumetric processing and visualization on heterogeneous architecture Wei Li Siemens Corporation Princeton, NJ In collaboration with Wei Hong & Gianluca Paladini Volumetric computation Areas Medical imaging,

More information

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver CAF versus MPI Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller m.hasert@grs-sim.de Applied Supercomputing in Engineering Motivation We develop several CFD

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Threads Implementation. Jo, Heeseung

Threads Implementation. Jo, Heeseung Threads Implementation Jo, Heeseung Today's Topics How to implement threads? User-level threads Kernel-level threads Threading models 2 Kernel/User-level Threads Who is responsible for creating/managing

More information

Performance analysis of single-phase, multiphase, and multicomponent lattice-boltzmann fluid flow simulations on GPU clusters

Performance analysis of single-phase, multiphase, and multicomponent lattice-boltzmann fluid flow simulations on GPU clusters CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2010) Published online in Wiley Online Library (wileyonlinelibrary.com)..1645 Performance analysis of single-phase,

More information

SPEED-UP GEARBOX SIMULATIONS BY INTEGRATING SCORG. Dr. Christine Klier, Sahand Saheb-Jahromi, Ludwig Berger*

SPEED-UP GEARBOX SIMULATIONS BY INTEGRATING SCORG. Dr. Christine Klier, Sahand Saheb-Jahromi, Ludwig Berger* SPEED-UP GEARBOX SIMULATIONS BY INTEGRATING SCORG Dr. Christine Klier, Sahand Saheb-Jahromi, Ludwig Berger* CFD SCHUCK ENGINEERING Engineering Services in computational fluid Dynamics (CFD) 25 employees

More information

UberFlow: A GPU-Based Particle Engine

UberFlow: A GPU-Based Particle Engine UberFlow: A GPU-Based Particle Engine Peter Kipfer Mark Segal Rüdiger Westermann Technische Universität München ATI Research Technische Universität München Motivation Want to create, modify and render

More information

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Parallelization Strategy

Parallelization Strategy COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with andreas.schaefer@cs.fau.de Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José,

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Webinar #3 Lattice Boltzmann method for CompBioMed (incl. Palabos)

Webinar #3 Lattice Boltzmann method for CompBioMed (incl. Palabos) Webinar series A Centre of Excellence in Computational Biomedicine Webinar #3 Lattice Boltzmann method for CompBioMed (incl. Palabos) 19 March 2018 The webinar will start at 12pm CET / 11am GMT Dr Jonas

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Technical report: How to implement your DdQq dynamics with only q variables per node (instead of 2q)

Technical report: How to implement your DdQq dynamics with only q variables per node (instead of 2q) Technical report: How to implement your DdQq dynamics with only q variables per node (instead of 2q) Jonas Latt Tufts University Medford, USA jonas.latt@gmail.com June 2007 1 Overview Naive implementations

More information

Simulation of moving Particles in 3D with the Lattice Boltzmann Method

Simulation of moving Particles in 3D with the Lattice Boltzmann Method Simulation of moving Particles in 3D with the Lattice Boltzmann Method, Nils Thürey, Christian Feichtinger, Hans-Joachim Schmid Chair for System Simulation University Erlangen/Nuremberg Chair for Particle

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

arxiv: v1 [cs.ms] 8 Aug 2018

arxiv: v1 [cs.ms] 8 Aug 2018 ACCELERATING WAVE-PROPAGATION ALGORITHMS WITH ADAPTIVE MESH REFINEMENT USING THE GRAPHICS PROCESSING UNIT (GPU) XINSHENG QIN, RANDALL LEVEQUE, AND MICHAEL MOTLEY arxiv:1808.02638v1 [cs.ms] 8 Aug 2018 Abstract.

More information

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms http://www.scidacreview.org/0902/images/esg13.jpg Matthew Norman Jeffrey Larkin Richard Archibald Valentine Anantharaj

More information

Heterogeneous-Race-Free Memory Models

Heterogeneous-Race-Free Memory Models Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,

More information

Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems

Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems Subject: Comp6470 - Special Topics in Computing Student: Tony Oakden (U4750194) Supervisor: Dr Eric McCreath

More information

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández Domain Decomposition for Colloid Clusters Pedro Fernando Gómez Fernández MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2004 Authorship declaration I, Pedro Fernando

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

A D3Q19 Lattice Boltzmann Solver on a GPU Using Constant-Time Circular Array Shifting

A D3Q19 Lattice Boltzmann Solver on a GPU Using Constant-Time Circular Array Shifting Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 383 A D3Q19 Lattice Boltzmann Solver on a GPU Using Constant-Time Circular Array Shifting Mohammadreza Khani and Tai-Hsien Wu Department of Electrical

More information

Lattice Boltzmann Method for Simulating Turbulent Flows

Lattice Boltzmann Method for Simulating Turbulent Flows Lattice Boltzmann Method for Simulating Turbulent Flows by Yusuke Koda A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

High Performance Computing

High Performance Computing High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich

More information

Application of GPU technology to OpenFOAM simulations

Application of GPU technology to OpenFOAM simulations Application of GPU technology to OpenFOAM simulations Jakub Poła, Andrzej Kosior, Łukasz Miroslaw jakub.pola@vratis.com, www.vratis.com Wroclaw, Poland Agenda Motivation Partial acceleration SpeedIT OpenFOAM

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

LARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS

LARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS ECCOMAS Congress 2016 VII European Congress on Computational Methods in Applied Sciences and Engineering M. Papadrakakis, V. Papadopoulos, G. Stefanou, V. Plevris (eds.) Crete Island, Greece, 5 10 June

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Progress in automatic GPU compilation and why you want to run MPI on your GPU TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 #pragma ivdep 2 !$ACC DATA

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Shape of Things to Come: Next-Gen Physics Deep Dive

Shape of Things to Come: Next-Gen Physics Deep Dive Shape of Things to Come: Next-Gen Physics Deep Dive Jean Pierre Bordes NVIDIA Corporation Free PhysX on CUDA PhysX by NVIDIA since March 2008 PhysX on CUDA available: August 2008 GPU PhysX in Games Physical

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1 Markus Hadwiger, KAUST Reading Assignment #2 (until Feb. 17) Read (required): GLSL book, chapter 4 (The OpenGL Programmable

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Introduction à OpenCL et applications

Introduction à OpenCL et applications Philippe Helluy 1, Anaïs Crestetto 1 1 Université de Strasbourg - IRMA Lyon, journées groupe calcul, 10 novembre 2010 Sommaire OpenCL 1 OpenCL 2 3 4 OpenCL GPU architecture A modern Graphics Processing

More information

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen Motivation Clustering bottleneck in Flow Cytometry research 3,000 data sets 25,000

More information