Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Similar documents
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

GPU Cluster Computing for FEM

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Total efficiency of core components in Finite Element frameworks

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Computing on GPU Clusters

International Supercomputing Conference 2009

Software and Performance Engineering for numerical codes on GPU clusters

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Acceleration of Unmodified CSM and CFD Solvers

Turbostream: A CFD solver for manycore

Performance. Computing (UCHPC)) for Finite Element Simulations

Numerical Algorithms on Multi-GPU Architectures

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

GPU Cluster Computing for Finite Element Applications

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

walberla: Developing a Massively Parallel HPC Framework

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure

Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters

Accelerating CFD with Graphics Hardware

Multigrid algorithms on multi-gpu architectures

Accelerating Double Precision FEM Simulations with GPUs

Hardware-Oriented Finite Element Multigrid Solvers for PDEs

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

M. Geveler, D. Ribbrock, D. Göddeke, P. Zajac, S. Turek Corresponding author:

The ICARUS white paper: A scalable, energy-efficient, solar-powered HPC center based on low power GPUs

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

High Performance Computing for PDE Towards Petascale Computing

Case study: GPU acceleration of parallel multigrid solvers

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

60x Computational Fluid Dynamics and Visualisation

Addressing Heterogeneity in Manycore Applications

Virtual EM Inc. Ann Arbor, Michigan, USA

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

simulation framework for piecewise regular grids

Advances of parallel computing. Kirill Bogachev May 2016

SIMD Exploitation in (JIT) Compilers

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

CFD Solvers on Many-core Processors

Large scale Imaging on Current Many- Core Platforms

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Lattice Boltzmann with CUDA

The next-generation CFD solver Flucs HPC aspects

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Massively Parallel Phase Field Simulations using HPC Framework walberla

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

High Performance Computing for PDE Some numerical aspects of Petascale Computing

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

Ian Buck, GM GPU Computing Software

Numerical Simulation on the GPU

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Two-Phase flows on massively parallel multi-gpu clusters

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

OpenACC programming for GPGPUs: Rotor wake simulation

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

RAMSES on the GPU: An OpenACC-Based Approach

GPUs and GPGPUs. Greg Blanton John T. Lubia

Next-generation CFD: Real-Time Computation and Visualization

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

Supercomputing and Mass Market Desktops

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

The HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners

arxiv: v1 [cs.pf] 5 Dec 2011

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

Large-scale Gas Turbine Simulations on GPU clusters

Parallel and High Performance Computing CSE 745

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

Why Parallel Architecture

Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Alan Humphrey, Qingyu Meng, Brad Peterson, Martin Berzins Scientific Computing and Imaging Institute, University of Utah

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

High Performance Computing. Introduction to Parallel Computing

Advanced Parallel Programming. Is there life beyond MPI?

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

An Introduction to OpenACC

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Transcription:

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund May 31, 2010

Overview 1 High performance CFD simulations Hardware Software Implementational issues 2 Results 3 Conclusion + Future Work

Overview 1 High performance CFD simulations Hardware Software Implementational issues 2 Results 3 Conclusion + Future Work

From serial to parallel hardware Different levels of parallelism 1 microprocessor-level (and beyond): IL, SIMD, multi-threading on a chip 2 node-level: parallel microprocessors in a compute-node 3 clusters: parallel compute-nodes in a cluster Multi- and many-core architectures 1 doubling the amount of transistors doubling the amount of cores 2 multi-core-architectures have been established in the mass market 3 many-core-architectures too (GPUs) 4 ever more heterogeneities: Cell BE

The challenge Application programmers have to deal with......both, implementing specific functionality and the paradigm shift of the underlying hardware: high-level optimisation on the application level hardware-oriented implementation to exploit parallel hardware today: focus on the method with respect to multi-level parallelism

Efficient methods for CFD Application programmers have to deal with... increase in data being acquired for fast ad-hoc processing or future analysis applications hunger for ever larger amounts of processing and memory resources application design is highly interdisciplinary The impact on CFD simulation software CFD becomes ubiquitous: applications in many different fields fast access to result data is an economic issue, software solutions must be: reliable, portable, designed for testability extendable, flexible real-time constraints become more important

Model Problem incompressible, laminar fluids with free surfaces Render fluid volumes at interactive rates for use in a 3D environment (interactive demonstrations, computer games, tsunami forecast,...) Quality criteria real-time constraint embedded in a virtual environment qualitative correctness: correct behaviour quantitative correctness: numerical quality less important stability (!) Design goals implementation within a building blocks library for maximum reusability exploit multi-level parallelism

LBM for SWE D2Q9 lattice direct method simulates particle propagation using a lattice with nine propagation directions LBE f α (x + e α t, t + t) = f α (x, t) 1 τ (f α f eq α )

Implementation but: Relying upon compilers? Dealing with hardware details? Memory wall?...?

Framework:HONEI libraries Primary design-goal: Abstract from target-hardware! Backends: x86 SIMD (SSE2 intrinsics) x86 multi-core (pthreads) distributed memory clusters (MPI) NVIDIA GPUs (CUDA 2.2) + multiple GPUs Cell BE (libspe2)

Data structures Packed lattice, domain decomposition cut domain in one dimension compress lattice: stationary obstacles store contiguously in memory well suited for SIMD / SIMT patch 0 patch 1

Define building blocks

Overview 1 High performance CFD simulations Hardware Software Implementational issues 2 Results 3 Conclusion + Future Work

Results

Results Solver Performance Type LBM (1500 2 sites) MFLUPS Speedup QS22 Blade 22.36-1 Core i7 core 15-4 Core i7 cores 40 2.66 1 GTX 285 193-2 GTX 285 348 1.80 4 CPU cores + 1 GPU 217-4 CPU cores + 2 GPUs 362 1.66 Hybrid LBM Solver GPU::CUDA GPU::CUDA CPU::MultiCore::SSE CPU::SSE CPU::SSE CPU::SSE CPU::SSE

Results Kernel runtime breakdown EquilibriumDistribution CollideStream Extraction Boundaries Opteron 3 threads 61 24 15 0 Core i7 4 threads 48 35 17 0 QS22 blade 7 72 16 5 GeForce 8800 GTX 19 52 19 10 GeForce GTX 285 32 37 24 7

Results Opteron cluster, MPI, strong scalability 4 nodes, Opteron 2214 MLUP/s 40 35 30 25 20 15 1 job, 1 node 2 jobs, 2 nodes 3 jobs, 3 nodes 4 jobs, 4 nodes 8 jobs, 4 nodes LiDO, weak scalability, n=600 #lattices time n 2 9.5 2n 2 9.7 4n 2 9.7 8n 2 9.8 16n 2 9.9 10 5 0 250 2 500 2 1000 2 1500 2 2000 2 2400 2 2800 2 lattice size

Results Real-time performance on the GPU - hardware GTX 8800, GTX 285 0.1 0.08 GTX285 30fps GTX8800 30fps Core i7 30fps total execution time per timestep 0.06 0.04 0.02 0 0 500 1000 1500 2000 2500 3000 N=M

Overview 1 High performance CFD simulations Hardware Software Implementational issues 2 Results 3 Conclusion + Future Work

Conclusion + Future Work Lessons learned Advanced CFD applications can be performed in real-time on a single node with reasonable resolution The LBM is well suited for this kind of application: it is easy to extend on the discrete level in its basics, well suited for spatial parallelism Drawbacks: presented method only first order accurate always: hard to stabilise Perspectives More numerical and performance comparisons with state-of-the-art SWE solvers (FD, FV) Large scale faster-than-real-time simulations for tsunami forecast Use higher order numerics for real-time simulations (FEM)

Acknowledgements Supported by DFG, project TU 102/22-2 and BMBF, HPC Software für skalierbare Parallelrechner: SKALB project 01IH08003D. Thanks to IBM for access to Cell hardware.