High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

Similar documents
Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Using GPUs for unstructured grid CFD

Trends in HPC (hardware complexity and software challenges)

OP2 FOR MANY-CORE ARCHITECTURES

GPUs and Emerging Architectures

Performance of deal.ii on a node

OPS C++ User s Manual

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Introduction to GPU hardware and to CUDA

Code Saturne on POWER8 clusters: First Investigations

Accelerating CFD with Graphics Hardware

Performance Analysis of the PSyKAl Approach for a NEMO-based Benchmark

OpenACC programming for GPGPUs: Rotor wake simulation

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Turbostream: A CFD solver for manycore

Design and Performance of the OP2 Library for Unstructured Mesh Applications

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

OpenCL: History & Future. November 20, 2017

Large-scale Gas Turbine Simulations on GPU clusters

OP2 C++ User s Manual

Parallel Programming. Libraries and Implementations

SHOC: The Scalable HeterOgeneous Computing Benchmark Suite

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Parallel Programming Libraries and implementations

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Experiences from the Fortran Modernisation Workshop

CFD Solvers on Many-core Processors

Cuda C Programming Guide Appendix C Table C-

OP2 Airfoil Example. Mike Giles, Gihan Mudalige, István Reguly. December 2013

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Parallel Programming. Libraries and implementations

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

David Ham Abstracting the hardware

n N c CIni.o ewsrg.au

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

HPC Architectures. Types of resource currently in use

High Performance Computing

CUDA. Matthew Joyner, Jeremy Williams

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

GPU Programming Using NVIDIA CUDA

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Allinea Unified Environment

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

General Purpose GPU Computing in Partial Wave Analysis

Modelling and implementation of algorithms in applied mathematics using MPI

Automated Finite Element Computations in the FEniCS Framework using GPUs

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Parallelising Pipelined Wavefront Computations on the GPU

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

ENERGY CONSUMPTION OF ALGORITHMS FOR SOLVING THE COMPRESSIBLE NAVIER-STOKES EQUATIONS ON CPU S, GPU S AND KNL S

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Lecture 11: GPU programming

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Trends and Challenges in Multicore Programming

GPU DEVELOPMENT & FUTURE PLAN OF MIDAS NFX

Accelerators in Technical Computing: Is it Worth the Pain?

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Experiences with GPGPUs at HLRS

Software and Performance Engineering for numerical codes on GPU clusters

Virtual EM Inc. Ann Arbor, Michigan, USA

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

IBM High Performance Computing Toolkit

ARCHER Champions 2 workshop

Next-generation CFD: Real-Time Computation and Visualization

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

60x Computational Fluid Dynamics and Visualisation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

Application Talk Ludwig: Performance portability

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

CUDA libraries. Lecture 5: libraries and tools. CUDA libraries. CUDA libraries

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Pedraforca: a First ARM + GPU Cluster for HPC

Introduction to Xeon Phi. Bill Barth January 11, 2013

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Two-Phase flows on massively parallel multi-gpu clusters

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Operational Robustness of Accelerator Aware MPI

OP2: an active library for unstructured grid applications on GPUs

PSyclone Separation of Concerns for HPC Codes. Dr. Joerg Henrichs Computational Science Manager Bureau of Meteorology

Knights Landing Scalability and the Role of Hybrid Parallelism

Transcription:

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration Jianping Meng, Xiao-Jun Gu, David R. Emerson, Gihan Mudalige, István Reguly and Mike B Giles Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1

Outline of presentation Today s challenge for scientific software The solution concept OPS brief background and history High level abstraction Performance and early results Summary and conclusions 2

Challenge for scientific software Parallel hardware: GPU, Intel Phi, GPDSP, FPGA Software framework: MPI, OpenMP, CUDA, OpenCL, OpenACC 1999 OpenMP released 2001 Became practical and popular, matrix multiplication tested 2006 NVIDIA launched CUDA (November 2006) 2009 OpenCL 1.0 released with Mac OS X Snow Leopard 2010 Mathematica 8 introduced systematical GPGPU features 2011 OpenACC released 2013 Ansys Fluent 15 3D AMG coupled pressure-based solver 3

Challenge for scientific software For scientific application developers, It is a time-consuming and risky task to adapt to an emerging hardware! But.next-gen hardware (e.g. Summit/Sierra) will have integrated GPUs Also TOP500 > 100 systems using accelerators growing at ~25/yr 4

OPS solution concept The basic idea is for the scientists and engineers to stay focused on their software development and allow OPS to provide the parallelism through abstracting all of the message passing and conversion to accelerators. This is particularly attractive to industry where software investment and software lifetime are key factors in making critical engineering decision. Complexity of today s hardware gets in the way and is making industry face mission-critical decisions in where to invest for future applications. The OPS (Oxford Parallel Structured software) project is developing an open-source framework for the execution of multi-block structured mesh applications on clusters of GPUs or multi-core CPUs and accelerators. Although OPS is designed to look like a conventional library, the implementation uses source-source translation to generate the appropriate back-end code for the different target platforms. For structured grid problems, OPS offers a potential way forward. 5

High-level abstraction Scientific applications share many similarities: Data: Vector, matrix, global (sum, max, min ) Loop: Operations over each point and time marching Stencils: Information from neighbouring points Common algorithms: linear algebra solvers, MG,.. So, we may be able to: Build up a high-level framework on top of these similarities Provide clear abstractions with reasonable performance Develop applications with hidden hardware details Example: PETSc library 6

Higher level abstraction for CFD For typical applications using multi-block structured meshes: Blocks: complex geometry Data: defined on block Loop: over block Stencil: FD, FV Halos: exchange Domain specific language! 7

Brief review of OP development 1994 Fortran PVM Unstructured OPlus developed as a flexible parallel library for unstructured grids on distributed memory systems using PVM (see Crumpton & Giles) 2011 C/C++ MPI CUDA Unstructured OP2 is second iteration of OPlus library but now handles C/C++ and use of Nvidea GPUs 2013 C/C++ MPI CUDA Structured OPS takes the abstraction idea and applies it to structured grid problems Multigrid aircraft computations using the Oplus parallel library, P.I. Crumpton and M.B. Giles, Parallel CFD 1995, Kyoto, Japan, pp. 339-346 8

V e l o Ops: Data abstraction Block 1 Block 2 Halo transfer V e l o ops_block* grid2d; grid2d = new ops_block[block_num]; grid2d[ib] = ops_decl_block(2, grid2d ib ); ops_dat* velo; velo = new ops_dat[block_num]; velo[ib]=ops_decl_dat(grid2d[ib],2, size,base,d_m,d_p, (Real*)temp, double, velo ib ); ops_halo* halos; ops_halo_group f_halos; halos[1] = ops_decl_halo(velo[0], velo[1], halo_iter, base_from, base_to, dir, dir); f_halos = ops_decl_halo_group(halo_num, halos); ops_halo_transfer(f_halos); 9

V e l o OPS: Parallel loop and stencil Block 1 Block 2 Halo transfer V e l o void collision(const Real *f, const Real *feq, Real* fcoll, const Real* omega) { for (int l = 0; l < nc; l++) { fcoll[ops_acc_md2(l, 0, 0)] = (1 - (*omega)) * f[ops_acc_md0(l, 0, 0)] + (*omega) * feq[ops_acc_md1(l, 0, 0)]; } } ops_par_loop(collision, "collision", grid2d[0], 2, iter_range_coll, ops_arg_dat(f[0], nc, S2D_00, "double", OPS_READ), ops_arg_dat(feq[0], nc, S2D_00, "double", OPS_READ), ops_arg_dat(fcoll[0], nc, S2D_00, "double", OPS_WRITE), ops_arg_gbl(&omega, 1, "double", OPS_READ)); 10

Work flow Use provided function calls and then test code in non-optimised serial and MPI mode Platform specified and optimised code NVCC, mpicxx 11

Lattice Boltzmann model Collision Stream-Collision scheme Easy to understand Easy to parallelise Particularly suitable for NS equations Other schemes can be implemented Stream 12

Performance and early results Problem: 2D Taylor-Green vortex Grid size: 4096 4096 Total time step : 500 Lattice: D2Q9 Hardware Ivy Bridge E5 (idataplex) Power8 13

Performance and early results Performance Power8 Compiler: IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08) CUDA compilation tools, release 7.0, V7.0.27 MPI: MPICH-3.1.4, built with GCC 14

Performance and early results idataplex Intel C/C++ 15.2.164 Intel MPI 5.03 Power8 IBM XL C/C++ 13.1.2 Cuda 7.0, V7.0.27 (K40) Cuda 7.5, V7.5.21 (K80) OpenMPI 15

Performance and early results Energy consumption Cores/GPU DL (KWh) OPS (KWh) 12 0.0072 0.0061 24 0.00397 0.00477 48 0.0028 0.00371 1 K40-0.00931 1 K80 0.00459 CPU consumption is measured on the Intel Ivy Bridge E5 system with the consideration of idle cores 16

Summary and conclusions OPS provides a concise high-level abstraction OPS can reduce complexity and speed up parallel application development OPS shows decent performance Users need a general understanding of CUDA, OpenACC, To adapt to a new hardware, we mainly need to update the translator Knowledge of Python and regular expression is necessary Although not a new concept, modern h/w complexity makes OPS or similar methodology look attractive.but.. 17

Acknowledgements This work was supported through the U.K. Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/K038494/1 Future-proof massively-parallel execution of multiblock applications For more information on this work, please visit http://oerc.ox.ac.uk/projects/ops 18

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration Jianping Meng, Xiao-Jun Gu, David R. Emerson, Gihan Mudalige, István Reguly and Mike B Giles Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 19

29th International Conference on Parallel Computational Fluid Dynamics 15-17 May, 2017 Technology and Innovation Centre University of Strathclyde, Glasgow www.parcfd.org Parallel CFD is the annual conference series dedicated to the discussion of recent developments and applications of parallel computing in the field of CFD and related disciplines Chairman: Professor Dimitris Drikakis Co-chairmen: Professor David Emerson and Dr. Marco Fossati Extended abstract submission deadline: 5 February 2017 Supported by: