Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

Similar documents
Automated Finite Element Computations in the FEniCS Framework using GPUs

Firedrake: Re-imagining FEniCS by Composing Domain-specific Abstractions

David Ham Abstracting the hardware

COFFEE: AN OPTIMIZING COMPILER FOR FINITE ELEMENT LOCAL ASSEMBLY. Fabio Luporini

Software abstractions for manycore software engineering

Experiments in Unstructured Mesh Finite Element CFD Using CUDA

Using GPUs for unstructured grid CFD

Cross-loop Optimization of Arithmetic Intensity and Data Locality for Unstructured Mesh Applications

OP2 FOR MANY-CORE ARCHITECTURES

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

OP2 Airfoil Example. Mike Giles, Gihan Mudalige, István Reguly. December 2013

Design and Performance of the OP2 Library for Unstructured Mesh Applications

OP2 C++ User s Manual

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

OP2: an active library for unstructured grid applications on GPUs

Re-architecting Virtualization in Heterogeneous Multicore Systems

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

Firedrake: Automating the Finite Element Method by Composing Abstractions

Optimising the Mantevo benchmark suite for multi- and many-core architectures

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

Generating programs works better than transforming them, if you get the abstraction right

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Flexible, Scalable Mesh and Data Management using PETSc DMPlex

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Finite element assembly strategies on multi- and many-core architectures

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Large scale Imaging on Current Many- Core Platforms

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

arxiv: v3 [cs.ms] 1 Jul 2016

Parallel Systems. Project topics

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Cost of Your Programs

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Performance of Implicit Solver Strategies on GPUs

Introduction to Runtime Systems

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Overview of research activities Toward portability of performance

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Designing a Domain-specific Language to Simulate Particles. dan bailey

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenACC Course. Office Hour #2 Q&A

A Local-View Array Library for Partitioned Global Address Space C++ Programs

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

A Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

ARCHER Champions 2 workshop

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

How to Optimize Geometric Multigrid Methods on GPUs

Two-Phase flows on massively parallel multi-gpu clusters

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Modern Processor Architectures. L25: Modern Compiler Design

A Cross-Input Adaptive Framework for GPU Program Optimizations

Intel Performance Libraries

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

SPOC : GPGPU programming through Stream Processing with OCaml

Trends in HPC (hardware complexity and software challenges)

Chapter 3 Parallel Software

Parallel Programming. Libraries and Implementations

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Accelerating CFD with Graphics Hardware

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Incorporation of Multicore FEM Integration Routines into Scientific Libraries

Research Faculty Summit Systems Fueling future disruptions

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Performance of deal.ii on a node

MODERN architectures now include ever-larger on-chip

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

OpenACC programming for GPGPUs: Rotor wake simulation

Introduction to Multicore Programming

Delite: A Framework for Heterogeneous Parallel DSLs

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPUs and Emerging Architectures

JCudaMP: OpenMP/Java on CUDA

NEW ADVANCES IN GPU LINEAR ALGEBRA

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Transcription:

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Graham R. Markall, Florian Rathgeber, David A. Ham, Paul H. J. Kelly, Carlo Bertolli, Adam Betts Imperial College London Mike B. Giles, Gihan R. Mudalige University of Oxford Istvan Z. Reguly Pazmany Peter Catholic University, Hungary Lawrence Mitchell University of Edinburgh

How do we get performance portability for the finite element method? Using a form compiler with pluggable backend support One backend: CUDA NVidia GPUs Long term plan: Target an intermediate representation

Manycore Form Compiler Dolfin Fluidity UFC UFL CUDA UFL FFC MCFC Compile-time code generation Plans to move to runtime code generation Generates assembly and marshalling code Designed to support isoparametric elements

Code String MCFC Pipeline Backend code generator Preprocessing Execution Form Processing Partitioning Preprocessing: insert Jacobian and transformed gradient operators into forms Execution: Run in python interpreter, retrieve Form objects from namespace Form processing: compute_form_data() Partitioning: helps loop-nest generation

Preprocessing Handles coordinate transformation as part of the form using UFL primitives x = state.vector_fields['coordinate'] J = Jacobian(x) invj = Inverse(J) detj = Determinant(J) Multiply each form by J Overloaded derivative operators, e.g.: def grad(u): return ufl.dot(invj, ufl.grad(u)) Code generation gives no special treatment to the Jacobian, its determinant or inverse

Loop nest generation Loops in typical assembly kernel: For (int i=; i<3; ++i) For (int j=; j<3; ++j) for (int q=; q<6; ++q) for (int d=; d<2; ++d) Inference of loop structure from preprocessed form: Basis functions: use rank of form Quadrature loop: Quadrature degree known Dimension loops: Find all the IndexSum indices Recursively descend through form graph identifying maximal sub-graphs that share sets of indices

Partitioning example: Sum IndexSum x IntValue SpatialDeriv SpatialDeriv x x

Partitioning example: Sum IndexSum x IntValue SpatialDeriv SpatialDeriv x x

Partitioning example: Sum IndexSum x IntValue SpatialDeriv SpatialDeriv x x

Partition code generation Once we know which loops to generate: Generate an expression for each partition (subexpression) Insert the subexpression into the loop nest depending on the indices it refers to Traverse the topmost expression of the form, and generate an expression that combines subexpressions, and insert into loop nest

Code gen example: for (int i=; i<3; ++i) { for (int j=; j<3; ++j) { for (int q=; q<6; ++q) { for (int d=; d<2; ++d) { IndexSum x Sum IntValue SpatialDer iv SpatialDer iv x x

Code gen example: for (int i=; i<3; ++i) { for (int j=; j<3; ++j) { LocalTensor[i,j] =.; for (int q=; q<6; ++q) { for (int d=; d<2; ++d) { IndexSum x Sum IntValue SpatialDer iv SpatialDer iv x x

Code gen example: for (int i=; i<3; ++i) { for (int j=; j<3; ++j) { LocalTensor[i,j] =.; for (int q=; q<6; ++q) { SubExpr =. SubExpr =. for (int d=; d<2; ++d) { IndexSum x Sum IntValue SpatialDer iv SpatialDer iv x x

Code gen example: for (int i=; i<3; ++i) { for (int j=; j<3; ++j) { LocalTensor[i,j] =.; for (int q=; q<6; ++q) { SubExpr =. SubExpr =. SubExpr += arg[i,q]*arg[j,q] for (int d=; d<2; ++d) { IndexSum x Sum IntValue SpatialDer iv SpatialDer iv x x

Code gen example: for (int i=; i<3; ++i) { for (int j=; j<3; ++j) { LocalTensor[i,j] =.; for (int q=; q<6; ++q) { SubExpr =. SubExpr =. SubExpr += arg[i,q]*arg[j,q] for (int d=; d<2; ++d) { SubExpr += d_arg[d,i,q]*d_arg[d,j,q] IndexSum x Sum IntValue SpatialDer iv SpatialDer iv x x

Code gen example: for (int i=; i<3; ++i) { for (int j=; j<3; ++j) { LocalTensor[i,j] =.; for (int q=; q<6; ++q) { SubExpr =. SubExpr =. SubExpr += arg[i,q]*arg[j,q] for (int d=; d<2; ++d) { SubExpr += d_arg[d,i,q]*d_arg[d,j,q] LocalTensor[i,j] += SubExpr + SubExpr IndexSum x Sum IntValue SpatialDer iv SpatialDer iv x x

Benchmarking MCFC and DOLFIN Comparing and profiling assembly + solve of an advection-diffusion test case: Coefficient(FiniteElement( CG, "triangle", )) p=trialfunction(t) q=testfunction(t) diffusivity =. M=p*q*dx adv_rhs = (q*t+dt*dot(grad(q),u)*t)*dx t_adv = solve(m, adv_rhs) d=-dt*diffusivity*dot(grad(q),grad(p))*dx A=M-.5*d diff_rhs=action(m+.5*d,t_adv) tnew=solve(a,diff_rhs)

Experiment setup Term-split advection-diffusion equation Advection: Euler timestepping Diffusion: Implicit theta scheme Solver: CG with Jacobi preconditioning Dolfin: PETSc MCFC: From (Markall, 29) CPU: 2 x 6 core Intel Xeon E565 Westmere (HT off), 48GB RAM GPU Nvidia GTX48 Mesh: 34428 unstructured elements, square domain. Run for 64 timesteps. Dolfin setup: Tensor representation, CPP opts on, form compiler opts off, MPI parallel

Adv-diff runtime

Breakdown of solver runtime (8 cores)

Dolfin profile % Exec. Function 5.8549 pair<boost::unordered_detail::hash_iterator_base<allocator<unsigned... >::emplace().9482 MatSetValues_MPIAIJ().247 malloc_consolidate 7.48235 _int_malloc 6.9363 dolfin::sparsitypattern::~sparsitypattern() 2.68 dolfin::ufc::update() 2.48799 MatMult_SeqAIJ() 2.48758 ffc_form_d2c6cdbe28542a53997b6972359545bb3cc_cell_integral ::tabulate_tensor() 2.368 /usr/lib/openmpi/lib/libopen-pal.so... 2.2247 boost::unordered_detail::hash_table<boost::unordered_detail::set<boost::hash<... >::rehash_impl().9389 dolfin::meshentity::entities().89775 _int_free.83794 free.737 malloc.523 /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so.47677 /usr/lib/x86_64-linux-gnu/libstdc++.so.6..5.47279 poll.42863 ffc_form_95862b38a944a3a64374d9d4be688fdbd8_cell_integral ::tabulate_tensor().8282 dolfin::sparsitypattern::insert().3536 ffc_form_ba8885bc23bf6ecc84f2b9c72327944f_cell_integral ::tabulate_tensor().8694 ffc_form_23b22f9865ca4de7884edcf285d35d5a55a3_cell_integral ::tabulate_tensor().983646 dolfin::genericfunction::evaluate().95484 dolfin::function::restrict().8699 VecSetValues_MPI()

MCFC CUDA Profile % Exec. Kernel 28.7 Matrix addto 4.9 Diffusion matrix local assembly 7. Vector addto 4. Diffusion RHS 2. Advection RHS.5 Mass matrix local assembly 42.6 Solver kernels

Thoughts Targeting the hardware directly allows for efficient implementations to be generated The MCFC CUDA backend embodies formspecific and hardware specific knowledge We need to target a performance portable intermediate representation

Layers manage complexity. Each layer of the IR: New optimisations introduced that are not possible in the higher layers With less complexity than the lower layers Unified Form Language (Form Compiler) Local assembly Quadrature vs. Tensor FErari optimisations OP2: Unstructured mesh Domain-specific language (DSL) Global assembly Matrix format Assembly algorithm Data structures Large parallel clusters using MPI Multicore CPUs using OpenMP, SSE, AVX GPUs using CUDA and OpenCL Streaming dataflow using FPGAs Backend-specific Classic opts.

Why OP2 for MCFC? Isolates a kernel that performs an operation for every mesh component (Local Assembly) The job of OP2 is to control all code necessary to apply the kernel, fast Pushing all the OpenMP, MPI, OpenCL, CUDA, AVX issues into the OP2 compiler. Abstracts away the matrix representation so OP2 controls whether (and how/when) the matrix is assembled.

Summary High performance implementations are obtained by flattening out abstractions Flattening abstractions increases complexity we need to combat this with a new, appropriate abstraction This greatly reduces the implementation space for the form compiler to work with Whilst still allowing performance portability MCFC OP2 implementation: ongoing

Spare slides

Fluidity Markup File SPUD UFL Code, State (Fields, Elements) MCFC Compile/run flow Options Tree Fluidity Runtime Code Simulation Executable Output Simulation Executable Linker Object Code MCFC OP2 Code OP2 Translator CUDA Code NVIDIA Compiler

OP2 Matrix support Matrix support follows from Iteration Spaces: What is the mapping between threads and elements? Example, on GPUs: For low-order, one thread per element For higher-order, one thread block per element OP2 extends iteration spaces to the matrix indices OP2 abstarcts them completely from the user theyre inherently temposrary data types Theres no concept of getting the matrix back from op2.

void mass(float *A, float *x[2], int i, int j) { int q; float J[2][2]; float detj; Pointer to a single matrix element Ptr to coords of current element const float w[3]= {.66667,.66667,.66667; const float CG[3][3] = {{.666667,.66667,.66667, {.66667,.666667,.66667, {.66667,.66667,.666667; OP2 Kernel Iteration space variables J[][] = x[][] - x[][]; J[][] = x[2][] - x[][]; J[][] = x[][] - x[][]; J[][] = x[2][] - x[][]; detj = J[][] * J[][] - J[][] * J[][]; for ( q = ; q < 3; q++ ) *A += CG[i][q] * CG[j][q] * detj * w[q];

void mass(float *A, float *x[2], int i, int j) Kernel Set, matrix dimensions op_par_loop(mass, op_iteration_space(elements, 3, 3), Matrix dataset op_arg_mat(mat, op_i(), elem_node, op_i(2), elem_node, OP_INC), Mapping op_arg_dat(xn, OP_ALL, elem_node, OP_READ)); Access specifier Dataset Indices for gather

The OP2 abstraction The mesh is represented in a general manner as a graph. Primitives: Sets (e.g. cells, vertices, edges) mappings (e.g. from cells to vertices) datasets (e.g. coefficients) No mesh entity requires special treatment Cells, vertices, etc are entities of different arity

The OP2 abstraction Parallel loops specify: A kernel An Iteration space: A set An Access Descriptor: Datasets to pass to the kernel, and the mappings through which they re accessed OP2 Runtime handles application of the kernel at each point in the iteration space, feeding the data specified in the access descriptor