From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

Similar documents
arxiv: v1 [physics.comp-ph] 24 Jul 2013

Fourteen years of Cactus Community

Cactus Framework: Scaling and Lessons Learnt

Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

GTC 2014 Session 4155

What is Cactus? Cactus is a framework for developing portable, modular applications

OpenACC programming for GPGPUs: Rotor wake simulation

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

How to Optimize Geometric Multigrid Methods on GPUs

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

n N c CIni.o ewsrg.au

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

The Cactus Framework. Erik Schnetter September 2006

Datura The new HPC-Plant at Albert Einstein Institute

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

AutoTune Workshop. Michael Gerndt Technische Universität München

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Porting COSMO to Hybrid Architectures

OP2 FOR MANY-CORE ARCHITECTURES

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Erik Schnetter Rochester, August Friday, August 27, 2010

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Code Generators for Stencil Auto-tuning

An Auto-Tuning Framework for Parallel Multicore Stencil Computations

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

CPU-GPU Heterogeneous Computing

Parallel Algorithms: Adaptive Mesh Refinement (AMR) method and its implementation

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

SENSEI / SENSEI-Lite / SENEI-LDC Updates

How to Write Code that Will Survive the Many-Core Revolution

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

A PSyclone perspec.ve of the big picture. Rupert Ford STFC Hartree Centre

Parallel Computing. November 20, W.Homberg

OpenACC (Open Accelerators - Introduced in 2012)

GPUs and Emerging Architectures

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Accelerating Financial Applications on the GPU

CUDA. Matthew Joyner, Jeremy Williams

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

HPC Application Porting to CUDA at BSC

The Role of Standards in Heterogeneous Programming

Parallel Programming Libraries and implementations

Chapter 3 Parallel Software

Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage

Parallelism. Wolfgang Kastaun. May 9, 2008

Performance Analysis of the PSyKAl Approach for a NEMO-based Benchmark

Code Auto-Tuning with the Periscope Tuning Framework

Parallel Numerical Algorithms

GPU Fundamentals Jeff Larkin November 14, 2016

COMP528: Multi-core and Multi-Processor Computing

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Heidi Poxon Cray Inc.

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

GPUs and Einstein s Equations

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Experiences with ENZO on the Intel Many Integrated Core Architecture

Parallel Programming. Libraries and Implementations

OpenACC 2.6 Proposed Features

Code optimization in a 3D diffusion model

High performance Computing and O&G Challenges

Scalability of Uintah Past Present and Future

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

PREPARING AN AMR LIBRARY FOR SUMMIT. Max Katz March 29, 2018

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Overview of research activities Toward portability of performance

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

arxiv: v1 [cs.ms] 8 Aug 2018

A GPU-Enabled Rapid Image Processing Framework

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Hybrid Implementation of 3D Kirchhoff Migration

Experiences with CUDA & OpenACC from porting ACME to GPUs

Architecture, Programming and Performance of MIC Phi Coprocessor

Adaptive Mesh Refinement in Titanium

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Transcription:

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M. Kierzynka, F. Löffler, J. Tao XSCALE 2012, Chicago (July 15-16, 2012)

10 Motivation y 5 0-5 -10 In astrophysics (and elsewhere), physical systems often described via systems of partial differential equations (PDEs) example: general relativity, hydrodynamics, electrodynamics,... -10-5 0 5 10 Solved via grid-based methods space: FD, FV, FE,...; time: RK,... adaptive mesh refinement (AMR), multi-block Large HPC systems (MPI, OpenMP, GPUs) http://einsteintoolkit.org x

Binary Black Hole Merger Image: I. Hinder, AEI

A Hard Problem Publication Physics Model (EDL) Results Kranc Code Generator CaKernel Programming Abstractions Cactus-Carpet Computational Infrastructure Parallel Programming Environment

A Hard Problem Publication Physics Model (EDL) Results Kranc Code Generator CaKernel Programming Abstractions Cactus-Carpet Computational Infrastructure Parallel Programming Environment CPU GPU MIC?

A Hard Problem Publication Physics Model (EDL) Results Kranc Code Generator CaKernel Programming Abstractions Cactus-Carpet Computational Infrastructure C++/OpenMP Parallel Programming Environment TBB? HPX? CUDA OpenCL OpenACC?

EDL: Equation Description Language Scalar Wave Equation: begin calculation Init u = 0 rho = A exp(-1/2 (r/w)**2) v_i = 0 end calculation begin calculation RHS D_t u = rho D_t rho = deltaˆij D_i v_j D_t v_i = D_i rho end calculation Aim: Write systems of PDEs in easily readable manner High-level, LaTeX-like syntax Better error reporting than e.g. standard Mathematica input Describe: variables, equations, initial/boundary conditions, parameters, analysis quantities begin calculation Energy eps = 1/2 (rho**2 + deltaˆij v_i v_j) end calculation... Also describe discretisation (e.g. stencils)

Kranc: Automated Code Generation and Optimisation Input in Mathematica (e.g. generated from EDL) Generate complete Cactus modules (employing AMR, multi-block,...) Customised FD stencils Coordinate transformations for multi-block systems http://kranccode.org/

Kranc: Automated Code Generation and Optimisation Automatically check consistency with other components High-level optimisations: (semi-automatic:) loop fission, loop fusion loop blocking, unrolling Low-level optimisations: CSE, SIMD, OpenMP Generate C++, OpenCL, CUDA http://kranccode.org/

Cactus Framework, Carpet AMR Driver Cactus: Software framework for HPC all math/physics/computer science located in modules (components) encourages/supports decentralized code development Carpet: Cactus driver for AMR, multiblock parallelized via MPI and OpenMP parallel I/O (HDF5) http://cactuscode.org/ http://carpetcode.org/

Cactus Programming Abstractions Grid Hierarchy: list of coordinates, boxes etc. describing domain and discretisation grid functions live on GH storage managed by driver component Schedule: set of active routines and their dependencies routines implement kernels (operators applied to grid functions) have associated meta-data (variables read/written, stencil sizes, iteration region,...) Allows high-level introspection (debugging, optimisation)

New in Cactus: Accelerator Framework Not all code suitable for GPUs administrative code, legacy code, I/O Avoid copying data from/to GPU if possible track which routines require/provide what data where (host or device) components already provide metadata for this can copy data automatically

Dynamic CPU/GPU Optimisations Emphasis: automatic, dynamic, fast optimisation for many different kernels (autotuning) Don t know configurations in advance (AMR, run-time parameters, varying code, different systems) cannot use static autotuning

Dynamic Tile Selection Need to tile loops to improve cache performance (GPU: to use fast memory) know stencil sizes, shapes know cache characteristics (line size) array size changes dynamically (AMR!) Apply heuristics to choose tile sizes, validated by experiments

Lightweight Kernel Generation Instruction cache is small need to generate short code reduce number of array pointers, simplify index arithmetic dynamic code generation turns run-time parameters into constants can even hard-code loop sizes when recompiling after AMR regridding

Fat Kernel Detection Some kernels are still just too large (e.g. Einstein equations have 7,000 flop) (semi-automated) loop fission when generating code restructure code to reduce number of live variables use heuristics to reduce number of threads

Integrated Performance Monitoring Use performance monitoring (PAPI, Nvidia Cupti) collect data after running kernel inform user about performance planned: use this also to e.g. identify fat kernels

Choosing CPU Affinity Forge (hybrid CPU/GPU system) NCSA

Benchmarks Input: High-level description of Einstein equations Code generated automatically, run in Cactus framework 8th order finite differencing (5 ghost points); ~7,000 flop in kernel Einstein equations on GPU, other code on CPU ~10% efficiency on GPU (cf. ~20% efficiency on CPU); (these are good numbers!) Interprocess synchronization memory copy RHS evaluations RHS derivatives Multipole BH tracking Horizons BH tracking data copy RHS advection Wait load imbalance Compute Psi4 OutputGH Time integrator data c Boundary condition

Scaling Results time per grid point update per GPU (ns) 2000 1500 1000 500 0 CPU cane, 2p(12t)/node CPU cane, 4p(6t)/node GPU cane,nox-split GPU cane,x-split CPU datura, 2p(6t)/node 1 2 4 8 16 24 32 40 96 Number of GPUs or CPUs

Conclusion Can auto-generate reasonably efficient code for complex equations from high-level input Infrastructure portable to (almost) all HPC architectures Need to optimise while generating C++/ CUDA code compilers don t do that