OpenStaPLE, an OpenACC Lattice QCD Application

Similar documents
Designing and Optimizing LQCD code using OpenACC

Designing and Optimizing LQCD codes using OpenACC

arxiv: v1 [hep-lat] 2 Jan 2017

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Accelerating Financial Applications on the GPU

Performance and Portability Studies with OpenACC Accelerated Version of GTC-P

arxiv: v1 [hep-lat] 12 Nov 2013

n N c CIni.o ewsrg.au

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

GPUs and Emerging Architectures

OpenACC programming for GPGPUs: Rotor wake simulation

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

CPU-GPU Heterogeneous Computing

Progress Report on QDP-JIT

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

A portable OpenCL Lattice Boltzmann code for multi- and many-core processor architectures

Accelerator programming with OpenACC

OpenACC 2.6 Proposed Features

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Lattice QCD code Bridge++ on arithmetic accelerators

Profiling & Tuning Applications. CUDA Course István Reguly

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

An Introduction to OpenACC

Parallel Computing. November 20, W.Homberg

MILC Performance Benchmark and Profiling. April 2013

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Addressing Heterogeneity in Manycore Applications

CUDA. Matthew Joyner, Jeremy Williams

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Directive-based Programming for Highly-scalable Nodes

Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Challenges in Programming Modern Parallel Systems

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

OpenACC Fundamentals. Steve Abbott November 13, 2016

Modernizing OpenMP for an Accelerated World

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Towards a Performance- Portable FFT Library for Heterogeneous Computing

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Code Migration Methodology for Heterogeneous Systems

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

OpenACC Fundamentals. Steve Abbott November 15, 2017

arxiv: v2 [hep-lat] 3 Nov 2016

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Understanding Dynamic Parallelism

Optimising the Mantevo benchmark suite for multi- and many-core architectures

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

IBM CORAL HPC System Solution

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

QCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs

Trends in HPC (hardware complexity and software challenges)

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Optimisation Myths and Facts as Seen in Statistical Physics

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Building NVLink for Developers

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Overview of research activities Toward portability of performance

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Automatic Testing of OpenACC Applications

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

University of Bielefeld

COMP Parallel Computing. Programming Accelerators using Directives

NVIDIA Application Lab at Jülich

Porting MILC to GPU: Lessons learned

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

arxiv: v2 [hep-lat] 21 Nov 2018

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

arxiv: v1 [physics.comp-ph] 4 Nov 2013

RAMSES on the GPU: An OpenACC-Based Approach

Programming NVIDIA GPUs with OpenACC Directives

GRID Testing and Profiling. November 2017

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

The Mont-Blanc approach towards Exascale

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

General Purpose GPU Computing in Partial Wave Analysis

An Evaluation of Unified Memory Technology on NVIDIA GPUs

arxiv: v1 [hep-lat] 1 Dec 2017

Transcription:

OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 1 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 2 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 3 / 33

HPC community specificity Concerning HPC Scientifc Applications software development has to adapt to specific characteristics Software lifetime may be very long; even tens of years. Software must be portable across current and future HPC hardware architectures, which are very heterogeneous (e.g CPU, GPU, MIC, etc.). Software has to be strongly optimized to exploit the available hardware for better performances. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 4 / 33

Making decisions in uncertain times A large fraction of modern HPC systems computing power is provided by highly parallel accelerators, such as GPUs. Although reluctant to embrace not consolidated technologies, the quest for performances lead to start using languages such as CUDA or OpenCL Proprietary languages prevent code portability: need to maintain multiple code versions Open spec. languages may not be supported by all vendors: need to re-implement the code need to maintain multiple code versions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 5 / 33

The use of OpenACC as a prospective solution Code modifications could be minimal Thanks to the annotation of pre-existing C code using #pragma directives. Programming efforts needed mainly to re-organize the data structures and to efficiently design data movements. If it will be superseded, programming efforts would not be lost: Also other directive based languages would benefit from data re-organization and efficiently designed data movements. Switching between directive based languages should be just a matter of changing the #pragma directives syntax. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 6 / 33

The use of OpenACC as a prospective solution Code modifications could be minimal Thanks to the annotation of pre-existing C code using #pragma directives. Programming efforts needed mainly to re-organize the data structures and to efficiently design data movements. If it will be superseded, programming efforts would not be lost: Also other directive based languages would benefit from data re-organization and efficiently designed data movements. Switching between directive based languages should be just a matter of changing the #pragma directives syntax. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 6 / 33

The use of OpenACC as a prospective solution Code modifications could be minimal Thanks to the annotation of pre-existing C code using #pragma directives. Programming efforts needed mainly to re-organize the data structures and to efficiently design data movements. If it will be superseded, programming efforts would not be lost: Also other directive based languages would benefit from data re-organization and efficiently designed data movements. Switching between directive based languages should be just a matter of changing the #pragma directives syntax. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 6 / 33

The use of OpenACC as a prospective solution The case of Lattice QCD Existing versions of the code targeting different architectures: C++ targeting x86 CPUs C++/CUDA targeting NVIDIA GPUs Will to design and implement one version: with good performances on present high-end architectures; portable across the different architectures; easy to maintain, allowing scientists to change/improve the code; possibly portable / easily-portable also on future unknown architectures. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 7 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 8 / 33

Hot Spot: The Dirac Operator Most of the running time in a LQCD simulation is spent applying the Dirac Operator, a stencil operator over a 4-dimensional lattice: D eo : reads from even sites of the lattice and writes in odd ones. D oe : reads from odd sites of the lattice and writes in even ones. Both perform vector-su(3) matrices multiplications (Complex Floating Point numbers). Strongly memory-bound operation on most architectures: 1 FLOP/Byte E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 9 / 33

Hot Spot: The Dirac Operator Most of the running time in a LQCD simulation is spent applying the Dirac Operator, a stencil operator over a 4-dimensional lattice: D eo : reads from even sites of the lattice and writes in odd ones. D oe : reads from odd sites of the lattice and writes in even ones. Both perform vector-su(3) matrices multiplications (Complex Floating Point numbers). Strongly memory-bound operation on most architectures: 1 FLOP/Byte E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 9 / 33

Planning the memory layout for LQCD : AoS vs SoA First version in C++ targeting CPU based clusters adopts AoS: / / fermions stored as AoS : typedef struct { double complex c1 ; / / component 1 double complex c2 ; / / component 2 double complex c3 ; / / component 3 } vec3_aos_t ; vec3_aos_t fermions [ sizeh ] ; Later version in C++/CUDA targeting NVIDIA GPU clusters adopts SoA: / / fermions stored as SoA : typedef struct { double complex c0 [ sizeh ] ; / / components 1 double complex c1 [ sizeh ] ; / / components 2 double complex c2 [ sizeh ] ; / / components 3 } vec3_soa_t ; vec3_soa_t fermions ; E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 10 / 33

The SU(3) matrix - fermion multiplication performance Testing data layout and data type Table: Execution time [ms] to perform 32 4 vector-su(3) multiplications (DP) Data NVIDIA Intel E5-2620v2 Intel E5-2630v3 Type Layout K20 GPU Naive Vect. Naive Vect. Complex AoS 8.75 30.16 n.a. 1 20.47 n.a. 1 SoA 1.45 45.75 32.21 18.69 13.93 Double SoA 1.48 106.90 38.58 43.69 16.08 1) Vectorization is not possible when using AoS data layout Intel Xeon E5-2620v2 implements AVX instructions Intel Xeon E5-2630v3 implements AVX2 and FMA3 instructions C. Bonati, E. Calore, S. Coscetti, M. D Elia, M. Mesiti, F. Negro, S. F. Schifano, R. Tripiccione, Development of Scientific Software for HPC Architectures Using OpenACC: The Case of LQCD, IEEE/ACM SE4HPCS 2015. doi: 10.1109/SE4HPCS.2015.9 E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 11 / 33

Fermions vectors data structure typedef struct { double complex c0 [ sizeh ] ; double complex c1 [ sizeh ] ; double complex c2 [ sizeh ] ; } vec3_soa_t ; Since C99 float/double standard complex data type: Real Part Img Part Double Double (8 bytes ) (8 bytes ) E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 12 / 33

Gauge field matrices data structure typedef struct { vec3_soa r0 ; vec3_soa r1 ; vec3_soa r2 ; } su3_soa_t ; E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 13 / 33

OpenACC example for the Deo function void Deo ( restrict const su3_soa const u, restrict vec3_soa const out, restrict const vec3_soa const in, restrict const double_soa const bfield ) int hx, y, z, t ; #pragma acc kernels present ( u ) present ( out ) present ( in ) present ( bfield ) #pragma acc loop independent gang collapse ( 2 ) for ( t=0; t<nt ; t++) { for ( z=0; z<nz ; z++) { #pragma acc loop independent vector tile ( TDIM0, TDIM1 ) for ( y=0; y<ny ; y++) { for ( hx=0; hx < nxh ; hx++) {... Nested loops over the lattice sites annotated with OpenACC directives. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 14 / 33

Single Device Performance Dirac Operator Lattice Processor (CPU or GPU) NVIDIA GK210 NVIDIA P100 Intel E5-2630v3 Intel E5-2697v4 SP DP SP DP SP DP SP DP 24 4 4.43 8.62 1.58 2.90 70.44 94.42 51.13 66.87 32 4 4.02 9.54 1.32 2.40 79.05 100.19 43.90 54.88 Table: Measured execution time per lattice site [ns],on several processors, in single and double precision. PGI Compiler 16.10. C. Bonati, E. Calore, S. Coscetti, M. D Elia, M. Mesiti, F. Negro, S. F. Schifano, G. Silvi, R. Tripiccione, Design and optimization of a portable LQCD Monte Carlo code using OpenACC International Journal Modern Physics C, 28(5), 2017. doi: 10.1142/S0129183117500632 E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 15 / 33

Multi Device Implementation with MPI Different kernels for borders and bulk operations (using async) to overlap computations and communications: C. Bonati, E. Calore, M. D Elia, M. Mesiti, F. Negro, F. Sanfilippo, S. F. Schifano, G. Silvi, R. Tripiccione, Portable multi-node LQCD Monte Carlo simulations using OpenACC, International Journal of Modern Physics C, 29(1), 2018. doi: 10.1142/S0129183118500109 E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 16 / 33

Overlap between computation and communication 8 GPUs One dimensional tailing of a 32 3 48 Lattice across: Local lattice: 32 3 6 per GPU 12 GPUs Local lattice: 32 3 4 per GPU E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 17 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 18 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 19 / 33

The COKA (Computing On Kepler Architecture) Cluster Dual socket Intel Haswell nodes, hosting 8 NVIDIA K80 each. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 20 / 33

Relative Speedup on NVIDIA K80 GPUs Dirac Operator in double precision C. Bonati, E. Calore, M. D Elia, M. Mesiti, F. Negro, F. Sanfilippo, S. F. Schifano, G. Silvi, R. Tripiccione, Portable multi-node LQCD Monte Carlo simulations using OpenACC, International Journal of Modern Physics C, 29(1), 2018. doi: 10.1142/S0129183118500109 E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 21 / 33

Strong Scaling Results on COKA Roberge Weiss simulation over a 32 3 48 lattice, with mass 0.0015 and beta 3.3600, using mixed precision floating-point. Using 2 CPUs we measure a 14 increase in the execution time wrt using 2 GPUs and the gap widens for more devices. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 22 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 23 / 33

D.A.V.I.D.E. Cluster (Development for an Added Value Infrastructure Designed in Europe) 45 nodes, containing: 2 POWER8+ CPUs (POWER8 plus NVLink) 4 NVIDIA Tesla P100 GPUs 2 Mellanox InfiniBand EDR (100 Gb/s) Energy efficient HPC cluster designed by E4 Computer Engineering for the European Prace Pre-Commercial Procurement (PCP) programme. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10th, 2018 24 / 33

DAVIDE Cluster Dual socket POWER8+ nodes, hosting 4 NVIDIA Tesla P100 GPUs each. Designed to meet the computing and data transfer requirements of data-analytics applications. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 25 / 33

Strong Scaling Performance (DAVIDE vs COKA) Figure: Dirac Operator of OpenStaPLE, running respectively on COKA GK210 and DAVIDE P100 GPUs. One K80 board contains two GK210 GPUs. C. Bonati, E. Calore, M. D Elia, M. Mesiti, F. Negro, S. F. Schifano, G. Silvi, R. Tripiccione, Early Experience on Running OpenStaPLE on DAVIDE, International Workshop on OpenPOWER for HPC (IWOPH 18). In Press. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 26 / 33

Scaling is limited by inter-socket communications Lattice 32 3 48 split across the 4 GPUs of one node Figure: NVIDIA Profiler View of the computing kernels and communications performed on one P100 GPU. Purple-blue colored: execution of D eo and D oe on the borders of the lattice. Turquoise colored: execution of D eo and D oe operations on the bulk of the lattice. Gold colored: communication steps. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 27 / 33

Strong Scaling of larger lattice sizes Figure: Aggregate GFLOP/s and Bandwidth, showing the Strong Scaling behavior of the Dirac Operator implementation of OpenStaPLE, running on the P100 GPUs of DAVIDE. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 28 / 33

Strong Scaling of larger lattice sizes GPU #0 communicating via InfiniBand and via NVLink GPU #2 communicating via X-Bus and via NVLink Lattice 48 3 96 split across the 16 GPUs contained in 4 DAVIDE nodes. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 29 / 33

Outline 1 OpenACC Staggered Parallel LatticeQCD Everywhere Motivations Design & Implementation 2 Performance analysis COKA Cluster DAVIDE Cluster 3 Conclusions E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 30 / 33

Conclusions OpenStaPLE: successful implementation of a parallel and portable Staggered Fermions LQCD application using MPI and OpenACC. Takeaways Planning for an optimal domain data layout is essential; Overlapping communication and computations is necessary to scale; Inter-socket links could be a serious bottleneck for scaling; Running functions inefficiently on GPUs to avoid data transfers between host and device, could pay off. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 31 / 33

Conclusions Limitations Large lattices can not fit on few nodes due to limited GPU memory; Scaling on high number of devices is limited by the inter-socket and inter-node bandwidths. Future works Performance analysis on highly NVLink interconnected machines; Investigate performance of multi-dimensional slicing; Improve the data layout to increase performance on CPUs; Investigate energy aspects and usages of all the power/energy metrics collected by DAVIDE. E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 32 / 33

Thanks for Your attention E. Calore (Univ. and INFN Ferrara) OpenStaPLE Munich, Oct 10 th, 2018 33 / 33