Progress Report on QDP-JIT

Similar documents
QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs

QCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs

Trends in HPC (hardware complexity and software challenges)

QDP++/Chroma on IBM PowerXCell 8i Processor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PoS(LATTICE2014)028. The FUEL code project

The need for speed... Bálint Joó, Scientific Computing Group Jefferson Lab

arxiv: v1 [hep-lat] 1 Dec 2017

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

arxiv: v2 [hep-lat] 21 Nov 2018

OpenStaPLE, an OpenACC Lattice QCD Application

arxiv: v1 [hep-lat] 13 Jun 2008

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Intel Knights Landing Hardware

arxiv: v2 [hep-lat] 3 Nov 2016

SIMD Exploitation in (JIT) Compilers

Addressing Heterogeneity in Manycore Applications

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

CPU-GPU Heterogeneous Computing

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Illinois Proposal Considerations Greg Bauer

The Mont-Blanc approach towards Exascale

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

LLVM and Clang on the Most Powerful Supercomputer in the World

Parallel Computing. November 20, W.Homberg

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

GPU Computing with NVIDIA s new Kepler Architecture

An Introduction to OpenACC

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Achieving Peak Performance on Intel Hardware. Jim Cownie: Intel Software Developer Conference Frankfurt, December 2017

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

OpenACC programming for GPGPUs: Rotor wake simulation

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Optimising the Mantevo benchmark suite for multi- and many-core architectures

COMP Parallel Computing. Programming Accelerators using Directives

Analysis and Visualization Algorithms in VMD

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

The Mont-Blanc Project

Technology for a better society. hetcomp.com

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

arxiv: v1 [physics.comp-ph] 4 Nov 2013

An Introduction to the SPEC High Performance Group and their Benchmark Suites

April 2 nd, Bob Burroughs Director, HPC Solution Sales

S Comparing OpenACC 2.5 and OpenMP 4.5

The Era of Heterogeneous Computing

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Performance of deal.ii on a node

arxiv: v1 [hep-lat] 12 Nov 2013

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Optimisation Myths and Facts as Seen in Statistical Physics

VLPL-S Optimization on Knights Landing

Accelerators in Technical Computing: Is it Worth the Pain?

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Growth in Cores - A well rehearsed story

Massively Parallel Phase Field Simulations using HPC Framework walberla

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Native Computing and Optimization. Hang Liu December 4 th, 2013

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenCL Vectorising Features. Andreas Beckmann

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

gpucc: An Open-Source GPGPU Compiler

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Lattice QCD code Bridge++ on arithmetic accelerators

Automated Finite Element Computations in the FEniCS Framework using GPUs

Massive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming

HPC Architectures. Types of resource currently in use

Operational Robustness of Accelerator Aware MPI

Performance Portability of QCD with Kokkos

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

OP2 FOR MANY-CORE ARCHITECTURES

CUDA Experiences: Over-Optimization and Future HPC

Steve Scott, Tesla CTO SC 11 November 15, 2011

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

CME 213 S PRING Eric Darve

Parallel Programming on Ranger and Stampede

Meta-Programming and JIT Compilation

Transcription:

Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 / 16

QDP-JIT/, A Framework for Lattice QCD Calculations for GPUs QDP-JIT/ provides a reimplementation of QDP++ for NVIDIA GPUs Automatic off-loading of expressions to the accelerators Multi-GPU support Dynamic code generation Additional Just-In-Time (JIT) compilation step with NVIDIA driver Data layout is optimized for coalesced memory accesses Automatic H2D, D2H memory transfers via a software cache Trajectory Time [s] 18 16 14 1 1 8 6 4 V =4 3 256, 2+1 Anisotropic Clover, m π ~ 23 MeV, τ =.2 CPU only (XE nodes) CPU+QUDA QDP-JIT+QUDA F. T. Winter M. A. Clark R. G. Edwards B. Joo in IPDPS'14 Automatic tuning of CUDA kernels 128 256 4 512 8 16 XE Sockets / XK Nodes Paper accepted for publication in IEEE International Parallel & Distributed Processing Symposium 14 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 2 / 16

QDP-JIT/LLVM Motivation Code maintainability No template specializations (SSE, AVX, etc.) for each architecture No heavy usage of #ifdef constructs Performance portability Efficient code generation for all relevant targets Not to be committed on compilers ability to deal with templated codes Support for vector units, memory pre-fetchers, etc. Efficient code: threading, scheduling, cache blocking, etc. QDP-JIT/LLVM LLVM IR Architecture independent implementation of QDP++ LLVM is a framework worth targeting LLVM IR is architecture independent LLVM is embraced by HPC industry, e.g. NVIDIA, IBM, Intel,... F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 3 / 16

QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? QDP-JIT/ is limited to GPUs. To target a broader range of architectures a new LLVM IR code generator was implemented. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 4 / 16

QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? GPU route is still there via, two approaches: The open source NV backend or closed source libnvvm library. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 5 / 16

QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? libnvvm part of CUDA since 5.5 and includes -specific optimizations. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 6 / 16

QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? Generate x86 code with LLVM s mature x86 backend. (Great SSE/AVX support) F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 7 / 16

QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? Generate PowerPC 64 code. Some support for QPX (work in progress). F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 8 / 16

QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? New architectures supported provided that it supports JIT compilation. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 9 / 16

Optimization: Custom Data Layout QDP++ specifies the data layout through the nesting order of templated data types: Outer < Spin < Color < Reality < float > > > > QDP-JIT splits the outer loop by an optional inner vector length I Outer < Spin < Color < Reality < Inner < float > > > > > The code generation step intercepts and changes the data layout Spin < Color < Reality < Outer < Inner < float > > > > > (GPUs, I = 1) Outer < Spin < Color < Reality < Inner < float > > > > > (CPUs with SSE/AVX, I = 2/4/8) Outer < Spin < Color < Inner < Reality < float > > > > > (BG/Q, I = 2 (DP)) F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 / 16

Benchmark on Intel Sandy Bridge 15 1 5 15 1 5 15 1 5 15 1 5 t_linalg (single precision), QDP++(SSE) vs. QDP-JIT/LLVM M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H Out of L2 cache for local problem sizes larger than L = 4. Within cache the code achieves up to 78% peak of E5-265 at 2.GHz, 256 (SP) peak 4 8 12 16 24 28 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 11 / 16

Benchmark on Intel Sandy Bridge 1 1 8 6 4 1 1 8 6 4 1 1 8 6 4 1 1 8 6 4 t_linalg (double precision), QDP++(SSE) vs. QDP-JIT/LLVM M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H Out of L2 cache for local problem sizes larger than L = 16 4. Within cache the code achieves up to 78% peak of E5-265 at 2.GHz, 128 (DP) peak 4 8 12 16 24 28 32 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 12 / 16

Benchmark on Blue Gene/Q (single node, preliminary) 15 1 5 15 1 5 15 1 5 15 1 5 t_linalg DP, 1 node, threads=32, inner=4, layout=oscri M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H QPX instructions are generated, there are however still alignment issue Out of L2 cache for local problem sizes larger than L = 16 4. 4 6 8 1 12 14 16 18 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 13 / 16

Benchmark on Blue Gene/Q (single node) 3 35 25 1 15 5 3 35 25 1 15 5 3 35 25 1 15 5 3 35 25 1 15 5 t_linalg DP, 1 node, threads=64, QDP++, OMP, gcc -O3 M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H GCC on vanilla QDP++ is currently doing better on the linear algebra than QDP-JIT/LLVM. Mainly because the LLVM BG/Q backend misses essential performance features. 4 6 8 1 12 14 16 18 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 14 / 16

Benchmark on Blue Gene/Q, preliminary Rb2 Wilson DSlash, local volume V =12 1 4, DP, 1 MPI rank/node QDP-JIT, 32 threads QDP++, 16 threads 1 Shifting of sub-lattices Overlapping of computation and off-node communication. For rb2 Wilson DSlash preliminary measurements show a speedup factor of 12.4. Performance [] 8 6 4 16 256 BG/Q nodes F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 15 / 16

Summary & Outlook QDP-JIT/LLVM provides an architecture independent implementation of QDP++ Runs Chroma HMC (Wilson Clover) on GPUs, x86, and BG/Q Optimizations: Custom data layout to support vectorization Multi-threading Sub-lattice shifting Improve performance on BG/Q (QPX, SPI) Intel Xeon Phi (KNL) Apply advanced optimizations: Polyhedral model Cache blocking Memory prefetching Overlapping MPI and compute F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 16 / 16