arxiv: v2 [hep-lat] 21 Nov 2018

Similar documents
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

arxiv: v1 [hep-lat] 1 Dec 2017

arxiv: v2 [hep-lat] 3 Nov 2016

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Progress Report on QDP-JIT

Lattice QCD on Graphics Processing Units?

arxiv: v1 [hep-lat] 12 Nov 2013

Optimization of Lattice QCD codes for the AMD Opteron processor

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs

Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions

Lattice QCD code Bridge++ on arithmetic accelerators

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Knights Landing Hardware

arxiv: v1 [hep-lat] 18 Nov 2013

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

OpenStaPLE, an OpenACC Lattice QCD Application

GRID Testing and Profiling. November 2017

arxiv: v1 [hep-lat] 29 Oct 2008

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

Benchmark results on Knight Landing (KNL) architecture

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Intel Xeon Phi архитектура, модели программирования, оптимизация.

FY17/FY18 Alternatives Analysis for the Lattice QCD Computing Project Extension II (LQCD-ext II)

HPC Architectures. Types of resource currently in use

PoS(LATTICE2014)028. The FUEL code project

Bei Wang, Dmitry Prohorov and Carlos Rosales

MILC Performance Benchmark and Profiling. April 2013

Architecture, Programming and Performance of MIC Phi Coprocessor

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Claude Tadonki. Mines ParisTech Centre de Recherche Informatique

Machine Learning for (fast) simulation

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

April 2 nd, Bob Burroughs Director, HPC Solution Sales

VLPL-S Optimization on Knights Landing

OpenCL Vectorising Features. Andreas Beckmann

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

arxiv: v1 [hep-lat] 1 Nov 2013

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Workpackage 5: High Performance Mathematical Computing

Vectorization on KNL

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

HPC in the Multicore Era

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

The DEEP (and DEEP-ER) projects

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

DEVITO AUTOMATED HIGH-PERFORMANCE FINITE DIFFERENCES FOR GEOPHYSICAL EXPLORATION

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

Reusing this material

Intel Architecture for HPC

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

arxiv: v1 [hep-lat] 13 Jun 2008

Performance analysis and optimization of the JOREK code for many-core CPUs

The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Simulating Stencil-based Application on Future Xeon-Phi Processor

An Efficient SIMD Implementation of Pseudo-Verlet Lists for Neighbour Interactions in Particle-Based Codes

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Introduction to Performance Tuning & Optimization Tools

Intel Many Integrated Core (MIC) Architecture

Online Course Evaluation. What we will do in the last week?

Introduction to Xeon Phi. Bill Barth January 11, 2013

Code optimization in a 3D diffusion model

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

HPC future trends from a science perspective

Intel Math Kernel Library 10.3

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Portable Power/Performance Benchmarking and Analysis with WattProf

QDP++/Chroma on IBM PowerXCell 8i Processor

ARE WE OPTIMIZING HARDWARE FOR

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

A Test Suite for High-Performance Parallel Java

SIMD Exploitation in (JIT) Compilers

Analysis and Visualization Tools for Lattice QCD

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Double-precision General Matrix Multiply (DGEMM)

The Mont-Blanc approach towards Exascale

Trends in HPC (hardware complexity and software challenges)

SIMD: Data parallel execution

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

PC DESY Peter Wegner. PC Cluster Definition 1

Dan Stafford, Justine Bonnot

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Matheus Serpa, Eduardo Cruz and Philippe Navaux

Benchmark results on Knight Landing architecture

QDP++ Primer. Robert Edwards David Richards

Transcription:

arxiv:1806.06043v2 [hep-lat] 21 Nov 2018 E-mail: j.m.o.rantaharju@swansea.ac.uk Ed Bennett E-mail: e.j.bennett@swansea.ac.uk Mark Dawson E-mail: mark.dawson@swansea.ac.uk Michele Mesiti E-mail: michele.mesiti@swansea.ac.uk We publish an extension of openqcd-1.6 with vector instructions using Intel intrinsics. Recent Intel processors support extended instruction sets with operations on 512-bit wide vectors, increasing both the capacity for floating point operations and register memory. Optimal use of the new capabilities requires reorganising data and floating point operations into these wider vector units. We report on the implementation and performance of the OpenQCD extension on clusters using Intel Knights Landing and Xeon Scalable (Skylake) CPUs. In complete HMC trajectories with physically relevant parameters we observe a performance increase of 5% to 10%. The 36th Annual International Symposium on Lattice Field Theory - LATTICE2018 22-28 July, 2018 Michigan State University, East Lansing, Michigan, USA. Speaker. c Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/

1. Introduction The newest generations of Intel processors, Xeon Scalable (Skylake) and Xeon Phi (Knights Landing), extend the current standard of vector instructions with the 512-bit wide instruction sets [1]. The width of registers is similarly increased and, in addition, the number of floating point registers is doubled to 32. Compilers can make use of the new instructions, but targeted code is required to reach optimal performance. Efficient vectorisation can lead to a doubling of the available register memory and floating point capability OpenQCD-1.6 [2] already includes optional and SSE targeted implementations and an extension targeting BlueGene/Q also exists [3]. Following the logic of these extensions we have reimplemented several performance critical functions using Intel intrinsic instructions for vectors. These are mainly the Dirac operator and the vector operations necessary for the conjugate gradient algorithm. Here we publish the extension to openqcd-1.6 [4]. In addition, we report scaling studies performed for OpenQCD-FASTSUM, the FASTSUM collaboration s extension of openqcd [5], with the implementation. 2. Implementation The implementation of the extension is guided by the expectation that lattice QCD simulations are memory bandwidth bound. The performance of the application is limited by the memory bandwidth between the processor and different cache levels rather than the capacity for floating point operations. The same assumption guides the existing, SSE and BlueGene/Q targeted implementations. Gauge matrices and spinors of the Wilson formulation are stored in memory as structures and SIMD vectors are constructed out of spinor degrees of freedom. Our implementation combines spinor and direction indices to construct the 512-bit vectors required. The construction of a vector does not require any rearrangement of the data before loading into registers. However, different Dirac and directional indices are often handled differently, increasing the number of floating point instructions required. Since the primary objective is to optimise memory use, we consider this acceptable. In the SSE and extensions, direct insertions of assembly code are used to achieve full control over the compiled code. We use Intel intrinsic instructions. Compilers generally replace these routines with assembly instructions in a one-to-one correspondence. Intrinsic functions offer more freedom in choosing the optimal compiler and allows easier porting to different processor types. They leave the compiler with the task of choosing the optimal instruction for each operation and assigning data to registers. This is especially important since the number of registers is increased in Xeon Scalable CPUs [1]. Using prewritten assembly code would confine the extension only to future processors with the higher register count. We implement several core functions, including the Dirac operator, the application of the Sheikholeslami-Wohlert term, and several linear algebra functions. The extension is activated using the AVX512 preprocessor flag, similarly to the existing and SSE preprocessor flags. When the flags are combined, the implementation is used when available. 1

Knights Landing Skylake Speedup Speedup Single Precision 4 4 4 4 8521 5455 1.56 36177 29839 1.21 8 4 4 4 6276 4130 1.52 34649 27769 1.25 8 8 4 4 6063 4042 1.50 36167 29289 1.23 8 8 8 4 5286 3791 1.39 34894 28476 1.23 8 8 8 8 5088 3721 1.37 26408 21617 1.22 16 8 8 8 4506 3338 1.35 25300 19180 1.32 Double precision 4 4 4 4 6164 3725 1.65 26737 24681 1.08 8 4 4 4 4105 2857 1.44 26690 24609 1.08 8 8 4 4 3533 2517 1.40 26521 19687 1.35 8 8 8 4 3296 2421 1.36 25267 19312 1.31 8 8 8 8 3191 2405 1.33 18772 14471 1.29 16 8 8 8 2911 2131 1.37 15513 15125 1.03 Table 1: The performance of the functions Dw and Dw_dble (single and double precision respectively) in Mflops on single Knights Landing and Skylake cores. 3. Benchmarking 35.0 26 32.5 24 30.0 27.5 25.0 22 20 18 22.5 16 20.0 14 Figure 1: The performance of Dw() (left), Dw_dble() (right) performance measures run on a single Skylake core using the and implementations. We run several performance tests on the FASTSUM extension of OpenQCD 1.6 with and without the implementation on Cineca Marconi A2 cluster Intel Knights Landing nodes and the Supercomputing Wales Sunbird cluster with Intel Skylake nodes. On Sunbird we use the Intel C Compiler to build the implementation with the compiler flags -std=c89 -xcore-avx512 -mtune=skylake -O3 -DAVX512 -DAVX -DFMA3 -DPM. The original version is compiled with 2

8 6 7 5 6 5 4 4 3 3 2 2 Figure 2: The Dw() (left), Dw_dble() (right) performance measures on a single KNL core using the AVX- 512 and implementations. -std=c89 -xcore-avx512 -march=skylake -O3 -DPM -DAVX -DFMA3 -DPM. On the Knights Landing cluster the version is compiled with -std=c89 -xmic-avx512 -O3 -DAVX512 -DAVX -DFMA3 -DPM and compared against the version compiled using -std=c89 -xmic-avx512 -O3 -DAVX -DFMA3 -DPM. Firstly, we have measured the single core performance of the Dirac operator itself using the timing tools provided in the openqcd code. These tests do not account for memory dependencies, but only measure floating point performance. The performance in Mflops per second is shown in Tab. 1 and in Fig. 1 and 2. With small lattice sizes we observe a significant improvement, exceeding a factor of two with certain cases. Naturally the improvement is smaller in a realistic test case due to data dependencies and MPI communication. 24 3 x32 1.22 $32^4 1.12 48 3 x32 1.20 1.11 1.10 1.09 1.08 1.18 1.16 1.14 1.07 1.12 1.06 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 6 2 7 2 8 2 9 2 10 2 11 Figure 3: Left: Strong scaling performance on the Sunbird Skylake cluster measured against the implementation. Right: Strong scaling performance on the Marconi Knights Landing cluster measured against the implementation 3

1.12 1.19 1.11 1.18 1.10 1.09 1.08 1.17 1.16 1.15 1.07 1.14 1.06 2 4 2 5 2 6 2 7 2 8 2 9 2 10 1.13 2 11 2 12 2 13 2 14 Figure 4: Left: Weak scaling performance on the Sunbird Skylake cluster using n cores and the lattice size V = 48 3 n measured against the implementation. Right: Weak scaling performance on the Marconi Knights Landing cluster using n cores the lattice size V = 32 3 n measured against the implementation. L T N Time (s) / trajectory Speedup 32 32 1 1.59e3 1.86e3 1.17 32 32 2 8.41e2 9.67e2 1.15 32 32 4 4.39e2 5.17e2 1.18 32 32 8 2.49e2 3.03e2 1.22 32 32 16 1.52e2 1.71e2 1.13 32 32 32 1.01e2 1.12e2 1.11 32 32 1 1.59e3 1.86e3 1.17 32 64 2 1.62e3 1.93e3 1.19 32 128 4 1.73e3 1.96e3 1.13 32 256 8 1.70e3 2.00e3 1.18 Table 2: Average timings per trajectory using N Knights Landing nodes with the volume T L 3. Speedup is measured against the implementation. To get a more complete picture we measure the average time taken to generate a HMC trajectory. We have produced 6 trajectories starting from a random gauge configuration and report the average time per trajectory. We use a Wilson-Yukawa gauge action with β = 1.5, c 0 = 5/3 and c 1 = 1/12 and include two fermions with κ = 0.278000465 and 0.276509194 and c sw = 1. Two levels of smearing are enabled in the fermion actions. Domain Deflation and blocking are enabled. The timings for several lattice sizes and configurations of nodes are given in Tabs. 2 and 3 and shown in Fig. 3 and 4. Full compilation and runtime parameters and the simulation output are publicly available [6] and [7]. The two version of the code scale similarly, with the version remaining faster in each case. On the Sunbird Skylake machine, in a full trajectory the improvement is between 6% and 13%. Each node has 40 cores and a minimal number allocation of nodes is used in each case. On the Marconi KNL system the improvement is between 11% and 4

L T N n Time (s) / trajectory Relative Speedup 32 48 1 16 1.31e3 1.43e3 1.09 32 48 1 32 7.96e2 8.58e2 1.08 32 48 2 64 3.96e2 4.28e2 1.08 32 48 4 128 1.99e2 2.14e2 1.08 32 48 7 256 1.16e2 1.23e2 1.06 32 48 13 512 6.04e1 6.41e1 1.06 32 48 26 1024 2.91e1 3.13e1 1.08 32 24 1 8 2.48e2 2.79e2 1.13 32 24 1 16 1.54e2 1.73e2 1.12 32 24 1 32 9.55e1 1.03e2 1.08 32 24 2 64 4.66e1 5.06e1 1.09 32 24 4 128 2.36e1 2.59e1 1.10 48 16 1 16 6.38e2 7.146e2 1.12 48 32 1 32 7.96e2 8.55e2 1.07 48 64 2 64 7.99e2 8.64e2 1.08 48 128 4 128 8.01e2 8.64e2 1.08 48 256 7 256 8.87e2 9.81e2 1.11 48 512 13 512 9.32e2 9.94e2 1.07 48 1024 26 1024 9.72e2 1.03e3 1.06 Table 3: Average timings per trajectory using N Skylake nodes with n cores and the volume T L 3. Speedup is measured relative to the implementation. L T N n Time (s) / trajectory speedup Skylake 32 24 1 32 2.28e3 2.47e3 1.08 32 24 2 64 1.12e3 1.21e3 1.08 32 24 4 128 5.68e2 6.22e2 1.10 Table 4: Average timings starting from a thermalised configuration per trajectory with the volume T L 3 with N nodes using n cores. 22%. In this case all 64 cores are used on each node. No clear dependence on the lattice size or number of nodes can be deduced from the data. Finally, we perform the same test with a thermalised starting configuration and a light quark. Two fermions are included κ = 0.27831 and 0.276509. The results are reported in Table 4. The speedup achieved is similar to the previous tests, between 8% and 10%. 4. Conclusion We announce an open source implementation of the Dirac operator in OpenQCD 1.6 with 5

extended vector operations using Intel s intrinsic operations. These operations allow the application to make full use of the wider, 512-bit registers, reducing the total number of memory request, in particular those to L1 cache, and the number of floating point operations. The implementation assumes that memory bandwidth is the main bottleneck in the application. Tradeoffs that reduce memory use at the cost of floating point operations and vector shuffles are considered acceptable. The application performs significantly better than the existing implementation on Knights Landing and Skylake processors. In realistic benchmarking cases the improvement factor is between 6% and 12% on Skylake nodes and 11% and 22% on Knight Landing nodes. 5. Acknowledgements We acknowledge the support of the Supercomputing Wales project, which is part-funded by the European Regional Development Fund (ERDF) via Welsh Government. References [1] Intel Architecture Instruction Set Extensions and Future Features Programming Reference, January 2018, Ref. # 319433-032 [2] luscher.web.cern.ch/luscher/openqcd/ [3] hpc.desy.de/simlab/codes/openqcd_bgopt/ [4] github.com/sa2c/openqcd-avx512 DOI: 10.5281/zenodo.1451764 [5] fastsum.gitlab.io [6] Sunbird Performance Data for Fastsum OpenQCD 1.0, DOI: 10.5281/zenodo.1475136 [7] Marconi KNL Performance Data for Fastsum OpenQCD 1.0, DOI: 10.5281/zenodo.1479288 6