Enabling the ARM high performance computing (HPC) software ecosystem

Similar documents
ARM High Performance Computing

Bootstrapping a HPC Ecosystem

ARM BOF. Jay Kruemcke Sr. Product Manager, HPC, ARM,

ARM Performance Libraries Current and future interests

HPC Network Stack on ARM

Beyond Hardware IP An overview of Arm development solutions

Software Ecosystem for Arm-based HPC

Arm's role in co-design for the next generation of HPC platforms

Jay Kruemcke Sr. Product Manager, HPC, Arm,

The Arm Technology Ecosystem: Current Products and Future Outlook

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited

Intel HPC Orchestrator System Software Stack Providing Key Building Blocks for Intel Scalable System Framework

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

OpenHPC: Project Overview and Updates

Barcelona Supercomputing Center

GOING ARM A CODE PERSPECTIVE

Intel Performance Libraries

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Butterfly effect of porting scientific applications to ARM-based platforms

Webinar Tips and Tricks for Porting HPC Apps

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Intel Parallel Studio XE 2015

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Arm in High Performance Computing: Fortran on AArch64

Arm Processor Technology Update and Roadmap

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

TOSS - A RHEL-based Operating System for HPC Clusters

Cray Scientific Libraries. Overview

The Mont-Blanc Project

Trends in HPC (hardware complexity and software challenges)

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

The Cray Programming Environment. An Introduction

Pedraforca: a First ARM + GPU Cluster for HPC

Intel Math Kernel Library 10.3

Double-precision General Matrix Multiply (DGEMM)

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Arm crossplatform. VI-HPS platform October 16, Arm Limited

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

High Performance Computing An introduction talk. Fengguang Song

Programming Environment 4/11/2015

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

LLVM and Clang on the Most Powerful Supercomputer in the World

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer

Introduction of Oakforest-PACS

A Uniform Programming Model for Petascale Computing

Intel Xeon Phi Coprocessor

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Post-K: Building the Arm HPC Ecosystem

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

Optimizing Compilers Most modern compilers are good at basic code optimisations. Typical Optimization Levels. Aliasing. Caveats. Helping the Compiler

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Cray Programming Environment. An Introduction

CUDA. Matthew Joyner, Jeremy Williams

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Linux HPC Software Stack

Arm's role in co-design for the next generation of HPC platforms

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Portable Power/Performance Benchmarking and Analysis with WattProf

Intel + Parallelism Everywhere. James Reinders Intel Corporation

ARM instruction sets and CPUs for wide-ranging applications

IT4Innovations national supercomputing center. Branislav Jansík

Toward Building up ARM HPC Ecosystem

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

SUSE Linux Entreprise Server for ARM

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

ARM TrustZone for ARMv8-M for software engineers

Meeting of the Technical Steering Committee (TSC) Board

User Training Cray XC40 IITM, Pune

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Modern Processor Architectures. L25: Modern Compiler Design

Matrix Multiplication

Scientific Programming in C XIV. Parallel programming

Advanced optimizations of cache performance ( 2.2)

High-Performance Scientific Computing

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Code-Agnostic Performance Characterisation and Enhancement

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Meeting of the Technical Steering Committee (TSC) Board

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Introduction to PICO Parallel & Production Enviroment

Transcription:

Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016

Are these supercomputers? For example, the Samsung S6 No doubt it is pretty amazing Four Cortex-A53 cores (1.5GHz) Four Cortex-A57 cores (2.1GHz) A Mali GPU (772MHz) Random Googling * gives performance as 34.6GFLOPs http://www.androidauthority.com/flagship-camera-shootout-688406 That means it can do 34 600 000 000 floating point calculations every second Note Intel Haswells are now up to about 44 GFLOPs per (3.2GHz) core Actually that would have been the world s most powerful computer back in 1992 * http://pages.experts-exchange.com/processing-power-compared/ 2

GFLOPS The Road to Exascale 1E+09 100000000 No. No. 1 1 No. 1 No. No. 500 500 33 PFLOPs 93 PFLOPs EXASCALE 10000000 1000000 100000 PETASCALE Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy 10000 1000 100 TERASCALE 34.6 GFLOPs 10 3 1 0.1 GIGASCALE 1985 1990 1995 2000 2005 2010 2015 2020

So what do people really solve on HPC systems? Weather and climate modelling New Met Office machine has over 480 000 cores, 2PB RAM and weighs 140 tonnes Computational Fluid Dynamics Modelling cars, planes, beaches, blood, Computational chemistry Molecular dynamics, Quantum interactions Atomic weapons simulations Don t mess with the nuclear stockpile Earth s mantle, galaxy formation, biological processes You name it, someone s modelling it at scale Kitware/R.N. Elias 4

Parallel programming in HPC Mainly Fortran (77, 90, 95, 2003), some C and C++, no Java, Python as glue Multiple pipelines, FMA Relies on architecture Vectorization Relies on compiler mainly OpenMP Source code instrumentation specifying loops how may be parallelized MPI Message passing explicitly stating how many bytes to send where 5

Scalable Vector Extensions (SVE) 6

Introducing the Scalable Vector Extension (SVE) General Purpose 64-bit ARMv8-A Scalable wide vectors 128-2048 Extending processing capability 7

Introducing Scalable Vector Extension (SVE) Extending ARMv8-A with AArch64 extension which expands vector length up to a maximum of 2048 bits Expands fine-grain data parallelism for HPC scientific workloads Better compiler target, reduces software deployment effort Beginning engagement with open-source community and wider ARM ecosystem 8

Post-K Japanese supercomputer 100x capacity 50x capability 15x efficiency ARMv8-A with SVE 9

ARM HPC Ecosystem 10

ARM HPC ecosystem roadmap Released Planned Concept Hardware AppliedMicro X-Gene 1 & 2 AMD Seattle Cavium ThunderX AppliedMicro X-Gene 3 Phytium Mars Cavium ThunderX2 Fujitsu Post K (SVE) Open-Source software OpenHPC 1.2 ARM Optimized Routines ARM Optimized Routines vector versions Altair PBS Pro GCC (gcc/g++/gfortran) LLVM - clang LLVM Flang ARM C/C++ Compiler ahead of LLVM trunk ARM Fortran Compiler ARM HPC tools ARM Performance Libraries ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Instruction Emulator ISV software Allinea DDT and MAP NAG Library & Compiler PathScale ENZO Rogue Wave TotalView ISV software 11 2016 2017 Future

now on ARM OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments Functional Areas Components include Base OS RHEL/CentOS 7.1, SLES 12 ARM s participation: Silver member of OpenHPC ARM is on OpenHPC Technical Steering Committee in order to drive ARM architecture build support Status (November 2016): 1.2.0 release out now All packages built on ARMv8 for both CentOS and SUSE ARM-based machines are being used for building and also in the OpenHPC build infrastructure Administrative Tools Provisioning Resource Mgmt. I/O Services Numerical/Scientifi c Libraries I/O Libraries Compiler Families MPI Families Development Tools Performance Tools Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, prun Warewulf SLURM, Munge. Altair PBS Pro* Lustre client (community version) Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, Mumps HDF5 (phdf5), NetCDF (including C++ and Fortran interfaces), Adios GNU (gcc, g++, gfortran) OpenMPI, MVAPICH2 Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy PAPI, Intel IMB, mpip, pdtoolkit TAU 12

ARM HPC tools portfolio ARM C/C++ Compiler COMMERCIALLY SUPPORTED FOR HPC APPLICATIONS ARM Performance Libraries BLAS, LAPACK and FFT MICRO-ARCHITECTURALLY TUNED ARM Code Advisor ACTIONABLE ADVICE TO OPTIMIZE YOUR CODE ARM SVE C/C++ Compiler COMPILER SUPPORT FOR ARM SCALABLE VECTOR EXTENSION ARM Instruction Emulator DEVELOP SOFTWARE FOR TOMORROW S HARDWARE TODAY 13

ARM Code Advisor (Beta) Combines static and dynamic information to produce actionable insights Performance Advice Compiler vectorization hints. Compilation flags advice. Fortran subarray warnings. OpenMP instrumentation. Insights from compilation and runtime Compiler Insights are embedded into the application binary by the ARM Compilers. OMPT interface used to instrument OpenMP runtime. Extensible Architecture Users can write plugins to add their own analysis information. Data accessible via web-browser, command-line, and REST API to support new user interfaces. 14

ARM Code Advisor (Beta) Typical workflow Source Code Compile Compiled Binary +Insight Profile Runtime Profile Analyse Web View HTTP 15

ARM Performance Libraries Optimized BLAS, LAPACK and FFT Commercial 64-bit ARMv8 math libraries Commonly used low-level math routines - BLAS, LAPACK and FFT. Validated with NAG s test suite, a de-facto standard. Best-in-class performance with commercial support Tuned by ARM for Cortex-A72, Cortex-A57 and Cortex-A53. Maintained and supported by ARM for a wide range of ARM-based SoCs. Regular benchmarking against open source alternatives. Performance on par with best-in-class math libraries Commercially Supported by ARM Silicon partners can provide tuned micro-kernels for their SoCs Partners can collaborate directly working with our source-code and test suite. Alternatively they can contribute through open source route. Validated with NAG test suite 16

Deep dive into optimizing DGEMM 17

DGEMM The maths Double precision GEneral Matrix-Matrix multiplication C = aa x B + bc Normally assume a=1, b=0 however for a BLAS implementation all must be catered for Also matrices are not necessarily square: A is m x k, B is k x n, C is m x n Not to mention the allocated storage may have the matrices as a small part of a wider space needing extra parameters to handle this 18

Coding DGEMM Naïve for (j=0; j<n; j++) for (i=0; i<n; i++) i for (k=0; k<n; k++) c[i][j] += a[i][k]*b[k][j]; C j A k B ( ) ( )( ) = k In C memory access is stride 1 in the second array index Good access to A Very bad access to B 19

DGEMM Loop reordering Want better use of data from loaded cache lines Make a[i][k] loop invariant for the inner loop for (k=0; k<n; k++) for (i=0; i<n; i++) for (j=0; j<n; j++) c[i][j] += a[i][k]*b[k][j]; ( ) ( )( ) = i C j A k B k Memory access for B and C is now good However these cache lines will need reloading for next element of A Enables automatic vectorization as by-product of optimization 20

DGEMM Loop unrolling Want better reuse of data from loaded cache lines Unroll outer loop to enable multiple A values to be used without further loads for (k=0; k<n; k+=4) for (i=0; i<n; i++) C j A k B for (j=0; j<n; j++) ( ) ( )( ) c[i][j] += a[i][k]*b[k][j] i k + a[i][k+1]*b[k+1][j] = + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; Memory access for A now uses more data from loaded cache line Cache big enough for multiple lines of B to be loaded to update single element of C Clean-up loop needed for non-multiples of unrolling factor 21

DGEMM Cache blocking Small matrices that fit better in cache solve faster Therefore splitting the matrix up into blocks in each direction Note I, J, K have unlisted assignment based on i and ii, j and jj, and k and kk for (ii=0; ii<n; ii+=blk) for (kk=0; kk<n; k+=blk) for (jj=0; jj<n; jj+=blk) for (k=0; k<blk; k+=4) for (i=0; i<blk; i++) for (j=0; j<blk; j++) C A B ( ) ( )( ) = c[i][j] += a[i][k]*b[k][j] + a[i][k+1]*b[k+1][j] + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; 22

DGEMM Adding OpenMP Parallelism is key to getting the best performance Ideally want to ensure that each thread can be working on updating own values Arrange parallelism to avoid locks on data Possibly use of cache topology to extract further performance Shared L2 & L3 cache Limit number of threads for small problems 23

DGEMM Register blocking Calculate as many elements of C for which there are available registers Work through current block with a block of registers 32 SIMD registers 8 * 3 for accumulators to C 4 to load from A 3 to load from B Less flow control C register block A B = 24

DGEMM Memory reordering GEMMs are O(N 3 ) Memory reordering is O(N 2 ) Worthwhile if we can improve performance in the kernel Transposing B means we can read A & B sequentially Interleaving rows Multiple rows can be loaded from single memory stream Vector registers can be populated from sequential reads Fewer memory streams means: Better prefetching Fewer cache misses at end of rows B interleaved B B T 25

Performance gains by moving to assembly Guaranteed instruction ordering No extra bits Need for good performance on in-order micro architectures Explicit vectorization Need to manage alignment ourselves FMA instructions explicitly included Optimal register utilization and instruction selection Explicit memory prefetching Getting data in memory before we need to use it Reduce the cost of memory stalls 26

Summary Machines today are all now multicore Parallel computing is the way to effectively use parallel machines Both for server and scientific applications Efficient code needs careful design to effectively scale up to many cores Just writing serial code and expecting it to parallelize will not work As machines get larger the energy costs are reaching Megawatts hence lower power technologies are more important than ever ARM HPC is well placed to be a major rival to current legacy architectures Work is happening today to mean the ARM HPC software ecosystem is ready to support our partners deployments 27

http://www.arm.com/hpc The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited