Enabling the ARM high performance computing (HPC) software ecosystem

Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016

Are these supercomputers? For example, the Samsung S6 No doubt it is pretty amazing Four Cortex-A53 cores (1.5GHz) Four Cortex-A57 cores (2.1GHz) A Mali GPU (772MHz) Random Googling * gives performance as 34.6GFLOPs http://www.androidauthority.com/flagship-camera-shootout-688406 That means it can do 34 600 000 000 floating point calculations every second Note Intel Haswells are now up to about 44 GFLOPs per (3.2GHz) core Actually that would have been the world s most powerful computer back in 1992 * http://pages.experts-exchange.com/processing-power-compared/ 2

GFLOPS The Road to Exascale 1E+09 100000000 No. No. 1 1 No. 1 No. No. 500 500 33 PFLOPs 93 PFLOPs EXASCALE 10000000 1000000 100000 PETASCALE Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy 10000 1000 100 TERASCALE 34.6 GFLOPs 10 3 1 0.1 GIGASCALE 1985 1990 1995 2000 2005 2010 2015 2020

So what do people really solve on HPC systems? Weather and climate modelling New Met Office machine has over 480 000 cores, 2PB RAM and weighs 140 tonnes Computational Fluid Dynamics Modelling cars, planes, beaches, blood, Computational chemistry Molecular dynamics, Quantum interactions Atomic weapons simulations Don t mess with the nuclear stockpile Earth s mantle, galaxy formation, biological processes You name it, someone s modelling it at scale Kitware/R.N. Elias 4

Parallel programming in HPC Mainly Fortran (77, 90, 95, 2003), some C and C++, no Java, Python as glue Multiple pipelines, FMA Relies on architecture Vectorization Relies on compiler mainly OpenMP Source code instrumentation specifying loops how may be parallelized MPI Message passing explicitly stating how many bytes to send where 5

Scalable Vector Extensions (SVE) 6

Introducing the Scalable Vector Extension (SVE) General Purpose 64-bit ARMv8-A Scalable wide vectors 128-2048 Extending processing capability 7

Introducing Scalable Vector Extension (SVE) Extending ARMv8-A with AArch64 extension which expands vector length up to a maximum of 2048 bits Expands fine-grain data parallelism for HPC scientific workloads Better compiler target, reduces software deployment effort Beginning engagement with open-source community and wider ARM ecosystem 8

Post-K Japanese supercomputer 100x capacity 50x capability 15x efficiency ARMv8-A with SVE 9

ARM HPC Ecosystem 10

ARM HPC ecosystem roadmap Released Planned Concept Hardware AppliedMicro X-Gene 1 & 2 AMD Seattle Cavium ThunderX AppliedMicro X-Gene 3 Phytium Mars Cavium ThunderX2 Fujitsu Post K (SVE) Open-Source software OpenHPC 1.2 ARM Optimized Routines ARM Optimized Routines vector versions Altair PBS Pro GCC (gcc/g++/gfortran) LLVM - clang LLVM Flang ARM C/C++ Compiler ahead of LLVM trunk ARM Fortran Compiler ARM HPC tools ARM Performance Libraries ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Instruction Emulator ISV software Allinea DDT and MAP NAG Library & Compiler PathScale ENZO Rogue Wave TotalView ISV software 11 2016 2017 Future

now on ARM OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments Functional Areas Components include Base OS RHEL/CentOS 7.1, SLES 12 ARM s participation: Silver member of OpenHPC ARM is on OpenHPC Technical Steering Committee in order to drive ARM architecture build support Status (November 2016): 1.2.0 release out now All packages built on ARMv8 for both CentOS and SUSE ARM-based machines are being used for building and also in the OpenHPC build infrastructure Administrative Tools Provisioning Resource Mgmt. I/O Services Numerical/Scientifi c Libraries I/O Libraries Compiler Families MPI Families Development Tools Performance Tools Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, prun Warewulf SLURM, Munge. Altair PBS Pro* Lustre client (community version) Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, Mumps HDF5 (phdf5), NetCDF (including C++ and Fortran interfaces), Adios GNU (gcc, g++, gfortran) OpenMPI, MVAPICH2 Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy PAPI, Intel IMB, mpip, pdtoolkit TAU 12

ARM HPC tools portfolio ARM C/C++ Compiler COMMERCIALLY SUPPORTED FOR HPC APPLICATIONS ARM Performance Libraries BLAS, LAPACK and FFT MICRO-ARCHITECTURALLY TUNED ARM Code Advisor ACTIONABLE ADVICE TO OPTIMIZE YOUR CODE ARM SVE C/C++ Compiler COMPILER SUPPORT FOR ARM SCALABLE VECTOR EXTENSION ARM Instruction Emulator DEVELOP SOFTWARE FOR TOMORROW S HARDWARE TODAY 13

ARM Code Advisor (Beta) Combines static and dynamic information to produce actionable insights Performance Advice Compiler vectorization hints. Compilation flags advice. Fortran subarray warnings. OpenMP instrumentation. Insights from compilation and runtime Compiler Insights are embedded into the application binary by the ARM Compilers. OMPT interface used to instrument OpenMP runtime. Extensible Architecture Users can write plugins to add their own analysis information. Data accessible via web-browser, command-line, and REST API to support new user interfaces. 14

ARM Code Advisor (Beta) Typical workflow Source Code Compile Compiled Binary +Insight Profile Runtime Profile Analyse Web View HTTP 15

ARM Performance Libraries Optimized BLAS, LAPACK and FFT Commercial 64-bit ARMv8 math libraries Commonly used low-level math routines - BLAS, LAPACK and FFT. Validated with NAG s test suite, a de-facto standard. Best-in-class performance with commercial support Tuned by ARM for Cortex-A72, Cortex-A57 and Cortex-A53. Maintained and supported by ARM for a wide range of ARM-based SoCs. Regular benchmarking against open source alternatives. Performance on par with best-in-class math libraries Commercially Supported by ARM Silicon partners can provide tuned micro-kernels for their SoCs Partners can collaborate directly working with our source-code and test suite. Alternatively they can contribute through open source route. Validated with NAG test suite 16

Deep dive into optimizing DGEMM 17

DGEMM The maths Double precision GEneral Matrix-Matrix multiplication C = aa x B + bc Normally assume a=1, b=0 however for a BLAS implementation all must be catered for Also matrices are not necessarily square: A is m x k, B is k x n, C is m x n Not to mention the allocated storage may have the matrices as a small part of a wider space needing extra parameters to handle this 18

Coding DGEMM Naïve for (j=0; j<n; j++) for (i=0; i<n; i++) i for (k=0; k<n; k++) c[i][j] += a[i][k]*b[k][j]; C j A k B ( ) ( )( ) = k In C memory access is stride 1 in the second array index Good access to A Very bad access to B 19

DGEMM Loop reordering Want better use of data from loaded cache lines Make a[i][k] loop invariant for the inner loop for (k=0; k<n; k++) for (i=0; i<n; i++) for (j=0; j<n; j++) c[i][j] += a[i][k]*b[k][j]; ( ) ( )( ) = i C j A k B k Memory access for B and C is now good However these cache lines will need reloading for next element of A Enables automatic vectorization as by-product of optimization 20

DGEMM Loop unrolling Want better reuse of data from loaded cache lines Unroll outer loop to enable multiple A values to be used without further loads for (k=0; k<n; k+=4) for (i=0; i<n; i++) C j A k B for (j=0; j<n; j++) ( ) ( )( ) c[i][j] += a[i][k]*b[k][j] i k + a[i][k+1]*b[k+1][j] = + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; Memory access for A now uses more data from loaded cache line Cache big enough for multiple lines of B to be loaded to update single element of C Clean-up loop needed for non-multiples of unrolling factor 21

DGEMM Cache blocking Small matrices that fit better in cache solve faster Therefore splitting the matrix up into blocks in each direction Note I, J, K have unlisted assignment based on i and ii, j and jj, and k and kk for (ii=0; ii<n; ii+=blk) for (kk=0; kk<n; k+=blk) for (jj=0; jj<n; jj+=blk) for (k=0; k<blk; k+=4) for (i=0; i<blk; i++) for (j=0; j<blk; j++) C A B ( ) ( )( ) = c[i][j] += a[i][k]*b[k][j] + a[i][k+1]*b[k+1][j] + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; 22

DGEMM Adding OpenMP Parallelism is key to getting the best performance Ideally want to ensure that each thread can be working on updating own values Arrange parallelism to avoid locks on data Possibly use of cache topology to extract further performance Shared L2 & L3 cache Limit number of threads for small problems 23

DGEMM Register blocking Calculate as many elements of C for which there are available registers Work through current block with a block of registers 32 SIMD registers 8 * 3 for accumulators to C 4 to load from A 3 to load from B Less flow control C register block A B = 24

DGEMM Memory reordering GEMMs are O(N 3 ) Memory reordering is O(N 2 ) Worthwhile if we can improve performance in the kernel Transposing B means we can read A & B sequentially Interleaving rows Multiple rows can be loaded from single memory stream Vector registers can be populated from sequential reads Fewer memory streams means: Better prefetching Fewer cache misses at end of rows B interleaved B B T 25

Performance gains by moving to assembly Guaranteed instruction ordering No extra bits Need for good performance on in-order micro architectures Explicit vectorization Need to manage alignment ourselves FMA instructions explicitly included Optimal register utilization and instruction selection Explicit memory prefetching Getting data in memory before we need to use it Reduce the cost of memory stalls 26

Summary Machines today are all now multicore Parallel computing is the way to effectively use parallel machines Both for server and scientific applications Efficient code needs careful design to effectively scale up to many cores Just writing serial code and expecting it to parallelize will not work As machines get larger the energy costs are reaching Megawatts hence lower power technologies are more important than ever ARM HPC is well placed to be a major rival to current legacy architectures Work is happening today to mean the ARM HPC software ecosystem is ready to support our partners deployments 27

http://www.arm.com/hpc The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited