Enabling the ARM high performance computing (HPC) software ecosystem

Size: px

Start display at page:

Download "Enabling the ARM high performance computing (HPC) software ecosystem"

Joseph Nash
6 years ago
Views:

1 Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016

2 Are these supercomputers? For example, the Samsung S6 No doubt it is pretty amazing Four Cortex-A53 cores (1.5GHz) Four Cortex-A57 cores (2.1GHz) A Mali GPU (772MHz) Random Googling * gives performance as 34.6GFLOPs That means it can do floating point calculations every second Note Intel Haswells are now up to about 44 GFLOPs per (3.2GHz) core Actually that would have been the world s most powerful computer back in 1992 * 2

3 GFLOPS The Road to Exascale 1E No. No. 1 1 No. 1 No. No PFLOPs 93 PFLOPs EXASCALE PETASCALE Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy TERASCALE 34.6 GFLOPs GIGASCALE

4 So what do people really solve on HPC systems? Weather and climate modelling New Met Office machine has over cores, 2PB RAM and weighs 140 tonnes Computational Fluid Dynamics Modelling cars, planes, beaches, blood, Computational chemistry Molecular dynamics, Quantum interactions Atomic weapons simulations Don t mess with the nuclear stockpile Earth s mantle, galaxy formation, biological processes You name it, someone s modelling it at scale Kitware/R.N. Elias 4

5 Parallel programming in HPC Mainly Fortran (77, 90, 95, 2003), some C and C++, no Java, Python as glue Multiple pipelines, FMA Relies on architecture Vectorization Relies on compiler mainly OpenMP Source code instrumentation specifying loops how may be parallelized MPI Message passing explicitly stating how many bytes to send where 5

6 Scalable Vector Extensions (SVE) 6

7 Introducing the Scalable Vector Extension (SVE) General Purpose 64-bit ARMv8-A Scalable wide vectors Extending processing capability 7

Introducing Scalable Vector Extension (SVE) Extending

length up to a maximum of 2048 bits Expands fine-grain

compiler target, reduces software deployment effort

8 Introducing Scalable Vector Extension (SVE) Extending ARMv8-A with AArch64 extension which expands vector length up to a maximum of 2048 bits Expands fine-grain data parallelism for HPC scientific workloads Better compiler target, reduces software deployment effort Beginning engagement with open-source community and wider ARM ecosystem 8

9 Post-K Japanese supercomputer 100x capacity 50x capability 15x efficiency ARMv8-A with SVE 9

10 ARM HPC Ecosystem 10

11 ARM HPC ecosystem roadmap Released Planned Concept Hardware AppliedMicro X-Gene 1 & 2 AMD Seattle Cavium ThunderX AppliedMicro X-Gene 3 Phytium Mars Cavium ThunderX2 Fujitsu Post K (SVE) Open-Source software OpenHPC 1.2 ARM Optimized Routines ARM Optimized Routines vector versions Altair PBS Pro GCC (gcc/g++/gfortran) LLVM - clang LLVM Flang ARM C/C++ Compiler ahead of LLVM trunk ARM Fortran Compiler ARM HPC tools ARM Performance Libraries ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Instruction Emulator ISV software Allinea DDT and MAP NAG Library & Compiler PathScale ENZO Rogue Wave TotalView ISV software Future

12 now on ARM OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments Functional Areas Components include Base OS RHEL/CentOS 7.1, SLES 12 ARM s participation: Silver member of OpenHPC ARM is on OpenHPC Technical Steering Committee in order to drive ARM architecture build support Status (November 2016): release out now All packages built on ARMv8 for both CentOS and SUSE ARM-based machines are being used for building and also in the OpenHPC build infrastructure Administrative Tools Provisioning Resource Mgmt. I/O Services Numerical/Scientifi c Libraries I/O Libraries Compiler Families MPI Families Development Tools Performance Tools Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, prun Warewulf SLURM, Munge. Altair PBS Pro* Lustre client (community version) Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, Mumps HDF5 (phdf5), NetCDF (including C++ and Fortran interfaces), Adios GNU (gcc, g++, gfortran) OpenMPI, MVAPICH2 Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy PAPI, Intel IMB, mpip, pdtoolkit TAU 12

13 ARM HPC tools portfolio ARM C/C++ Compiler COMMERCIALLY SUPPORTED FOR HPC APPLICATIONS ARM Performance Libraries BLAS, LAPACK and FFT MICRO-ARCHITECTURALLY TUNED ARM Code Advisor ACTIONABLE ADVICE TO OPTIMIZE YOUR CODE ARM SVE C/C++ Compiler COMPILER SUPPORT FOR ARM SCALABLE VECTOR EXTENSION ARM Instruction Emulator DEVELOP SOFTWARE FOR TOMORROW S HARDWARE TODAY 13

ARM Code Advisor (Beta) Combines static and dynamic information to produce actionable insights Performance Advice Compiler vectorization hints. Compilation flags advice. Fortran subarray warnings.

14 ARM Code Advisor (Beta) Combines static and dynamic information to produce actionable insights Performance Advice Compiler vectorization hints. Compilation flags advice. Fortran subarray warnings. OpenMP instrumentation. Insights from compilation and runtime Compiler Insights are embedded into the application binary by the ARM Compilers. OMPT interface used to instrument OpenMP runtime. Extensible Architecture Users can write plugins to add their own analysis information. Data accessible via web-browser, command-line, and REST API to support new user interfaces. 14

15 ARM Code Advisor (Beta) Typical workflow Source Code Compile Compiled Binary +Insight Profile Runtime Profile Analyse Web View HTTP 15

ARM Performance Libraries Optimized BLAS, LAPACK and FFT Commercial 64-bit ARMv8 math

Validated with NAG s test suite, a de-facto standard.

Cortex-A53. Maintained and supported by ARM for a wide range of ARM-based SoCs.

Performance on par with best-in-class math libraries Commercially Supported by ARM Silicon

16 ARM Performance Libraries Optimized BLAS, LAPACK and FFT Commercial 64-bit ARMv8 math libraries Commonly used low-level math routines - BLAS, LAPACK and FFT. Validated with NAG s test suite, a de-facto standard. Best-in-class performance with commercial support Tuned by ARM for Cortex-A72, Cortex-A57 and Cortex-A53. Maintained and supported by ARM for a wide range of ARM-based SoCs. Regular benchmarking against open source alternatives. Performance on par with best-in-class math libraries Commercially Supported by ARM Silicon partners can provide tuned micro-kernels for their SoCs Partners can collaborate directly working with our source-code and test suite. Alternatively they can contribute through open source route. Validated with NAG test suite 16

17 Deep dive into optimizing DGEMM 17

18 DGEMM The maths Double precision GEneral Matrix-Matrix multiplication C = aa x B + bc Normally assume a=1, b=0 however for a BLAS implementation all must be catered for Also matrices are not necessarily square: A is m x k, B is k x n, C is m x n Not to mention the allocated storage may have the matrices as a small part of a wider space needing extra parameters to handle this 18

19 Coding DGEMM Naïve for (j=0; j<n; j++) for (i=0; i<n; i++) i for (k=0; k<n; k++) c[i][j] += a[i][k]*b[k][j]; C j A k B ( ) ( )( ) = k In C memory access is stride 1 in the second array index Good access to A Very bad access to B 19

20 DGEMM Loop reordering Want better use of data from loaded cache lines Make a[i][k] loop invariant for the inner loop for (k=0; k<n; k++) for (i=0; i<n; i++) for (j=0; j<n; j++) c[i][j] += a[i][k]*b[k][j]; ( ) ( )( ) = i C j A k B k Memory access for B and C is now good However these cache lines will need reloading for next element of A Enables automatic vectorization as by-product of optimization 20

21 DGEMM Loop unrolling Want better reuse of data from loaded cache lines Unroll outer loop to enable multiple A values to be used without further loads for (k=0; k<n; k+=4) for (i=0; i<n; i++) C j A k B for (j=0; j<n; j++) ( ) ( )( ) c[i][j] += a[i][k]*b[k][j] i k + a[i][k+1]*b[k+1][j] = + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; Memory access for A now uses more data from loaded cache line Cache big enough for multiple lines of B to be loaded to update single element of C Clean-up loop needed for non-multiples of unrolling factor 21

22 DGEMM Cache blocking Small matrices that fit better in cache solve faster Therefore splitting the matrix up into blocks in each direction Note I, J, K have unlisted assignment based on i and ii, j and jj, and k and kk for (ii=0; ii<n; ii+=blk) for (kk=0; kk<n; k+=blk) for (jj=0; jj<n; jj+=blk) for (k=0; k<blk; k+=4) for (i=0; i<blk; i++) for (j=0; j<blk; j++) C A B ( ) ( )( ) = c[i][j] += a[i][k]*b[k][j] + a[i][k+1]*b[k+1][j] + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; 22

23 DGEMM Adding OpenMP Parallelism is key to getting the best performance Ideally want to ensure that each thread can be working on updating own values Arrange parallelism to avoid locks on data Possibly use of cache topology to extract further performance Shared L2 & L3 cache Limit number of threads for small problems 23

24 DGEMM Register blocking Calculate as many elements of C for which there are available registers Work through current block with a block of registers 32 SIMD registers 8 * 3 for accumulators to C 4 to load from A 3 to load from B Less flow control C register block A B = 24

25 DGEMM Memory reordering GEMMs are O(N 3 ) Memory reordering is O(N 2 ) Worthwhile if we can improve performance in the kernel Transposing B means we can read A & B sequentially Interleaving rows Multiple rows can be loaded from single memory stream Vector registers can be populated from sequential reads Fewer memory streams means: Better prefetching Fewer cache misses at end of rows B interleaved B B T 25

26 Performance gains by moving to assembly Guaranteed instruction ordering No extra bits Need for good performance on in-order micro architectures Explicit vectorization Need to manage alignment ourselves FMA instructions explicitly included Optimal register utilization and instruction selection Explicit memory prefetching Getting data in memory before we need to use it Reduce the cost of memory stalls 26

27 Summary Machines today are all now multicore Parallel computing is the way to effectively use parallel machines Both for server and scientific applications Efficient code needs careful design to effectively scale up to many cores Just writing serial code and expecting it to parallelize will not work As machines get larger the energy costs are reaching Megawatts hence lower power technologies are more important than ever ARM HPC is well placed to be a major rival to current legacy architectures Work is happening today to mean the ARM HPC software ecosystem is ready to support our partners deployments 27

28 The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited

ARM High Performance Computing

ARM High Performance Computing Eric Van Hensbergen Distinguished Engineer, Director HPC Software & Large Scale Systems Research IDC HPC Users Group Meeting Austin, TX September 8, 2016 ARM 2016 An introduction