Knights Landing Scalability and the Role of Hybrid Parallelism

Knights Landing Scalability and the Role of Hybrid Parallelism Sergi Siso 1, Aidan Chalk 1, Alin Elena 2, James Clark 1, Luke Mason 1 1 Hartree Centre @ STFC - Daresbury Labs 2 Scientific Computing Department @ STFC - Daresbury Labs

New Hartree Scafell Pike System Bull Sequana X1000 Supercomputer ~4 PFLOP +800 2xIntel Xeon Gold E5-6142 v5 +800 self-hosted Intel KNL nodes 24 1TB high-memory nodes 30 data-hierarchy nodes equipped with a self-hosted KNL 384GB of memory and a local NVMe drive. Xeon Phi KNL Access Programme planned. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 2

Peak Performance normalized against IvyBridge Serial Threading & vectorization importance 500 450 400 350 300 250 200 150 100 50 Peak serial Peak threading Peak vectorization Peak threading & vectorization 0 Ivy Bridge E5-2697 v2 Broadwell E5-2697A v4 Xeon Phi KNC 5110p Xeon Phi KNL 7210 Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 3

Software implications Code needs to be modernized to benefit from newer platforms. Vectorization, threading, micro-arch optimizations. But still leverage MPI for multi-knl scalability. We need to deal with the increasing complexity. Software needs good SE abstractions to separate the parallel and platform specific optimizations from the science domain. Task-based parallelism. Domain specific languages. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 4

Hartree IPCC Code modernization Code generation tool Task-Based Parallelism DL_POLY Today s presentation Code optimization DL_MESO Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 5

DL_MESO LATTICE BOLTZMANN 6

DL_MESO LBE: Old Code Lattice Boltzmann Multiple fluids Multiple phases Coupled with solute diffusion and heat transfer 30 25 20 15 10 DL_MESO Lattice Boltzmann Scalability (BGK Shan Chen with 4 fluids, Size: 160^3) OpenMP SpeedUp MPI SpeedUp {Ideal} 5 0 0 10 20 30 Intel VTune shown significant memory and vectorization issues (bad performance on KnC & KNL) Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 7

DL_MESO MINILBE: Current Code MLUPS 350 300 250 DL_MESO MINILBE Performance (BGK Shan Chen with 4 fluids, Size: 160^3) 2 x Intel Xeon E5-2697 v2 Intel Xeon Phi 5110p Intel Xeon Phi 7210 X2.5 when moving to KNL (memory bound code) 200 150 100 50 Promising result, but OpenMP just allows for single KNL 0 Original code Optimized code Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 8

DL_MESO MINILBE: Future Reintroduce MPI to minilbe (work in progress) Goal: Take minilbe code and add distributed memory parallelism, while retaining performance. The large drop in performance when moving to MPI is mainly due to halo swaps. There are 4 sends and 4 receives for each dimension and data is not in contiguous memory regions. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 9

DL_POLY 10

DL_POLY_4 Classical Molecular Dynamics code developed at Daresbury Lab, mainly by Dr I Todorov and Dr W Smith. Suitable for a wide variety of simulation types, including biological systems or materials under extreme conditions. This work focused on the two-body interactions, and in particular the Van der Waals forces. DL_POLY_4 uses a Verlet list + link-cell approach for neighbour finding, we rewrote the neighbour finding to use a sorted cell-list [1] approach, which does not store neighbour lists but computes them on the fly. [1] Gonnet, Pedro. "Efficient and scalable algorithms for smoothed particle hydrodynamics on hybrid shared/distributed-memory architectures." SIAM Journal on Scientific Computing 37.1 (2015): C95-C121. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 11

DL_POLY: Hybrid MPI+OpenMP Time to solution on KNC Work done by Alin Elena while @ ICHEC IPCC (Dublin) MPI-only starts with much better performance. Hybrid implementation beats MPI in the SMT region. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 12

DL_POLY using task-based parallelism We implemented a new version of shared memory parallelism using OpenMP task-based parallelism. The computation is divided into interdependent tasks. These tasks are executed by any thread, provided its dependencies are satisfied. The parallelism and load balancing are handled by the runtime, the programmer merely tells the runtime what the tasks are and data requirements for those tasks. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 13

DL_POLY Taskyield on KNL Other attempts: OpenMP 5.0 Reductions (unavailable as of Intel 18 beta update 0). OmpSs commutative dependencies (in progress, currently poor performance). QuickSched + conflicts (unsuccessful, couldn t explain performance loss). Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 14

Scalability of the Parallel Region w/ Taskyield on KNL Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 15

DL_POLY Conclusions First OpenMP-tasks implementation works well, though some of the task relationships available in other libraries are still lacking. This can lead to difficulties using them for problems that don t need the full constraints of dependencies. New additions expected in OpenMP 5.0 and beyond should help with this. Also we need to try new/upcoming features such as task reductions & commutative dependencies from OmpSs. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 16

Sergi Siso: sergi.siso@stfc.ac.uk Aidan Chalk: aidan.chalk@stfc.ac.uk Alin Marin Elena: alin-marin.elena@stfc.ac.uk Luke Mason: luke.mason@stfc.ac.uk James Clark: james.clark@stfc.ac.uk http://www.hartree.stfc.ac.uk Thanks For Your Attention. Any Questions? Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 17