Knights Landing Scalability and the Role of Hybrid Parallelism

Size: px

Start display at page:

Download "Knights Landing Scalability and the Role of Hybrid Parallelism"

Debra Stephens
5 years ago
Views:

James Clark 1, Luke Mason 1 1 Hartree Centre @ STFC -

1 Knights Landing Scalability and the Role of Hybrid Parallelism Sergi Siso 1, Aidan Chalk 1, Alin Elena 2, James Clark 1, Luke Mason 1 1 Hartree STFC - Daresbury Labs 2 Scientific Computing STFC - Daresbury Labs

New Hartree Scafell Pike System Bull Sequana X1000 Supercomputer ~4 PFLOP +800 2xIntel Xeon Gold E5-6142 v5 +800 self-hosted Intel KNL nodes 24 1TB high-memory nodes 30

2 New Hartree Scafell Pike System Bull Sequana X1000 Supercomputer ~4 PFLOP xIntel Xeon Gold E v self-hosted Intel KNL nodes 24 1TB high-memory nodes 30 data-hierarchy nodes equipped with a self-hosted KNL 384GB of memory and a local NVMe drive. Xeon Phi KNL Access Programme planned. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 2

3 Peak Performance normalized against IvyBridge Serial Threading & vectorization importance Peak serial Peak threading Peak vectorization Peak threading & vectorization 0 Ivy Bridge E v2 Broadwell E5-2697A v4 Xeon Phi KNC 5110p Xeon Phi KNL 7210 Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 3

4 Software implications Code needs to be modernized to benefit from newer platforms. Vectorization, threading, micro-arch optimizations. But still leverage MPI for multi-knl scalability. We need to deal with the increasing complexity. Software needs good SE abstractions to separate the parallel and platform specific optimizations from the science domain. Task-based parallelism. Domain specific languages. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 4

5 Hartree IPCC Code modernization Code generation tool Task-Based Parallelism DL_POLY Today s presentation Code optimization DL_MESO Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 5

6 DL_MESO LATTICE BOLTZMANN 6

$fluids, Size: 160^3) OpenMP SpeedUp MPI SpeedUp {Ideal} 5 0 0 10 20 30 Intel VTune shown significant memory and vectorization issues (bad performance on KnC & KNL) Tuesday, 05 September 2017 RSE'17$

7 DL_MESO LBE: Old Code Lattice Boltzmann Multiple fluids Multiple phases Coupled with solute diffusion and heat transfer DL_MESO Lattice Boltzmann Scalability (BGK Shan Chen with 4 fluids, Size: 160^3) OpenMP SpeedUp MPI SpeedUp {Ideal} Intel VTune shown significant memory and vectorization issues (bad performance on KnC & KNL) Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 7

8 DL_MESO MINILBE: Current Code MLUPS DL_MESO MINILBE Performance (BGK Shan Chen with 4 fluids, Size: 160^3) 2 x Intel Xeon E v2 Intel Xeon Phi 5110p Intel Xeon Phi 7210 X2.5 when moving to KNL (memory bound code) Promising result, but OpenMP just allows for single KNL 0 Original code Optimized code Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 8

DL_MESO MINILBE: Future Reintroduce MPI to minilbe (work in progress) Goal: Take minilbe code and add distributed memory parallelism, while retaining performance.

9 DL_MESO MINILBE: Future Reintroduce MPI to minilbe (work in progress) Goal: Take minilbe code and add distributed memory parallelism, while retaining performance. The large drop in performance when moving to MPI is mainly due to halo swaps. There are 4 sends and 4 receives for each dimension and data is not in contiguous memory regions. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 9

10 DL_POLY 10

11 DL_POLY_4 Classical Molecular Dynamics code developed at Daresbury Lab, mainly by Dr I Todorov and Dr W Smith. Suitable for a wide variety of simulation types, including biological systems or materials under extreme conditions. This work focused on the two-body interactions, and in particular the Van der Waals forces. DL_POLY_4 uses a Verlet list + link-cell approach for neighbour finding, we rewrote the neighbour finding to use a sorted cell-list [1] approach, which does not store neighbour lists but computes them on the fly. [1] Gonnet, Pedro. "Efficient and scalable algorithms for smoothed particle hydrodynamics on hybrid shared/distributed-memory architectures." SIAM Journal on Scientific Computing 37.1 (2015): C95-C121. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 11

12 DL_POLY: Hybrid MPI+OpenMP Time to solution on KNC Work done by Alin Elena ICHEC IPCC (Dublin) MPI-only starts with much better performance. Hybrid implementation beats MPI in the SMT region. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 12

13 DL_POLY using task-based parallelism We implemented a new version of shared memory parallelism using OpenMP task-based parallelism. The computation is divided into interdependent tasks. These tasks are executed by any thread, provided its dependencies are satisfied. The parallelism and load balancing are handled by the runtime, the programmer merely tells the runtime what the tasks are and data requirements for those tasks. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 13

14 DL_POLY Taskyield on KNL Other attempts: OpenMP 5.0 Reductions (unavailable as of Intel 18 beta update 0). OmpSs commutative dependencies (in progress, currently poor performance). QuickSched + conflicts (unsuccessful, couldn t explain performance loss). Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 14

15 Scalability of the Parallel Region w/ Taskyield on KNL Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 15

16 DL_POLY Conclusions First OpenMP-tasks implementation works well, though some of the task relationships available in other libraries are still lacking. This can lead to difficulties using them for problems that don t need the full constraints of dependencies. New additions expected in OpenMP 5.0 and beyond should help with this. Also we need to try new/upcoming features such as task reductions & commutative dependencies from OmpSs. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 16

17 Sergi Siso: Aidan Chalk: Alin Marin Elena: Luke Mason: James Clark: Thanks For Your Attention. Any Questions? Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 17

Comparison and analysis of parallel tasking performance for an irregular application

Comparison and analysis of parallel tasking performance for an irregular application Patrick Atkinson, University of Bristol (p.atkinson@bristol.ac.uk) Simon McIntosh-Smith, University of Bristol Motivation