Knights Landing Scalability and the Role of Hybrid Parallelism

Similar documents
Comparison and analysis of parallel tasking performance for an irregular application

HPC Architectures. Types of resource currently in use

ORAP Forum October 10, 2013

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

An Efficient SIMD Implementation of Pseudo-Verlet Lists for Neighbour Interactions in Particle-Based Codes

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

General overview and first results.

Benchmark results on Knight Landing (KNL) architecture

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Peta-Scale Simulations with the HPC Software Framework walberla:

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Optimization of the Gadget code and energy measurements on second-generation Intel Xeon Phi

HPC future trends from a science perspective

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

High performance computing and numerical modeling

Knights Corner: Your Path to Knights Landing

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

arxiv: v2 [hep-lat] 3 Nov 2016

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

arxiv: v1 [hep-lat] 1 Dec 2017

HPC code modernization with Intel development tools

Reusing this material

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Code Saturne on POWER8 clusters: First Investigations

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

simulation framework for piecewise regular grids

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

Simulation using MIC co-processor on Helios

VLPL-S Optimization on Knights Landing

Transport Simulations beyond Petascale. Jing Fu (ANL)

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Benchmark results on Knight Landing architecture

Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi

CME 213 S PRING Eric Darve

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Software and Performance Engineering for numerical codes on GPU clusters

Kevin O Leary, Intel Technical Consulting Engineer

Intel Knights Landing Hardware

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis

The DEEP (and DEEP-ER) projects

Matheus Serpa, Eduardo Cruz and Philippe Navaux

Introduction to Xeon Phi. Bill Barth January 11, 2013

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Exascale: challenges and opportunities in a power constrained world

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Trends in HPC (hardware complexity and software challenges)

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Code optimization in a 3D diffusion model

Intel Xeon Phi архитектура, модели программирования, оптимизация.

International Conference Russian Supercomputing Days. September 25-26, 2017, Moscow

walberla: Developing a Massively Parallel HPC Framework

Debugging Intel Xeon Phi KNC Tutorial

Center Extreme Scale CS Research

Umeå University

Umeå University

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

Multi-threaded ATLAS Simulation on Intel Knights Landing Processors

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

Accelerating Insights In the Technical Computing Transformation

1 st International Serpent User Group Meeting in Dresden, Germany, September 15 16, 2011

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

It's the end of the world as we know it

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Machine Learning for (fast) simulation

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

Intel Many Integrated Core (MIC) Architecture

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Mapping MPI+X Applications to Multi-GPU Architectures

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GOING ARM A CODE PERSPECTIVE

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Efficient Parallel Programming on Xeon Phi for Exascale

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

LS-DYNA Scalability Analysis on Cray Supercomputers

Transcription:

Knights Landing Scalability and the Role of Hybrid Parallelism Sergi Siso 1, Aidan Chalk 1, Alin Elena 2, James Clark 1, Luke Mason 1 1 Hartree Centre @ STFC - Daresbury Labs 2 Scientific Computing Department @ STFC - Daresbury Labs

New Hartree Scafell Pike System Bull Sequana X1000 Supercomputer ~4 PFLOP +800 2xIntel Xeon Gold E5-6142 v5 +800 self-hosted Intel KNL nodes 24 1TB high-memory nodes 30 data-hierarchy nodes equipped with a self-hosted KNL 384GB of memory and a local NVMe drive. Xeon Phi KNL Access Programme planned. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 2

Peak Performance normalized against IvyBridge Serial Threading & vectorization importance 500 450 400 350 300 250 200 150 100 50 Peak serial Peak threading Peak vectorization Peak threading & vectorization 0 Ivy Bridge E5-2697 v2 Broadwell E5-2697A v4 Xeon Phi KNC 5110p Xeon Phi KNL 7210 Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 3

Software implications Code needs to be modernized to benefit from newer platforms. Vectorization, threading, micro-arch optimizations. But still leverage MPI for multi-knl scalability. We need to deal with the increasing complexity. Software needs good SE abstractions to separate the parallel and platform specific optimizations from the science domain. Task-based parallelism. Domain specific languages. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 4

Hartree IPCC Code modernization Code generation tool Task-Based Parallelism DL_POLY Today s presentation Code optimization DL_MESO Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 5

DL_MESO LATTICE BOLTZMANN 6

DL_MESO LBE: Old Code Lattice Boltzmann Multiple fluids Multiple phases Coupled with solute diffusion and heat transfer 30 25 20 15 10 DL_MESO Lattice Boltzmann Scalability (BGK Shan Chen with 4 fluids, Size: 160^3) OpenMP SpeedUp MPI SpeedUp {Ideal} 5 0 0 10 20 30 Intel VTune shown significant memory and vectorization issues (bad performance on KnC & KNL) Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 7

DL_MESO MINILBE: Current Code MLUPS 350 300 250 DL_MESO MINILBE Performance (BGK Shan Chen with 4 fluids, Size: 160^3) 2 x Intel Xeon E5-2697 v2 Intel Xeon Phi 5110p Intel Xeon Phi 7210 X2.5 when moving to KNL (memory bound code) 200 150 100 50 Promising result, but OpenMP just allows for single KNL 0 Original code Optimized code Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 8

DL_MESO MINILBE: Future Reintroduce MPI to minilbe (work in progress) Goal: Take minilbe code and add distributed memory parallelism, while retaining performance. The large drop in performance when moving to MPI is mainly due to halo swaps. There are 4 sends and 4 receives for each dimension and data is not in contiguous memory regions. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 9

DL_POLY 10

DL_POLY_4 Classical Molecular Dynamics code developed at Daresbury Lab, mainly by Dr I Todorov and Dr W Smith. Suitable for a wide variety of simulation types, including biological systems or materials under extreme conditions. This work focused on the two-body interactions, and in particular the Van der Waals forces. DL_POLY_4 uses a Verlet list + link-cell approach for neighbour finding, we rewrote the neighbour finding to use a sorted cell-list [1] approach, which does not store neighbour lists but computes them on the fly. [1] Gonnet, Pedro. "Efficient and scalable algorithms for smoothed particle hydrodynamics on hybrid shared/distributed-memory architectures." SIAM Journal on Scientific Computing 37.1 (2015): C95-C121. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 11

DL_POLY: Hybrid MPI+OpenMP Time to solution on KNC Work done by Alin Elena while @ ICHEC IPCC (Dublin) MPI-only starts with much better performance. Hybrid implementation beats MPI in the SMT region. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 12

DL_POLY using task-based parallelism We implemented a new version of shared memory parallelism using OpenMP task-based parallelism. The computation is divided into interdependent tasks. These tasks are executed by any thread, provided its dependencies are satisfied. The parallelism and load balancing are handled by the runtime, the programmer merely tells the runtime what the tasks are and data requirements for those tasks. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 13

DL_POLY Taskyield on KNL Other attempts: OpenMP 5.0 Reductions (unavailable as of Intel 18 beta update 0). OmpSs commutative dependencies (in progress, currently poor performance). QuickSched + conflicts (unsuccessful, couldn t explain performance loss). Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 14

Scalability of the Parallel Region w/ Taskyield on KNL Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 15

DL_POLY Conclusions First OpenMP-tasks implementation works well, though some of the task relationships available in other libraries are still lacking. This can lead to difficulties using them for problems that don t need the full constraints of dependencies. New additions expected in OpenMP 5.0 and beyond should help with this. Also we need to try new/upcoming features such as task reductions & commutative dependencies from OmpSs. Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 16

Sergi Siso: sergi.siso@stfc.ac.uk Aidan Chalk: aidan.chalk@stfc.ac.uk Alin Marin Elena: alin-marin.elena@stfc.ac.uk Luke Mason: luke.mason@stfc.ac.uk James Clark: james.clark@stfc.ac.uk http://www.hartree.stfc.ac.uk Thanks For Your Attention. Any Questions? Tuesday, 05 September 2017 RSE'17 Knight's Landing Parallelsim 17