The Mont-Blanc project Updates from the Barcelona Supercomputing Center

Similar documents
Butterfly effect of porting scientific applications to ARM-based platforms

Barcelona Supercomputing Center

Pedraforca: a First ARM + GPU Cluster for HPC

The Mont-Blanc Project

ARM High Performance Computing

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

Building supercomputers from commodity embedded chips

The Arm Technology Ecosystem: Current Products and Future Outlook

Building supercomputers from embedded technologies

Software Ecosystem for Arm-based HPC

GOING ARM A CODE PERSPECTIVE

Arm's role in co-design for the next generation of HPC platforms

Arm Processor Technology Update and Roadmap

The Mont-Blanc approach towards Exascale

Enabling the ARM high performance computing (HPC) software ecosystem

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

Beyond Hardware IP An overview of Arm development solutions

Jay Kruemcke Sr. Product Manager, HPC, Arm,

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Overview of Architectures and Programming Languages for Parallel Computing

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Trends in HPC (hardware complexity and software challenges)

European energy efficient supercomputer project

Understanding the Role of GPGPU-accelerated SoC-based ARM Clusters

HPC projects. Grischa Bolls

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

Please cite this article using the following BibTex entry:

Performance POP up. EU H2020 Center of Excellence (CoE)

Software stack deployment for Earth System Modelling using Spack

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

MB3 D6.1 Report on profiling and benchmarking of the initial set of applications on ARM-based HPC systems Version 1.1

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Tuning Alya with READEX for Energy-Efficiency

Bootstrapping a HPC Ecosystem

FUJITSU HPC and the Development of the Post-K Supercomputer

Design Decisions for a Source-2-Source Compiler

Architecture, Programming and Performance of MIC Phi Coprocessor

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

AutoTune Workshop. Michael Gerndt Technische Universität München

General Purpose GPU Computing in Partial Wave Analysis

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited

Introduction to Runtime Systems

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Intel Math Kernel Library 10.3

EU Research Infra Integration: a vision from the BSC. Josep M. Martorell, PhD Associate Director

Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard

NEXTGenIO Performance Tools for In-Memory I/O

Post-K: Building the Arm HPC Ecosystem

SUSE Linux Entreprise Server for ARM

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Bei Wang, Dmitry Prohorov and Carlos Rosales

Intel Performance Libraries

The DEEP (and DEEP-ER) projects

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Slurm Configuration Impact on Benchmarking

Application Example Running on Top of GPI-Space Integrating D/C

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

On the scalability of tracing mechanisms 1

Big Orange Bramble. August 09, 2016

The Cray Programming Environment. An Introduction

FEniCS Performance Investigation and Porting minidft to GPU Clusters

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Intel Many Integrated Core (MIC) Architecture

User Training Cray XC40 IITM, Pune

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Feature Detection Plugins Speed-up by

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Our new HPC-Cluster An overview

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

Performance Tools for Technical Computing

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

GPUs and Emerging Architectures

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

A Case for High Performance Computing with Virtual Machines

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

HPC state of play. HPC Ecosystem. HPC system supply. HPC use 24% EU 4,3% EU. Application software & tools. Academia 23% Bio-sciences 22% CAE 21%

Software and Tools for HPE s The Machine Project

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Introduction of Oakforest-PACS

Intra-MIC MPI Communication using MVAPICH2: Early Experience

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Scalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center

Experiences with ENZO on the Intel Many Integrated Core Architecture

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Building an Exotic HPC Ecosystem at The University of Tulsa

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Transcription:

montblanc-project.eu @MontBlanc_EU The Mont-Blanc project Updates from the Barcelona Supercomputing Center Filippo Mantovani This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n 671697

The legacy Mont-Blanc vision Vision: to leverage the fast growing market of mobile technology for scientific computation, HPC and data centers. 2012 2013 2014 2015 2016 2017 2018 Mont-Blanc Mont-Blanc 2 Mont-Blanc 3 2

The legacy Mont-Blanc vision Vision: to leverage the fast growing market of mobile technology for scientific computation, HPC and data centers. 2012 2013 2014 2015 2016 2017 2018 Mont-Blanc Phases share a common structure Mont-Blanc 2 Mont-Blanc 3 Experiment with real hardware Android dev-kits, mini-clusters, prototypes, production ready systems Push software development System software, HPC benchmarks/mini-apps/production codes Study next generation architectures Learn from hardware deployment and evaluation for planning new systems 3

Hardware platforms We started here We ended up here N. Rajovic et al., The Mont-Blanc Prototype: An Alternative Approach for HPC Systems, in Proceedings of SC 16, p. 38:1 38:12. 4

System Software and Use Cases We started here Source files (C, C++, FORTRAN, Python, ) Compilers GNU Arm HPC Mercurium ATLAS FFTW HDF5 clblas LAPACK Boost PETSc Arm PL Extrae Nagios Ganglia Scientific libraries Perf Developer tools Linux OS / Ubuntu OpenCL driver Allinea Cluster management Puppet SLURM Scalasca OpenLDAP NTP Nanos++ OpenCL CUDA MPI Power monitor CPU CPU CPU Runtime libraries Hardware support / Storage DVFS GPU NFS Lustre Network driver Network We ended up here Different OS flavors Arm HPC Compiler Arm Performance Libraries Allinea tools All well packed and distributed through OpenHPC Several complex HPC production codes have run on Mont-Blanc Alya AVL codes WRF FEniCS 5

Study of Next-Generation Architectures We started here We ended up here A Multi-level Simulation Approach (MUSA) allows us: To gather performance traces on any current HPC architecture To replay them using almost any architecture configuration To study scalability and performance figures at scale, changing the number of MPI processes simulated Credits: N. Rajovic Credits: MUSA team @ BSC 6

Where BSC is contributing today? Evaluation of solutions Hardware solutions Mini-clusters deployed liaising with SoC providers and system integrators Software solutions Arm Performance Libraries, Arm HPC Compiler F. Banchelli et al., Is Arm software ecosystem ready for HPC?, poster at SC17. Use cases Alya: finite element code where we experiment atomics-avoiding techniques GOAL: test new runtime features to be pushed into OpenMP HPCG: benchmark where we started looking at vectorization GOAL: explore techniques for exploitation of the Arm Scalable Vector Extension Simulation of next generation large clusters MUSA: Combining detailed trace driven simulation with sampling strategies for exploring how architectural parameters affects the performance at scale. T. Grass et al., MUSA: A Multi-level Simulation Approach for Next-Generation HPC Machines, in SC16 proceedings, pp. 526 537. 7

Evaluation of Arm Performance Libraries Goal Test an HPC code making use of arithmetic and FFT libraries Method Quantum Espresso pwscf input Compiled with GCC 7.1.0 Platform configuration #1 (poster SC17) AMD Seattle Arm PL 2.2 ATLAS 3.11.39 OpenBLAS 0.2.20 FFTW 3.3.6 Platform configuration #2 Cavium ThunderX2 Arm PL v18.0 OpenBLAS 0.2.20 FFTW 3.3.7 8

Evaluation of the Arm HPC Compiler Goal Evaluate the Arm HPC Compilers v18.0 vs v1.4 Method Run Polybench benchmark suite Including 30 benchmarks by Ohio State University Run on Cavium ThunderX2 Execution time increment v18.0 vs v1.4 SIMD instructions v18.0 vs v1.4 9

High Performance Conjugate Gradient Speed Up Speed Up Problem Scalability of HPCG is very limited OpenMP parallelization of the reference HPCG version is poor Goals 1. Improve OpenMP parallelization of HPCG 2. Study current auto-vectorization for leveraging SVE 3. Analyze other performance limitations (e.g. cache effects) 12,00 12,00 10,00 10,00 8,00 8,00 6,00 6,00 4,00 4,00 2,00 2,00 0,00 1 2 4 8 16 28 0,00 1 2 4 8 16 28 OpenMP Threads OpenMP Threads Arm HPC Compiler 1.4 GCC 7.1.0 On Cavium ThunderX2 Arm HPC Compiler 1.4 GCC 7.1.0 10

High Performance Conjugate Gradient Problem Scalability of HPCG is very limited OpenMP parallelization of the reference HPCG version is poor Goals 1. Improve OpenMP parallelization of HPCG 2. Study current auto-vectorization for leveraging SVE 3. Analyze other performance limitations (e.g. cache effects) On Cavium ThunderX2 11

HPCG - SIMD parallelization First approach Check auto-vectorization in current platforms Method Count SIMD instructions in the ComputeSYMGS region On Cavium ThunderX2 using Arm HPC Compiler v18.0 On Intel Xeon Platinum 8160 (Skylake) using ICC supporting AVX512 x10 6 12

HPCG - SVE emulation Increment in SIMD instructions against NEON First approach Check auto-vectorization when SVE is enabled Method Evaluate auto-vectorization in a whole execution of HPCG (one iteration) Generate binary using Arm HPC Compiler v1.4 enabling SVE Emulate SVE instruction using Arm Instruction Emulator in Cavium ThunderX2 35 30 25 20 15 10 5 0 SVE 128b SVE 256b SVE 512b SVE 1024b SVE 2048b 13

HPGC - Memory access evaluation Cache hit ratio degraded when using multi-coloring approaches Data related to ComputeSYMGS Gathered on Cavium ThunderX2 Compiled with GCC 0% 100% 0% 100% ~13% L1D miss ratio ~35% L2D miss ratio Next steps Optimize data access patterns in memory Simulate SVE gather load instructions in order to quantify the benefits 14

Alya: BSC code for multi-physics problems Parallelization of finite elements code Analysis with Paraver: Reductions with indirect accesses on large arrays using No coloring Use of atomics operations harms performance Coloring Use of coloring harms locality Commutative Multidependences (OmpSs feature to be hopefully included in OpenMP) Credits: M. Garcia, J. Labarta 15

Alya: taskification and dynamic load balancing Goal Quantify the effect of commutative dependences and DLB on an HPC code Method Run the Assembly phase of Alya (containing atomics) On MareNostrum 3, 2x Intel Xeon SandyBridge-EP E5-2670 On Cavium ThunderX, 2x CN8890 Assembly phase 16 nodes x P processes/node x T threads/process Credits: M. Josep, M. Garcia, J. Labarta 16

Multi-Level Simulation Approach Level 1: Trace generation HPC application execution OpenMP Runtime System Plugin Task / chunk creation events, dependencies MPI Call Instrumenatation MPI calls Pintool / DynamoRIO Dynamic instructions Credits: T. Grass, C. Gomez, M. Casas, M. Moreto 17 Trace

Multi-Level Simulation Approach Level 2: Network simulation (Dimemas) Trace Network simulator Rank 1 Rank 2 Time Level 3: Multi-core simulation (TaskSim + Ramulator + McPAT) Multi-core simulator Thread 1 Thread 2 Time Credits: T. Grass, C. Gomez, M. Casas, M. Moreto 18

Multi-Level Parameters Architectural CPU architecture Number of cores Core frequency Threads per core Reorder buffer size SIMD width Micro-architectural L1/2/3 Cache size/latency Main memory Memory technology Capacity Bandwidth Latency Problem: Simulation time diverges Solution: We supported different modes (Burst, Detailed, Sampling) trading accuracy for speed Credits: T. Grass, C. Gomez, M. Casas, M. Moreto 19

MUSA: status SC 16 paper Validation of the methodology with 5 applications BT-MZ, SP-MZ, LU-MZ, HYDRO, SPECFEM3D Proven performance figures at scale up to 16 kmpi ranks Status update Added parameter sets for state-of-the art architectures Support for power consumption modeling Including CPU, NoC and memory hierarchy Incremented set of applications Expanded trace database Including traces gathered on MareNostrum4 (Intel Skylake + OmniPath) Included support for DynamoRIO Credits: T. Grass, C. Gomez, M. Casas, M. Moreto 20

Student Cluster Competition Rules 12 teams of 6 undergraduate students 1 cluster operating within 3 kw power budget 3 HPC applications + 2 benchmarks One team from University Politècnica de Catalunya (UPC-Spain) Participating with Mont-Blanc technology 3 awards to win Best HPL 1st, 2nd, 3rd overall places Fan favorite We are looking for an Arm-based cluster for 2018!!! 21

Interested in any of the topics presented? Visit our booths @ SC17! booth #1694 booth #1925 booth #1975 Follow us! montblanc-project.eu @MontBlanc_EU filippo.mantovani@bsc.es 22