Butterfly effect of porting scientific applications to ARM-based platforms

Similar documents
The Mont-Blanc project Updates from the Barcelona Supercomputing Center

Barcelona Supercomputing Center

Building supercomputers from embedded technologies

The Mont-Blanc Project

Building supercomputers from commodity embedded chips

Pedraforca: a First ARM + GPU Cluster for HPC

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

ARM High Performance Computing

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

The Mont-Blanc approach towards Exascale

GOING ARM A CODE PERSPECTIVE

HPC projects. Grischa Bolls

HPC state of play. HPC Ecosystem. HPC system supply. HPC use 24% EU 4,3% EU. Application software & tools. Academia 23% Bio-sciences 22% CAE 21%

European energy efficient supercomputer project

Beyond Hardware IP An overview of Arm development solutions

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Arm Processor Technology Update and Roadmap

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

The DEEP (and DEEP-ER) projects

D.A.V.I.D.E. (Development of an Added-Value Infrastructure Designed in Europe) IWOPH 17 E4. WHEN PERFORMANCE MATTERS

Software Ecosystem for Arm-based HPC

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

The Arm Technology Ecosystem: Current Products and Future Outlook

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Portable Power/Performance Benchmarking and Analysis with WattProf

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

Understanding the Role of GPGPU-accelerated SoC-based ARM Clusters

Hybrid programming with MPI and OpenMP On the way to exascale

Software and Performance Engineering for numerical codes on GPU clusters

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Arm's role in co-design for the next generation of HPC platforms

Slurm BOF SC13 Bull s Slurm roadmap

CS500 SMARTER CLUSTER SUPERCOMPUTERS

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Performance Tools (Paraver/Dimemas)

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

POP CoE: Understanding applications and how to prepare for exascale

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

CCR. ISC18 June 28, Kevin Pedretti, Jim H. Laros III, Si Hammond SAND C. Photos placed in horizontal env

MB3 D6.1 Report on profiling and benchmarking of the initial set of applications on ARM-based HPC systems Version 1.1

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Trends in HPC (hardware complexity and software challenges)

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Enabling the ARM high performance computing (HPC) software ecosystem

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

The rcuda middleware and applications

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Jay Kruemcke Sr. Product Manager, HPC, Arm,

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

AcuSolve Performance Benchmark and Profiling. October 2011

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Initial Explorations of ARM Processors for Scientific Computing

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

CPU-GPU Heterogeneous Computing

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

IBM CORAL HPC System Solution

GPUs and Emerging Architectures

Advances of parallel computing. Kirill Bogachev May 2016

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Building NVLink for Developers

ECE 8823: GPU Architectures. Objectives

Position Paper: OpenMP scheduling on ARM big.little architecture

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Dr. Yassine Hariri CMC Microsystems

Gen-Z Memory-Driven Computing

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Building blocks for 64-bit Systems Development of System IP in ARM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

ABySS Performance Benchmark and Profiling. May 2010

Interconnect Your Future

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

High performance Computing and O&G Challenges

Efficient Programming for Multicore Processor Heterogeneity: OpenMP Versus OmpSs

Preparing GPU-Accelerated Applications for the Summit Supercomputer

simulation framework for piecewise regular grids

Carlo Cavazzoni, HPC department, CINECA

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

THE LEADER IN VISUAL COMPUTING

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Parallel Mesh Partitioning in Alya

There s STILL plenty of room at the bottom! Andreas Olofsson

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Interconnect Your Future

Overview of Architectures and Programming Languages for Parallel Computing

Transcription:

montblanc-project.eu @MontBlanc_EU Butterfly effect of porting scientific applications to ARM-based platforms Filippo Mantovani September 12 th, 2017 This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n 671697

Mont-Blanc projects in a glance Vision: to leverage the fast growing market of mobile technology for scientific computation, HPC and data centers. 2012 2013 2014 2015 2016 2017 2018 Mont-Blanc Hardware Mobile technology (ARMv7) Software Enabling ARM based clusters Next generation studies Gathering info from first prototypes and extrapolate Mont-Blanc 2 Hardware Mobile and server technology (ARMv8) Software Massive improvement of the ecosystem (development tools, resiliency) Next generation studies Pre-exascale HPC compute node Mont-Blanc 3 Hardware Server technology (ARMv8) Software Focusing on study of mini-apps and real industrial applications Next generation studies Focusing on an HPC SoC Get a market-ready system! 2

Mont-Blanc projects in a glance Vision: to leverage the fast growing market of mobile technology for scientific computation, HPC and data centers. 2012 2013 2014 2015 2016 2017 2018 Mont-Blanc Mont-Blanc 2 More details: Collaborative Research Projects Workshop Wednesday 13 September 2017 Plenary Room, Crausaz Wordsworth Building 9:50 - Etienne Walter Mont-Blanc 3 3

Mont-Blanc contributions Scientific applications Porting and benchmarking of mini-apps and full scale applications Scalability study on real ARM-based platforms 4

The Mont-Blanc prototype ecosystem Prototypes are critical to accelerate software development System software stack + applications PRACE prototypes Mini-clusters Arndale Odroid XU Odroid XU-3 NVIDIA Jetson Mont-Blanc prototype 1080 compute cards Fine grained power monitoring system Installed between Jan and May 2015 Operational since May 2015 @ BSC ARM 64-bit mini-clusters APM X-GENE2 Cavium ThunderX NVIDIA TX1 Mont-Blanc 3 demonstrator Based on new-generation ARM 64-bit processors Cavium ThunderX2 SoC Targeting HPC market Tibidabo Carma Pedraforca 5 2012 2013 2014 2015 2016 2017

The first Mont-Blanc prototype 5.6 cm Memory 4 GB LPDDR3-1600 Local Storage microsd up to 64 GB 8.5 cm Network USB3.0 to 1GbE bridge Exynos5 Dual SoC 2x cores ARM Cortex-A15 1x GPU ARM Mali-T604 2 Racks 8 BullX chassis 72 Compute blades 1080 Compute cards 2160 CPUs 1080 GPUs 4.3 TB of DRAM 17.2 TB of Flash Operational since May 2015 @ BSC 6

The Mont-Blanc 3 demonstrator Codename: Dibona Power supply units Redundant management server and storage 48 compute nodes 96 Cavium ThunderX2 CPUs 3000 cores 12000 threads Infiniband EDR switches Direct Liquid Cooling Management network 7

System software stack for ARM Source files (C, C++, FORTRAN, Python, ) Compilers GNU ARM HPC Mercurium Scientific libraries ATLAS FFTW HDF5 clblas LAPACK Boost PETSc ARM PL Developer tools Extrae Perf Allinea Scalasca 1 2 Developed since 2011! Today in collaboration with all major OpenHPC partners Tested on several ARM-based platform Nagios Ganglia Cluster management Puppet SLURM OpenLDAP NTP Runtime libraries Nanos++ OpenCL CUDA MPI 3 Mostly based on open-source packages Power monitor CPU CPU CPU Hardware support / Storage DVFS Linux OS / Ubuntu OpenCL driver GPU NFS Lustre Network driver Network 4 Effort in standardize power measurements formats for efficient use of existing systems 8

Scientific applications: methodology Applications Benchmarks Mini-apps Production / Industrial codes Tracing applications with the objective of Test current solutions and provide feedbacks to technology providers Test SoC, e.g. PAPI on Cavium ThunderX Measure power consumption and correlate it with performance Evaluation of Arm HPC Compiler Understanding code limitations and helping the developers in restructuring it applying OmpSs/OpenMP4.0 and analyze the effect Benefit of taskification Exploring new techniques, e.g. Dynamic Load Balancing Performing extrapolation studies using next generation machine parameters MUltiscale Simulation Architecture Tech report: http://upcommons.upc.edu/handle/2117/107063 Poster accepted at SC 17 9

PAPI support for Cavium ThunderX INFO: Hardware performance counters are handled by a Performance Monitor Unit (PMU), part of the SoC. Access them is SoC dependent In order to access the PMU Device Tree must include PMU definition Linux Kernel must implement a way to access PMU We patched kernel 4.2.6 Support for Cavium is natively provided from kernel 4.4 Extended PAPI and libpfm libraries for supporting ThunderX PMU Providing all functions and definitions needed Including common and implementation defined PMUv3 events Making available several PAPI events Patch published here: https://goo.gl/mxn2h2 Credits: L. Wikelius et al. @ Cavium 10

Lattice Boltzmann D2Q37 Fluid dynamic code MPI+OpenMP for simulation of e.g. mixing layer evolution of fluids at different temperature/density Simple structure Serial initialization + closing Propagate (memory bound) Collide (compute bound) Propagate Collide Initialization Closing Credits: E. Calore 11

D2Q37 Clustering analysis (on ThunderX) Different runs have been performed over the same lattice size with a varying number of threads: 6, 8, 12, 16, 24, 32, 48. Collide function is scaling almost perfectly up to 48 threads For the Propagate, increasing the number of threads makes threads competing for the same resource (memory) Access to hw counters allows 12 this GoingARM kind of workshop study

Power traces on Mont-Blanc mini-clusters Power measurements on mini-clusters are highly heterogeneous: Coming from different devices (in-band, out-of-band, ) With different sampling rates With synchronization issues However, we tried to make power monitoring easy for the user With SLURM integration: simply launching jobs to *-power queue users get power traces in format CSV and Paraver 13

D2Q37 Correlating performance, power, energy Time shift Coarse power measurements Even if with limitations, the user can access power figures Timeline of instantaneous power consumption Energy to solution in one click Credits: Enrico Calore 14

D2Q37 Correlating performance, power, energy rere 15 Credits: Enrico Calore

D2Q37 Correlating performance, power, energy Histogram of cycles per us (i.e. frequency) 16 Credits: Enrico Calore

Evaluation of the Arm HPC Compiler Comparison of GCC 7.1.0 vs Arm HPC Compiler 1.3 Arm HCP Compiler 1.4 has been released on August 17 Evaluation on Polybench, LULESH, CoMD and QuantumESPRESSO Preliminary results show huge difference in number of instructions More details in the poster accepted at SC 17 17

Alya: BSC code for multi-physics problems Parallelization of finite elements code Analysis with Paraver: Reductions with indirect accesses on large arrays using No coloring Coloring Commutative Multidependences (OmpSs feature to be hopefully included in OpenMP) 18

Alya: taskification and dynamic load balancing Towards throughput computing Tasks + DLB dotted lines Assembly phase DLB helps in all cases Even more in the bad ones 16 nodes x P processes/node x T threads/process Side effects Hybrid MPI+OmpSs Nx1 can perform better than pure MPI! Nx1 + DLB: hope for lazy programmers Subscale phase 19

Alya: coupled codes Observations Marenostrum3 32 nodes 512 cores Big impact configuration and kind of coupling in original version Important improvement with DLB-Lend-When-Idle in all cases Almost constant performance independent of configuration and kind of coupling Fluid dominated Particle dominated Credits: J. Labarta, M. Garcia Fluid Particle 20

Scientific applications in Mont-Blanc 3 21

Conclusions / Vision Most hardware limitations will evolve, eventually In the original market of the devices When extending to the server market Pushed by other markets (e.g. mobile, automotive) Tools can help understand the real problems and suggest/evaluate alternatives Having access to architectural specifics (e.g. hw counters) is extremely important! e.g. correlating performance and power Programming model and runtime will help overcome Asynchrony and overlap Resilience Variability / Load balancing The work performed on Mont-Blanc 3 applications is not strictly ARM specific Benefits for any new platform!!! 22

Student Cluster Competition Rules 12 teams of 6 undergraduate students 1 cluster operating within 3 kw power budget 3 HPC applications + 2 benchmarks One team from University Politecnica of Catalunya (UPC-Spain) Participating with Mont-Blanc technology 3 awards to win Best HPL 1st, 2nd, 3rd overall places Fan favorite 23

For more information montblanc-project.eu @MontBlanc_EU filippo.mantovani@bsc.es 24