Optimize HPC - Application Efficiency on Many Core Systems

Similar documents
Arm crossplatform. VI-HPS platform October 16, Arm Limited

Software Ecosystem for Arm-based HPC

DPDK on Arm64 Status Review & Plan

Accelerate Ceph By SPDK on AArch64

A New Security Platform for High Performance Client SoCs

A Developer's Guide to Security on Cortex-M based MCUs

Introduction to parallel Computing

The Changing Face of Edge Compute

Arm s Latest CPU for Laptop-Class Performance

Connect your IoT device: Bluetooth 5, , NB-IoT

WAVE ONE MAINFRAME WAVE THREE INTERNET WAVE FOUR MOBILE & CLOUD WAVE TWO PERSONAL COMPUTING & SOFTWARE Arm Limited

Unleash the DSP performance of Arm Cortex processors

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Rendering Structures Analyzing modern rendering on mobile

How to Build Optimized ML Applications with Arm Software

Hardware- Software Co-design at Arm GPUs

What is gem5 and where do I get it?

Building firmware update: The devil is in the details

TZMP-1 Software Reference Implementation. Ken Liu 2018-Mar-12

2017 Arm Limited. How to design an IoT SoC and get Arm CPU IP for no upfront license fee

Standard Cell Design and Optimization Methodology for ASAP7 PDK

How to Build Optimized ML Applications with Arm Software

Supplier Training Visual Guide

Connect Your IoT Device: Bluetooth 5, , NB-IoT

Advanced Software Features for the LA-950

Accelerating intelligence at the edge for embedded and IoT applications

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Implementing debug. and trace access. through functional I/O. Alvin Yang Staff FAE. Arm Tech Symposia Arm Limited

Bringing Intelligence to Enterprise Storage Drives

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Diversity of. connectivity required for scalable IoT devices. Sam Grove Principal Software Engineer Arm. Arm TechCon 2017.

Improve the container image compatibility on Arm

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

OPERATING SYSTEM. Chapter 4: Threads

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Trends and Challenges in Multicore Programming

Arm Mbed Edge. Shiv Ramamurthi Arm. Arm Tech Symposia Arm Limited

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Analytical Modeling of Parallel Programs

A Lightweight OpenMP Runtime

Advanced IP solutions enabling the autonomous driving revolution

CSCE 626 Experimental Evaluation.

How Can You Trust Formally Verified Software?

Chapter 4: Threads. Chapter 4: Threads

How Can You Trust Formally Verified Software?

New Approaches to Connected Device Security

Compute solutions for mass deployment of autonomy

Web Programming Pre-01A Web Programming Technologies. Aryo Pinandito, ST, M.MT

CCIX: a new coherent multichip interconnect for accelerated use cases

High Performance Computing

OpenMP 4.0/4.5. Mark Bull, EPCC

Distributed Systems CS /640

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Parallelization of Sequential Programs

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Introduction to Parallel Computing

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

A Secure and Connected Intelligent Future. Ian Smythe Senior Director Marketing, Client Business Arm Tech Symposia 2017

OpenMP 4.0. Mark Bull, EPCC

How Can You Trust Formally Verified Software?

Programming for Fujitsu Supercomputers

Accelerate HPC Development with Allinea Performance Tools

A Uniform Programming Model for Petascale Computing

Performance Analysis with Periscope

Chapter 4: Multithreaded Programming

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming with MPI and OpenMP

Callisto-RTS: Fine-Grain Parallel Loops

Allows program to be incrementally parallelized

Beyond TrustZone PSA Reed Hinkel Senior Manager Embedded Security Market Development

EE/CSCI 451: Parallel and Distributed Computation

Concurrent Programming with OpenMP

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Performance analysis basics

Introduction to OpenMP

Making progress vs strategy

PRACE Autumn School Basic Programming Models

[Potentially] Your first parallel application

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Introduction to Performance Tuning & Optimization Tools

CS420: Operating Systems

Using Virtual Platforms To Improve Software Verification and Validation Efficiency

Introduction to parallel computing

Speeding Up Reactive Transport Code Using OpenMP. OpenMP

Techniques to improve the scalability of Checkpoint-Restart

Introduction to Parallel Programming

Introduction to Parallel Programming

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Scalasca performance properties The metrics tour

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

CSC630/CSC730 Parallel & Distributed Computing

OPPORTUNITIES AND CHALLENGES OF DIGITAL TRANSFORMATION FOR ORGANISATIONS WITH MOBILE WORKERS

Adopt-a-JSR July Meeting

Transcription:

Meet the experts Optimize HPC - Application Efficiency on Many Core Systems 2018 Arm Limited Florent Lebeau 27 March 2018

2 2018 Arm Limited

Speedup Multithreading and scalability I wrote my program to run in parallel with a few OpenMP directives But the performance is not really what I expected. 9 8 Example with modified version of Cloverleaf Multi-threaded version with OpenMP No MPI, no IO 7 6 5 4 Ideal Measured Number of threads Runtime (seconds) 1 281 3 2 192 2 4 144 1 1 2 3 4 5 6 7 8 Threads 8 155 3 2018 Arm Limited

Outline What are the challenges of multithreaded applications? How to identify issues and act quickly? How to understand the performance and optimise the code? 4 2018 Arm Limited

sync sync Challenges Sequential sections Regions outside of parallel sections Master or single OpenMP sections Synchronisation overhead Load imbalance Implicit and explicit barriers Scheduling policy Communications Hardware contention A B C D E F A B C D E F A E B C D F 5 2018 Arm Limited

Identifying the amount of sequential code Arm Performance Reports is an application reporting tool for HPC Easy to use: no re-compiling required Gives a comprehensible and readable summary of the application behavior 6 2018 Arm Limited

Sequential sections and scalability Running Performance Reports on the example using 1 thread indicates that: Time spent in serial code is 11% Theoretical speedup of 4.5 with 8 threads With 8 threads: Speedup is only 1.8 7 2018 Arm Limited

Where does code run serially? Arm MAP is a lightweight multi-node profiling tool Compiling with debugging flag required Shows processes and threads activity over time See source code is annotated Information aggregated by stacks and function Compute, IO and MPI 8 2018 Arm Limited

Critical sections Profiling our example using 8 threads with Arm MAP shows OpenMP activity in Light Green Single thread activity in Dark Green Synchronisation in Grey The tool shows serial execution happens in the pdv_module::pdv subroutine In a OpenMP single section Can be replaced with OpenMP parallel do section 9 2018 Arm Limited

Speedup Synchronization overhead 9 8 7 6 5 4 Ideal Measured Amdahl Theoretical speedup: 4.5 Measured: 1.8 3 2 1 1 2 3 4 5 6 7 8 Threads 10 2018 Arm Limited

Identifying the amount of synchronization Performance Reports on the example using 8 threads shows: Low amount of computation System load is only 78% Possible reasons: Load imbalance Implicit and explicit barriers Scheduling policy Communications Hardware contention 11 2018 Arm Limited

Speedup System Load and thread binding Dynamic adjustment of the number of threads is enabled OMP_DYNAMIC='TRUE Threads are not bound OMP_PROC_BIND= FALSE Binding the threads and disabling dynamic adjustment slightly improve performance Performance Reports detects a sign of overly fine grained parallelism 12 2018 Arm Limited 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 Ideal No binding Binding Amdahl Threads

Where are threads waiting? Profiling Cloverleaf using 8 threads (bound to cores) with Arm MAP shows Overhead in one OpenMP region Implicit barriers Inner loop parallelization only 13 2018 Arm Limited

Understanding resource usage Memory accesses Max Avg Min Balanced Imbalanced 14 2018 Arm Limited

Additional metrics PAPI 15 2018 Arm Limited

Thank You Danke Merci 谢谢ありがとう Gracias Kiitos 감사합니다 धन यव द תודה 16 2018 Arm Limited