Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Similar documents
Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Understanding Dynamic Parallelism

Allinea Unified Environment

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Accelerate HPC Development with Allinea Performance Tools

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

Debugging at Scale Lindon Locks

Debugging HPC Applications. David Lecomber CTO, Allinea Software

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Introduction to Performance Tuning & Optimization Tools

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Performance Tools for Technical Computing

Intel Parallel Studio 2011

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

Introduction to parallel Computing

IBM High Performance Computing Toolkit

Performance analysis basics

KNL tools. Dr. Fabio Baruffa

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Introduction to Parallel Performance Engineering

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Beyond Hardware IP An overview of Arm development solutions

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Development tools to enable Multicore

Parallel Debugging with TotalView BSC-CNS

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software

GOING ARM A CODE PERSPECTIVE

Oracle Exadata: Strategy and Roadmap

DDT Debugging Techniques

Intel Parallel Studio XE 2015

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah

Debugging and profiling of MPI programs

Welcomes PRACE/LinkSCEEM 2011 Winter School Jacques Philouze Vice President Sales & Marketing

Optimize HPC - Application Efficiency on Many Core Systems

Programming for Fujitsu Supercomputers

xsim The Extreme-Scale Simulator

High Availability/ Clustering with Zend Platform

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

HPC on Windows. Visual Studio 2010 and ISV Software

Memory & Thread Debugger

The Cray Rainier System: Integrated Scalar/Vector Computing

Large Scale Debugging

Revealing the performance aspects in your code

The Cray Programming Environment. An Introduction

Eliminate Threading Errors to Improve Program Stability

Intel Xeon Phi архитектура, модели программирования, оптимизация.

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

Code optimization. Geert Jan Bex

IBM Tivoli Monitoring (ITM) And AIX. Andre Metelo IBM SWG Competitive Project Office

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Debugging with GDB and DDT

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel

Building NVLink for Developers

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Cloud Computing. Summary

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Jackson Marusarz Software Technical Consulting Engineer

Oracle Autonomous Database

Debugging with GDB and DDT

MPI versions. MPI History

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Arm's role in co-design for the next generation of HPC platforms

Intel VTune Amplifier XE

Von Antreibern und Beschleunigern des HPC

Eliminate Memory Errors to Improve Program Stability

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

The Arm Technology Ecosystem: Current Products and Future Outlook

Profiling: Understand Your Application

High Performance Computing. Introduction to Parallel Computing

NEXTGenIO Performance Tools for In-Memory I/O

IBM Power Systems HPC Cluster

Porting Scientific Applications to OpenPOWER

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

MPI History. MPI versions MPI-2 MPICH2

Portable Power/Performance Benchmarking and Analysis with WattProf

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Performance Analysis of Parallel Scientific Applications In Eclipse

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

NightStar. NightView Source Level Debugger. Real-Time Linux Debugging and Analysis Tools BROCHURE

ClearSpeed Visual Profiler

IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE

Users and utilization of CERIT-SC infrastructure

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Monitoring and evaluating I/O performance of HPC systems

Directions in Workload Management

No Time to Read This Book?

Eliminate Threading Errors to Improve Program Stability

DDT: A visual, parallel debugger on Ra

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

IoT Market: Three Classes of Devices

Transcription:

Tools and Methodology for Ensuring HPC Programs Correctness and Performance Beau Paisley bpaisley@allinea.com

About Allinea Over 15 years of business focused on parallel programming development tools Strong R&D investment to drive innovation in changing landscape Committed to giving great support to the HPC community

Where to find Allinea s tools Over 65% of Top 100 HPC systems From small to very large tools provision 8 of the Top 10 HPC systems Up to 700,000 core tools usage Future leadership systems Millions of cores usage

Allinea: Industry Standard Tools for HPC (and hundreds more)

Performance in a Nutshell Algorithmic Issues Balance and Shared Bottlenecks The Memory Wall Use of Processor Capability

Performance Improvement Workflow Get a realistic test case Performance on real data matters Keep the test case for reference and re-use Profile your code Add -g flag to your compilation Run with a profiler Look for the significant Which part/phase of the code dominates time? Is there any unexpected significant time use? What is the nature of the problem? Compute? I/O? MPI? Thread synchronization? Display the metrics that show the problem best Apply brain to solve MPI can you balance the work better? Compute is memory time dominant can you improve layout? Think of the future Try larger process or thread counts to watch for scalability problems Keep the profile (.map file) for future comparison

Performance Improvement Workflow

The Uncomfortable Truth about Applications caption

Obtaining Program Correctness in a Nutshell Interactive multi-process and multithread debugging at any scale Support common architectures and coprocessors Offline debugging for large runs and nondeterministic bugs Support for integration to regression test

Debugging at Scale Requires Powerful Visual Representations

Enable Debugging Across a Range of Architectures IBM OpenPOWER Intel KNL NVIDIA GPUs 64-bit ARM

Enable Large Scale Debugging and Regression Testing with Offline Debugging caption

Overview of Allinea Tools Demand for software efficiency Demand for developer efficiency Demand for performance optimization Forge Performance Reports Measure MAP Profile and Optimize Pull for MAP to develop performance fix Open Interfaces (eg. JSON APIs) Continuous Integration Leads to MAP to optimize performance Debug, optimize, edit, commit, build, repeat Leads to DDT to understand and fix Version Control Demand for debugging DDT Debug

Analyze and tune application performance A single-page report on application performance for users and administrators Identify configuration problems and resource bottlenecks immediately Track mission-critical performance over time and after system upgrades Ensure key applications run at full speed on a new cluster or architecture

Vectorization, MPI, I/O, memory, energy caption

Allinea MAP The Profiler Small data files <5% slowdown No instrumentation No recompilation

How Allinea MAP is different Adaptive sampling Sample frequency decreases over time Data never grows too much Run for as long as you want Scalable Same scalable infrastructure as Allinea DDT Merges sample data at end of job Handles very high core counts, fast Instruction analysis Categorizes instructions sampled Knows where processor spends time Shows vectorization and memory bandwidth Thread profiling Core-time not thread-time profiling Identifies lost compute time Detects OpenMP issues Integrated Part of Forge tool suite Zoom and drill into profile Profiling within your code

Why MAP? Easy to use Fast, with low overhead Time-based indexing of data Extensive metrics, e.g., (I/O, memory, floating-point operations, linelevel granularity) Customized views into the data Extensible with API that can be used to capture custom measurements High accuracy Professional, responsive support

Allinea DDT The Debugger Who had a rogue behavior? Merges stacks from processes and threads Where did it happen? leaps to source How did it happen? Diagnostic messages Some faults evident instantly from source Run with Allinea tools Identify a problem Gather info Who, Where, How, Why Fix Why did it happen? Unique Smart Highlighting Sparklines comparing data across processes

Bottling it Lock in obtain results; Performance AND Correctness Save your results nightly Tie your performance results to your continuous integration server

Top Tips for HPC Development Success Performance is important Software needs performance attention Regular profiling pays rewards Test correctness and validate performance on real workloads Integrate your debugger with program correctness regression testing Constant diligence pays off