Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Similar documents
Allinea Unified Environment

Understanding Dynamic Parallelism

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Debugging HPC Applications. David Lecomber CTO, Allinea Software

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Debugging at Scale Lindon Locks

IBM High Performance Computing Toolkit

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Welcomes PRACE/LinkSCEEM 2011 Winter School Jacques Philouze Vice President Sales & Marketing

AutoTune Workshop. Michael Gerndt Technische Universität München

The Eclipse Parallel Tools Platform

The Titan Tools Experience

An Introduction to OpenACC

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software

Development tools to enable Multicore

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

Parallel Programming. Libraries and Implementations

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

An Introduction to the SPEC High Performance Group and their Benchmark Suites

OpenACC 2.6 Proposed Features

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Large Scale Debugging

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors

The Cray Programming Environment. An Introduction

Debugging on Blue Waters

Accelerate HPC Development with Allinea Performance Tools

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Pedraforca: a First ARM + GPU Cluster for HPC

Debugging Intel Xeon Phi KNC Tutorial

Parallel Programming Libraries and implementations

Intel Xeon Phi Coprocessors

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

Hybrid Implementation of 3D Kirchhoff Migration

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

[Scalasca] Tool Integrations

Scalability Improvements in the TAU Performance System for Extreme Scale

GPU Fundamentals Jeff Larkin November 14, 2016

ARCHER Single Node Optimisation

GPU Computing with NVIDIA s new Kepler Architecture

Debugging and profiling of MPI programs

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Early Experiences Writing Performance Portable OpenMP 4 Codes

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Trends in HPC (hardware complexity and software challenges)

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Towards a Reconfigurable HPC Component Model

Illinois Proposal Considerations Greg Bauer

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Reusing this material

Debugging with GDB and DDT

Parallel Programming. Libraries and implementations

LS-DYNA Scalability Analysis on Cray Supercomputers

Accelerating HPL on Heterogeneous GPU Clusters

Running the FIM and NIM Weather Models on GPUs

COMP528: Multi-core and Multi-Processor Computing

Guillimin HPC Users Meeting July 14, 2016

Performance Tools for Technical Computing

Debugging, benchmarking, tuning i.e. software development tools. Martin Čuma Center for High Performance Computing University of Utah

Operational Robustness of Accelerator Aware MPI

Oracle Developer Studio Performance Analyzer

Bright Cluster Manager Advanced HPC cluster management made easy. Martijn de Vries CTO Bright Computing

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Bei Wang, Dmitry Prohorov and Carlos Rosales

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Peta-Scale Simulations with the HPC Software Framework walberla:

HPC Architectures. Types of resource currently in use

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Heidi Poxon Cray Inc.

Using Intel VTune Amplifier XE for High Performance Computing

LDetector: A low overhead data race detector for GPU programs

Intel Xeon Phi архитектура, модели программирования, оптимизация.

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

User Training Cray XC40 IITM, Pune

OpenACC Course. Office Hour #2 Q&A

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

DDT Debugging Techniques

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah

Parallel Systems. Project topics

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Improving Applica/on Performance Using the TAU Performance System

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Automatic trace analysis with the Scalasca Trace Tools

Transcription:

Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013

Embracing GPUs GPUs a rival to traditional processors New languages, compilers, standards Great price/performance ratios CUDA, OpenACC, OpenCL,... HPC developers need to consider Data transfer Multiple memory levels Grid/block layout and thread scheduling Synchronization Bugs are inevitable Processing flow

Full-scale parallel debugging Full CUDA 5 support OpenMP + OpenACC Mixed MPI / OpenMP / CUDA Debug the real device World-class Parallel Debugging

Allinea DDT Demo: CUDA Debugging Essentials

Debugging Dynamic Parallelism

Full-scale parallel profiling Just 5% runtime overhead No need to instrument code MPI, memory and CPU See the slowest lines of code World-class Parallel Profiling

Use of Allinea DDT at SciNet Productively debug parallel CUDA code Completely understand parallel CUDA code Interact with data, algorithms, codes, programs and applications in real time Develop parallel CUDA code from scratch Port parallel algorithms, codes, programs and applications to CUDA Scale CUDA algorithms, codes, programs and applications http://www.scinethpc.ca/

Petascale Debugging Cray XK* systems ORNL Titan MPI debugging proven at 220,000 CPU cores in 2010 ~300,000 CPU cores in 2012 MPI-OpenACC hybrid codes expected to scale similarly NCSA Blue Waters ~700,000 CPU cores

Allinea at GTC 2013 Please visit our booth!!! Live demonstrations Further information Call to action: Evaluate Allinea tools! http://www.allinea.com/products/

Allinea DDT: Fixing bugs made easy A tool that allows you to solve your problems faster Control threads and processes en-masse Syntax-highlighting source browser Parallel stacks and variable views that highlight divergence Supports the latest in MPI, OpenMP, CUDA and more CUDA 5.0 and Kepler K20 Intel Xeon Phi coprocessor

Allinea DDT: Proven to the extreme Scalability by design User interface that scales High performance tree architecture Proven performance at Petascale Measured in milliseconds Routine use at 100,000+ cores

Allinea DDT: More than debugger Integrated automated detection of bugs Static analysis Memory leaks and errors Open plugin architecture MPI checking tools Offline mode - debug in batch mode

Allinea DDT 3.2.1 (10/2012) CUDA 5.0 support Intel Xeon Phi support MPMD support MPICH 2 Intel MPI IBM BlueGene /Q port

Allinea Tools 4.0 (3/2013) GA introduction of Allinea MAP Observe and improve the performance of your MPI application Improved syntax highlighting and code folding in code viewer Remote client Native client for Linux, Windows and Mac Redesigned welcome page Thread viewer Shows OpenMP thread ID for supported implementations

Allinea Tools 4.0 (2) Support for MPICH 3 Cray-specific Attach to GPU codes (Cray XK7) Memory Debugging Library preloading C/Fortran (no threads/threads) now supported (Cray XK6/7) C++ (no threads/threads) not supported

Allinea MAP: The profiler you'll want to use Works first time, every time From one process to tens of thousands 5% slowdown No instrumentation and no huge data files No need to recompile

Allinea MAP: Refreshing simple See where time is spent in your source code. Visualize the entire run Zoom in to explore iterations, functions and loops.

Allinea MAP: Surprisingly deep A wide range of metrics for deep insights into performance: CPU: Time spent in FPU, vector (SSE/AVX) and memory operations MPI: Time spent and bytes transferred in P2P / collective operations Visual display of distribution over ranks, min / max ranks labelled Export data via XML or CSV for deeper analysis and reports

Allinea MAP: Built on strength World-class scalability Shares Allinea DDT tree architecture proven beyond Petascale Data is merged on the cluster: no huge files.