Understanding Dynamic Parallelism

Similar documents
Allinea Unified Environment

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

Debugging HPC Applications. David Lecomber CTO, Allinea Software

Accelerate HPC Development with Allinea Performance Tools

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

The Eclipse Parallel Tools Platform

Debugging at Scale Lindon Locks

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile

Development tools to enable Multicore

Welcomes PRACE/LinkSCEEM 2011 Winter School Jacques Philouze Vice President Sales & Marketing

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Parallel Programming Libraries and implementations

Accelerator programming with OpenACC

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Debugging Intel Xeon Phi KNC Tutorial

The Cray Programming Environment. An Introduction

OpenACC Course. Office Hour #2 Q&A

Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software

Performance Tools for Technical Computing

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Cuda C Programming Guide Appendix C Table C-

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Reusing this material

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

S Comparing OpenACC 2.5 and OpenMP 4.5

Guillimin HPC Users Meeting July 14, 2016

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Debugging, benchmarking, tuning i.e. software development tools. Martin Čuma Center for High Performance Computing University of Utah

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel

Intel Parallel Studio 2011

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC 2.6 Proposed Features

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

The Arm Technology Ecosystem: Current Products and Future Outlook

Parallel Programming. Libraries and Implementations

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

Trends and Challenges in Multicore Programming

Pedraforca: a First ARM + GPU Cluster for HPC

COMP528: Multi-core and Multi-Processor Computing

HPC on Windows. Visual Studio 2010 and ISV Software

Large Scale Debugging

IBM High Performance Computing Toolkit

The Titan Tools Experience

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

High Performance Computing with Accelerators

Intel Xeon Phi Coprocessors

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah

OpenMP 4.0: A Significant Paradigm Shift in Parallelism

Programming Environment 4/11/2015

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallel Debugging with TotalView BSC-CNS

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

n N c CIni.o ewsrg.au

OpenStaPLE, an OpenACC Lattice QCD Application

Debugging on Intel Platforms

Parallel Programming. Libraries and implementations

Oracle Developer Studio 12.6

Performance Analysis of Parallel Scientific Applications In Eclipse

GPUs and Emerging Architectures

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

Eliminate Memory Errors to Improve Program Stability

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Debugging and profiling of MPI programs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

To hear the audio, please be sure to dial in: ID#

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Accelerating Financial Applications on the GPU

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Oracle Developer Studio Performance Analyzer

Eliminate Threading Errors to Improve Program Stability

Developing Scientific Applications with the IBM Parallel Environment Developer Edition

Running the FIM and NIM Weather Models on GPUs

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

The Stampede is Coming: A New Petascale Resource for the Open Science Community

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Early Experiences Writing Performance Portable OpenMP 4 Codes

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

NightStar. NightView Source Level Debugger. Real-Time Linux Debugging and Analysis Tools BROCHURE

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Intel Xeon Phi Coprocessor

CME 213 S PRING Eric Darve

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Hybrid Implementation of 3D Kirchhoff Migration

Transcription:

Understanding Dynamic Parallelism Know your code and know yourself Presenter: Mark O Connor, VP Product Management

Agenda Introduction and Background Fixing a Dynamic Parallelism Bug Understanding Dynamic Parallelism Questions & Answers

Allinea The Company Parallel development tools company since 2002 Leading in HPC software tools market worldwide Global customer base Making parallel programming accessible to the widest range of scientists and programmers Design an unrivaled productive and easy-to-use development environment To help you reach the highest level of performance and scalability Define a new standard of customer support

Allinea Unified environment A modern integrated environment for HPC developers Supporting the lifecycle of application development and improvement Allinea DDT : Productively debug code Allinea MAP : Enhance application performance Designed for productivity Consistent easy to use tools Enables effective HPC development Improve system usage Fewer failed jobs Higher application performance

Unified building blocks in production since 2010 Shared Graphical Interface Shared Configuration Files Shared Scalable Architecture Shared Intelligence and Data Consolidation

Allinea MAP Increase application performance Parallel profiler designed for: C/C++, Fortran Multiprocess code Interdependent or independent processes Multithreaded code Monitor the main threads for each process Accelerated codes GPUs, Intel Xeon Phi Improve productivity : Helps you detect performance issues quickly and easily Tells you immediately where your time is spent in your source code Helps you to optimize your application efficiently

Allinea MAP Find performance issues quickly Look at the entire application on real data sets Visualize the entire run at full scale, not just reduced sets Zoom in to explore iterations, functions and loops Understand the nature of bottlenecks Source code viewer pinpoints bottleneck locations CPU, MPI and memory access metrics identify the cause

Allinea DDT Fix software problems - fast Graphical debugger designed for: C/C++, Fortran, UPC, CUDA Multithreaded code Single address space Multiprocess code Interdependent or independent processes Accelerated codes GPUs, Intel Xeon Phi Any mix of the above Slash your time to debug : Reproduces and triggers your bugs instantly Helps you easily understand where issues come from quickly Helps you to fix them as swiftly as possible

Allinea DDT Scalable debugging by design Where did it happen? Allinea DDT leaps to source automatically Merges stacks from processes and threads How did it happen? Some faults evident instantly from source Why did it happen? Real-time data comparison and consolidation Unique Smart Highlighting coloring differences and changes Sparklines comparing data across processes

New in Allinea DDT 4.1 Debug problems even quicker Debugging logbook Records debugging activity Compare runs side-by-side Extends offline debugging capabilities Benefit : Compare sane runs to buggy runs to quickly narrow down your problem.

New in Allinea DDT 4.1 Debug problems even quicker

New in Allinea DDT 4.1 Debug problems even quicker Version control integration Highlights where source code has been changed Source code annotated with a change heatmap Support for Mercurial, CVS, SVN, Git Benefit : Quickly identify the cause of regressions by seeing at a glance what has changed

New in Allinea DDT 4.1 Debug problems even quicker

New in Allinea DDT 4.1 Tighten the link with VisIt Visualization enhancements Pick cells and interact with them in the debugger e.g. set a watchpoint Display of multiple datasets Wizard to guide data layout Benefit: Link visualization to precise memory areas to shorten the debugging process

Leading the way to Innovation Support for accelerated environments CUDA 5.0 and Kepler 20 Intel Xeon Phi Coprocessor GPU directives (both OpenACC and non-openacc) Support for complex architectures Debug and profile MPI, OpenMP and CUDA combinations Supports low power CPU architectures (Moonshot program) Support for all major compilers, MPI and OpenMP implementations Murex : NVIDIA Carma Dev Kit Quick resolution of our customer issues 90% of support tickets are resolved within 7 days University of Gent

Today: Debugging Dynamic Parallelism on K20

Debugging Dynamic Parallelism

Debugging Dynamic Parallelism

Debugging Dynamic Parallelism wait, what?

Debugging Dynamic Parallelism

Debugging is About Understanding What actually happens here? Which values are put into data and when? What's the relationship between n and data? How many kernels are launched?

Allinea DDT + MAP An integrated, ready-to-run development suite Get correct results fast using the industry-leading parallel debugger Full support for NVIDIA CUDA 5 See which loops can be offloaded to the GPU most effectively with Allinea MAP

Questions and Answers Mark O Connor, VP Product Management Robert Rick, VP Sales, Director of Operations, Americas

Upcoming GTC Express Webinars July 10 Introduction to the CUDA Toolkit as an Application Build Tool Adam DeConinck, HPC Systems Engineer, NVIDIA July 11 Uncovering the Elusive HIV Capsid with Kepler GPUs Juan R. Perilla, Postdoctoral Fellow, University of Illinois at Urbana-Champaign Register at www.gputechconf.com/gtcexpress