Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Similar documents
Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

An Introduction to OpenACC

A High Level Programming Environment for Accelerated Computing. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Hybrid multicore supercomputing and GPU acceleration with next-generation Cray systems

Addressing Performance and Programmability Challenges in Current and Future Supercomputers

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

The Cray Programming Environment. An Introduction

The Arm Technology Ecosystem: Current Products and Future Outlook

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

Alistair Hart, Roberto Ansaloni, Alan Gray, Kevin Stratford (EPCC) ( Cray Exascale Research Initiative Europe)

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

OpenACC Course. Office Hour #2 Q&A

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Steps to create a hybrid code

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

The Eclipse Parallel Tools Platform

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

The Titan Tools Experience

Using the Cray Programming Environment to Convert an all MPI code to a Hybrid-Multi-core Ready Application

OpenACC programming for GPGPUs: Rotor wake simulation

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

John Levesque Nov 16, 2001

Understanding Dynamic Parallelism

Heidi Poxon Cray Inc.

Parallel Programming. Libraries and Implementations

Steve Scott, Tesla CTO SC 11 November 15, 2011

The Cray Programming Environment. An Introduction

Programming Environment 4/11/2015

Illinois Proposal Considerations Greg Bauer

User Training Cray XC40 IITM, Pune

Adrian Tate XK6 / openacc workshop Manno, Mar

Trends and Challenges in Multicore Programming

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

How to Write Code that Will Survive the Many-Core Revolution

Preparing GPU-Accelerated Applications for the Summit Supercomputer

CUDA. Matthew Joyner, Jeremy Williams

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Parallel Programming Libraries and implementations

Allinea Unified Environment

It s not my fault! Finding errors in parallel codes 找並行程序的錯誤

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Programming NVIDIA GPUs with OpenACC Directives

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Parallel Programming. Libraries and implementations

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPU Computing with NVIDIA s new Kepler Architecture

Lecture 1: an introduction to CUDA

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Debugging HPC Applications. David Lecomber CTO, Allinea Software

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

OpenACC Accelerator Directives. May 3, 2013

CUDA Experiences: Over-Optimization and Future HPC

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Wednesday : Basic Overview. Thursday : Optimization

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Chapter 3 Parallel Software

Accelerating Financial Applications on the GPU

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Directive-based Programming for Highly-scalable Nodes

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

OpenACC 2.6 Proposed Features

General Purpose GPU Computing in Partial Wave Analysis

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile

Trends in HPC (hardware complexity and software challenges)

arxiv: v1 [hep-lat] 12 Nov 2013

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Reveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

Porting COSMO to Hybrid Architectures

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Accelerating sequential computer vision algorithms using commodity parallel hardware

Portable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools

Overview of research activities Toward portability of performance

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Addressing Heterogeneity in Manycore Applications

Operational Robustness of Accelerator Aware MPI

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

OpenACC (Open Accelerators - Introduced in 2012)

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

Transcription:

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1

The New Generation of Supercomputers Hybrid multicore has arrived and is here to stay Fat nodes are getting fatter Accelerators have leapt into the Top500 The Cray XK Hybrid Architecture XK6 Announced in May 2011 NVIDIA Fermi X2090 GPU XK7 Announced in October 2012 NVIDIA Kepler K20 GPU Fully upgradeable from Cray XT/XE systems Unified X86/GPU programming environment Fully compatible with Cray XE6 product line Cray Gemini interconnect high bandwidth/low latency scalability 2

Cray s Vision for Accelerated Computing Most important hurdle for widespread adoption of accelerated computing in HPC is programming difficulty Need a single programming model that is portable across machine types Portable expression of heterogeneity and multi-level parallelism Programming model and optimization should not be significantly difference for accelerated nodes and multi-core x86 processors Allow users to maintain a single code base Cray s approach to Accelerator Programming is to provide an ease of use tightly coupled high level programming environment with compilers, libraries, and tools that can hide the complexity of the system Ease of use is possible with Compiler making it feasible for users to write applications in Fortran, C, and C++ Tools to help users port and optimize for hybrid systems Auto-tuned scientific libraries SC'12 Luiz DeRose 2012 3

OpenACC Accelerator Programming Model Why a new model? There are already many ways to program: CUDA and OpenCL All are quite low-level and closely coupled to the GPU PGI CUDA Fortran: still CUDA just in a better base language User needs to write specialized kernels: Hard to write and debug Hard to optimize for specific GPU Hard to update (porting/functionality) OpenACC Directives provide high-level approach Simple programming model for hybrid systems Easier to maintain/port/extend code Non-executable statements (comments, pragmas) The same source code can be compiled for multicore CPU Based on the work in the OpenMP Accelerator Subcommittee PGI accelerator directives, CAPS HMPP First steps in the right direction Needed standardization Possible performance sacrifice A small performance gap is acceptable (do you still hand-code in assembly?) Goal is to provide at least 80% of the performance obtained with hand coded CUDA Compiler support: all complete in 2012 Cray CCE: complete in the 8.1 release PGI Accelerator version 12.6 onwards CAPS Full support in version 1.3 4

Steps to Create a Hybrid Code SC'12 Luiz DeRose 2012 5

Why You Should Care For the next decade (at least) all HPC systems will have the same basic architecture: Communication between nodes MPI, SHMEM, UPC, Coarray Multithreading within the node MPI will not do Vectorization at the lowest level Current petascale applications are not structured to take advantage of these architectures Currently 80-90% of applications use a single level of parallelism message passing between cores of the MPP system Looking forward, application developers are faced with a significant task in preparing their applications for the future 6

Tools needed to Create Hybrid Codes Need a good Programming Environment to close the gap between peak performance and possible performance A lot more than just a compiler Specific tools needed for identifying the parallelism in an application Fine-grained profiling: loop level rather than routine Profiling and characterizing looping structures in a complex application Scoping tool for investigating parallelizability of high-level looping structures Tools for maintaining performance-portable applications Application developers want to develop a single code that can run efficiently on multi-core nodes with or without an accelerator Performance Analysis Debugger Software such as Cray's Hybrid Programming Environment provides tools to help Cannot replace the developer's inside knowledge 7

Reveal New analysis and code restructuring assistant Key Features Uses both the performance toolset and CCE s program library functionality to provide static and runtime analysis information Annotated source code with compiler optimization information Provides feedback on critical dependencies that prevent optimizations Assists user with the code optimization phase by correlating source code with analysis to help identify which areas are key candidates for optimization Scoping analysis Identifies shared, private and ambiguous arrays Allows user to privatize ambiguous arrays Allows user to override dependency analysis Source code navigation Uses performance data collected through CrayPat 8

Scoping Assistance Review Scoping Results User addresses parallelization issues for unresolved variables Parallelization inhibitor messages are provided to assist user with analysis Loopmark and optimization annotations Loops with scoping information are highlighted red needs user assistance Compiler feedback 9

Scoping Assistance Generate Directive Reveal generates example OpenMP directive 10

Cray Apprentice2 Overview with GPU Data 11

OpenACC Debugging On device debugging with Allinea DDT Variables arrays, pointers, full F90 and C support Set breakpoints and step warps and blocks Requires Cray compiler for on-device debugging Other compilers to follow Identical to CUDA Full warp/block/kernel controls OpenACC DIRECTIVES FOR ACCELERATORS Step host & device View variables Set breakpoints Compatibility with Cray CCE 8.x OpenACC SC'12 Luiz DeRose 2012 12

What is Cray Libsci_acc? Provide basic scientific libraries optimized for hybrid systems Incorporate the existing GPU libraries into Cray libsci Independent to, but fully compatible with OpenACC Multiple use case support Get the base use of accelerators with no code change Get extreme performance of GPU with or without code change Provide additional performance and usability Two interfaces Simple interface Auto-adaptation Base performance of GPU with minimal (or no) code change Target for anybody: non-gpu users and non-gpu expert Expert interface Advanced performance of the GPU with controls for data movement Target for CUDA, OpenACC, and GPU experts Does not imply that the expert interfaces are always needed to get great performance SC'12 Luiz DeRose 2012 13

Case Studies SC'12 Luiz DeRose 2012 14

Performance (TFlop/s) Porting Parallel Himeno to the Cray XK6 Several versions tested, with communication implemented in MPI and Fortran coarrays ~600 lines of Fortran GPU version using OpenACC accelerator directives Total number of accelerator directives: 27 plus 18 "end" directives Arrays reside permanently on the GPU memory Cray XK6 timings compared to best Cray XE6 results (hybrid MPI/OpenMP) Data transfers between host and GPU are: Communication buffers for the halo exchange Control value 5.0 4.0 3.0 2.0 1.0 Himeno Benchmark - XL configuration XE6 MPI/OMP XK6 async XK6 blocking 0.0 0 32 64 96 128 Number of nodes 15

CloverLeaf on the Cray XK6 2D hydro code, with several stencil-type operations Developed by AWE Using to explore programming models to be released as Open Source to the Mantevo project hosted by Sandia (miniapps) Current performance for 87 steps Mesh CUDA OpenACC 960x960 2.44 2.03 3840x3840 37.42 31.77 SC'12 Luiz DeRose 2012 16

GAMESS Computational chemistry package suite developed and maintained by the Gordon Group at Iowa State University http://www.msg.ameslab.gov/gamess/ ijk-tuples kernel - Source changes CUDA - 1800 lines of hand-coded CUDA OpenACC approximately 75 directives added to the original source Performance of ijk-tuples on 16 XK6 Nodes with Fermi CPU Only (16 ranks per node) 311 Seconds CUDA 134 seconds OpenACC 138 seconds CUDA was only ~3% faster than OpenACC Performance of ijk-tuples on 16 XK6 Nodes with Kepler CPU Only (16 ranks per node) 311 Seconds CUDA 76.6 seconds OpenACC 68.1 seconds OpenACC was ~12.5% faster than CUDA!! SC'12 Luiz DeRose 2012 17

Summary Cray provides a high level programming environment for acceletate Computing Fortran, C, and C++ compilers OpenACC directives to drive compiler optimization Compiler optimizations to take advantage of accelerator and multi-core X86 hardware appropriately Cray Reveal Scoping analysis tool to assist user in understanding their code and taking full advantage of SW and HW system Cray Performance Measurement and Analysis toolkit Single tool for GPU and CPU performance analysis with statistics for the whole application Parallel Debugger support with DDT or TotalView Auto-tuned Scientific Libraries support Getting performance from the system no assembly required SC'12 Luiz DeRose 2012 18