Reveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment

Similar documents
Reveal. Dr. Stephen Sachs

Hands-On II: Ray Tracing (data parallelism) COMPUTE STORE ANALYZE

Sonexion GridRAID Characteristics

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon

Array, Domain, & Domain Map Improvements Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018

XC System Management Usability BOF Joel Landsteiner & Harold Longley, Cray Inc. Copyright 2017 Cray Inc.

Memory Leaks Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016

Productive Programming in Chapel: A Computation-Driven Introduction Chapel Team, Cray Inc. SC16, Salt Lake City, UT November 13, 2016

OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017

The Use and I: Transitivity of Module Uses and its Impact

Q & A, Project Status, and Wrap-up COMPUTE STORE ANALYZE

Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

Locality/Affinity Features COMPUTE STORE ANALYZE

Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016

Addressing Performance and Programmability Challenges in Current and Future Supercomputers

Data-Centric Locality in Chapel

Grab-Bag Topics / Demo COMPUTE STORE ANALYZE

MPI for Cray XE/XK Systems & Recent Enhancements

Adding Lifetime Checking to Chapel Michael Ferguson, Cray Inc. CHIUW 2018 May 25, 2018

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Project Caribou; Streaming metrics for Sonexion Craig Flaskerud

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

Transferring User Defined Types in

Cray XC System Node Diagnosability. Jeffrey J. Schutkoske Platform Services Group (PSG)

Compiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018

Caching Puts and Gets in a PGAS Language Runtime

Data Parallelism COMPUTE STORE ANALYZE

Redfish APIs on Next Generation Cray Hardware CUG 2018 Steven J. Martin, Cray Inc.

How-to write a xtpmd_plugin for your Cray XC system Steven J. Martin

Intel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla

First experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs. Alistair Hart (Cray UK Ltd.

Hybrid Warm Water Direct Cooling Solution Implementation in CS300-LC

Enhancing scalability of the gyrokinetic code GS2 by using MPI Shared Memory for FFTs

Vectorization of Chapel Code Elliot Ronaghan, Cray Inc. June 13, 2015

Chapel s New Adventures in Data Locality Brad Chamberlain Chapel Team, Cray Inc. August 2, 2017

New Tools and Tool Improvements Chapel Team, Cray Inc. Chapel version 1.16 October 5, 2017

Productive Programming in Chapel:

XC Series Shifter User Guide (CLE 6.0.UP02) S-2571

Chapel Hierarchical Locales

Lustre Networking at Cray. Chris Horn

Steps to create a hybrid code

Using the Cray Programming Environment to Convert an all MPI code to a Hybrid-Multi-core Ready Application

Evaluating Shifter for HPC Applications Don Bahls Cray Inc.

Chapel: Productive Parallel Programming from the Pacific Northwest

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Heidi Poxon Cray Inc.

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Chapel Background and Motivation COMPUTE STORE ANALYZE

Standard Library and Interoperability Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Language Improvements Chapel Team, Cray Inc. Chapel version 1.15 April 6, 2017

Performance Optimizations Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016

Performance Optimizations Generated Code Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015

Data Parallelism, By Example COMPUTE STORE ANALYZE

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Striking the Balance Driving Increased Density and Cost Reduction in Printed Circuit Board Designs

Intel and the Future of Consumer Electronics. Shahrokh Shahidzadeh Sr. Principal Technologist

Using CrayPAT and Apprentice2: A Stepby-step

Language and Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016

Benchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016

Cray Programming Environment User's Guide S

Enabling Parallel Computing in Chapel with Clang and LLVM

High Performance Computing The Essential Tool for a Knowledge Economy

The Cray Programming Environment. An Introduction

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Concurrency. Power. Programming Difficulty. Resiliency

ISTeC Cray High-Performance Computing System. Richard Casey, PhD RMRCE CSU Center for Bioinformatics

John Levesque Nov 16, 2001

Mobility: Innovation Unleashed!

Monitoring Power CrayPat-lite

Intel Many Integrated Core (MIC) Architecture

The Intel Processor Diagnostic Tool Release Notes

ARCHER Single Node Optimisation

Device Firmware Update (DFU) for Windows

Compiling applications for the Cray XC

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Refactoring Applications for the XK7 and Future Hybrid Architectures. John M Levesque (Cray Inc) & Jeff Larkin (Nvidia)

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Intel RealSense Depth Module D400 Series Software Calibration Tool

Early Experiences Writing Performance Portable OpenMP 4 Codes

What s New August 2015

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Intel Array Building Blocks

Performance Measurement and Analysis Tools Installation Guide S

Running Docker* Containers on Intel Xeon Phi Processors

Hybrid multicore supercomputing and GPU acceleration with next-generation Cray systems

OpenACC Accelerator Directives. May 3, 2013

Show Me the Money: Monetization Strategies for Apps. Scott Crabtree moderator

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel USB 3.0 extensible Host Controller Driver

Introduction to Cray Data Virtualization Service S

Memory & Thread Debugger

Vectorization Advisor: getting started

GraphBuilder: A Scalable Graph ETL Framework

Intel Xeon Phi Coprocessor Performance Analysis

The Transition to PCI Express* for Client SSDs

Programming Environment 4/11/2015

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Best practices. IBMr. How to use OMEGAMON XE for DB2 Performance Expert on z/os to identify DB2 deadlock and timeout. IBM DB2 Tools for z/os

Transcription:

Reveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. 2

Cray Compiler Optimization Feedback OpenMP Assistance MCDRAM Allocation Assistance

Reveal Overview Reduce effort associated with adding OpenMP to MPI programs when a pure MPI program no longer scales Produce performance portable code through OpenMP directives Get insight into optimizations performed by the Cray compiler Use as a first step to parallelize loops that will target GPUs Track requests to memory and evaluate the bandwidth contribution of objects within a program 4

When to Move to a Hybrid Programming Model When code is network bound Increased MPI collective and point-to-point wait times When MPI starts leveling off Too much memory used, even if on-node shared communication is available As the number of MPI ranks increases, more off-node communication can result, creating a network injection issue When contention of shared resources increases 5

Approach to Adding Parallelism 1. Identify key high-level loops Determine where to add additional levels of parallelism 1. Perform parallel analysis and scoping Split loop work among threads 2. Add OpenMP layer of parallelism Insert OpenMP directives 3. Analyze performance for further optimization, specifically vectorization of innermost loops We want a performance-portable application at the end 6

The Problem How Do I Parallelize This Loop? How do I know this is a good loop to parallelize? What prevents me from parallelizing this loop? Can I get help building a directive? subroutine sweepz do j = 1, js do i = 1, isz radius = zxc(i+mypez*isz) theta = zyc(j+mypey*js) do m = 1, npez do k = 1, ks n = k + ks*(m-1) + 6 r(n) = recv3(1,j,k,i,m) p(n) = recv3(2,j,k,i,m) u(n) = recv3(5,j,k,i,m) v(n) = recv3(3,j,k,i,m) w(n) = recv3(4,j,k,i,m) f(n) = recv3(6,j,k,i,m) enddo enddo call ppmlr do k = 1, kmax n = k + 6 xa (n) = zza(k) dx (n) = zdz(k) xa0(n) = zza(k) dx0(n) = zdz(k) e (n) = p(n)/(r(n)*gamm)+0.5 & *(u(n)**2+v(n)**2+w(n)**2) enddo call ppmlr enddo enddo subroutine ppmlr call boundary call flatten call paraset(nmin-4, nmax+5, para, dx, xa) call parabola(nmin-4,nmax+4,para,p,dp,p6,pl,flat) call parabola(nmin-4,nmax+4, para,r,dr,r6,rl,flat) call parabola(nmin-4,nmax+4,para,u,du,u6,ul,flat) call states(pl,ul,rl,p6,u6,r6,dp,du,dr,plft,ulft,& rlft,prgh,urgh,rrgh) call riemann(nmin-3,nmax+4,gam,prgh,urgh,rrgh,& plft,ulft,rlft pmid umid) call evolve(umid, pmid) ß contains more calls call remap ß contains more calls call volume(nmin,nmax,ngeom,radius,xa,dx,dvol) call remap ß contains more calls return end 7

Loop Work Estimates Gather loop statistics using the Cray performance tools and the Cray Compiling Environment (CCE) to determine which loops have the most work Helps identify high-level serial loops to parallelize Based on runtime analysis, approximates how much work exists within a loop 8

Collect Loop Work Estimates Set up loop work estimates experiment with Cray compiler and Cray performance tools $ module load PrgEnv-cray perftools-lite-loops Build program with Cray program library h pl=/full_path/program.pl Run program to get loop work estimates 9

Example Loop Statistics Table 2: Loop Stats by Function Loop Loop Loop Loop Loop Function=/.LOOP[.] Incl Hit Trips Trips Trips PE=HIDE Time Avg Min Max Total ------------------------------------------------------------------------ 8.995914 100 25 0 25 sweepy_.loop.1.li.33 8.995604 2500 25 0 25 sweepy_.loop.2.li.34 8.894750 50 25 0 25 sweepz_.loop.05.li.49 8.894637 1250 25 0 25 sweepz_.loop.06.li.50 4.420629 50 25 0 25 sweepx2_.loop.1.li.29 4.420536 1250 25 0 25 sweepx2_.loop.2.li.30 4.387534 50 25 0 25 sweepx1_.loop.1.li.29 4.387457 1250 25 0 25 sweepx1_.loop.2.li.30 2.523214 187500 107 0 107 riemann_.loop.2.li.63 10

View Source and Optimization Information 11

Scope Selected Loop(s) Trigger dependence analysis scope loops above given threshold 12

Review Scoping Results Loops with scoping information are flagged. Red needs user assistance Parallelization inhibitor messages are provided to assist user with analysis 13

Review Scoping Results (continued) Reveal identifies calls that prevent parallelization Reveal identifies shared reductions down the call chain 14

Review Scoping Results (continued) 15

Generate OpenMP Directives! Directive inserted by Cray Reveal. May be incomplete.!$omp parallel do default(none) &!$OMP& unresolved (dvol,dx,dx0,e,f,flat,p,para,q,r,radius,svel,u,v,w, &!$OMP& xa,xa0) &!$OMP& private (i,j,k,m,n,$$_n,delp2,delp1,shock,temp2,old_flat, &!$OMP& onemfl,hdt,sinxf0,gamfac1,gamfac2,dtheta,deltx,fractn, &!$OMP& ekin) &!$OMP& shared (gamm,isy,js,ks,mypey,ndim,ngeomy,nlefty,npey,nrighty, &!$OMP& recv1,send2,zdy,zxc,zya) do k = 1, ks do i = 1, isy radius = zxc(i+mypey*isy)! Put state variables into 1D arrays, padding with 6 ghost zones do m = 1, npey do j = 1, js n = j + js*(m-1) + 6 r(n) = recv1(1,k,j,i,m) p(n) = recv1(2,k,j,i,m) u(n) = recv1(4,k,j,i,m) v(n) = recv1(5,k,j,i,m) w(n) = recv1(3,k,j,i,m) f(n) = recv1(6,k,j,i,m) enddo enddo Reveal generates OpenMP directive with illegal clause marking variables that need addressing do j = 1, jmax n = j + 6 16

Reveal Auto-Parallelization Minimal user time investment includes time to set up and run optimization experiment Collect loop work estimates Build program library Click button in Reveal Run experimental binary, Compare against original program Trigger dependence analysis scope loops above given threshold create new binary Even if experiment does not yield a performance improvement, Reveal will provide insight into parallelization issues Use on X86 hardware or Armv8 hardware 17

Validate User Inserted Directives User inserted directive with misscoped variable n 18

Look For Vectorization Opportunities Choose Compiler Messages view to access message filtering, then select desired type of message 19

Thank You!