HPC with PGI and Scalasca

Size: px
Start display at page:

Download "HPC with PGI and Scalasca"

Transcription

1 HPC with PGI and Scalasca Stefan Rosenberger Supervisor: Univ.-Prof. Dipl.-Ing. Dr. Gundolf Haase Institut für Mathematik und wissenschaftliches Rechnen Universität Graz May 28, 2015 Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

2 1 PGI Tools 2 Scalasca Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

3 Parallel Programming with PGI Automatic shared-memory parallel programs compiling. PGI unroll loops automatic: Normal code: double A[ ], B [ ] ; double Z ; for ( int i =0; i <100; i ++){ Z = Z + A[ i ] B[ i ] ; } Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

4 Parallel Programming with PGI Automatic shared-memory parallel programs compiling. PGI unroll loops automatic: Normal code: double A[ ], B [ ] ; double Z ; for ( int i =0; i <100; i ++){ Z = Z + A[ i ] B[ i ] ; } Unrolled code: double A[ ], B [ ] ; double Z ; for ( int i =0; i <100; i +=2){ Z = Z + A[ i ] B[ i ] ; Z = Z + A[ i +1] B[ i +1]; } Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

5 Parallel Programming with PGI Supports OpenMP shared-memory parallel programs compiling. Distributed computing using an MPI message-passing library for communication between distributed processes. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

6 Parallel Programming with PGI Supports OpenMP shared-memory parallel programs compiling. Distributed computing using an MPI message-passing library for communication between distributed processes. Common tasks for the development: Code optimization efficient execution might need more time to compile. Function inlining: replaces a call to a function or a subroutine with the body of the function or subroutine. Directives and pragmas allow users to place hints in the source code. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

7 Auto Parallelization using -Mconcur -Mconcur scans code for loops that are candidates for auto-parallelization. -Mconcur must be used at both: compile-time and link-time. -Mconcur finds opportunities for auto-parallelization; (-Minfo... information option, which loop is parallelized). Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

8 Auto Parallelization using -Mconcur Auto parallelization fails in some situations: Innermost Loops: PGI compiler will not parallelize innermost loops by default (it is usually not profitable). Timing Loops: Example (Fortran Syntax): do j = 1, 2 do i = 1, n a ( i ) = b ( i ) + c ( i ) enddo enddo Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

9 Auto Parallelization using -Mconcur Auto parallelization fails in some situations: Scalars: Consider the following example: do j = 1, n x = b ( j ) do i = 1, n a ( i, j ) = x + c ( i, j ) enddo enddo Scalar Last Values: Problems can arise if a privatized scalar is accessed outside the loop. Consider the following example: f o r ( i = 1 ; i <N; i ++){ i f ( x [ i ] > 5. 0 ) t = x [ i ] ; } v = t ; f ( v ) ; Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

10 Interprocedural Analysis (IPA) The commandline option -Mipa activate IPA. IPA occurs in three phases: 1 Collection: Create a summary of each function; -Mipa switch present on the command line!; 2 Propagation: Summary information across all function and file boundaries. 3 Recompile/Optimization: Recompile each of the object files with the propagated interprocedural information, producing a specialized object file. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

11 Using OpenMP with PGI #pragma omp p a r a l l e l f o r s h a r e d ( u, f, d e l e ) p r i v a t e ( i, n, c0, c1, c2, t0, t1, t2, p d e l e ) s c h e d u l e ( guided, 2) for ( i = 0 ; i < n s i z e ; i ++) { p d e l e = d e l e + ( i dpn ) ; n = ( i / dpn ) dpn ; c0 = n ; c1 = n+1; c2 = n+2; t0 = p d e l e ++; t1 = p d e l e ++; t2 = p d e l e ++; u [ i ] = omega ( t0 f [ c0 ] + t1 f [ c1 ] + t2 f [ c2 ] ) ; } PGI understand # pragma, and handle the code correct. Necessary commandline - option: -mp Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

12 Using OpenMP with PGI #pragma omp p a r a l l e l f o r s h a r e d ( u, f, d e l e ) p r i v a t e ( i, n, c0, c1, c2, t0, t1, t2, p d e l e ) s c h e d u l e ( guided, 2) for ( i = 0 ; i < n s i z e ; i ++) { p d e l e = d e l e + ( i dpn ) ; n = ( i / dpn ) dpn ; c0 = n ; c1 = n+1; c2 = n+2; t0 = p d e l e ++; t1 = p d e l e ++; t2 = p d e l e ++; u [ i ] = omega ( t0 f [ c0 ] + t1 f [ c1 ] + t2 f [ c2 ] ) ; } PGI understand # pragma, and handle the code correct. Necessary commandline - option: -mp Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

13 PGI Tool Options -Mneginfo... optional error information messages to standard errors. -Msafeptr... option can significantly improve performance of C/C++ programs in which there is known to be no pointer aliasing. -Munroll... unrolls loops. -Mvect... searching for loops that are candidates for highlevel transformations such as loop distribution, loop interchange... Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

14 PGI Tool Options -Mneginfo... optional error information messages to standard errors. -Msafeptr... option can significantly improve performance of C/C++ programs in which there is known to be no pointer aliasing. -Munroll... unrolls loops. -Mvect... searching for loops that are candidates for highlevel transformations such as loop distribution, loop interchange... Part of this options are automatically included in the -O1... -O4 options. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

15 Local and Global Optimization One can call the global and local optimization with the following command-line options: -O0... no optimization -O1... specifies local optimization. (good for irregular codes, many short if statements) -O... When no level is specified, level two global optimizations are performed, including traditional scalar optimizations, induction recognition, and loop invariant motion. No SIMD vectorization is enabled. -O2... Level two specifies global optimization. -O3... Level three specifies aggressive global optimization. This level performs all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. -O4... Level four performs all level-one, level-two, and level-three optimizations and enables hoisting of guarded invariant floating point expressions. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

16 More to Informations, Quick start -fast and -fastsse... options create a generally optimal set of flags. Some options for -fast and -fastsse -o2... Specifies a code optimization level of 2. -Munroll=c:1... Unrolls loops, executing multiple instances of the original loop during each iteration. -Mnoframe... Indicates to not generate code to set up a stack frame. -Mlre... Indicates loop-carried redundancy elimination. -Mpre... Indicates partial redundancy elimination. One can find much more options in the PGI user guide Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

17 Scalasca 1 PGI Tools 2 Scalasca Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

18 Scalasca Getting started with Scalasca Scalasca is a program to improve the performance of programs on multi-cores. In particular, the program analyse the computing time of the code. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

19 Scalasca Getting started with Scalasca Scalasca is a program to improve the performance of programs on multi-cores. In particular, the program analyse the computing time of the code. scalasca -instrument: (or short skin) Prepend any instrumentation flags to your compile/link commands. scalasca -analyze: (or short scan) is used to control the Score-P measurement environment during the execution of the target application. scalasca -examine: (or short square) is used to postprocess the analysis report generated by a Score-P profiling measurement (browser Cube). Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

20 Scalasca Scalasca Instrumentation All the necessary instrumentation of user routines, OpenMP constructs and MPI functions should be handled by the Score-P instrumenter, which is accessed through the scorep command. The scorep instrumenter must be used with the link command. Attention, Scalasca did not support CUDA SHMEM OpenMP nested parallelism. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

21 Scalasca Runtime measurement collection & analysis We consider the following example (including filtering): 1 e x p o r t SCOREP FILTERINGFILE =... / s r c / example cpu / F i l t e r F i l e. f i l t 2 skin / usr / bin / mpicxx O0 g fopenmp DOPENMP Wall DFAST ACC DFAST AMG DNOSSE DP2P v1 example cgamg. cpp o example cg 3 scan / usr / bin / mpirun np 0. / example cg 4 s c o r e p s c o r e r f... / s r c / example cpu / F i l t e r F i l e. f i l t s c o r e p e x a m p l e cg XxO sum / p r o f i l e. cubex 5 s q u a r e f... / s r c / example cpu / F i l t e r F i l e. f i l t s c o r e p e x a m p l e cg XxO sum / Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

22 Scalasca Knowledge on time Tracing The Scalasca structure: skin... Prepare and link the application with the measurement libraries. scan... collects measurement data in a new folder. square... Scalasca s graphical interface. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

23 Scalasca Knowledge on time Tracing The Scalasca structure: skin... Prepare and link the application with the measurement libraries. scan... collects measurement data in a new folder. square... Scalasca s graphical interface. One should note, that the during skin Scalasca implements time-measuring functions. Therefore, the time measurement could be warped. Use Filtering-Files to erase simple functions from the scan process. Note: Scalasca requires ASCII code to read the filtering files. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

24 Scalasca Runtime measurement collection & analysis One get a visualisation like: Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

SCALASCA v1.0 Quick Reference

SCALASCA v1.0 Quick Reference General SCALASCA is an open-source toolset for scalable performance analysis of large-scale parallel applications. Use the scalasca command with appropriate action flags to instrument application object

More information

OpenMP and MPI parallelization

OpenMP and MPI parallelization OpenMP and MPI parallelization Gundolf Haase Institute for Mathematics and Scientific Computing University of Graz, Austria Chile, Jan. 2015 OpenMP for our example OpenMP generation in code Determine matrix

More information

USER'S GUIDE FOR OPENPOWER CPUS. Version 2018

USER'S GUIDE FOR OPENPOWER CPUS. Version 2018 USER'S GUIDE FOR OPENPOWER CPUS Version 2018 TABLE OF CONTENTS Preface... ix Audience Description... ix Compatibility and Conformance to Standards... ix Organization... x Hardware and Software Constraints...

More information

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers

More information

OpenACC Support in Score-P and Vampir

OpenACC Support in Score-P and Vampir Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Scientific Computing

Scientific Computing Lecture on Scientific Computing Dr. Kersten Schmidt Lecture 20 Technische Universität Berlin Institut für Mathematik Wintersemester 2014/2015 Syllabus Linear Regression, Fast Fourier transform Modelling

More information

OPENMP FOR ACCELERATORS

OPENMP FOR ACCELERATORS 7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There

More information

The Cray XT Compilers

The Cray XT Compilers The Cray XT Compilers Geir Johansen, Cray Inc. ABSTRACT: The Cray XT3 and Cray XT4 supports compilers from the Portland Group, PathScale, and the GNU Compiler Collection. The goal of the paper is to provide

More information

PGI Visual Fortran Release Notes. Version The Portland Group

PGI Visual Fortran Release Notes. Version The Portland Group PGI Visual Fortran Release Notes Version 13.3 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary of STMicroelectronics,

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

PVF User's Guide. Version

PVF User's Guide. Version Version 2014 www.pgroup.com TABLE OF CONTENTS Preface... xii Audience Description...xii Compatibility and Conformance to Standards... xii Organization... xiii Hardware and Software Constraints... xiv Conventions...

More information

PGI Compiler User's Guide

PGI Compiler User's Guide Version 2016 PGI Compilers and Tools TABLE OF CONTENTS Preface... xii Audience Description...xii Compatibility and Conformance to Standards... xii Organization... xiii Hardware and Software Constraints...

More information

No Time to Read This Book?

No Time to Read This Book? Chapter 1 No Time to Read This Book? We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

Introduction to Compilers and Optimization

Introduction to Compilers and Optimization Introduction to Compilers and Optimization Le Yan (lyan1@cct.lsu.edu) Scientific Computing Consultant Louisiana Optical Network Initiative / LSU HPC April 1, 2009 Goals of training Acquaint users with

More information

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

CPS343 Parallel and High Performance Computing Project 1 Spring 2018 CPS343 Parallel and High Performance Computing Project 1 Spring 2018 Assignment Write a program using OpenMP to compute the estimate of the dominant eigenvalue of a matrix Due: Wednesday March 21 The program

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

Shared Memory Programming With OpenMP Exercise Instructions

Shared Memory Programming With OpenMP Exercise Instructions Shared Memory Programming With OpenMP Exercise Instructions John Burkardt Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Advanced Computational Science

More information

Open Multi-Processing: Basic Course

Open Multi-Processing: Basic Course HPC2N, UmeåUniversity, 901 87, Sweden. May 26, 2015 Table of contents Overview of Paralellism 1 Overview of Paralellism Parallelism Importance Partitioning Data Distributed Memory Working on Abisko 2 Pragmas/Sentinels

More information

Parallel Programming: OpenMP

Parallel Programming: OpenMP Parallel Programming: OpenMP Xianyi Zeng xzeng@utep.edu Department of Mathematical Sciences The University of Texas at El Paso. November 10, 2016. An Overview of OpenMP OpenMP: Open Multi-Processing An

More information

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir VI-HPS Team Congratulations!? If you made it this far, you successfully used Score-P to instrument

More information

Automatic trace analysis with the Scalasca Trace Tools

Automatic trace analysis with the Scalasca Trace Tools Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification

More information

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

AMD S X86 OPEN64 COMPILER. Michael Lai AMD AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries

More information

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Lecture 2: Introduction to OpenMP with application to a simple PDE solver Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor

More information

TAU Performance System Hands on session

TAU Performance System Hands on session TAU Performance System Hands on session Sameer Shende sameer@cs.uoregon.edu University of Oregon http://tau.uoregon.edu Copy the workshop tarball! Setup preferred program environment compilers! Default

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Shared Memory Programming With OpenMP Computer Lab Exercises

Shared Memory Programming With OpenMP Computer Lab Exercises Shared Memory Programming With OpenMP Computer Lab Exercises Advanced Computational Science II John Burkardt Department of Scientific Computing Florida State University http://people.sc.fsu.edu/ jburkardt/presentations/fsu

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

Optimising with the IBM compilers

Optimising with the IBM compilers Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square

More information

PGI User s Guide. Parallel Fortran, C and C++ for Scientists and Engineers

PGI User s Guide. Parallel Fortran, C and C++ for Scientists and Engineers PGI User s Guide Parallel Fortran, C and C++ for Scientists and Engineers The Portland Group STMicroelectronics 9150 SW Pioneer Court, Suite H Wilsonville, OR 97070 While every precaution has been taken

More information

PGI Installation and Release Notes for OpenPOWER CPUs

PGI Installation and Release Notes for OpenPOWER CPUs PGI Installation and Release Notes for OpenPOWER CPUs Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Release Overview... 1 1.1. About This Release...1 1.2. Release Components... 1 1.3.

More information

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group Parallelising Scientific Codes Using OpenMP Wadud Miah Research Computing Group Software Performance Lifecycle Scientific Programming Early scientific codes were mainly sequential and were executed on

More information

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance

More information

PGI VISUAL FORTRAN USER'S GUIDE. Version 2017

PGI VISUAL FORTRAN USER'S GUIDE. Version 2017 PGI VISUAL FORTRAN USER'S GUIDE Version 2017 TABLE OF CONTENTS Preface... xii Audience Description... xii Compatibility and Conformance to Standards...xii Organization... xiii Hardware and Software Constraints...xiv

More information

PGI VISUAL FORTRAN USER'S GUIDE. Version 2018

PGI VISUAL FORTRAN USER'S GUIDE. Version 2018 PGI VISUAL FORTRAN USER'S GUIDE Version 2018 TABLE OF CONTENTS Preface... xi Audience Description... xi Compatibility and Conformance to Standards... xi Organization... xii Hardware and Software Constraints...

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir VI-HPS Team Score-P: Specialized Measurements and Analyses Mastering build systems Hooking up the

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Parallele Numerik. Blatt 1

Parallele Numerik. Blatt 1 Universität Konstanz FB Mathematik & Statistik Prof. Dr. M. Junk Dr. Z. Yang Ausgabe: 02. Mai; SS08 Parallele Numerik Blatt 1 As a first step, we consider two basic problems. Hints for the realization

More information

PVF User's Guide. Version PGI Compilers and Tools

PVF User's Guide. Version PGI Compilers and Tools Version 2016 PGI Compilers and Tools TABLE OF CONTENTS Preface... xii Audience Description...xii Compatibility and Conformance to Standards... xii Organization... xiii Hardware and Software Constraints...

More information

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time

More information

CS691/SC791: Parallel & Distributed Computing

CS691/SC791: Parallel & Distributed Computing CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.

More information

PGI Release Notes for Intel 64 and AMD64 CPUs

PGI Release Notes for Intel 64 and AMD64 CPUs PGI Release Notes for Intel 64 and AMD64 CPUs Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Release Overview... 1 1.1. Product Overview... 1 1.1.1. Licensing Terminology... 1 1.1.2.

More information

Scalasca 1.3 User Guide

Scalasca 1.3 User Guide Scalasca 1.3 User Guide Scalable Automatic Performance Analysis March 2011 The Scalasca Development Team scalasca@fz-juelich.de Copyright 1998 2011 Forschungszentrum Jülich GmbH, Germany Copyright 2009

More information

Performance analysis : Hands-on

Performance analysis : Hands-on Performance analysis : Hands-on time Wall/CPU parallel context gprof flat profile/call graph self/inclusive MPI context VTune hotspots, per line profile advanced metrics : general exploration, parallel

More information

USER'S GUIDE FOR X86-64 CPUS. Version 2018

USER'S GUIDE FOR X86-64 CPUS. Version 2018 USER'S GUIDE FOR X86-64 CPUS Version 2018 TABLE OF CONTENTS Preface... xi Audience Description... xi Compatibility and Conformance to Standards... xi Organization... xii Hardware and Software Constraints...

More information

OPEN MP and MPI on Kingspeak chpc cluster

OPEN MP and MPI on Kingspeak chpc cluster OPEN MP and MPI on Kingspeak chpc cluster Command to compile the code with openmp and mpi /uufs/kingspeak.peaks/sys/pkg/openmpi/std_intel/bin/mpicc -o hem hemhotlz.c -I /uufs/kingspeak.peaks/sys/pkg/openmpi/std_intel/include

More information

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued)

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued) Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued) VI-HPS Team Congratulations!? If you made it this far, you successfully used Score-P

More information

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall CS 701 Charles N. Fischer Class Meets Tuesdays & Thursdays, 11:00 12:15 2321 Engineering Hall Fall 2003 Instructor http://www.cs.wisc.edu/~fischer/cs703.html Charles N. Fischer 5397 Computer Sciences Telephone:

More information

Integrating Parallel Application Development with Performance Analysis in Periscope

Integrating Parallel Application Development with Performance Analysis in Periscope Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation

More information

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen OpenMP Dr. William McDoniel and Prof. Paolo Bientinesi HPAC, RWTH Aachen mcdoniel@aices.rwth-aachen.de WS17/18 Loop construct - Clauses #pragma omp for [clause [, clause]...] The following clauses apply:

More information

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP* Advanced OpenMP Compiling and running OpenMP programs C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp 2 1 Running Standard environment variable determines the number of threads: tcsh

More information

Manual and Compiler Optimizations

Manual and Compiler Optimizations Manual and Compiler Optimizations Shawn T. Brown, PhD. Director of Public Health Applications Pittsburgh Supercomputing Center, Carnegie Mellon University Introduction to Performance Optimization Real

More information

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the

More information

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005 Programming for High Performance Computing in Modern Fortran Bill Long, Cray Inc. 17-May-2005 Concepts in HPC Efficient and structured layout of local data - modules and allocatable arrays Efficient operations

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität petkovve@in.tum.de March 2010 Outline Motivation Periscope (PSC) Periscope performance analysis

More information

PGI Visual Fortran User's Guide. Parallel Fortran for Scientists and Engineers. Release The Portland Group

PGI Visual Fortran User's Guide. Parallel Fortran for Scientists and Engineers. Release The Portland Group PGI Visual Fortran User's Guide Parallel Fortran for Scientists and Engineers Release 2013 The Portland Group PGI Visual Fortran User s Guide Copyright 2013 NVIDIA Corporation All rights reserved. Printed

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Make Your C/C++ and PL/I Code FLY With the Right Compiler Options

Make Your C/C++ and PL/I Code FLY With the Right Compiler Options Make Your C/C++ and PL/I Code FLY With the Right Compiler Options Visda Vokhshoori/Peter Elderon IBM Corporation Session 13790 Insert Custom Session QR if Desired. WHAT does good application performance

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

PGI Visual Fortran Release Notes. Version The Portland Group

PGI Visual Fortran Release Notes. Version The Portland Group PGI Visual Fortran Release Notes Version 12.10 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary of STMicroelectronics,

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

The SGI Pro64 Compiler Infrastructure - A Tutorial

The SGI Pro64 Compiler Infrastructure - A Tutorial The SGI Pro64 Compiler Infrastructure - A Tutorial Guang R. Gao (U of Delaware) J. Dehnert (SGI) J. N. Amaral (U of Alberta) R. Towle (SGI) Acknowledgement The SGI Compiler Development Teams The MIPSpro/Pro64

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 8 ] OpenMP Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

PGI Compiler User's Guide. Parallel Fortran, C and C++ for Scientists and Engineers. Release The Portland Group

PGI Compiler User's Guide. Parallel Fortran, C and C++ for Scientists and Engineers. Release The Portland Group PGI Compiler User's Guide Parallel Fortran, C and C++ for Scientists and Engineers Release 2013 The Portland Group PGI Compiler User s Guide Copyright 2013 NVIDIA Corporation All rights reserved. Printed

More information

Optimization Prof. James L. Frankel Harvard University

Optimization Prof. James L. Frankel Harvard University Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

Performance Analysis and Optimization MAQAO Tool

Performance Analysis and Optimization MAQAO Tool Performance Analysis and Optimization MAQAO Tool Andrés S. CHARIF-RUBIAL Emmanuel OSERET {achar,emmanuel.oseret}@exascale-computing.eu Exascale Computing Research 11th VI-HPS Tuning Workshop MAQAO Tool

More information

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz OPENMP GPU OFFLOAD IN FLANG AND LLVM Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz MOTIVATION What does HPC programmer need today? Performance à GPUs, multi-cores, other

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

Compilers and optimization techniques. Gabriele Fatigati - Supercomputing Group

Compilers and optimization techniques. Gabriele Fatigati - Supercomputing Group Compilers and optimization techniques Gabriele Fatigati - g.fatigati@cineca.it Supercomputing Group The compilation is the process by which a high-level code is converted to machine languages. Born to

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Dr. Hyrum D. Carroll November 22, 2016 Parallel Programming in a Nutshell Load balancing vs Communication This is the eternal problem in parallel computing. The basic approaches

More information

OpenACC introduction (part 2)

OpenACC introduction (part 2) OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence

More information

PGI Installation and Release Notes for OpenPOWER CPUs

PGI Installation and Release Notes for OpenPOWER CPUs PGI Installation and Release Notes for OpenPOWER CPUs Version 2016 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Release Overview... 1 1.1. About This Release...1 1.2. Release Components... 1 1.3.

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

PGI CDK Release Notes

PGI CDK Release Notes PGI CDK Release Notes Release 2010 The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 While every precaution has been taken in the preparation of this document, The Portland

More information

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview The Portland Group Published: v0.7 November 2008 Contents 1. Introduction... 1 1.1 Scope... 1 1.2 Glossary... 1 1.3 Execution

More information

Scalasca 1.4 User Guide

Scalasca 1.4 User Guide Scalasca 1.4 User Guide Scalable Automatic Performance Analysis March 2013 The Scalasca Development Team scalasca@fz-juelich.de Copyright 1998 2013 Forschungszentrum Jülich GmbH, Germany Copyright 2009

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to

More information

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009. Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines

More information

Introduction to Machine-Independent Optimizations - 1

Introduction to Machine-Independent Optimizations - 1 Introduction to Machine-Independent Optimizations - 1 Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Outline of

More information