HPC with PGI and Scalasca

Size: px

Start display at page:

Download "HPC with PGI and Scalasca"

Opal Moody
5 years ago
Views:

1 HPC with PGI and Scalasca Stefan Rosenberger Supervisor: Univ.-Prof. Dipl.-Ing. Dr. Gundolf Haase Institut für Mathematik und wissenschaftliches Rechnen Universität Graz May 28, 2015 Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

2 1 PGI Tools 2 Scalasca Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

3 Parallel Programming with PGI Automatic shared-memory parallel programs compiling. PGI unroll loops automatic: Normal code: double A[ ], B [ ] ; double Z ; for ( int i =0; i <100; i ++){ Z = Z + A[ i ] B[ i ] ; } Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

4 Parallel Programming with PGI Automatic shared-memory parallel programs compiling. PGI unroll loops automatic: Normal code: double A[ ], B [ ] ; double Z ; for ( int i =0; i <100; i ++){ Z = Z + A[ i ] B[ i ] ; } Unrolled code: double A[ ], B [ ] ; double Z ; for ( int i =0; i <100; i +=2){ Z = Z + A[ i ] B[ i ] ; Z = Z + A[ i +1] B[ i +1]; } Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

5 Parallel Programming with PGI Supports OpenMP shared-memory parallel programs compiling. Distributed computing using an MPI message-passing library for communication between distributed processes. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

6 Parallel Programming with PGI Supports OpenMP shared-memory parallel programs compiling. Distributed computing using an MPI message-passing library for communication between distributed processes. Common tasks for the development: Code optimization efficient execution might need more time to compile. Function inlining: replaces a call to a function or a subroutine with the body of the function or subroutine. Directives and pragmas allow users to place hints in the source code. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

7 Auto Parallelization using -Mconcur -Mconcur scans code for loops that are candidates for auto-parallelization. -Mconcur must be used at both: compile-time and link-time. -Mconcur finds opportunities for auto-parallelization; (-Minfo... information option, which loop is parallelized). Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

8 Auto Parallelization using -Mconcur Auto parallelization fails in some situations: Innermost Loops: PGI compiler will not parallelize innermost loops by default (it is usually not profitable). Timing Loops: Example (Fortran Syntax): do j = 1, 2 do i = 1, n a ( i ) = b ( i ) + c ( i ) enddo enddo Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

9 Auto Parallelization using -Mconcur Auto parallelization fails in some situations: Scalars: Consider the following example: do j = 1, n x = b ( j ) do i = 1, n a ( i, j ) = x + c ( i, j ) enddo enddo Scalar Last Values: Problems can arise if a privatized scalar is accessed outside the loop. Consider the following example: f o r ( i = 1 ; i <N; i ++){ i f ( x [ i ] > 5. 0 ) t = x [ i ] ; } v = t ; f ( v ) ; Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

10 Interprocedural Analysis (IPA) The commandline option -Mipa activate IPA. IPA occurs in three phases: 1 Collection: Create a summary of each function; -Mipa switch present on the command line!; 2 Propagation: Summary information across all function and file boundaries. 3 Recompile/Optimization: Recompile each of the object files with the propagated interprocedural information, producing a specialized object file. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

11 Using OpenMP with PGI #pragma omp p a r a l l e l f o r s h a r e d ( u, f, d e l e ) p r i v a t e ( i, n, c0, c1, c2, t0, t1, t2, p d e l e ) s c h e d u l e ( guided, 2) for ( i = 0 ; i < n s i z e ; i ++) { p d e l e = d e l e + ( i dpn ) ; n = ( i / dpn ) dpn ; c0 = n ; c1 = n+1; c2 = n+2; t0 = p d e l e ++; t1 = p d e l e ++; t2 = p d e l e ++; u [ i ] = omega ( t0 f [ c0 ] + t1 f [ c1 ] + t2 f [ c2 ] ) ; } PGI understand # pragma, and handle the code correct. Necessary commandline - option: -mp Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

12 Using OpenMP with PGI #pragma omp p a r a l l e l f o r s h a r e d ( u, f, d e l e ) p r i v a t e ( i, n, c0, c1, c2, t0, t1, t2, p d e l e ) s c h e d u l e ( guided, 2) for ( i = 0 ; i < n s i z e ; i ++) { p d e l e = d e l e + ( i dpn ) ; n = ( i / dpn ) dpn ; c0 = n ; c1 = n+1; c2 = n+2; t0 = p d e l e ++; t1 = p d e l e ++; t2 = p d e l e ++; u [ i ] = omega ( t0 f [ c0 ] + t1 f [ c1 ] + t2 f [ c2 ] ) ; } PGI understand # pragma, and handle the code correct. Necessary commandline - option: -mp Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

13 PGI Tool Options -Mneginfo... optional error information messages to standard errors. -Msafeptr... option can significantly improve performance of C/C++ programs in which there is known to be no pointer aliasing. -Munroll... unrolls loops. -Mvect... searching for loops that are candidates for highlevel transformations such as loop distribution, loop interchange... Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

14 PGI Tool Options -Mneginfo... optional error information messages to standard errors. -Msafeptr... option can significantly improve performance of C/C++ programs in which there is known to be no pointer aliasing. -Munroll... unrolls loops. -Mvect... searching for loops that are candidates for highlevel transformations such as loop distribution, loop interchange... Part of this options are automatically included in the -O1... -O4 options. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

15 Local and Global Optimization One can call the global and local optimization with the following command-line options: -O0... no optimization -O1... specifies local optimization. (good for irregular codes, many short if statements) -O... When no level is specified, level two global optimizations are performed, including traditional scalar optimizations, induction recognition, and loop invariant motion. No SIMD vectorization is enabled. -O2... Level two specifies global optimization. -O3... Level three specifies aggressive global optimization. This level performs all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. -O4... Level four performs all level-one, level-two, and level-three optimizations and enables hoisting of guarded invariant floating point expressions. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

16 More to Informations, Quick start -fast and -fastsse... options create a generally optimal set of flags. Some options for -fast and -fastsse -o2... Specifies a code optimization level of 2. -Munroll=c:1... Unrolls loops, executing multiple instances of the original loop during each iteration. -Mnoframe... Indicates to not generate code to set up a stack frame. -Mlre... Indicates loop-carried redundancy elimination. -Mpre... Indicates partial redundancy elimination. One can find much more options in the PGI user guide Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

17 Scalasca 1 PGI Tools 2 Scalasca Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

18 Scalasca Getting started with Scalasca Scalasca is a program to improve the performance of programs on multi-cores. In particular, the program analyse the computing time of the code. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

19 Scalasca Getting started with Scalasca Scalasca is a program to improve the performance of programs on multi-cores. In particular, the program analyse the computing time of the code. scalasca -instrument: (or short skin) Prepend any instrumentation flags to your compile/link commands. scalasca -analyze: (or short scan) is used to control the Score-P measurement environment during the execution of the target application. scalasca -examine: (or short square) is used to postprocess the analysis report generated by a Score-P profiling measurement (browser Cube). Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

20 Scalasca Scalasca Instrumentation All the necessary instrumentation of user routines, OpenMP constructs and MPI functions should be handled by the Score-P instrumenter, which is accessed through the scorep command. The scorep instrumenter must be used with the link command. Attention, Scalasca did not support CUDA SHMEM OpenMP nested parallelism. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

21 Scalasca Runtime measurement collection & analysis We consider the following example (including filtering): 1 e x p o r t SCOREP FILTERINGFILE =... / s r c / example cpu / F i l t e r F i l e. f i l t 2 skin / usr / bin / mpicxx O0 g fopenmp DOPENMP Wall DFAST ACC DFAST AMG DNOSSE DP2P v1 example cgamg. cpp o example cg 3 scan / usr / bin / mpirun np 0. / example cg 4 s c o r e p s c o r e r f... / s r c / example cpu / F i l t e r F i l e. f i l t s c o r e p e x a m p l e cg XxO sum / p r o f i l e. cubex 5 s q u a r e f... / s r c / example cpu / F i l t e r F i l e. f i l t s c o r e p e x a m p l e cg XxO sum / Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

22 Scalasca Knowledge on time Tracing The Scalasca structure: skin... Prepare and link the application with the measurement libraries. scan... collects measurement data in a new folder. square... Scalasca s graphical interface. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

23 Scalasca Knowledge on time Tracing The Scalasca structure: skin... Prepare and link the application with the measurement libraries. scan... collects measurement data in a new folder. square... Scalasca s graphical interface. One should note, that the during skin Scalasca implements time-measuring functions. Therefore, the time measurement could be warped. Use Filtering-Files to erase simple functions from the scan process. Note: Scalasca requires ASCII code to read the filtering files. Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

24 Scalasca Runtime measurement collection & analysis One get a visualisation like: Stefan Rosenberger HPC with PGI and Scalasca May 28, / 19

SCALASCA v1.0 Quick Reference

SCALASCA v1.0 Quick Reference General SCALASCA is an open-source toolset for scalable performance analysis of large-scale parallel applications. Use the scalasca command with appropriate action flags to instrument application object