Module II: Optimizing Serial Programs Part 1 - Compilation and Linking

Size: px

Start display at page:

Download "Module II: Optimizing Serial Programs Part 1 - Compilation and Linking"

Dominic Bailey
5 years ago
Views:

1 Performance Programming: Theory, Practice and Case Studies Module II: Optimizing Serial Programs Part 1 - Compilation and Linking 38

2 Compilation overview Outline Compiler optimizations General optimizations Specifying target architecture Function inlining Data alignment optimizations Data prefetching Aliasing optimizations Compiler macro options Compiler pragmas and directives Software pipelining Linking overview Static and dynamic linking Using optimized mathematical libraries 39

3 Compilation Overview 40 Compiler: translates source language program into target language program Analysis stage Synthesis stage Analysis stage converts source program into intermediate representation includes three stages: lexical, syntax and semantic analysis Synthesis stage converts intermediate representation into target language program is divided into two steps: code optimization and code generation Other activities: symbol table management, error handling, OS interface

4 Compiler Organization Example: Sun ONE Studio compilers 41 Different frontends (C, C++, Fortran 77, Fortran 95) generate intermediate representation (SunIR) Backend generates code for target architectures (SPARC, x86)

5 Using Compilers Optimizing compiler: most important performance enhancing tool Compiler optimizations from a usage perspective: Applicability Utility Cost Complementary Optimizations Example: using double-word alignment and architecture-specific optimizations results in better performance than when these are used in isolation Use the latest release of compiler for best performance 42

6 Selecting Compiler Options Determining optimal compiler options is an iterative process Order of flags can make a difference, e.g. later options may take precedence over earlier ones Possible cross-compilation: compilation and target systems may be different Default settings of options should be considered Macro options can be used Trade-off between optimization level and compilation time and resource utilization 43

7 Setting Optimization Level 44 The most basic and actively used optimization option Generally is set with -O option Different compilers have different number of levels and different meanings for each of them For example, GNU: O1-O3, Sun: O1-O5 Compaq (DEC) -O2 and HP +O2 don't necessarily mean the same thing By default (no option) usually no optimization is used. Default optimization level (usually -O) can mean different things and is not recommended. Higher optimizations lead to higher compilation times and larger binaries

8 Different Optimization Levels Example: Optimization levels for Sun ONE Studio compilers -O2: -O1 plus basic global optimization including algebraic simplification, local and global subexpression elimination, register allocation, elimination of the dead code, constant propagation, tail-call elimination. -O1: Basic local optimization, assembly postpass. -O3: -O2 plus loop unrolling, fusion, software pipelining. -O4: -O3 plus function inlining within the module, aggressive global optimization. 45 -O5: Highest optimization level, likely to improve performance when used in combination with profile feedback.

9 Example: Effect of Optimization Runtime for basic matrix-matrix multiplication compiled with different optimization B HP C compiler Run Time (sec.) None O1 O2 O3 O

10 Increased Compilation Time Compilation time for dblat3.f function from Netlib Compaq Fortran Compiler X5.4A B5P Compilation Time (sec.) -g O1 O2 O3 O4 O

11 Setting Target Architecture Selecting specific target architecture allows the compiler backend to optimally use available resources Instructions understood by the CPU Registers (number, types) CPU properties (latencies, pipeline depth, etc.) Example options IBM: -qarch=pwr4 Sun: -xarch=v8plusa SGI: -mips4 GNU: -march=pentiumpro DEC: -arch ev67 48

12 Benefit of Architecture Setting Example: repeated coordinate transformation Sun Forte 6 Fortran 77 compiler Runs on Sun Ultra 80 (400MHz UltraSPARC-II) CPU characterized by -xarch=v8plusa option Option -xarch=v8 (for SuperSPARC) is suboptimal Runtime in seconds v8 v8plusa

13 Function Call Inlining 50 Inlining: replicating function code in the callee function. Pros Cons Eliminates function call overhead Improved optimization due to code transparency across function calls Larger binaries and increased compilation time Possibility of increased instruction cache misses Inlining can be controlled by compiler options (optimization levels Ox and specialized options) Some environments allow inlining assembly templates

14 Basic example Example: Inlining Inlining point Function body 51

15 Example: Inlining (continued) Building executable with and without inlining Sun ONE Studio 7 compilers; Ultra 60 (360 MHz) Cross-file inlining Runtime in seconds no inline inline

16 Inlining Options 53 Options that control inlining for various compilers: -inline -xinline +Oinline -Q -Minline -xcrossfile High -Ox optimization levels can imply inlining Functions can be inlined selectively: e.g. - xinline=[f1..fn] Selective inlining may reduce the number of inlined functions (otherwise inlined with -Ox)

17 Vectorization of Standard Calls Standard math calls in large loops can be vectorized (exp, log, trig functions, etc.) Compiler can replace calls with vectorized versions provided in libraries Can be a dedicated option HP: +Ovectorize Sun: -xvector Can be included in some optimization -Ox level Can be controlled by directives or pragmas in the code 54

18 Profile/Feedback Optimization 55 Feedback directed optimizations based on runtime execution frequency data: Optimizations performed on more frequently executed portions Register allocation, basic block ordering, code motion and rearrangement, inlining Corresponding compiler options HP: +Oprofile=collect, +Oprofile=use Sun: -xprofile=collect, -xprofile=use Compaq: -feedback (in combination with Pixie) GNU: -fprofile-arcs, -fbranch-probabilities Using profile-feedback optimization Compile program with options to collect data Run on a training data set. Profile file or directory is generated Recompile with option to optimize based on collected data Other options should be used consistently

19 Data Prefetching 56 Prefetching allows overlapping execution with fetching data from memory Benefit memory latency bound applications (particularly on high latency machines) Efficient for programs with repeatable memory access patterns Best used in combination with option that specify microarchitecture features Sample compiler options HP: +Odataprefetch Sun: -xprefetch, -xprefetch_level SGI: pf<n>, prefetch, prefetch_ahead, prefetch_manual Prefetching can also be manually controlled with pragmas or directives

20 Benefits of Prefetching Example daxpy-type loop calculation Moderate effect on Sun Ultra 60 Big impact on high-latency Sun Enterprise Even higher effect on UltraSPARC-III based SunFire 480R (extra prefetch cache, higher clock speed makes relative latencies higher) Scaled runtimes Ultra 60 Sun E1000 Sun 480R no -xprefetch -xprefetch 57

21 Floating Point Optimizations 58 IEEE 754 floating point standard ensures similar numerical behavior on different platforms Binary representation of FP numbers Operation precedence Rounding Underflow, overflow and trap handling Some compiler optimizations relax IEEE 754 requirements (slight numeric differences can occur) Algebraic simplifications Underflow control Rounding control Trap handling Example options that affect FP behavior Sun: -fsimple, -fround, -fns, -ftrap HP: +FPstring, +FPVZO

22 Example: Algebraic Simplifications 59 Sun -fsimple option allows non-ieee arithmetic and specifies FP simplifying assumptions -fsimple=0 No simplifying assumptions; IEEE conformant. -fsimple=1 Conservative simplification; does not strictly conform to IEEE 754, but numeric results typically unchanged. IEEE 754 default rounding and trapping modes unchanged. Infinities or NaNs not propagated as NaNs; that is, x*0 can be replaced by 0. Computations do not depend on the sign of zero. -fsimple=2 Aggressive optimizations that may lead to different numeric results. Example: with -fsimple=2 computation of x/y replaced with x*z, where z=1/y is computed once. Other optimizations: cycle shrinking, height reduction of the directed acyclic graph, cross-iteration common subexpression elimination, and scalar replacement. -fsimple=2 best used in conjunction with other high optimization flags.

23 Algebraic Simplifications (cont.) Example 1: elimination of redundant FP operations causing exceptions: dum1 = 0.d0 dum2 = 20.d0/dum1 -fsimple=0 - program to aborts; -fsimple=1 okay Example2: -fsimple=1 and -fsimple=2 for dotproduct (Forte 6 Update 1 Fortran 90; Ultra60) sum=0.d0 do i=1,nele sum = sum + a(i)*b(i) enddo Runtime in seconds -fsimple=1 -fsimple=

24 Data Alignment 61 Most microprocessors have preferred data alignment Data on natural or preferred byte-boundaries accessed faster, e.g. data on 8-byte boundaries can be accessed in 1 doubleword load/store instruction Misaligned data may cause restrictive load/store instructions Programming language standards specify language-specific alignment rules Compiler makes conservative assumption on data-alignment Compiler options (+align, -dalign) cause generation of double-word load/store for DP data Padding may be inserted in Fortran COMMON blocks Must be used consistently (if one module compiled with it, compile all) Optimizer can better carry out other optimizations (e.g. software pipelining)

25 Example: Data Alignment Example: array copy on SunFire 480 Forte Developer 7 compiler Runtime in seconds Memory no -dalign -dalign L2 cache

26 Pointer Alias Analysis Options 63 C programs: Pointers can point to overlapping regions of memory (aliasing) causing memory ambiguity Memory alias disambiguation in general programs is very complex for compiler to perform Compiler is conservative: optimizations (load-hoisting, unrolling, s/w pipelining) suppressed Options can be used to inform the compiler about aliasing properties of the code IBM: -qalias, SGI: alias, SUN: -xrestrict, -xalias_level, HP: -alias, GNU: -fstrict-aliasing, - fargument-alias, -fstrict-aliasing Fortran standard makes it programmer's responsibility to ensure absence of aliasing

27 Pointer Alias Analysis Alias Relationships in a program: Does alias Does not alias May alias int *i, j=1; double func(double b) int sum, *vloc; i = &j { double *c; function foo (double *a) c = (double *)... malloc(sizeof(double)); for (i=0;i<n;i++) { *c = 10.0; for (j=0;j<m;j++) { b = b + *c; a[vloc[i]+j] = free(c); a[vloc[i]+j] + sum return(b); } } } Can have big performance impact in memory bound programs Compiler generates conservative code in does alias situation Large programs with many pointers: may alias is most common. Considered equivalent to does alias (conservative code generated) More the number of does not alias relationships, higher the flexibility compiler has in generating optimized code

28 Effect of Pointer Alias Analysis Computational program on Sun Blade 1000 Structures similar to those used in graph-partitioning Type-based alias disambiguation with -xalias_level=< setting> option; 7 settings: any, basic, weak, layout, strict, std, strong Without any setting: (default level) layout used Programs conforming to ISO 1999 C standard: std Runtime in seconds -xalias_level=strong (14 ld instructions) -xalias_level=std (21) -xalias_level=strict (21) -xalias_level=layout (31) -xalias_level=weak (31) xalias_level=basic (36) -xalias_level=any (66) 65

29 Compiler Macro Options 66 Some compilers provide meta-options that combine most effective optimizations Examples: Sun: -fast, SGI: -Ofast Can be a good starting point for selecting options Caveats Can interfere with other options Macro options can change between releases Can have components that should be used consitently on all parts of the code (e.g. data alignment) Some component options should be also used for linking Macro options can have architecture settings Sun ONE Studio C compiler: -fast -fns -fsimple=2 -fsingle -ftrap=%none - xalias_level=basic -native -xbuiltin=%all -xdepend -xlibmil -xmemalign=8s -xo5 - xprefetch=auto,explicit

30 Compiler Directives and Pragmas Directives and pragmas: Annotations inserted in source Provide compiler with specific information about parts of the program (usually statements following the annotation) Directives/pragmas specific to a compiler are usually ignored by other compilers First step in source-code modification Directives are used for Parallelization (e.g. OpenMP) Pipelining control Prefetching control Data alignment Alias analysis Storage allocation / array padding 67

31 Pipelining 68 Software pipelining Runtime of a program is T = Ninstr X Cycle-time X CPI A higher Instruction Level Parallelism (ILP) decreases CPI Technique to extract Instruction Level Parallelism (ILP) in loops Breaks loop iterations in multiple parts; parts from disjoint iterations are overlapped such that they map to different functional units of the processor Parallelism in loop iterations mapped onto ILP of the processor Software pipelining and loop unrolling: Independent but related approaches; often used together Loop unrolling: decreases loop overhead but no scheduling of instructions Software pipelining: instructions scheduled to decrease processor pipeline stalls by hiding instruction latencies

32 Pipelining (cont.) Software Pipelining via Modulo Scheduling In many cases it is not safe for compiler to pipeline a loop Loops with indirection Loops with branches or conditionals Fat (computationally dense) loops Loops where trip count calculation cannot be safely performed 69

33 Pipelining Control Example: extracted from reservoir simulation application (data-set fits in 4MB level-2 cache) Sun Forte 6 update 1 compilers do ivr=isvs(1,ksld),isvs(2,ksld) iv1 = isvr(1,ivr) iv2 = isvr(2,ivr) iv = isvr(3,ivr)!$pragma sun pipeloop = 0 do ic=iv1,iv2 in1 = ic + iv in2 = ic - icf wrk(in1) = wrk(in1) - a(in2)*wrk(ic) enddo Runtime in seconds enddo Pragma pipeloop No pragma

34 Linking Overview 71 Linking: stages in generating and running executables (link-editing and runtime linking) Link-editor Concatenates relocatable object files (produced by the compiler or assembler) to generate libraries or executables Performs symbol resolution to bind external symbols to implementations using the symbol tables in object files and libraries Runtime (dynamic) linker Loads executable files and shared libraries and generates a runnable process. Maps the files produced by the link-editor to memory and performs relocations Linker stages can be hidden: link-editing can done by the compiler and the runtime linker is invoked by running the executable

35 Static and Dynamic Linking Static linking: objects files from archives (*.a files) go into the executable Executable contains all the code it needs to run Easier to deploy and test applications Only the code for the required functions is used Limited portability/flexibility Dynamic linking: executable may have dependencies on runtime libraries Executables tend to be smaller (do not replicate the code ) Libraries can be shared between several processes running on the system (improves memory utilization and reduces paging) Greater flexibility for building and testing an application 72

36 Types of Libraries 73 System libraries (usually in /usr/lib) Typically shared libraries (can be static as well, but use of shared libs is recommended for portability) Some environments provide both 64-bit and 32-bit versions Compiler libraries Static and dynamic libraries In some environments these libraries might not be available on user system, developers can either link them statically or link dynamic versions and distribute libraries with the application The dynamically linked compiler libraries can be later replaced with a new version or patch User or application libraries An application can use dynamic or static libraries as well as the mixture of the two Greater flexibility with dynamic linking

37 Linker Mapfiles Mapfiles can be used to specify the layout of functions in libraries (and memory) Can improve instruction cache utilization and reduce paging activity if callers/callees are placed nearby Mapfiles can be generated by profiling tools Can be used for other purposes (e.g. to reduce the scope of symbols) 74

38 Optimized Mathematical Libraries Using optimized libraries takes advantage of the highly efficient implementations of standard mathematical functions Examples of optimized libraries Optimized versions of standard UNIX math (libm) calls Vectorized versions of standard calls Optimized inline templates Optimized and/or parallelized BLAS, FFT, etc. Libraries/calls with SIMD instructions Libraries for distributed computing 75

39 Vectorized Math Libraries Some libraries offer vector versions of some elementary mathematical functions Functions are evaluated for an entire vector of values at once Can be invoked (replacing standard calls) using compiler options Examples: HP: HP-VML (Itanium) Sun: Vector Math Library (libmvec) IBM: Vector Mathematical Acceleration SubSystem (MASS) 76

40 Optimized BLAS, FFT, etc. Optimized/parallelized libraries for BLAS, sparse BLAS, LAPACK, ScaLAPACK, FFT Versions for 32- and 64-bit, different CPU architectures, etc. Examples Sun: Performance Library (libsunperf), Scientific Subroutine Library (libs3l) HP: MLIB (VECLIB, LAPACK, ScaLAPACK, and Super- LU_DIST) IBM: Engineering and Scientific Subroutine Library (ESSL) SGI: Scientific Computing Software Library (SCSL) Compaq: Compaq Extended Math Library (CXML) 77

41 Single Instruction Multiple Data Single Instruction Multiple Data (SIMD) One instruction operates on several pieces of data Used in media, bioinformatics, etc. Can be accessed by wrappers provided in libraries SIMD implementations Sun: VIS Intel: MMX, SSE, SSE-2 AMD: 3DNow! Motorola: AltiVec 78

42 Summary 79 Compiler is the most efficient tool for optimizing applications Compiler options should be carefully selected Higher optimization leads to larger binaries and higher compilation times Large performance potential in using options related to setting architecture, data alignment and alias disambiguation Prefetching options can be used to hide memory latency Directives can be used to provide additional information to the compiler about regions of the code Optimized mathematical libraries can efficiently implement common APIs

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers