Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Similar documents
AMD S X86 OPEN64 COMPILER. Michael Lai AMD

The SGI Pro64 Compiler Infrastructure - A Tutorial

Kampala August, Agner Fog

Compiler Options. Linux/x86 Performance Practical,

Code Merge. Flow Analysis. bookkeeping

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Optimising with the IBM compilers

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

DragonEgg - the Fortran compiler

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Tour of common optimizations

Compiler Optimization

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

HPC with PGI and Scalasca

Double-precision General Matrix Multiply (DGEMM)

Dan Stafford, Justine Bonnot

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

Rechnerarchitektur (RA)

Optimization Prof. James L. Frankel Harvard University

7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker!

AMD DEVELOPER INSIDE TRACK

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

PERFORMANCE OPTIMISATION

Parallel Numerical Algorithms

Tiling: A Data Locality Optimizing Algorithm

Martin Kruliš, v

Programming Language Implementation

Code optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space

Overview of a Compiler

Improving Numerical Reproducibility in C/C++/Fortran

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Quick-Reference Guide to Optimization with Intel Compilers

Lecture Notes on Loop Optimizations

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Programmazione Avanzata

NAN propagation versus fault trapping in floating point code

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

GPU Floating Point Features

What s New August 2015

HPC VT Machine-dependent Optimization

Exploring Parallelism At Different Levels

Introduction to Compilers

Goals of Program Optimization (1 of 2)

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Optimising for the p690 memory system

Compiling for Advanced Architectures

Consistency of Floating-Point Results using the Intel Compiler or Why doesn t my application always give the same answer?

Performance Analysis and Optimization MAQAO Tool

Optimizations - Compilation for Embedded Processors -

Loop Transformations! Part II!

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.

Code optimization with the IBM XL compilers on Power architectures IBM

Pro Fortran. User Guide. Mac OS X With Intel Processors

Supercomputing in Plain English Part IV: Henry Neeman, Director

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

OpenUH Compiler Suite

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Control flow graphs and loop optimizations. Thursday, October 24, 13

Make Your C/C++ and PL/I Code FLY With the Right Compiler Options

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Clang - the C, C++ Compiler

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Accelerator programming with OpenACC

Overview of a Compiler

COMS W4115 Programming Languages and Translators Lecture 21: Code Optimization April 15, 2013

Numerical Modelling in Fortran: day 7. Paul Tackley, 2017

Loop Modifications to Enhance Data-Parallel Performance

Code Generators for Stencil Auto-tuning

A Simple Path to Parallelism with Intel Cilk Plus

Never forget Always use the ftn, cc, and CC wrappers

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Flang - the Fortran Compiler SYNOPSIS DESCRIPTION. Contents. Flang the Fortran Compiler. flang [options] filename...

Scout. A Source-to-Source Transformator for SIMD-Optimizations. Zellescher Weg 12 Willers-Bau A 105 Tel

Performance analysis basics

Introduction to Machine-Independent Optimizations - 1

ECE 574 Cluster Computing Lecture 10

CS153: Compilers Lecture 15: Local Optimization

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Tiling: A Data Locality Optimizing Algorithm

TUNING CUDA APPLICATIONS FOR MAXWELL

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

Overview of a Compiler

Code Optimization. Code Optimization

Scheduling Image Processing Pipelines

OpenCL Vectorising Features. Andreas Beckmann

IBM. Code optimization with the IBM z/os XL C/C++ compiler. Introduction. z/os XL C/C++ compiler

TUNING CUDA APPLICATIONS FOR MAXWELL

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization


SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design

Transcription:

Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers can be turned on that discover different opportunities to optimize Fortran code. There are pros and cons when choosing optimizations; the application will execute much faster after compilation but the compilation speed itself will be slow. Some of the optimizations described below will benefit almost any Fortran code, while others should only be applied to specific situations. No Optimization The O0 option is the default optimization level and provides the fastest compilation speed. It disables all optimizations and is useful for debugging. The -g option is the common debugging flag used with this level of optimization. Basic Optimizations The O1 option will cause most code to run faster and enables optimizations that only span sequential code sequences. The optimizations include, but are not limited to, common subexpression elimination, constant propagation, dead code elimination, and instruction scheduling. This option is compatible with debugging options. Normal Optimizations The O2 option enables normal optimizers that can make most code run faster. It can substantially rearrange the code generated for a program. The optimizations include, but not limited to, strength reduction, partial redundancy elimination, innermost loop unrolling, control flow optimizations, advanced instruction scheduling. This option is not generally usable with debugging options. Advanced Optimizations The O3 option enables advanced optimizers that can significantly rearrange and modify the code generated for a program. It provides all optimizations applied with O1 and O2. Additional optimizations include auto loop vectorization, loop permutation (loop reordering), loop tiling (improved cache performance), loop skewing, loop reversal, loop fusion and fission, unimodular transformations, forward substitution, and expression simplification. This option is not usable with debugging options. Inter-Procedural Optimizations The -O4 or -Ofast option enables advanced optimizers that can significantly rearrange and modify the code generated for a program. The optimizations include all optimizations that are included with O3 as well as turning on inter-procedural analysis with inlining optimizers.

Auto-Parallelization The apo option enables automatic parallelization of those loops in your source program to leverage multiple cores or processors. SSE Optimizations These optimizations leverage the Streaming SIMD Extension (SSE) instruction set of the CPU. These instructions were first introduced with the Intel Pentium 4 as part of the SIMD (single instruction, multiple data) processor supplementary instruction sets. SSE2 instructions The msse2 and mno-sse2 options enable and disable respectively the use of SSE2 instructions for floating-point operations. This msse2 option is automatically enabled on processors which support SSE2. It may be disabled with the mno-sse2 option. SSE3 instructions The -msse3 option enables the use of SSE3 instructions for floating-point operations. This option is automatically turned on when the -march=host option is specified and the host supports SSE3 instructions. SSE4a instructions The -msse4a option enables the use of SSE4a instructions. This option is automatically turned on when the -march=host option is specified and the host supports SSE4a instructions. SSSE4.1 instructions The -msse4.1 option enables the use of SSE4.1 instructions. This option is automatically turned on when the -march=host option is specified and the host supports SSSE4.1 instructions

Math Optimization Level The -speed_math=n option enables aggressive math optimizations that may improve performance at the expense of accuracy. Valid arguments for n are 0-11. The following table describes the effect of each level: n effect 0 enable wrap around optimization 1 allow relational operator folding; may cause signed integer overflow 2 enable partial redundancy elimination for loads and stores 3 enable memory optimization for functions without aliased arrays 4 inline NINT and related intrinsics with limited-domain algorithm use fast_powf in libm instead of powf 5 use multiplication and square root for exp() where faster 6 allow optimizations that reassociate floating point operators 7 see notes below 8 allow use of reciprocal instruction; convert a/b to a*(1/b) 9 use fast algorithms with limited domains for complex norm and divide use x*rsqrt(x) for sqrt(x) on machines where faster dead casgn function elimination 10 use AMD ACML library if applicable 11 allow relational operator folding; may cause unsigned integer overflow use IEEE rounding instead of Fortran rounding for NINT intrinsics use IEEE rounding instead of Fortran rounding for ANINT intrinsics NOTES: A. Departure from strict rounding is applied at 3 levels: level 1 is applied at n=5, level 2 is applied at n=7, and level 3 is applied at n=10. B. Conformance to IEEE-754 arithmetic rules is relaxed at 2 levels: level 1 is applied at n=6, level 2 is applied at n=10 C. At n=10, the loop unrolling constraints are modified: loop size is increased to 7000, limit is increased to 9, minimum iteration is decreased to 200. Enable OpenMP Directives The -openmp option enables the recognition of OpenMP directives. OpenMP directives usually begin in column one in the form of: C$OMP for fixed source format!$omp for free source format Please refer OpenMP Specification for the detail.

OpenMP optimization Level The speed_openmp=n enables progressively more aggressive OpenMP optimizations based on the value of n as follows: n effect 0 allow code optimization and movement through OpenMP Barrier 1 enable loose memory equivalence algorithm during optimization 2 Enable MU generation in SSA generation for OpenMP pragma 3 Enable CHI generation in SSA generation for OpenMP pragma 4 Allow loop unrolling for loops with OpenMP chunksize directive 5 Use a risky but faster algorithm to handle thread private common blocks Each level includes all previous optimizations (e.g. 3 includes 0,1, and 2). Safe Floating-Point The safefp option is used to disable optimizations that may produce inaccurate or invalid floating point results in numerically sensitive codes. The effect of this option is to preserve the FPU control word, enable NAN checks, disable CABS inlining, and disable floating-point register variables. Optimization Diagnostics These options do not perform optimizations; rather they are used to provide diagnostic information on which optimizations were performed. Equally importantly, they provide diagnostic information on which optimizations were not performed and why not. Report Parallelization Results The LNO:verbose=on option is used to display the results of the apo option. It will report which loops were parallelized and which were not and why not. Report Vectorization Results The LNO:simd_verbose=on option is used to display the results of vectorization of loops which occurs at optimization levels greater than O3. It will report which loops were vectorized and which were not and why not. Loop Nest Optimizations These options are used to control the loop nest optimizer in the compiler to deliver the best performance for a specific code. Usually, the default level is designed to achieve performance improvement for most situations. The aggressive level may achieve additional speed improvements, but it may also produce performance degradation. Thus, a set of options for individual optimizations is given to professional tuners for pursuing the best performance improvement. For loop nest optimizations, the options starts with -LNO:.

Loop Vectorization (SIMD Vectorization) The LNO:simd=n option is used to control loop vectorization and utilizes Streaming SIMD Extension (SSE) instruction sets provided by certain processors to work on multiple pieces of data at once. n=0 disables loop vectorization. n=1 is the default vectorization. n=2 enables aggressive vectorization. When O3 is used, -LNO:simd=1 is implied. Loop Tiling/Blocking The CG:blocking=[on off] option is used to enable/disable the loop tiling optimization, that transforms the loop space in order to significant improve cache locality. Loop Fusion and Fission The LNO:fusion=n option is used to control loop fusion which merges two small consecutive loops into one larger loop in order to improve utilization of CPU resources. n=0 disables fusion. n=1 is the default level. n=2 is the aggressive level. When O3 is used, LNO:fusion=1 is implied. The LNO:fission=n option is used to control loop fission which splits one large loop into two smaller loops to reduce excessive register usage. n=0 disables fission. n=1 is the default level. n=2 is the aggressive level. When the O3 is used LNO:fission=2 is implied. Note: When both fusion and fission are performed on a same loop, fusion has a higher priority than fission. The option LNO:fusion_max_size=n, where n=0-99999, can be used to prevent fusion of loops whose size is greater than n. Unroll-and-Jam The option LNO:fu=n, where n=0-100, is used to control unroll-and-jam on the innermost loop, which completely unrolls the innermost loop body to a straight line of code if the iteration is more than the limit n. The enlarged loop body may improve utilization of CPU resources. However, if the loop body is too large, performance may degrade due to register pressure. The option LNO:full_unroll_size=n, where n=0-10000, is used to limit unroll-and-jam of the innermost loop. Unrolling is disabled if the size of the innermost loop body is greater than the limit n. The option LNO:ou=n, where n=1-32 is used to set the number of iterations that the outer loop is unrolled to the limit n. Prefetch The LNO:prefetch=n option is used to control how the compiler generates prefetch instructions for memory loads and stores to improve cache locality. n=0 disables prefetch generation. n=1 is the default level. n=2 is the aggressive level.

Code Generation Optimizations The code generator contains its own optimizer and various options are designed to fine tune those optimizations. These options are prefixed with CG:. Global Code Motion The CG:gcm=[on off] option is used to enable/disable the instruction level global code motion optimization. This optimization is designed to improve scheduling efficiency. Instruction Level Loop Optimization The CG:loop_opt=[on off] option is used to enable/disable the instruction level loop optimization, especially loop unrolling. Instruction level loop unrolling may make some OpenMP programs fail due to changes in iteration counts. Peephole Optimizations The CG:peephole_optimize=[on off] option is used to enable/disable the instruction level peephole optimizations. Peephole optimizations include many small optimizations to improve the efficiency of instructions generated by code generator. Control Flow Optimization The CG:cflow=[on off] option is used to enable/disable instruction level control flow optimization. This optimization rearranges the order of basic blocks to improve performance on critical paths. Hyperblock Scheduling The CG:hb_sched=[on off] option is used to enable/disable the hyperblock scheduling. Hyperblock scheduling is an advanced scheduling technique to schedule code based on a hyperblock instead of a super block or a basic block. Inter-Procedural Analysis and Optimization Inter-procedural analysis collects information from every piece of a program and analyzes their relationships in order to optimize for better performance. The options to control inter-procedural analysis start are prefixed with IPA:. Inlining The IPA:inline=[on off] option is used to enable/disable the inliner. This optimization improves performance by removing those calls to small functions and incorporating them inline at the point of reference. The option IPA:plimit=n, where n=0-int_max, is used to limit the inliner to a specific number of calls and basic blocks inside the function under consideration. Cloning The option IPA:multi_clone=n, where n=0-int_max, is used to control the maximum clones per call graph node to the upper bound n. When n is 0, cloning is disabled.

Inter-Procedural Structure Optimization The IPA:optimize_struct=[0 1] option is used to enable/disable the inter-procedural struct optimization. If set to 0, this optimization is disabled. If set to 1, this optimization is enabled. This optimization results in splitting structure types to improve performance. Inter-Procedural Constant Propagation The IPA:cprop=[on off] option is used to enable/disable inter-procedural constant propagation. AMD ACML Library The OPT:fast_math=[on off] option is used to enable/disable the utilization of the AMD ACML Library. The AMD ACML library is a free math library released by AMD for improving mathematic computation speed on AMD processors. When the option is on, the compiler will convert certain math functions to use the ACML library to improve performance. The library must be specified to the linker if this option is on. Copyright 2010 Absoft Corporation