Optimising with the IBM compilers

Similar documents
Threaded Programming. Lecture 9: Introduction to performance optimisation

PERFORMANCE OPTIMISATION

Code optimization with the IBM XL compilers on Power architectures IBM

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Optimizing Compilers Most modern compilers are good at basic code optimisations. Typical Optimization Levels. Aliasing. Caveats. Helping the Compiler

Optimization and Programming Guide

Optimisation p.1/22. Optimisation

Blue Gene/P Advanced Topics

Martin Kruliš, v

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Optimising for the p690 memory system

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Introduction to Compilers HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Optimization and Programming Guide

IA-64 Compiler Technology

Tour of common optimizations

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Kampala August, Agner Fog

Optimization and Programming Guide

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Compiler Optimization

Supercomputing in Plain English Part IV: Henry Neeman, Director

Control Hazards. Branch Prediction

Short Notes of CS201

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

CS201 - Introduction to Programming Glossary By

Optimization Prof. James L. Frankel Harvard University

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

Introduction to OpenMP

Final CSE 131B Winter 2003

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Reading Assignment. Lazy Evaluation

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Program Optimization

Control Hazards. Prediction

MODEL ANSWERS COMP36512, May 2016

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Compiling for Advanced Architectures

Implementing Coroutines with call/cc. Producer/Consumer using Coroutines

In context with optimizing Fortran 90 code it would be very helpful to have a selection of

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

register allocation saves energy register allocation reduces memory accesses.

Performance analysis basics

Compiler Design and Construction Optimization

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

DSP Mapping, Coding, Optimization

ECE 486/586. Computer Architecture. Lecture # 7

Supercomputing in Plain English

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

CS252 Graduate Computer Architecture Midterm 1 Solutions

A Fast Review of C Essentials Part I

Chapter 9 Subroutines and Control Abstraction. June 22, 2016

Fixed-Point Math and Other Optimizations

Group B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction

Loop Transformations! Part II!

Code generation and local optimization

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Code Quality Analyzer (CQA)

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Multiple Instruction Issue. Superscalars

Lecture 4: Instruction Set Design/Pipelining

Concepts Introduced in Chapter 3

Code generation and local optimization

Compiler Architecture

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Advanced processor designs

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

Code optimization techniques

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Compiler Design. Code Shaping. Hwansoo Han

Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Intel Corporation

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites:

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Parallel Programming & Cluster Computing Stupid Compiler Tricks

Lecture 6: Static ILP

ARCHER Single Node Optimisation

IBM. Optimization and Programming Guide for Little Endian Distributions. IBM XL Fortran for Linux, V Version 15.1.

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Efficient Microarchitecture Modeling and Path Analysis for Real-Time Software. Yau-Tsun Steven Li Sharad Malik Andrew Wolfe

Hardware-based Speculation

Pipelining and Vector Processing

CS 360 Programming Languages Interpreters

Computer Architecture and Organization. Instruction Sets: Addressing Modes and Formats

Compiler techniques for leveraging ILP

Introduction. L25: Modern Compiler Design

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

CHAPTER 3. Register allocation

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Computer System Architecture Final Examination Spring 2002

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Run-time Environments. Lecture 13. Prof. Alex Aiken Original Slides (Modified by Prof. Vijay Ganesh) Lecture 13

Appendix A The DL Language

Transcription:

Optimising with the IBM

Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square roots register use and spilling loop unrolling/pipelining 2

Introduction Unless we write assembly code, we are always using a compiler. Modern are (quite) good at optimisation memory optimisations are an exception Usually much better to get the compiler to do the optimisation. avoids machine-specific coding break codes much less often than humans Even modifying code can be thought of as helping the compiler. 3

Compiler flags Typical compiler has hundreds of flags/options. most are never used many are not related to optimisation Most have flags for different levels of general optimisation. -O1, -O2, -O3,... When first porting code, switch optimisation off. only when you are satisfied that the code works, turn optimisation on, and test again. but don t forget to use them! also don t forget to turn off debugging, bounds checking and profiling flags... 4

Optimisation -On -O0 or -qnoopt No optimisation There is no -O1 -O2 -O3 low level optimisation e.g. redundant code elimination, invariant code removal from loops, some loop unrolling + pipelining,. consider using -qmaxmem=-1 to allows more memory for space intensive optimisations This is the default for optimisation > 2 more extensive optimisation e.g some floating point reordering, deeper loop unrolling, consider using qstrict to avoid numerical differences 5

Optimisation On -O4 includes all the optimisations of O3, plus: -qhot (for Fortran), -qipa, -qarch=auto, -qtune=auto,-qcache=auto -qhot Higher Order Transformations optimises F90 array syntax statements changes loop nests to improve cache behaviour allows user to transform loops to vector library calls (MASS) introduce array padding syntax -qhot[=[no]vector arraypad[=n]] 6

Optimisation On -qipa expands the scope of optimisation to an entire program unit, allows inlining -qipa[=level=n inline=name fine tuning] -O5 includes all of -O4 optimisations plus: -qipa=level=2 Target machine options specifies a particular processor and memory system -qarch: restricts compiler to a particular instruction set -qtune: optimisation based on a particular machine -qcache: defines a particular cache / memory geometry we are compiling on Power4s, so defaults should be fine. 7

Inlining Enabled with -qipa option can also use -Q for routines in the same compilation unit Not always beneficial: may help to be selective can list functions to be inlined also makes profiling harder... Very important for OO code OO design encourages methods with very small bodies inline keyword in C++ can be used as a hint 8

Aliasing The compiler has to make conservative assumptions about aliasing storage associated data in Fortran pointers in C (and Fortran) You can tell the compiler that aliasing is not present in a program unit with the -qalias flag. various options for different types and levels of aliasing see man pages and online documentation. 9

Compiler flags (cont.) Note that highest levels of optimisation may cause your code to break. might be a compiler bug, but more likely because your code is not strictly standard conforming e.g. passing same argument twice make your code go slower -O5 is not always best. Isolate routines and flags which cause the problem. binary chop try high levels on key routines only one routine per file may help 10

Compiler hints A mechanism for giving additional information to the compiler, e.g. values of variables (e.g. loop trip counts) independence of loop iterations independence of index array elements aliasing properties Appear as comments (Fortran), or preprocessor pragmas (C) don t affect portability IBM hints are only taken into account if -qhot or -qsmp (autoparallelization) is used. not much use for sequential C/C++ code 11

ASSERT directive Gives compiler extra information about a loop trip count independence of iterations!ibm* ASSERT (NODEPS,ITERCNT(50)) DO I=1,N A(I) = A(I) + FUNC(I) END DO 12

CNCALL directive Tells the compiler that no procedure called in the loop has a loop carried dependency weaker than ASSERT(NODEPS) as the loop itself may still have dependencies!ibm* CNCALL DO I=2,N A(I) = A(I) + A(I-1) + FUNC(I) END DO 13

PERMUTATION directive Tells the compiler that an array contains no repeated values in the scope of a loop!ibm* PERMUTATION (IDX) DO I=1,N A(IDX(I)) = A(IDX(I)) + B(I) END DO 14

Code modification When flags and hints don t solve the problem, we will have to resort to code modification. Be aware that this may introduce bugs. make the code harder to read/maintain. only be effective on certain architectures. Try to think about what optimisation the compiler is failing to do how can rewriting help 15

How can we work out what the compiler has done? eyeball assembly code use diagnostics flags Increasingly difficult to work out what actually occured in the processor. superscalar, out-of-order, speculative execution Can estimate expected performance count flops, load/stores, estimate cache misses compare actual performance with expectations 16

To dump assembly code, compile with -S option produces.s file To produce a loop optimisation report (from -qhot), compile with -qreport produces.lst file contains transformed source code, showing unrolled loops, address computations, array syntax expanded into loops not very pretty! notes on loop transformations (e.g. fusion, unrolling) 17

Locals and globals Compiler analysis is more effective with local variables Has to make worst case assumptions about global variables Globals could be modified by any called procedure (or by another thread). Use local variables where possible Automatic variables are stack allocated: allocation is essentially free. In C, use file scope globals in preference to externals 18

Conditionals Even with sophisticated branch prediction hardware, branches are bad for performance. If you can t eliminate them, at least try to get them out of the critical loops. Simple example: do i=1,k if (n.eq. 0) then a(i) = b(i) + c else a(i) = 0. endif end do if (n.eq. 0) then do i=1,k a(i) = b(i) + c end do else do i=1,k a(i) = 0. end do endif 19

A little harder for the compiler... do i=1,k if (i.le. j) then a(i) = b(i) + c else a(i) = 0. endif end do do i=1,j a(i) = b(i) + c end do do i = j+1,k a(i) = 0. end do 20

Data types Performance can be affected by choice of data types complicated by trade-offs with memory usage and cache hit rates avoid INTEGER*1, INTEGER*2 and INTEGER*8 (unless compiling with -q64) Avoid unecessary type conversions especially integer to floating point and viceversa N.B. some type conversions are implicit 21

CSE Compilers are generally good at Common Subexpression Elimination. A couple of cases where they might have trouble: Different order of operands d = a + c e = a + b + c Function calls d = a + func(c) e = b + func(c) 22

Divides and Square Roots Divides and square roots are expensive not pipelined divide takes 30 cycles sqrt takes 36 cycles can use both FPUs in parallel Can often reduce divide count by storing reciprocals Use of look-up tables is a trade-off between computation and memory access time. 23

Register use Most make a reasonable job of register allocation. Can have problems in some cases: loops with large numbers of temporary variables such loops may be produced by inlining or unrolling array elements with complex index expressions can help compiler by introducing explicit scalar temporaries for (i=0;i<n;i++){ b[i] += a[c[i]]; c[i+1] = 2*i; } tmp = c[0]; for (i=0;i<n;i++){ b[i] += a[tmp]; tmp = 2*i; c[i+1] = tmp; } 24

Spilling If compiler runs out of registers it will generate spill code. store a value and then reload it later on Examine your source code and count how many loads/stores are required Compare with assembly code May need to distribute loops 25

Loop unrolling Loop unrolling and software pipelining are two of the most important optimisations for scientific codes on modern RISC processors. Compilers generally good at this. If compiler fails, usually better to try and remove the impediment, rather than unroll by hand. cleaner, more portable, better performance 26

Loop unrolling Replace loop body by multiple copies of the body Modify loop control take care of arbitrary loop bounds Example: do i=1,n a(i)=b(i)+d*c(i) end do do i=1,n-3,4 a(i)=b(i)+d*c(i) a(i+1)=b(i+1)+d*c(i+1) a(i+2)=b(i+2)+d*c(i+2) a(i+3)=b(i+3)+d*c(i+3) end do do j = i,n a(j)=b(j)+d*c(j) end do 27

Software pipelining for (i=0;i<n;i++){ a(i) += b; } //prologue t1 = a(0); //L 0 t2 = b + t1; //A 0 t1 = a(1); //L 1 for (i=0;i<n;i++){ t1 = a(i); //L i t2 = b + t1; //A i a(i) = t2; //S i } for (i=0;i<n-2;i++){ a(i) = t2; //S i t2 = b + t1; //A i+1 t1 = a(i+2); //L i+2 } //epilogue a(n-2) = t2; //S n-2 t2 = b + t1; //A n-1 a(n-1) = t2; //S n-1 28

Impediments to unrolling Function calls except in presence of good interprocedural analysis and inlining Conditionals especially control transfer out of the loop Pointer/array aliasing 29

Example for (i=0;i<m;i++){ a[i] += c[i] * a[n]; } Compiler may not know that a[i] and a[n] don t overlap Can be unrolled, but with limited benefits, as instructions from different iterations can t be interleaved. Rewrite: tmp = a[n]; for (i=0;i<m;i++){ a[i] += c[i] * tmp; } 30

Outer loop unrolling If we have a loop nest, then it is possible to unroll one of the outer loops instead of the innermost one. Compiler may do this do i=1,n do j=1,m a(j,i)=c*d(j) end do end do do i=1,n,4 do j=1,m a(j,i)=c*d(j) a(j,i+1)=c*d(j) a(j,i+2)=c*d(j) a(j,i+3)=c*d(j) end do end do 2 loads for 1 flop 5 loads for 4 flops 31

Summary Remember compiler is always there. Try to help compiler, rather than do its job for it. Use flags and hints as much as possible Minimise code modifications 32

Documentation XL Fortran for AIX: User s Guide XL Fortran for AIX: Language Reference C for AIX: C/C++ Language Reference C for AIX: Compiler Reference All available locally from HPCx User Guide References and Further Reading section or from publib.boulder.ibm.com click on IBM Publications Center and search 33