Systems I. Code Optimization I: Machine Independent Optimizations

Similar documents
Code Optimization September 27, 2001

Lecture 11 Code Optimization I: Machine Independent Optimizations. Optimizing Compilers. Limitations of Optimizing Compilers

Great Reality #4. Code Optimization September 27, Optimizing Compilers. Limitations of Optimizing Compilers

! Must optimize at multiple levels: ! How programs are compiled and executed , F 02

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

Code Optimization I: Machine Independent Optimizations Oct. 5, 2009

CS 105. Code Optimization and Performance

Program Optimization. Jo, Heeseung

Code Optimization April 6, 2000

Great Reality # The course that gives CMU its Zip! Code Optimization I: Machine Independent Optimizations Feb 11, 2003

CS429: Computer Organization and Architecture

Code Optimization & Performance. CS528 Serial Code Optimization. Great Reality There s more to performance than asymptotic complexity

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

CS 33. Architecture and Optimization (1) CS33 Intro to Computer Systems XV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Computer Organization and Architecture

CSC 252: Computer Organization Spring 2018: Lecture 14

Program Optimization. Today. DAC 2001 Tutorial. Overview Generally Useful Optimizations. Optimization Blockers

Many of the following slides are taken with permission from. The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.

Performance Optimizations

Program Optimization

These Slides. Program Optimization. Optimizing Compilers. Performance Realities. Limitations of Optimizing Compilers. Generally Useful Optimizations

Today Overview Generally Useful Optimizations Program Optimization Optimization Blockers Your instructor: Based on slides originally by:

Today Overview Generally Useful Optimizations Program Optimization Optimization Blockers Your instructor: Based on slides originally by:

CS 201. Code Optimization, Part 2

How to Write Fast Numerical Code

Code Optimization. What is code optimization?

Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002

Code Optimization II: Machine Dependent Optimizations

Introduction to Computer Systems , fall th Lecture, Sep. 14 th

The Hardware/Software Interface CSE351 Spring 2013

CS241 Computer Organization Spring Loops & Arrays

Machine-Level Prog. IV - Structured Data

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Arrays. Young W. Lim Wed. Young W. Lim Arrays Wed 1 / 19

4) C = 96 * B 5) 1 and 3 only 6) 2 and 4 only

Arrays. Young W. Lim Mon. Young W. Lim Arrays Mon 1 / 17

Lecture 12 Code Optimization II: Machine Dependent Optimizations. Previous Best Combining Code. Machine Independent Opt. Results

Assembly IV: Complex Data Types. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Referencing Examples

Systems I. Machine-Level Programming VII: Structured Data

Machine-Level Programming III: Procedures

CS213. Machine-Level Programming III: Procedures

Previous Best Combining Code The course that gives CMU its Zip!

Compiler Construction D7011E

CS241 Computer Organization Spring 2015 IA

Machine-level Programming (3)

The course that gives CMU its Zip! Machine-Level Programming III: Procedures Sept. 17, 2002

Task. ! Compute sum of all elements in vector. ! Vector represented by C-style abstract data type. ! Achieved CPE of " Cycles per element

Question 4.2 2: (Solution, p 5) Suppose that the HYMN CPU begins with the following in memory. addr data (translation) LOAD 11110

Previous Best Combining Code The course that gives CMU its Zip! Code Optimization II: Machine Dependent Optimizations Oct.

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Optimization part 1 1

Assembly III: Procedures. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Changelog. Performance. locality exercise (1) a transformation

Foundations of Computer Systems

Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.

Machine-Level Programming IV: Structured Data

Introduction to Computer Systems: Semester 1 Computer Architecture

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XV 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Corrections made in this version not in first posting:

Assembly III: Procedures. Jo, Heeseung

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

ASSEMBLY III: PROCEDURES. Jo, Heeseung

Turning C into Object Code Code in files p1.c p2.c Compile with command: gcc -O p1.c p2.c -o p Use optimizations (-O) Put resulting binary in file p

Computer Systems C S Cynthia Lee

CS61 Section Solutions 3

Optimizing Program Performance

Computer Organization - Overview

CS , Fall 2001 Exam 1

Instruction Set Architecture

Instruction Set Architecture

Homework 0: Given: k-bit exponent, n-bit fraction Find: Exponent E, Significand M, Fraction f, Value V, Bit representation

ASSEMBLY II: CONTROL FLOW. Jo, Heeseung

Fundamentals of Programming Session 13

What is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation

Data Structures in Memory!

Instruction Set Architecture

CISC 360 Instruction Set Architecture

CMSC 313 Fall2009 Midterm Exam 2 Section 01 Nov 11, 2009

7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker!

Assembly II: Control Flow. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Y86 Processor State. Instruction Example. Encoding Registers. Lecture 7A. Computer Architecture I Instruction Set Architecture Assembly Language View

EECS 213 Fall 2007 Midterm Exam

Chapter 3 Machine-Level Programming II Control Flow

CS241 Computer Organization Spring Addresses & Pointers

Assembly Language: IA-32 Instructions

Basic Data Types The course that gives CMU its Zip! Machine-Level Programming IV: Data Feb. 5, 2008

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CAS CS Computer Systems Spring 2015 Solutions to Problem Set #2 (Intel Instructions) Due: Friday, March 20, 1:00 pm

Introduction to Computer Systems

Introduction to Computer Systems

1 /* file cpuid2.s */ 4.asciz "The processor Vendor ID is %s \n" 5.section.bss. 6.lcomm buffer, section.text. 8.globl _start.

Integral. ! Stored & operated on in general registers. ! Signed vs. unsigned depends on instructions used

Basic Data Types The course that gives CMU its Zip! Machine-Level Programming IV: Structured Data Sept. 18, 2003.

Data Representa/ons: IA32 + x86-64

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Credits to Randy Bryant & Dave O Hallaron

Transcription:

Systems I Code Optimization I: Machine Independent Optimizations Topics Machine-Independent Optimizations Code motion Reduction in strength Common subexpression sharing Tuning Identifying performance bottlenecks

Great Reality Thereʼs s more to performance than asymptotic complexity Constant factors matter too! Easily see 10:1 performance range depending on how code is written Must optimize at multiple levels: algorithm, data representations, procedures, and loops Must understand system to optimize performance How programs are compiled and executed How to measure program performance and identify bottlenecks How to improve performance without destroying code modularity and generality 2

Optimizing Compilers Provide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies Donʼt t (usually) improve asymptotic efficiency up to programmer to select best overall algorithm big-o savings are (often) more important than constant factors but constant factors also matter Have difficulty overcoming optimization blockers potential memory aliasing potential procedure side-effects 3

Limitations of Optimizing Compilers Operate Under Fundamental Constraint Must not cause any change in program behavior under any possible condition Often prevents it from making optimizations when would only affect behavior under pathological conditions. Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles e.g., data ranges may be more limited than variable types suggest Most analysis is performed only within procedures whole-program analysis is too expensive in most cases Most analysis is based only on static information compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative 4

Machine-Independent Optimizations Optimizations you should do regardless of processor / compiler Code Motion Reduce frequency with which computation performed If it will always produce same result Especially moving code out of loop for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } 5

Compiler-Generated Code Motion Most compilers do a good job with array code + simple loop structures Code Generated by GCC for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j]; } imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4) # Inner Loop movl 12(%ebp),%edi # b.l40: movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ cmpl %ebx,%ecx # loop if j<n jl.l40 6

Reduction in Strength Replace costly operation with simpler one Shift, add instead of multiply or divide 16*x --> x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Pentium II or III, integer multiply only requires 4 CPU cycles Recognize sequence of products for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } 7

Make Use of Registers Reading and writing registers much faster than reading/writing memory Limitation Compiler not always able to determine whether variable can be held in register Possibility of Aliasing See example later 8

Machine-Independent Opts. (Cont.) Share Common Subexpressions Reuse portions of expressions Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i,j */ up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; int inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 3 multiplications: i*n, (i 1)*n, (i+1)*n 1 multiplication: i*n leal -1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*n 9

Time Scales Absolute Time Typically use nanoseconds 10 9 seconds Time scale of computer instructions Clock Cycles Most computers controlled by high frequency clock signal Typical Range 100 MHz» 10 8 cycles per second» Clock period = 10ns 2 GHz» 2 X 10 9 cycles per second» Clock period = 0.5ns 10

Example of Performance Measurement Loop unrolling Assume even number of elements void vsum1(int n) { int i; for(i=0; i<n; i++) c[i] = a[i] + b[i]; } void vsum2(int n) { int i; for(i=0; i<n; i+=2) { c[i] = a[i] + b[i]; c[i+1] = a[i+1] + b[i+1]; } 11

Cycles Per Element Convenient way to express performance of program that operators on vectors or lists Length = n T = CPE*n + Overhead 1000 900 Cycles 800 700 600 500 400 300 200 100 vsum1 Slope = 4.0 vsum2 Slope = 3.5 0 0 50 100 150 200 Elements 12

Code Motion Example Procedure to Convert String to Lower Case void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } 13

Lower Case Conversion Performance Time quadruples when string length doubles Quadratic performance lower1 CPU Seconds 1000 100 10 1 0.1 0.01 0.001 0.0001 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 String Length 14

Convert Loop To Goto Form void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: } strlen executed every iteration strlen linear in length of string Must scan string until finds '\0' Overall performance is quadratic 15

Improving Performance void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } Move call to strlen outside of loop Since result does not change from one iteration to another Form of code motion 16

Lower Case Conversion Performance Time doubles when double string length Linear performance CPU Seconds 1000 100 10 1 0.1 0.01 0.001 0.0001 0.00001 0.000001 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 String Length lower1 lower2 17

Optimization Blocker: Procedure Calls Why couldnʼt t the compiler move strlen out of the inner loop? Procedure may have side effects Alters global state each time called Function may not return same value for given arguments Depends on other parts of global state Procedure lower could interact with strlen Why doesnʼt t compiler look at code for strlen? Linker may overload with different version Unless declared static Interprocedural optimization is not used extensively due to cost Warning: Compiler treats procedure call as a black box Weak optimizations in and around them 18

Summary Today Improving program performance (machine independent) Mostly focusing on instruction count Next time Optimization blocker: procedure calls Optimization blocker: memory aliasing Tools (profiling) for understanding performance 19