Many of the following slides are taken with permission from. The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.

Similar documents
Systems I. Code Optimization I: Machine Independent Optimizations

Computer Organization and Architecture

Program Optimization. Jo, Heeseung

CSC 252: Computer Organization Spring 2018: Lecture 14

Lecture 11 Code Optimization I: Machine Independent Optimizations. Optimizing Compilers. Limitations of Optimizing Compilers

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Code Optimization September 27, 2001

Great Reality #4. Code Optimization September 27, Optimizing Compilers. Limitations of Optimizing Compilers

Code Optimization I: Machine Independent Optimizations Oct. 5, 2009

! Must optimize at multiple levels: ! How programs are compiled and executed , F 02

CS 105. Code Optimization and Performance

CS 33. Architecture and Optimization (1) CS33 Intro to Computer Systems XV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Machine Program: Procedure. Zhaoguo Wang

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

Procedure Calls. Young W. Lim Sat. Young W. Lim Procedure Calls Sat 1 / 27

Code Optimization April 6, 2000

CPS104 Recitation: Assembly Programming

Great Reality # The course that gives CMU its Zip! Code Optimization I: Machine Independent Optimizations Feb 11, 2003

Credits and Disclaimers


Performance Optimizations

Program Optimization. Today. DAC 2001 Tutorial. Overview Generally Useful Optimizations. Optimization Blockers

CS429: Computer Organization and Architecture

Procedure Calls. Young W. Lim Mon. Young W. Lim Procedure Calls Mon 1 / 29

CSE 351: Week 4. Tom Bergan, TA

Today Overview Generally Useful Optimizations Program Optimization Optimization Blockers Your instructor: Based on slides originally by:

COMP 210 Example Question Exam 2 (Solutions at the bottom)

Buffer Overflows. Buffer Overflow. Many of the following slides are based on those from

Registers. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth

Computer Organization: A Programmer's Perspective

CMSC 313 Lecture 12. Project 3 Questions. How C functions pass parameters. UMBC, CMSC313, Richard Chang

Today Overview Generally Useful Optimizations Program Optimization Optimization Blockers Your instructor: Based on slides originally by:

Meet & Greet! Come hang out with your TAs and Fellow Students (& eat free insomnia cookies) When : TODAY!! 5-6 pm Where : 3rd Floor Atrium, CIT

These Slides. Program Optimization. Optimizing Compilers. Performance Realities. Limitations of Optimizing Compilers. Generally Useful Optimizations

4) C = 96 * B 5) 1 and 3 only 6) 2 and 4 only

Assembly I: Basic Operations. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

Cache Memory and Performance

Function Call Convention

Optimization part 1 1

Assembly Programmer s View Lecture 4A Machine-Level Programming I: Introduction

CS241 Computer Organization Spring 2015 IA

Assignment 11: functions, calling conventions, and the stack

Machine- Level Programming II: Arithme6c & Control

Corrections made in this version not in first posting:

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 7

Machine/Assembler Language Putting It All Together

Credits and Disclaimers

x86 assembly CS449 Fall 2017

Changelog. Performance. locality exercise (1) a transformation

Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

Compiler Construction D7011E

Question 4.2 2: (Solution, p 5) Suppose that the HYMN CPU begins with the following in memory. addr data (translation) LOAD 11110

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

1. A student is testing an implementation of a C function; when compiled with gcc, the following x86-64 assembly code is produced:

Credits and Disclaimers

Computer Labs: Profiling and Optimization

x86 Assembly Crash Course Don Porter

Code Optimization & Performance. CS528 Serial Code Optimization. Great Reality There s more to performance than asymptotic complexity

Basic Data Types. CS429: Computer Organization and Architecture. Array Allocation. Array Access

CS 33: Week 3 Discussion. x86 Assembly (v1.0) Section 1G

Compilers Crash Course

Introduction to Compilers

CS 31: Intro to Systems Functions and the Stack. Martin Gagne Swarthmore College February 23, 2016

CS429: Computer Organization and Architecture

Credits to Randy Bryant & Dave O Hallaron

Function Calls and Stack

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 6

What is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation

u Arrays One-dimensional Multi-dimensional (nested) Multi-level u Structures Allocation Access Alignment u Floating Point

%r8 %r8d. %r9 %r9d. %r10 %r10d. %r11 %r11d. %r12 %r12d. %r13 %r13d. %r14 %r14d %rbp. %r15 %r15d. Sean Barker

Lecture 4 CIS 341: COMPILERS

Instruction Set Architectures

Simple C Program. Assembly Ouput. Using GCC to produce Assembly. Assembly produced by GCC is easy to recognize:

Foundations of Computer Systems

Machine Programming 3: Procedures

CMSC 313 Lecture 12 [draft] How C functions pass parameters

143A: Principles of Operating Systems. Lecture 5: Calling conventions. Anton Burtsev January, 2017

CPEG421/621 Tutorial

Lab 10: Introduction to x86 Assembly

SYSTEMS PROGRAMMING AND COMPUTER ARCHITECTURE Assignment 5: Assembly and C

CSE 351 Section 4 GDB and x86-64 Assembly Hi there! Welcome back to section, we re happy that you re here

Assembly Language: Function Calls

Turning C into Object Code Code in files p1.c p2.c Compile with command: gcc -O p1.c p2.c -o p Use optimizations (-O) Put resulting binary in file p

x86 Programming I CSE 351 Winter

Cache Memory and Performance

Assembly Language: Function Calls" Goals of this Lecture"

The Hardware/Software Interface CSE351 Spring 2015

Implementing Threads. Operating Systems In Depth II 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

GCC and Assembly language. GCC and Assembly language. Consider an example (dangeous) foo.s

Machine-Level Programming V: Unions and Memory layout

Changelog. Corrections made in this version not in first posting:

x86 assembly CS449 Spring 2016

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING PREVIEW SLIDES 16, SPRING 2013

MACHINE-LEVEL PROGRAMMING I: BASICS COMPUTER ARCHITECTURE AND ORGANIZATION

Assembly Language: Function Calls" Goals of this Lecture"

X86 Stack Calling Function POV

An Elegant Weapon for a More Civilized Age

Lecture 1: Course Overview

Transcription:

CS 3114 Many of the following slides are taken with permission from Complete Powerpoint Lecture Notes for Computer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html 1 The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506. Many other slides were based on notes written by Dr Godmar Back for CS 3214.

Asymptotic Complexity 2 Big-O O( ), Θ( ) Describes asymptotic behavior of time or space cost function of an algorithm as input size grows Subject of complexity analysis (CS 3114) Determine if a problem is tractable or not Example: Quicksort O(n log n) average case Bubble Sort O(n^2) average case Actual cost may be C 1 * N^2 + C 2 * N + C 3 These constants can matter and optimization can reduce them Determine how big of a problem a tractable algorithm can handle in a concrete implementation on a given machine

Compiler Programmer Roles of Programmer vs Compiler 3 Programmer: Choice of algorithm, Big-O Manual application of some optimizations Choice of program structure that s amenable to optimization Avoidance of optimization blockers High-Level Low-Level Optimizing Compiler Applies transformations that preserve semantics, but reduce amount of, or time spent in computations Provides efficient mapping of code to machine: Selects and orders code Performs register allocation Usually consists of multiple stages

Controlling with gcc -O0 ( O zero ) This is the default: minimal optimizations -O1 Apply optimizations that can be done quickly -O2 Apply more expensive optimizations. That s a reasonable default for running production code. Typical ratio between O2 and O0 is 5-20. -O3 Apply even more expensive optimizations -Os Optimize for code size 4 See info gcc for list which optimizations are enabled when; note that f switches may enable additional optimizations that are not included in O Note: ability to debug code symbolically under gdb decreases with optimization level; usually use O0 g or O1 g or ggdb3

Limitations of Optimizing Compilers 5 Fundamentally, must emit code that implements specified semantics under all conditions Can t apply optimizations even if they would only change behavior in corner case a programmer may not think of Due to memory aliasing Due to unseen procedure side-effects Do not see beyond current compilation unit Intraprocedural analysis typically more extensive (since cheaper) than interprocedural analysis Usually base decisions on static information

Types of s Copy Propagation 6 Code Motion Strength Reduction Common Subexpression Elimination Eliminating Memory Accesses Through use of registers Inlining

Copy Propagation int arith1(int x, int y, int z) { int x_plus_y = x + y; int x_minus_y = x - y; int x_plus_z = x + z; int x_minus_z = x - z; int y_plus_z = y + z; int y_minus_z = y - z; int xy_prod = x_plus_y * x_minus_y; int xz_prod = x_plus_z * x_minus_z; int yz_prod = y_plus_z * y_minus_z; 7 return xy_prod + xz_prod + yz_prod; int arith2(int x, int y, int z) { return (x + y) * (x - y) + (x + z) * (x - z) + (y + z) * (y - z); Which produces faster code?

Copy Propagation 8 arith1: leal movl subl imull movl subl imull movl subl imull ret (%rdx,%rdi), %ecx %edi, %eax %edx, %eax %ecx, %eax arith2: %esi, %ecx leal %edx, %ecx movl %esi, %edx subl %edx, %ecx imull %edi, %edx movl %esi, %edx subl %edi, %esi %esi, %edx imull %ecx, %eax movl %edx, %eax subl imull ret (%rdx,%rdi), %ecx %edi, %eax %edx, %eax %ecx, %eax %esi, %ecx %edx, %ecx %esi, %edx %edx, %ecx %edi, %edx %esi, %edx %edi, %esi %esi, %edx %ecx, %eax %edx, %eax

Constant Propagation #include <stdio.h> 9 int sum(int a[], int n) { int i, s = 0; for (i = 0; i < n; i++) s += a[i]; return s;.lc0:.string "Sum is %d\n" main: movl $55, 4(%esp) movl $.LC0, (%esp) call printf int main() { int v[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ; int s = sum(v, 10); printf("sum is %d\n", s);

Code Motion Do not repeat computations if result is known Usually out of loops ( code hoisting ) 10 for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];

Strength Reduction 11 Substitute lower cost operation for more expensive one E.g., replace 48*x with (x << 6) (x << 4) Often machine dependent

Common Subexpression Elimination Reuse already computed expressions 12 /* Sum neighbors of i,j */ up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; 3 multiplications: i*n, (i 1)*n, (i+1)*n int inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 1 multiplication: i*n

Eliminating Memory Accesses, Take 1 Registers are faster than memory 13 double sp1(double *x, double *y) { double sum = *x * *x + *y * *y; double diff = *x * *x - *y * *y; return sum * diff; How many memory accesses? Number of memory accesses not related to how often pointer dereferences occur in source code sp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm2, %xmm2 movapd %xmm1, %xmm0 subsd %xmm2, %xmm1 addsd %xmm2, %xmm0 mulsd %xmm1, %xmm0 ret

Eliminating Memory Accesses, Take 2 Order of accesses matters void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; How many memory accesses? 14 sp1: movsd addsd movsd movsd mulsd movsd ret (%rdi), %xmm0 (%rsi), %xmm0 %xmm0, (%rdx) (%rdi), %xmm0 (%rsi), %xmm0 %xmm0, (%rcx)

Eliminating Memory Accesses, Take 3 15 Compiler can't know that sum or prod will never point to same location as x or y! void sp2(double *x, double *y, double *sum, double *prod) { double xlocal = *x; double ylocal = *y; How many memory accesses? *sum = xlocal + ylocal; *prod = xlocal * ylocal; sp2: movsd (%rdi), %xmm0 movsd (%rsi), %xmm2 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm2, %xmm1 movsd %xmm1, (%rdx) movsd %xmm0, (%rcx) ret