Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.

Similar documents
CS 31: Intro to Systems Functions and the Stack. Martin Gagne Swarthmore College February 23, 2016

Computer Labs: Profiling and Optimization

W4118: PC Hardware and x86. Junfeng Yang

Assembly Language: Function Calls

COMP 210 Example Question Exam 2 (Solutions at the bottom)

Assembly Language: Function Calls" Goals of this Lecture"

Assembly Language: Function Calls" Goals of this Lecture"

4) C = 96 * B 5) 1 and 3 only 6) 2 and 4 only

Second Part of the Course

CS241 Computer Organization Spring 2015 IA

CS241 Computer Organization Spring Introduction to Assembly

Computer Organization - Overview

CSE 351: Week 4. Tom Bergan, TA

Assembly Language: Function Calls. Goals of this Lecture. Function Call Problems

Procedure Calls. Young W. Lim Sat. Young W. Lim Procedure Calls Sat 1 / 27

Introduction to Computer Systems

Introduction to Computer Systems

Homework. In-line Assembly Code Machine Language Program Efficiency Tricks Reading PAL, pp 3-6, Practice Exam 1

Y86 Processor State. Instruction Example. Encoding Registers. Lecture 7A. Computer Architecture I Instruction Set Architecture Assembly Language View

211: Computer Architecture Summer 2016

x86 Assembly Crash Course Don Porter

Simple C Program. Assembly Ouput. Using GCC to produce Assembly. Assembly produced by GCC is easy to recognize:

Procedure Calls. Young W. Lim Mon. Young W. Lim Procedure Calls Mon 1 / 29

Credits and Disclaimers

Computer System Architecture

Putting the pieces together

Turning C into Object Code Code in files p1.c p2.c Compile with command: gcc -O p1.c p2.c -o p Use optimizations (-O) Put resulting binary in file p

CS 31: Intro to Systems ISAs and Assembly. Martin Gagné Swarthmore College February 7, 2017

Computer Labs: Profiling and Optimization

Function Calls COS 217. Reading: Chapter 4 of Programming From the Ground Up (available online from the course Web site)

Systems Architecture I

Introduction to Computer Systems: Semester 1 Computer Architecture

You may work with a partner on this quiz; both of you must submit your answers.

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Performance Evaluation. December 2, 1999

Instruction Set Architecture

Instruction Set Architecture

Instruction Set Architectures

x86 architecture et similia

The von Neumann Machine

Assembly Programmer s View Lecture 4A Machine-Level Programming I: Introduction

Instruction Set Architecture

CISC 360 Instruction Set Architecture

Assembly Language. Lecture 2 x86 Processor Architecture

ASSEMBLY I: BASIC OPERATIONS. Jo, Heeseung

Code Optimization September 27, 2001

CPEG421/621 Tutorial

Code Optimization April 6, 2000

Memory Models. Registers

Program Optimization

Instruction Set Architectures

Final exam. Scores. Fall term 2012 KAIST EE209 Programming Structures for EE. Thursday Dec 20, Student's name: Student ID:

Great Reality #4. Code Optimization September 27, Optimizing Compilers. Limitations of Optimizing Compilers

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?

Lecture 11 Code Optimization I: Machine Independent Optimizations. Optimizing Compilers. Limitations of Optimizing Compilers

Assembly I: Basic Operations. Computer Systems Laboratory Sungkyunkwan University

Question 4.2 2: (Solution, p 5) Suppose that the HYMN CPU begins with the following in memory. addr data (translation) LOAD 11110

Introduction to Computer Systems. Exam 1. February 22, This is an open-book exam. Notes are permitted, but not computers.

Machine-level Representation of Programs. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Machine Programming 1: Introduction

Protection and System Calls. Otto J. Anshus

History of the Intel 80x86

Credits and Disclaimers

Introduction to 8086 Assembly

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

Towards the Hardware"

Process Layout, Function Calls, and the Heap

Compiler Construction D7011E

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Lecture #16: Introduction to Runtime Organization. Last modified: Fri Mar 19 00:17: CS164: Lecture #16 1

CS429: Computer Organization and Architecture

Instruction Set Architecture

Lecture 16 Optimizing for the memory hierarchy

Assembly I: Basic Operations. Jo, Heeseung

! Must optimize at multiple levels: ! How programs are compiled and executed , F 02

x86 assembly CS449 Fall 2017

CS61 Section Solutions 3

What is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation

CS 33: Week 3 Discussion. x86 Assembly (v1.0) Section 1G

Assembly Language: Overview!

CS213. Machine-Level Programming III: Procedures

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College February 9, 2016

Subprograms: Local Variables

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018

Machine Code and Assemblers November 6

ECE 391 Exam 1 Review Session - Spring Brought to you by HKN

CS 105. Code Optimization and Performance

Assembly III: Procedures. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CIS 2107 Computer Systems and Low-Level Programming Spring 2012 Final

Register Allocation, iii. Bringing in functions & using spilling & coalescing

Multiprocessor Solution

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Assembly Language Programming Introduction

Meet & Greet! Come hang out with your TAs and Fellow Students (& eat free insomnia cookies) When : TODAY!! 5-6 pm Where : 3rd Floor Atrium, CIT

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

Lecture 18. Optimizing for the memory hierarchy

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

Time Measurement. CS 201 Gerson Robboy Portland State University. Topics. Time scales Interval counting Cycle counters K-best measurement scheme

Transcription:

Principles Performance Tuning CS 27 Don t optimize your code o Your program might be fast enough already o Machines are getting faster and cheaper every year o Memory is getting denser and cheaper every year o Hand optimization may make the code less readable, less robust, and more difficult to test Performance tuning of bottlenecks o Identify performance bottlenecks o Machine independent algorithm improvements o Machine instruction dependent, but architecture dependent improvements Try not to sacrifice correctness, readability and robustness 2 Amdahl s Law Only Bottlenecks Matter Definition of speedup Speedup = Original Enhanced Amdahl s law (967) OverallSpeedup = ( f ) + o f is the fraction of program enhanced o s is the speedup of the enhanced portion Original Enhanced = Speedup f s Examples Amdahl s law What is the overall speedup if you make 0% of a program 90 times faster? What is the overall speedup if you make 90% of a program 0 times faster OverallSpeedup = ( f ) + 0. ( 0.) + 90 = 0.9 ( 0.9) + 0 0.90 0.9 f s. 5.26 3 4

Identify Performance Bottlenecks Use tools such as gprof to learn where the time goes Each sample counts as 0.0 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 76.2 3.46 3.46 6664590 0.00 0.00 partition 6.74 4.22 0.76 54358002 0.00 0.00 swap 3.74 4.39 0.7 0.7 0.7 fillarray 2.86 4.52 0.3 0.3 4.35 quicksort 0.44 4.54 0.02 printarray More sophisticated tools o Tools that use performance counters to show cache miss/hit etc (e.g. VTune) o Tools for multiprocessor systems (for multi-threaded programs) o Tools to investigate where I/O operations take place 5 Strategies to Speedup Use a better algorithm o Complexity of the algorithm makes a big difference Simple code optimizations o Extract common expression f(x*y + x*z) + g(x*y+x*z) o Loop unrolling for (i=0; i<n; i++) x[i]=y[i]; for (i=0; i<n; i+=4) { /* if N is divisible by 4 */ x[i] = y[i]; x[i+] = y[i+]; x[i+2] = y[i+2]; x[i+3] = y[i+3]; Enable compiler optimizations o Modern compilers perform most of the above optimizations o Example use level 3 optimization in gcc gcc O3 foo.c 6 Strategies to Speedup, con d Memory Hierarchy Improve performance with deep memory hierarchy o Make the code cache-aware o Reduce the number of I/O operations Inline procedures o Remove the procedure call overhead (compilers can do this) Inline assembly o Almost never do this unless you deal with hardware directly o Or when the high-level language is in the way 7 Hardware trends o CPU clock rate doubles every 8-24 months (50% per year) o DRAM and disk Access times improve at a rate about 0% per year o Memory hierarchy is getting deeper (L, L2 and L3 caches) Software performance has become more sensitive to cache misses o Register cycle o L cache hit 2-4 cycles o L2 cache hit ~0 cycles o L3 cache hit ~50 cycles o L3 miss ~500 cycles o Disk I/O ~30M cycles Register L cache L2 cache L3 cache DRAM Disk 8

Example Matrix Multiply Transpose Matrix B First C = A B C = A B T int i, j, k; for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) C[i][j] += A[i][k] * B[k][j]; int i, j, k; for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) C[i][j] += A[i][k] * BT[j][k]; How many cache misses? What about the cache miss situation now? Execution time on tux (N=000, -O3 with gcc) 8.5sec Execution time on tux (N=000, -O3 with gcc) 3sec 9 0 A Blocked Matrix Multiply Inline Procedure C = A B T To specify an inline procedure static inline int plus5(int x) { return x + 5; int i, j, ii, jj, k, block; block = 0; for (ii=0; ii<n; ii+=block) for (jj=0; jj<n; jj+=block) for (i=ii; i<ii+block; i++) for (j=jj; j<jj+block; j++) for (k=0; k<n; k++) C[i][j] += A[i][k] * BT[j][k]; Is this better than using macro? #define plus5(x) (x+5) Execution time on tux (N=000, -O3 with gcc) 4.4sec 2

Why Inline Assembly? For most system modules (>99%), programming in C delivers adequate performance It is more convenient to write system programs in C o Robust programming techniques apply to C better o Modular programming is easier o Testing is easier When do you have to use assembly? o You need to use certain instructions that the compiler don t generate (MMX, SSE, SSE2, and IA32 special instructions) o You need to access some hardware, which is not possible in a highlevel language A compromise is to write most programs in C and as little as possible in assembly inline assembly 3 Inline Assembly Basic format for gcc compiler asm [volatile] ( "asm-instructions" ); asm [volatile] ( "asm-instructions" ); o asm-instructions will be inlined into where this statement is in the C program o The key word volatile is optional telling the gcc compiler not to optimize away the instructions o Need to use \n\t to separate instructions. Otherwise, the strings will be concatenated without space in between. Example o asm volatile( "cli" ); o asm ( pushl %eax\n\t incl %eax ); But, to integrate assembly with C programs, we need a contract on register and memory operands 4 Extended Inline Assembly Register Allocation Extended format asm [volatile] ( asm-instructions" out-regs in-regs usedregs); o Both asm and volatile can be enclosed by o volatile is telling gcc compiler not to optimize away o asm-instructions are assembly instructions o out-regs provide output registers (optional) o in-regs provide input registers (optional) o used-regs list registers used in the assembly program (optional) Use a single letter to specify register allocation constrain Example asm ("addl %0, %" "r" (a), "r" (b) ); o gcc will save and load registers for you o If you use a, b, D, you will need to specify %%eax, %%ebx, a b c d S D I q r g A Meaning eax ebx ecx edx esi edi Constant value (0 to 3) Allocate a register from eax, ebx, ecx, and edx Allocate a register from eax, ebx, ecx, edx, esi, edi eax, ebx, ecx, edx or variable in memory eax and edx combined into a 64-bit integer 5 6

Compile with O (Optimize) Result Is Elsewhere C program gcc S -O foo.c C program gcc S O foo.c asm ("addl %0, % "r" (a), "r" (b) );.text.globl add2.type add2 pushl movl leave ret %ebp %esp, %ebp asm volatile ("addl %0, % "r" (a), "r" (b) );.text.globl add2 add2 pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl 2(%ebp), %eax #APP addl %eax, %edx #NO_APP leave ret gcc optimized away the asm instructions! 7 The result is not in %eax. 8 Constrain Register Allocation Summary C program asm ("addl %, %%eax a" (a), "r" (b) ); gcc S O foo.c.text.globl add2 add2 pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movl 2(%ebp), %edx #APP addl %edx, %eax #NO_APP leave ret 9 Don t optimize your code, unless it is really necessary Use a better algorithm is choice # Then, tune the bottleneck first (Amdahl s law) o Identify the bottlenecks by using tools o Make program cache aware o Reduce I/O operations o Inline procedures o Inline assembly (to access hardware including special instructions) Additional reading besides the textbook o Jon Bentley s Writing Efficient Programs (Prentice-Hall, 982), Programming Pearls and More Programming Pearls (Addision Wesley, 986 and 988) o John Hennessy and David Patterson s Computer Organization and Design The Hardware/Software Interface (Morgan Kaufman, 997) 20

What s Covered in The Final Exam? Rephrase What do I expect you all to know o Master the C language o Modules, interfaces and abstract data types o Memory allocation o Robust programming o Testing o Concept of computer architecture o Basic IA32 instruction set and assembly o How assemblers and linkers work o Use UNIX system services (signal, processes and interprocess communication) o How to write portable code o Performance tuning The final will be in COS 04, 30-330pm, 5/20 Open book and open notes 2