Program Optimization

Similar documents
Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Code Optimization & Performance. CS528 Serial Code Optimization. Great Reality There s more to performance than asymptotic complexity

Systems I. Code Optimization I: Machine Independent Optimizations

Code Optimization September 27, 2001

Lecture 11 Code Optimization I: Machine Independent Optimizations. Optimizing Compilers. Limitations of Optimizing Compilers

Great Reality #4. Code Optimization September 27, Optimizing Compilers. Limitations of Optimizing Compilers

Program Optimization. Jo, Heeseung

! Must optimize at multiple levels: ! How programs are compiled and executed , F 02

Code Optimization April 6, 2000

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

High Performance Computing and Programming, Lecture 3

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 105. Code Optimization and Performance

CS241 Computer Organization Spring Principle of Locality

Performance Improvement. The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 7

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

211: Computer Architecture Summer 2016

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Computer Organization - Overview

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Storage Management 1

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Introduction to Computer Systems: Semester 1 Computer Architecture

CS 2461: Computer Architecture 1

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache Lab Implementation and Blocking

Performance Op>miza>on

Introduction to Computer Systems

Introduction to Computer Systems

Computer Organization and Architecture

CS 33. Architecture and Optimization (1) CS33 Intro to Computer Systems XV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS429: Computer Organization and Architecture

Today Cache memory organization and operation Performance impact of caches

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 2. Memory locality optimizations Address space organization

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Advanced Programming & C++ Language

CSC D70: Compiler Optimization Memory Optimizations

CSE351 Winter 2016, Final Examination March 16, 2016

CMPSC 311- Introduction to Systems Programming Module: Caching

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Final Lecture. A few minutes to wrap up and add some perspective

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CSC 252: Computer Organization Spring 2018: Lecture 14

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Memory Hierarchy. Announcement. Computer system model. Reference

Carnegie Mellon. Cache Memories

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

CS654 Advanced Computer Architecture. Lec 2 - Introduction

Why Pointers. Pointers. Pointer Declaration. Two Pointer Operators. What Are Pointers? Memory address POINTERVariable Contents ...

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Overview. EE 4504 Computer Organization. Much of the computer s architecture / organization is hidden from a HLL programmer

F28HS Hardware-Software Interface: Systems Programming

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Giving credit where credit is due

Contents of Lecture 3

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

CISC 360. Cache Memories Nov 25, 2008

Code Optimization I: Machine Independent Optimizations Oct. 5, 2009

HPC VT Machine-dependent Optimization

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Cache Memories October 8, 2007

High-Performance Parallel Computing

Optimization part 1 1

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

DSP Mapping, Coding, Optimization

CSCI 402: Computer Architectures. Performance of Multilevel Cache

Chapter 11. Instruction Sets: Addressing Modes and Formats. Yonsei University

Chapter 2. Computer Abstractions and Technology. Lesson 4: MIPS (cont )

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Memory Management! Goals of this Lecture!

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Dynamic Control Hazard Avoidance

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

o Code, executable, and process o Main memory vs. virtual memory

Optimising for the p690 memory system

Basic Page Replacement

ANITA S SUPER AWESOME RECITATION SLIDES

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Tutorial 11. Final Exam Review

Computer Architecture and Organization. Instruction Sets: Addressing Modes and Formats

Speed. Lecture #39 Writing Really Fast Programs Example 1 bit counting Basic Idea

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

CS 3330 Exam 3 Fall 2017 Computing ID:

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Introduction to Computer Systems

Example. How are these parameters decided?

Transcription:

Program Optimization Professor Jennifer Rexford http://www.cs.princeton.edu/~jrex 1 Goals of Today s Class Improving program performance o When and what to optimize o Better algorithms & data structures vs. tuning the code Exploiting an understanding of underlying system o Compiler capabilities o Hardware architecture o Program execution Why? o To be effective, and efficient, at making programs faster Avoid optimizing the fast parts of the code Help the compiler do its job better o To review material from the second half of the course 2 1

Improving Program Performance Most programs are already fast enough o No need to optimize performance at all o Save your time, and keep the program simple/readable Most parts of a program are already fast enough o Usually only a small part makes the program run slowly o Optimize only this portion of the program, as needed Steps to improve execution (time) efficiency o Do timing studies (e.g., gprof) o Identify hot spots o Optimize that part of the program o Repeat as needed 3 Ways to Optimize Performance Better data structures and algorithms o Improves the asymptotic complexity Better scaling of computation/storage as input grows E.g., going from O(n 2 ) sorting algorithm to O(n log n) o Clearly important if large inputs are expected o Requires understanding data structures and algorithms Better source code the compiler can optimize o Improves the constant factors Faster computation during each iteration of a loop E.g., going from 1000n to 10n running time o Clearly important if a portion of code is running slowly o Requires understanding hardware, compiler, execution 4 2

Helping the Compiler Do Its Job 5 Optimizing Compilers Provide efficient mapping of program to machine o Register allocation o Code selection and ordering o Eliminating minor inefficiencies Don t (usually) improve asymptotic efficiency o Up to the programmer to select best overall algorithm Have difficulty overcoming optimization blockers o Potential function side-effects o Potential memory aliasing 6 3

Limitations of Optimizing Compilers Fundamental constraint o Compiler must not change program behavior o Ever, even under rare pathological inputs Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles o Data ranges more limited than variable types suggest o Array elements remain unchanged by function calls Most analysis is performed only within functions o Whole-program analysis is too expensive in most cases Most analysis is based only on static information o Compiler has difficulty anticipating run-time inputs 7 Avoiding Repeated Computation A good compiler recognizes simple optimizations o Avoiding redundant computations in simple loops o Still, programmer may still want to make it explicit Example o Repetition of computation: n * i for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; for (i = 0; i < n; i++) { int ni = n * i; for (j = 0; j < n; j++) a[ni + j] = b[j]; 8 4

Worrying About Side Effects Compiler cannot always avoid repeated computation o May not know if the code has a side effect o that makes the transformation change the code s behavior Is this transformation okay? int func1(int x) { return f(x) + f(x) + f(x) + f(x); Not necessarily, if int counter = 0; int f(int x) { return counter++; int func1(int x) { return 4 * f(x); And this function may be defined in another file known only at link time! 9 Another Example on Side Effects Is this optimization okay? for (i = 0; i < strlen(s); i++) { /* Do something with s[i] */ length = strlen(s); for (i = 0; i < length; i++) { /* Do something with s[i] */ Short answer: it depends o Compiler often cannot tell o Most compilers do not try to identify side effects Programmer knows best o And can decide whether the optimization is safe 10 5

Memory Aliasing Is this optimization okay? void twiddle(int *xp, int *yp) { *xp += *yp; *xp += *yp; void twiddle(int *xp, int *yp) { *xp += 2 * *yp; Not necessarily, what if xp and yp are equal? o First version: result is 4 times *xp o Second version: result is 3 times *xp 11 Memory Aliasing Memory aliasing o Single data location accessed through multiple names o E.g., two pointers that point to the same memory location Modifying the data using one name o Implicitly modifies the values seen through other names xp, yp Blocks optimization by the compiler o The compiler cannot tell when aliasing may occur o and so must forgo optimizing the code Programmer often does know o And can optimize the code accordingly 12 6

Another Aliasing Example Is this optimization okay? int *x, *y; *x = 5; *y = 10; printf( x=%d\n, *x); printf( x=5\n ); Not necessarily o If y and x point to the same location in memory o the correct output is x = 10\n 13 Summary: Helping the Compiler Compiler can perform many optimizations o Register allocation o Code selection and ordering o Eliminating minor inefficiencies But often the compiler needs your help o Knowing if code is free of side effects o Knowing if memory aliasing will not happen Modifying the code can lead to better performance o Profile the code to identify the hot spots o Look at the assembly language the compiler produces o Rewrite the code to get the compiler to do the right thing 14 7

Exploiting the Hardware 15 Underlying Hardware Implements a collection of instructions o Instruction set varies from one architecture to another o Some instructions may be faster than others Registers and caches are faster than main memory o Number of registers and sizes of caches vary o Exploiting both spatial and temporal locality Exploits opportunities for parallelism o Pipelining: decoding one instruction while running another Benefits from code that runs in a sequence o Superscalar: perform multiple operations per clock cycle Benefits from operations that can run independently o Speculative execution: performing instructions before knowing they will be reached (e.g., without knowing outcome of a branch) 16 8

Addition Faster Than Multiplication Adding instead of multiplying o Addition is faster than multiplication Recognize sequences of products o Replace multiplication with repeated addition for (i = 0; i < n; i++) { int ni = n * i; for (j = 0; j < n; j++) a[ni + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; 17 Bit Operations Faster Than Arithmetic Shift operations to multiple/divide by powers of 2 o x >> 3 is faster than x/8 o x << 3 is faster than x * 8 53 001 1 0 1 0 1 53<<2 110 1 0 0 0 0 Bit masking is faster than mod operation o x & 15 is faster than x % 16 53 & 15 001 1 0 1 0 1 000 0 1 1 1 1 5 000 0 0 1 0 1 18 9

Caching: Matrix Multiplication Caches o Slower than registers, but faster than main memory o Both instruction caches and data caches Locality o Temporal locality: recently-referenced items are likely to be referenced in near future o Spatial locality: Items with nearby addresses tend to be referenced close together in time Matrix multiplication o Multiply n-by-n matrices A and B, and store in matrix C o Performance heavily depends on effective use of caches 19 Matrix Multiply: Cache Effects for (i=0; i<n; i++) { for (j=0; j<n; j++) { for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; Reasonable cache effects (*,j) o Good spatial locality for A (i,j) (i,*) o Poor spatial locality for B o Good temporal locality for C A B C 20 10

Matrix Multiply: Cache Effects for (j=0; j<n; j++) { for (k=0; k<n; k++) { for (i=0; i<n; i++) c[i][j] += a[i][k] * b[k][j]; Rather poor cache effects o Bad spatial locality for A o Good temporal locality for B o Bad spatial locality for C (*,k) (*,j) (k,j) A B C 21 Matrix Multiply: Cache Effects for (k=0; k<n; k++) { for (i=0; i<n; i++) { for (j=0; j<n; j++) c[i][j] += a[i][k] * b[k][j]; Good poor cache effects o Good temporal locality for A o Good spatial locality for B o Good spatial locality for C (i,k) (k,*) A B C (i,*) 22 11

Parallelism: Loop Unrolling What limits the performance? for (i = 0; i < length; i++) sum += data[i]; Limited apparent parallelism o One main operation per iteration (plus book-keeping) o Not enough work to keep multiple functional units busy o Disruption of instruction pipeline from frequent branches Solution: unroll the loop o Perform multiple operations on each iteration 23 Parallelism: After Loop Unrolling Original code for (i = 0; i < length; i++) sum += data[i]; After loop unrolling (by three) /* Combine three elements at a time */ limit = length 2; for (i = 0; i < limit; i+=3) sum += data[i] + data[i+1] + data[i+2]; /* Finish any remaining elements */ for ( ; i < length; i++) sum += data[i]; 24 12

Program Execution 25 Avoiding Function Calls Function calls are expensive o Caller saves registers and pushes arguments on stack o Callee saves registers and pushes local variables on stack o Call and return disrupt the sequence flow of the code Function inlining: void g(void) { /* Some code */ Some compilers support inline keyword directive. void f(void) { g(); void f(void) { /* Some code */ 26 13

Writing Your Own Malloc and Free Dynamic memory management o Malloc to allocate blocks of memory o Free to free blocks of memory Existing malloc and free implementations o Designed to handle a wide range of request sizes o Good most of the time, but rarely the best for all workloads Designing your own dynamic memory management o Forego using traditional malloc/free, and write your own o E.g., if you know all blocks will be the same size o E.g., if you know blocks will usually be freed in the order allocated o E.g., <insert your known special property here> 27 Conclusion Work smarter, not harder o No need to optimize a program that is fast enough o Optimize only when, and where, necessary Speeding up a program o Better data structures and algorithms: better asymptotic behavior o Optimized code: smaller constants Techniques for speeding up a program o Coax the compiler o Exploit capabilities of the hardware o Capitalize on knowledge of program execution 28 14

Course Wrap Up 29 The Rest of the Semester Deans Date: Tuesday May 12 o Final assignment due at 9pm o Cannot be accepted after 11:59pm Final Exam: Friday May 15 o 1:30-4:20pm in Friend Center 101 o Exams from previous semesters are online at http://www.cs.princeton.edu/courses/archive/spring09/cos217/exam2prep/ o Covers entire course, with emphasis on second half of the term o Open book, open notes, open slides, etc. (just no computers!) No need to print/bring the IA-32 manuals Office hours during reading/exam period o Daily, times TBA on course mailing list Review sessions o May 13-14, time TBA on course mailing list 30 15

Goals of COS 217 Understand boundary between code and computer o Machine architecture o Operating systems o Compilers Learn C and the Unix development tools o C is widely used for programming low-level systems o Unix has a rich development environment o Unix is open and well-specified, good for study & research Improve your programming skills o More experience in programming o Challenging and interesting programming assignments o Emphasis on modularity and debugging 31 Relationship to Other Courses Machine architecture o Logic design (306) and computer architecture (471) o COS 217: assembly language and basic architecture Operating systems o Operating systems (318) o COS 217: virtual memory, system calls, and signals Compilers o Compiling techniques (320) o COS 217: compilation process, symbol tables, assembly and machine language Software systems o Numerous courses, independent work, etc. o COS 217: programming skills, UNIX tools, and ADTs 32 16

Lessons About Computer Science Modularity o Well-defined interfaces between components o Allows changing the implementation of one component without changing another o The key to managing complexity in large systems Resource sharing o Time sharing of the CPU by multiple processes o Sharing of the physical memory by multiple processes Indirection o Representing address space with virtual memory o Manipulating data via pointers (or addresses) 33 Lessons Continued Hierarchy o Memory: registers, cache, main memory, disk, tape, o Balancing the trade-off between fast/small and slow/big Bits can mean anything o Code, addresses, characters, pixels, money, grades, o Arithmetic can be done through logic operations o The meaning of the bits depends entirely on how they are accessed, used, and manipulated 34 17

Have a Great Summer!!! 35 18