CS553 Lecture Profile-Guided Optimizations 3

Similar documents
Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Static Scheduling Techniques. Loop Unrolling (1 of 2) Loop Unrolling (2 of 2)

Computer Science 146. Computer Architecture

CS425 Computer Systems Architecture

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Advanced Computer Architecture

Control flow graphs and loop optimizations. Thursday, October 24, 13

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

EECS 583 Class 3 More on loops, Region Formation

Lecture 6: Static ILP

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Register Allocation. CS 502 Lecture 14 11/25/08

Hiroaki Kobayashi 12/21/2004

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Lecture 18 List Scheduling & Global Scheduling Carnegie Mellon

VLIW/EPIC: Statically Scheduled ILP

COS 598C - Advanced Compilers

HW/SW Codesign. WCET Analysis

Conditional Elimination through Code Duplication

Trace Scheduling. basic block: single-entry, single-exit instruction sequence trace (aka superblock, hyperblock): fused basic block sequence

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

15-740/ Computer Architecture Lecture 27: VLIW. Prof. Onur Mutlu Carnegie Mellon University

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Dependence Analysis. Hwansoo Han

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.

Hardware-Software Codesign. 9. Worst Case Execution Time Analysis

G.1 Introduction: Exploiting Instruction-Level Parallelism Statically G-2 G.2 Detecting and Enhancing Loop-Level Parallelism G-2 G.

Calvin Lin The University of Texas at Austin

Instruction Scheduling. Software Pipelining - 3

Compiler: Control Flow Optimization

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

CS553 Lecture Dynamic Optimizations 2

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

EECS 583 Class 3 Region Formation, Predicated Execution

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Lecture Notes on Loop Optimizations

IA-64 Compiler Technology

Lecture 9: Multiple Issue (Superscalar and VLIW)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Compiler techniques for leveraging ILP

EECS 583 Class 2 Control Flow Analysis LLVM Introduction

CS252 Graduate Computer Architecture Midterm 1 Solutions

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:

Computer Science 246 Computer Architecture

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Loop Invariant Code Motion. Background: ud- and du-chains. Upward Exposed Uses. Identifying Loop Invariant Code. Last Time Control flow analysis

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UNIT I (Two Marks Questions & Answers)

CS293S Redundancy Removal. Yufei Ding

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Intermediate representation

Data Prefetch and Software Pipelining. Stanford University CS243 Winter 2006 Wei Li 1

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

ECE 5730 Memory Systems

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Control Flow Analysis

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Lecture 19: Instruction Level Parallelism

HPL-PD A Parameterized Research Architecture. Trimaran Tutorial

Hardware-based Speculation

Tiling: A Data Locality Optimizing Algorithm

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Instruction-Level Parallelism and Its Exploitation

Tour of common optimizations

Goals of Program Optimization (1 of 2)

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Lecture 6. Register Allocation. I. Introduction. II. Abstraction and the Problem III. Algorithm

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Program design and analysis

CS 406/534 Compiler Construction Putting It All Together

times the performance of a high-end microprocessor introduced in 1984 [3]. 1 Such rapid growth in microprocessor performance has stimulated the develo

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

CS 351 Final Exam Solutions

HW 2 is out! Due 9/25!

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Topic 9: Control Flow

A 100 B F

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Bentley Rules for Optimizing Work

Transcription:

Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553 Lecture Profile-Guided Optimizations 2 Motivation for Profiling Limitations of static analysis Compilers can analyze possible paths but must behave conservatively Frequency information cannot be obtained through static analysis How runtime information helps Control flow information 10% if c 90% Optimize the more frequent path (perhaps at the expense of the less frequent path) Memory conflicts st r1,0(r5) ld r2,0(r4) If r5 and r4 always have different values, we can move the load above the store CS553 Lecture Profile-Guided Optimizations 3 1

Profile-Guided Optimizations asic idea Instrument and run program on sample inputs to get likely runtime behavior Can use this information to improve instruction scheduling Many other uses Code placement Inlining Value speculation ranch prediction Class-based optimization (static method lookup) CS553 Lecture Profile-Guided Optimizations 4 Profiling Issues Profile data Collected over whole program run May not be useful (unbiased branches) May not reflect all runs May be expensive and inconvenient to gather Continuous profiling [nderson 97] May interfere with program CS553 Lecture Profile-Guided Optimizations 5 2

Control-Flow Profiles Commonly gather two types of information Execution frequencies of basic blocks ranch frequencies of conditional branches Represent information in a weighted flow graph 1 100 execution frequencies 2 100 70 30 branch frequencies 3 70 4 30 Instrumentation Insert instrumentation code at basic block entrances and before each branch Take average of values from multiple training runs Fairly inexpensive CS553 Lecture Profile-Guided Optimizations 6 Code Motion Using Control Flow Profiles Code motion across basic blocks Increased scheduling freedom C s1 move code above a join If we want to move s1 to, we must move s1 to both and s1 C move code below a split If we want to move s1 to, we must move s1 to both and C CS553 Lecture Profile-Guided Optimizations 7 3

Code Motion Using Control Flow Profiles (cont) Code motion across basic blocks Increased scheduling freedom s1 C move code below a join C s1 C tail duplication prevents C from seeing s1 s1 C move code above a split If we want to move s1 from to and if s1 would destroy a value along the C path, do renaming (in hardware or software) What if s1 might cause an exception? CS553 Lecture Profile-Guided Optimizations 8 Memory-Dependence Profiles Gather information about memory conflicts Frequencies of address matches between pairs of loads and stores ttempts to answer the question: re two references independent of one another? Concentrate on ambiguous reference pairs (those that the compiler cannot figure out) st1: store r5 ld2: load r4 (st1, ld2, 7) If this number is low, we can speculatively assume that st1 and ld2 do not conflict Instrumentation Much more expensive (in both space and time) to gather than control flow information First perform control flow profiling pply only to most frequently executed blocks CS553 Lecture Profile-Guided Optimizations 9 4

Trace Scheduling [Fisher 81] and [Ellis 85] asic idea We want large blocks to create large scheduling windows, but basic blocks are small because branches are frequent Create superblocks to increase scheduling window Use profile information to create good superblocks Optimize each superblock independently Superblocks sequence of basic blocks with a single entrance and multiple exits Goals Want large superblocks Want to avoid early exits Want blocks that match actual execution paths 1 2 3 4 a superblock CS553 Lecture Profile-Guided Optimizations 10 Trace Scheduling (example) b[i] = old a[i] =... if (a[i]>0) then b[i]= new ; else stmt X stmt Y endif c[i] =... trace: b[i] = old a[i] =... b[i]= new ; c[i] =... if (a[i]<=0) then goto repair continue:... repair: restore old b[i] stmt X stmt Y recalculate c[i]? goto continue CS553 Lecture Profile-Guided Optimizations 11 5

Trace Scheduling (cont) Three steps 1. Create superblocks 2. Enlarge superblocks 3. Compact (optimize) superblocks 1. Superblock formation Create traces using mutual-most-likely heuristic (two blocks and are mutual-most-likely if is the most likely successor of, and is the most likely predecessor of ) D 10 70 30 C 70 10 E trace is a maximal sequence of mutualmost-likely blocks that does not contain a back edge Each block belongs to exactly one trace CS553 Lecture Profile-Guided Optimizations 12 Trace Scheduling (cont) 1. Superblock formation (cont) Convert traces into Superblocks Use tail duplication to eliminate side entrances trace 70 C 30 superblock 70 30 C 70 10 70 10 E E E Tail duplication increases code size CS553 Lecture Profile-Guided Optimizations 13 6

Trace Scheduling (cont) 2. Superblock enlargement Enlarge superblocks that are too small Code expansion can hurt i-cache performance Three techniques for enlargement ranch target expansion If the last branch in a superblock is likely to jump to the start of another superblock, append the contents of the target superblock to the first superblock Loop peeling Loop unrolling These last two techniques apply to superblock loops, which are superblocks whose last blocks are likely to jump to their first blocks ssume that each loop body has a single dominant path CS553 Lecture Profile-Guided Optimizations 14 Trace Scheduling (cont) 3. Optimizations Perform list scheduling for each superblock Memory-dependence profiles can be used to speculatively assume that load/store pairs do not conflict Insert repair code in case the assumption is incorrect Software pipelining CS553 Lecture Profile-Guided Optimizations 15 7

Speculation based on memory-dependence profiles (example) b[i] = old a[i] =... if (a[i]>0) then b[i]= new ; else stmt X stmt Y endif c[i] = a[j] trace: b[i] = old c[i] = a[j] a[i] =... b[i]= new ; if (i==j) then goto deprepair if (a[i]<=0) then goto repair continue:... deprepair: c[i] = a[i] if (a[i]<=0) then goto repair goto continue repair: restore old b[i] stmt X stmt Y goto continue CS553 Lecture Profile-Guided Optimizations 16 Enhancements to Profile-Guided Code Scheduling Path profiling [all and Larus 96] Collect information about entire paths instead of about individual edges Edge profiles Path profiles Path profiles Limit paths to some specified length (can thus handle loops) Can also stop paths at back edges Disadvantages of path profiling? CS553 Lecture Profile-Guided Optimizations 17 8

Lessons Larger scope helps How can we increase scope? How do we schedule across control dependences? Static information is limited Use profiles How else can profiles be used in optimization? Can we do these kinds of optimizations at runtime? CS553 Lecture Profile-Guided Optimizations 18 Concepts Instruction scheduling Software pipelining Trace scheduling oth use profile information oth look at scopes beyond basic blocks Miscellany Path profiling Speculative hedging CS553 Lecture Profile-Guided Optimizations 19 9

Next Time Reading Mahlke 92 Lecture Speculation and predication CS553 Lecture Profile-Guided Optimizations 20 10