FSU. Av oiding Unconditional Jumps by Code Replication. Frank Mueller and David B. Whalley. Florida State University DEPARTMENT OF COMPUTER SCIENCE

Similar documents
Avoiding Unconditional Jumps by Code Replication. Frank Mueller and David B. Whalley. from the execution of a number of programs.

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

Evaluating Heuristic Optimization Phase Order Search Algorithms

Control flow graphs and loop optimizations. Thursday, October 24, 13

FSU DEPARTMENT OF COMPUTER SCIENCE

r[2] = M[x]; M[x] = r[2]; r[2] = M[x]; M[x] = r[2];

Compiler Optimization

Lecture Notes on Loop Optimizations

CSC D70: Compiler Optimization

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Tour of common optimizations

Lecture 4: Instruction Set Design/Pipelining

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Static Single Assignment Form in the COINS Compiler Infrastructure

In Search of Near-Optimal Optimization Phase Orderings

Languages and Compiler Design II IR Code Optimization

Optimization Prof. James L. Frankel Harvard University

COP5621 Exam 4 - Spring 2005

Intermediate Code & Local Optimizations

FSU DEPARTMENT OF COMPUTER SCIENCE

for (sp = line; *sp; sp++) { for (sp = line; ; sp++) { switch (*sp) { switch (*sp) { case p :... case k :...

Goals of Program Optimization (1 of 2)

CODE GENERATION Monday, May 31, 2010

Methods for Saving and Restoring Register Values across Function Calls

ECE 486/586. Computer Architecture. Lecture # 7

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Lecture 3 Local Optimizations, Intro to SSA

Optimization. ASU Textbook Chapter 9. Tsan-sheng Hsu.

Decreasing Process Memory Requirements by Overlapping Program Portions

4.1 Interval Scheduling

Compiler and Architectural Support for. Jerey S. Snyder, David B. Whalley, and Theodore P. Baker. Department of Computer Science

More Code Generation and Optimization. Pat Morin COMP 3002

Control Flow Analysis & Def-Use. Hwansoo Han

CMSC 611: Advanced Computer Architecture

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Lecture 8: Induction Variable Optimizations

Final Exam Review Questions

Pipelining and Vector Processing

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

Software Exorcism: A Handbook for Debugging and Optimizing Legacy Code

HY425 Lecture 09: Software to exploit ILP

Optimizations. Optimization Safety. Optimization Safety CS412/CS413. Introduction to Compilers Tim Teitelbaum

Aggressive Loop Unrolling in a Retargetable, Optimizing Compiler JACK W. DAVIDSON and SAN JAY JINTURKAR. Abstract

HY425 Lecture 09: Software to exploit ILP

Compiler Theory. (Intermediate Code Generation Abstract S yntax + 3 Address Code)

Pipelining, Branch Prediction, Trends

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Machine-Independent Optimizations

Induction Variable Identification (cont)

Avoiding Conditional Branches by Code Replication

Compiler Optimization and Code Generation

Multi-dimensional Arrays

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

CS553 Lecture Profile-Guided Optimizations 3

Code Generation. Frédéric Haziza Spring Department of Computer Systems Uppsala University

Liveness Analysis and Register Allocation. Xiao Jia May 3 rd, 2013

Implementation of Additional Optimizations for VPO on the Power Platform

Soot A Java Bytecode Optimization Framework. Sable Research Group School of Computer Science McGill University

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

An Overview of GCC Architecture (source: wikipedia) Control-Flow Analysis and Loop Detection

Code optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space

Redundant Computation Elimination Optimizations. Redundancy Elimination. Value Numbering CS2210

5008: Computer Architecture HW#2

INSTITUTE OF AERONAUTICAL ENGINEERING

Bentley Rules for Optimizing Work

Reducing the Cost of Branches by Using Registersf-

Using a Swap Instruction to Coalesce Loads and Stores

Loop Invariant Code Motion. Background: ud- and du-chains. Upward Exposed Uses. Identifying Loop Invariant Code. Last Time Control flow analysis

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

COMS W4115 Programming Languages and Translators Lecture 21: Code Optimization April 15, 2013

Principles of Compiler Design

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall

Intermediate representation

Intermediate Representations Part II

EECS 583 Class 3 More on loops, Region Formation

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.

Memory Bandwidth Optimizations for Wide-Bus Machines

Intermediate Code Generation

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Topic 9: Control Flow

Model-based Software Development

Intermediate Code & Local Optimizations. Lecture 20

Fast Searches for Effective Optimization Phase Sequences

Topic I (d): Static Single Assignment Form (SSA)

Why Global Dataflow Analysis?

Calvin Lin The University of Texas at Austin

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Compiler Optimization Techniques

Lecture 1 Introduc-on

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

A main goal is to achieve a better performance. Code Optimization. Chapter 9

Tuning Programs For Performance

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Other Forms of Intermediate Code. Local Optimizations. Lecture 34

Administrative. Other Forms of Intermediate Code. Local Optimizations. Lecture 34. Code Generation Summary. Why Intermediate Languages?

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

CISC / RISC. Complex / Reduced Instruction Set Computers

Transcription:

Av oiding Unconditional Jumps by Code Replication by Frank Mueller and David B. Whalley Florida State University

Overview uncond. jumps occur often in programs [4-0% dynamically] produced by loops, conditional statements, etc. can almost always be avoided technique to avoid uncond. jumps introduction evaluation 2

Motivation code generated includes uncond. jumps for while-loops and for-loops typically if-then-else construct always other language constructs (break, goto, continue in C) asaside-effect of optimizations uncond. jump instruction can be avoided when code is replicated from the target methods often employed for certain class of loops in frontend new method ispart of optimizations in back-end of compiler can be applied universally to all uncond. jumps may introduce sources for other optimizations 3

Example While-Loop (RTLs for 68020) without replication i=; while (x[i]) { x[i-] = x[i]; i++; } with replication a[0]=a[6]+x.+; NZ=B[a[6]+x.+]?0; a[]=a[0]; PC=NZ==0,L6; L5 a[0]=a[6]+x.; NZ=B[a[0]]?0; L000 PC=NZ==0,L6; B[a[0]]=B[a[0]+]; B[a[0]-]=B[a[]++]; a[0]=a[0]+; a[0]=a[0]+; NZ=B[a[0]+]?0; PC=L5; PC=NZ!=0,L000; L6... L6... 4

Example 2 For-Loop (RTLs for 68020) without replication for (i = k; i < 0; i++) x[i] = y[i]; with replication L9 L8 d[0]=l[a[6]+k.]; a[0]=d[0]+a[6]+x.; a[]=d[0]+a[6]+y.; PC=L8; B[a[0]++]=B[a[]++]; d[0]=d[0]+; L9 d[0]=l[a[6]+k.]; NZ=d[0]?0; PC=NZ>=0,L000; a[0]=d[0]+a[6]+x.; a[]=d[0]+a[6]+y.; B[a[0]++]=B[a[]++]; d[0]=d[0]+; NZ=d[0]?0; NZ=d[0]?0; PC=NZ<0,L9; PC=NZ<0,L9;... L000... 5

Example 3 Exit Condition in the Middle of a Loop (RTLs for 68020) i=; while (i++ < n) x[i-] = x[i]; without replication with replication L5 d[]=; a[0]=a[6]+x.; d[0]=; d[]=2; NZ=d[0]?L[_n]; PC=NZ>=0,L6; a[0]=a[6]+x.+; d[0]=d[]; a[0]=a[0]+; d[]=d[]+; L000 NZ=d[0]?L[_n]; B[a[0]]=B[a[0]+]; PC=NZ>=0,L6; a[0]=a[0]+; B[a[0]]=B[a[0]+]; d[0]=d[]; PC=L5; d[]=d[]+; L6... NZ=d[0]?L[_n] PC=NZ<0,L000; L6... 6

Example 4 If-Then-Else Statement (RTLs for 68020) without replication if (i > 5) i=i/n; else i=i*n; return(i); with replication L22 L23 NZ=L[a[6]+i.]?5; PC=NZ<=0,L22; d[0]=l[a[6]+i.]; d[0]=d[0]/l[a[6]+n.]; L[a[6]+i.]=d[0]; PC=L23; d[0]=l[a[6]+i.]; d[0]=d[0]*l[a[6]+n.]; L[a[6]+i.]=d[0]; a[6]=uk; PC=RT; L22 NZ=L[a[6]+i.]?5; PC=NZ<=0,L22; d[0]=l[a[6]+i.]; d[0]=d[0]/l[a[6]+n.]; L[a[6]+i.]=d[0]; a[6]=uk; PC=RT; d[0]=l[a[6]+i.]; d[0]=d[0]*l[a[6]+n.]; L[a[6]+i.]=d[0]; a[6]=uk; PC=RT; 7

Remote Preheader 2 2 0 3 3 4 4 5 5 No Preheader Remote Preheader 8

Algorithm JUMPS. set up matrix (used to find shortest replication sequence) 2. traverse basic blocks until uncond. jump found 3. choose replication sequence (towards return or loop) 4. expand replication sequence to include all blocks within a loop 5. replicate code and adjust its control flow 6. adjust control flow ofportion of loops which was not copied 7. if control flow has become non-reducible then remove replicated code 9

Interference with Natural Loops 2 3 2 3 2 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 7 7 7 7 7..... Without Replication With Partial Replication With Loop Replication

Partial Overlapping of Natural Loops 2 2 2 3 3 3 4 4 4... Initial Control Flow After Replication Adjusted Control Flow 2

Measurements for Motorola 68020/6888 and Sun SPARC test programs included benchmarks UNIX utilities applications instrumentation of programs at code generation time compiled with different sets of optimizations: SIMPLE: standard opt. LOOPS: standard opt. + code replication at loops JUMPS: standard opt. + generalized code repl. 3

Measurements (cont.) Number of Static and Dynamic Instructions (Sun SPARC) Sun SPARC program static instructions dynamic instructions executed SIMPLE LOOPS JUMPS SIMPLE LOOPS JUMPS cal 338 +3.25% +2.89% 37,237-2.95% -3.5% quicksort 32 +5.6% +50.6% 836,404-2.86% -4.2% wc 209 +0.96% +58.37% 540,58-0.00% -.96% grep 968 +4.24% +79.34%,930,79-0.04% -3.57% sort,966 +4.63% +89.7%,8,960-0.7% -0.49% od,352 +4.59% +95.9% 2,336,04-8.84% -0.22% mincost,068 +6.84% +30.99% 335,750-0.59% -3.9% bubblesort 75 +7.43% +5.4% 29,07,668-0.05% -0.07% matmult 28 +4.59% +3.67% 4,403,74-0.08% -0.28% banner 69 +7.69% +66.27% 2,565 -.68% -0.25% sieve 93 +3.23% +3.23% 2,84,965-3.73% -3.73% compact,49 +.07% +75.8% 3,409,945 -.94% -4.86% queens 4 +0.00% +7.89% 263,58-0.00% -0.03% deroff 7,987 +.50% +204.98% 448,58-0.0% -3.3% average,76 +3.97% +56.53% 4,784,59-2.39% -5.7% 5

Measurements (cont.) Number of Static and Dynamic Instructions (Motorola 68020) Motorola 68020 program static instructions dynamic instructions executed SIMPLE LOOPS JUMPS SIMPLE LOOPS JUMPS cal 323 +3.72% +24.77% 36,290-3.09% -3.7% quicksort 245 +3.67% +37.96% 536,566-0.39% -3.96% wc 73 +0.58% +56.65% 42,038-0.00% -5.32% grep 775 +3.35% +80.90%,309,586-0.03% -3.44% sort,558 +3.98% +63.67% 902,075 -.49% -2.43% od,98 +2.92% +85.73%,980,808-9.45% -0.30% mincost 906 +3.20% +35.98% 302,062 -.0% -5.3% bubblesort 37 +3.65% +2.92% 20,340,23-8.92% -8.92% matmult 46 +3.42% +3.42% 4,89,507-0.2% -0.2% banner 77 +3.95% +55.93% 2,473 -.42% -3.34% sieve 70 +.43% +.43%,759,088-8.53% -8.53% compact,43 +0.70% +73.93% 0,602,59 -.54% -5.26% queens 94 +0.00% +2.77% 89,58-0.00% -0.05% deroff 5,730 +.06% +55.7% 360,05-0.03% -7.05% average 905 +2.55% +49.37% 3,6,675-3.30% -6.94% 6

Future Work handle indirect jumps in algorithm should improve dynamic savings may reduce size of generated code limit length of replication sequence, use depth-bound DFS should reduce size of generated code should improve compile-time overhead of optimization phase determine best phase ordering trade-off compile-time / exploit optimizations 8

Conclusions code replication avoids almost all uncond. jumps reduces number of executed instructions by 6% increases number of instructions between branches by.5 on SPARC results in 4% decreased cache work (except for small caches) increases code size by 53% outperforms traditional methods to avoid uncond. jumps should be applied in the back-end of highly optimizing compilers 9

Order of Optimizations branch chaining; dead code elimination; reorder basic blocks to minimize jumps; code replication (either JUMPS or LOOPS); dead code elimination; instruction selection; register assignment; if (change) instruction selection; do { register allocation by register coloring; instruction selection; common subexpression elimination; dead variable elimination; code motion; strength reduction; recurrences; instruction selection; branch chaining; constant folding at conditional branches; code replication (either JUMPS or LOOPS); dead code elimination; } while (change); filling of delay slots for RISCs; 20