If-Conversion SSA Framework and Transformations SSA 09

Similar documents
EECS 583 Class 3 Region Formation, Predicated Execution

VLIW/EPIC: Statically Scheduled ILP

EECS 583 Class 3 More on loops, Region Formation

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Hardware-based Speculation

Getting CPI under 1: Outline

Advanced Computer Architecture

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Hiroaki Kobayashi 12/21/2004

Trace Scheduling. basic block: single-entry, single-exit instruction sequence trace (aka superblock, hyperblock): fused basic block sequence

Topic 9: Control Flow

HPL-PD A Parameterized Research Architecture. Trimaran Tutorial

Computer Science 146. Computer Architecture

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Multiple Instruction Issue. Superscalars

An Overview of GCC Architecture (source: wikipedia) Control-Flow Analysis and Loop Detection

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Instruction-Level Parallelism (ILP)

Instruction Scheduling

Dynamic Control Hazard Avoidance

ILP: Instruction Level Parallelism

CS425 Computer Systems Architecture

Lecture 4: Instruction Set Design/Pipelining

Instruction Scheduling. Software Pipelining - 3

TDT 4260 lecture 7 spring semester 2015

ECSE 425 Lecture 11: Loop Unrolling

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Processor (IV) - advanced ILP. Hwansoo Han

HW 2 is out! Due 9/25!

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Four Steps of Speculative Tomasulo cycle 0

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Adapted from David Patterson s slides on graduate computer architecture

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Hardware-Based Speculation

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CMSC 611: Advanced Computer Architecture

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

This test is not formatted for your answers. Submit your answers via to:

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Advanced Instruction-Level Parallelism

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

Advanced processor designs

fast code (preserve flow of data)

CS146 Computer Architecture. Fall Midterm Exam

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

CMSC411 Fall 2013 Midterm 2 Solutions

SPARK: A Parallelizing High-Level Synthesis Framework

Compiler Design. Code Shaping. Hwansoo Han

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Multicore and Parallel Processing

Computer Science 246 Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Porting LLVM to a Next Generation DSP

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Register Allocation in Just-in-Time Compilers: 15 Years of Linear Scan

Multi-cycle Instructions in the Pipeline (Floating Point)

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Floating Point/Multicycle Pipelining in DLX

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CS433 Homework 3 (Chapter 3)

Super Scalar. Kalyan Basu March 21,

Path Analysis and Renaming for Predicated Instruction Scheduling

Lecture 9: Multiple Issue (Superscalar and VLIW)

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

EECC551 Exam Review 4 questions out of 6 questions

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Model-based Software Development

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Hardware Speculation Support

The Processor: Instruction-Level Parallelism

IA-64 Compiler Technology

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Course on Advanced Computer Architectures

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Calvin Lin The University of Texas at Austin

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EECS 583 Class 13 Software Pipelining

Execution-based Scheduling for VLIW Architectures. Kemal Ebcioglu Erik R. Altman (Presenter) Sumedh Sathaye Michael Gschwind

Static vs. Dynamic Scheduling

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Copyright 2012, Elsevier Inc. All rights reserved.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Topic Notes: MIPS Instruction Set Architecture

Transcription:

If-Conversion SSA Framework and Transformations SSA 09 Christian Bruel 29 April 2009

Motivations Embedded VLIW processors have architectural constraints - No out of order support, no full predication, Several functional units - Need conditional support and compiler techniques to provide aggressive ILP Traditional if-conversion techniques for partial predication: - Local if-then-else constructs or pattern matching style optimizations. - Global Hyperblock approach followed by a full predication if-conversion algorithm. - No global framework allowing to extend the ISA set of predicated instructions => need a configurable if-conversion algorithm. This paper presents : - A SSA if-conversion algorithm using the select instruction and speculation - An extension to this algorithm to support a configurable set of predicated instructions

If-Conversion If-Conversion - Process to convert a control flow region into a sequence of conditional instructions. Remove conditional branches and simplify control flow Need architectural support - Increase ILP - Increase locality - Some optimisations are more efficient with a single basic block (software pipelining, instruction scheduling) Global problem - Control flow regions can be complex - Not limited to simple regions ( hammock ) or pattern matching - Lot of tradoffs - Limit code size explosion or even reduce it Local problem - Predicate construction and allocation - Variable renaming - Predicated instruction constructions - Predicated instruction optimizations

3 types of architectural support Example: conditional assignment select if-conv Partial predication Full predication c = cmp t,0 t = r+1 r = select c,t,r c = cmp t,0 c? r = ldw c = cmp t,0 c? r = r+1 Framework is able to support all or a mix of those architectural supports - Balance speculation and predication Minimum requirement is a form of conditional move and predicate building (cmp) and merging (logical and/or)

If-Conversion (why SSA) r = l=2 r=l+3 Use r Q1: which r to use? Q2: what is the best connection to the definition? (smallest predicate set) Q3: what are the value that needs to hold predicate?

If-Conversion (why SSA) r = r1 = l=2 r=l+3 l=2 r2=l+3 r=ф(r1,r2) Use r Use r Q1: which r to use? Q2: what is the best connection to the definition? (smallest predicate set) Q3: what are the value that needs to hold predicate? A1: SSA does the renaming A2: PHI shows variables that are conditional, merge the immediate condition A3: Only those that have a reaching point. Other a speculated

If-Conversion (why SSA) l=2 r=l+3 r = l=2 r2=l+3 r1 = r=ф(r1,r2) r1 depends on TRUE predicate r2 depends on c l is not processed: speculed Use r Use r SSA holds (almost) all the information we need Need a global framework to handle more complex regions And a phi walking process

Local SSA transformations ф removal ф reduction ф augmentation r2=2 r1=s+1 r2=2 r1=s+1 r2=2 r1=s+1 r3=ф(r2,r4,r5) r=ф(r1,r2) r=ф(r1,r2,r3) r=ф(r1,r3) r1=s+1 r2=2 r=select c?r1:r2 r1=s+1 r2=2 r4=select c?r1:r2 r=ф(r4,r3) r1=s+1 r2=2 r6=r2 r3=ф(r4,r5) t=ф(r1,r6,r3)

Local SSA transformation (predicate merging) 2 r1=s+1 r2=2 r=ф(r1,r2) r1=s+1 c=c1&c2 r=ф(r1,r2) ф removal r2=2 Costs! One predicate register One extra op in critical path Higher dependence height c=c1&c2 r1=s+1 r2=2 r=select c?r1,r2

IFC-SSA Global Framework A set of transformations are applied iteratively in postorder on the CFG - Considered regions are group of basic blocks with a single conditional entry - Blocks reached on multiple conditions are detected (predicate merge) - Side entries can be removed using block duplication While the region grows, when do we stop? - Decision to continue reconsidered at each iteration based on cost functions and static information - Decision to continue until all instructions from the candidate region can be removed, speculated or predicated - Objective function is local so easy to compute whole don t exceed the sum of the parts - Stop on hazardeous instructions (generally)

IFC-SSA running example BB1 BB1 BB1 BB1 BB3 BB2 BB3 BB2 BB2 BB3 BB2 BB3 BB4 BB4 BB4 BB4 BB5 BB5 BB5 BB5 BB6 BB6 BB6 BB6 Region processed in postorder : BB4, BB2, BB1

SSA for partial predication Problem: transforming ф into select was realizing join points - Need to relink select into the equivalent predicated instructions - Principle: Transforms speculation into predication t=test1 br<t> P=test2 r1=ldw r2=sub s=select p?r1:r2 r3=add r=ф(s,r3)

SSA for partial predication Problem: transforming ф into select was realizing join points - Need to relink select into the equivalent predicated instructions - Principle: Transforms speculation into predication Not very convenient: Not SSA anymore. - Predication introduced a renaming problem. - Need to keep the select form and add a new speculation->predication pass after out-of-ssa Breaks objective function and incompatible with SSA code generator - Or generate directly an extended SSA for predication (current implementation uses ψ-ssa) t=test1 br<t> t=test1 br<t> P=test2 r1=ldw r2=sub s=select p?r1,r2 r3=add P=test2 P? s=ldw!p? s=sub r3=add r=ф(s,r3) r=ф(s,r3)

SSA for partial predication predicate construction Nested predicates are automatically supported by select speculation like any other instruction - Predicate are handled has data operand dependency Predicated version even more difficult to express fully in SSA Need to express SSA renaming with speculated merging points - And after SSA complicated renaming - Let SSA do it with an extension t=test1 P=test2 r1=ldw r2=sub s=select p?r1,r2 r3=add r=select t?s,r3 t=test1 p=test2 p&t? r1=ldw!p&t? r2=sub s=select p?r1,r2!t r3=add r=select t? s,r3 r1 is defined on p and r2 is defined on!p -> coalesced, can be renamed s is defined for t, r3 defined on!t -> can be renamed t=test1 p=test2 p&t? r=ldw!p&t? r=sub!t r=add

Status Implemented for 3 kinds of conditional instructions support - ST231 speculative model 4 issue VLIW (4 32 bit ALUs, 2 multipliers, 1 load store, 64 x 32 bit registers select instruction, 8 1 bit branch registers, Wired and/or instructions Speculative loads using static analysis or dismissible load Speculative stores using compiler generated dummy stack slots - ST240 Predicated variant using predicated loads and stores - Internal experimental core with fully predicated ISA Implemented in the production C/C++ Open64 code generator - After middle end optimizations and code selection - Before loop unrolling, local and global schedulings Further possible improvements - Allow multiple exits regions -> efficient region selection - Add backedge coalescing to improve if-converted loop bodies - Improve scalar optimisations on predicated instructions

Results Bench set with MIBENCH and SPEC 2000 No code size bloat. (or even decrease) Select speculated model: - 23 % geometrical mean for multimedia applications - (~3% for scalar applications) Partially Predicated model with ψ SSA: - 25 % geometrical mean for multimedia applications wc.c: only one back edge branch remaining. - Static schedule loop into a single basic block. - 31 cycles -> 23 cycles

Conclusion Performance experience - Fitness to resources availability Objective function very easy to compute locally The if-converted region cost doesn t exceed the sum of the parts ponderated by edge frequencies - Dynamically adapt to speculated or predicated model when alternative possible - Speculation shows more efficient than predication due to Smaller dependence height Predication need many predicate registers (while speculation needs more general registers) But predication increases the set of if-converted regions (alleviate hazards) SSA naturally sets the environment for if-conversion analysis - Need to be maintained to be used incrementally - Nested conditionalized can be refeeded to the process Speculated model - We enter in strict SSA form and produces strict SSA - Conditional instructions are realized into a select instruction Predicated model - Enter in strict SSA form and produces extended (ψ) SSA or strict SSA+select (need post SSA transformations) Papers: - If-Conversion SSA framework for partially predicated VLIW architectures C.Bruel CGO ODES-04 : Workshop on Optimizations for DSP and Embedded Systems - If-Conversion for Embedded VLIW Atchitectures C.Bruel IJES - International Journal of Embedded Systems. To be published

Thanks Thanks to Christophe Guillon and Francois de Ferriere for their feedbacks, suggestions and (sometime) bug reports. Questions?