Optimal ILP and Register Tiling: Analytical Model and Optimization Framework

Similar documents
Compiling for Advanced Architectures

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm

Hardware-based Speculation

Instruction Scheduling

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014

Optimization. Industrial AI Lab.

Generic Software pipelining at the Assembly Level

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Compact Multi-Dimensional Kernel Extraction for Register Tiling

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Computer Science 146. Computer Architecture

Loop Transformations! Part II!

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

high-speed-high-capacity memory

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Towards Approximate Computing: Programming with Relaxed Synchronization

Instruction-Level Parallelism (ILP)

Code generation for modern processors

Code generation for modern processors

Affine Loop Optimization using Modulo Unrolling in CHAPEL

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Legal and impossible dependences

Giovanni De Micheli. Integrated Systems Centre EPF Lausanne

Lecture 12: Executing Alpha

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core

A polyhedral loop transformation framework for parallelization and tuning

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Communication-Minimal Tiling of Uniform Dependence Loops

Getting CPI under 1: Outline

The Processor: Instruction-Level Parallelism

TDT 4260 lecture 7 spring semester 2015

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory

RECAP. B649 Parallel Architectures and Programming

Register Allocation via Hierarchical Graph Coloring

CS 152, Spring 2011 Section 10

ILP: COMPILER-BASED TECHNIQUES

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Scheduling. CSE471 Susan Eggers 1

Reducing Memory Requirements of Nested Loops for Embedded Systems

CS425 Computer Systems Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Neural Network Assisted Tile Size Selection

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Code Merge. Flow Analysis. bookkeeping

The Elcor Intermediate Representation. Trimaran Tutorial

Lecture: Pipeline Wrap-Up and Static ILP

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Affine and Unimodular Transformations for Non-Uniform Nested Loops

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Optimal Tile Size Selection Guided by Analytical Models

HW 2 is out! Due 9/25!

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Intermediate Representation (IR)

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC

Gate Sizing by Lagrangian Relaxation Revisited

ECE 669 Parallel Computer Architecture

Compiler techniques for leveraging ILP

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

POWER PERFORMANCE OPTIMIZATION METHODS FOR DIGITAL CIRCUITS

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

5008: Computer Architecture

Overpartioning with the Rice dhpf Compiler

CS 654 Computer Architecture Summary. Peter Kemper

Parameterized Tiled Loops for Free

Multiple Instruction Issue. Superscalars

Advanced Computer Architecture

Programmation Concurrente (SE205)

The Apron Library. Bertrand Jeannet and Antoine Miné. CAV 09 conference 02/07/2009 INRIA, CNRS/ENS

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computing and Informatics, Vol. 36, 2017, , doi: /cai

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Four Steps of Speculative Tomasulo cycle 0

Studying Optimal Spilling in the light of SSA

Loop Tiling for Parallelism

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Unroll-and-Jam Using Uniformly Generated Sets

INSTRUCTION LEVEL PARALLELISM

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ECSE 425 Lecture 11: Loop Unrolling

Memory Bank Disambiguation using Modulo Unrolling for Raw Machines

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Science 246 Computer Architecture

Lecture 18 List Scheduling & Global Scheduling Carnegie Mellon

Computer Graphics. Coordinate Systems and Change of Frames. Based on slides by Dianna Xu, Bryn Mawr College

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Transcription:

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan. Renganarayana, Upadrasta Ramakrishna, Sanjay Rajopadhye Computer Science Department Colorado State University

Overview ILP and register reuse Execution time and register pressure functions Optimal ILP and register tiling problem Optimal tiling problem as convex opt. problem Validation Related work Conclusions & Future work October 21, 2005 LCPC '05 2

ILP and Register Reuse Loop programs dominate application execution time main sources of ILP and register reuse Transformations expose / exploit ILP enable register reuse These transformations interact in subtle ways ILP - Register Reuse tradeoff? October 21, 2005 LCPC '05 3

ILP - Register Reuse Tradeoff Optimal combination of transformations Quantification of interactions A mathematical model to study the interactions to choose the optimal trans. parameters TTBOOK: no such model has been studied October 21, 2005 LCPC '05 4

Contributions Cost model with trans. params. as variables closed forms: execution time & register pressure Convex optimization problem formulation A globally optimal solution First such formulation & optimal solution October 21, 2005 LCPC '05 5

Exposing and Exploiting ILP Exposing ILP Unroll and Jam Loop permutation or skewing Multi-dimensional scheduling Exploiting ILP DAG schedulers Software pipelining October 21, 2005 LCPC '05 6

Exposing ILP with Unroll and Jam Unrolled Loop body (9 iterations) for i 1 for i 2 ] = -1 ]+ -1] DAG exposes parallelism October 21, 2005 LCPC '05 7

Exposing ILP with Permutation for i 1 for i 2 ] = 3.23 * -1] for i 2 for i 1 ] = 3.23 * -1] October 21, 2005 LCPC '05 8

Exposing ILP with Skewing for i 1 for i 2 ] = -1 ]+ -1] All the iterations of inner-most loop are parallel Sufficient ILP Performance limited only by the available execution resources October 21, 2005 LCPC '05 9

Register Reuse Unrol and Jam Scalar replacement scalar replacement enables register placement classic register allocators are sufficient Loop tiling array register allocation registers allocated to array values no code size increase requires an array register allocator October 21, 2005 LCPC '05 10

Scalar Replacement for i 1 for i 2 ] = -1 ]+ -1] unroll 2 x 2 for i 1 step 2 for i 2 step 2 ] = -1 ]+ -1] +1 ] = ]+ +1-1] +1] = -1, i 2 +1]+ ] +1 +1] = +1]+ +1 ] for i 1 step 2 T =,0] for i 2 step 2 T = -1 ]+ T +1 ] = T + +1-1] +1] = -1, i 2 +1]+T +1 +1] = +1]+ +1 ] ] = T replace ] by a scalar T Saves 2 loop independent loads plus 1 loop carried load T can be allocated to a register Which array references to scalar replace? October 21, 2005 LCPC '05 11

Tiling Tiling Similar to Unroll and Jam Decreases life time of values Limits MAXLIVE for i 1 for i 2 ] = -1 ]+ -1] October 21, 2005 LCPC '05 12

Register tiling for i 1 for i 2 ] = -1 ]+ -1] 3x3 register tile Tile sizes: Affects load/store savings Constrained by number of registers How to choose the tile sizes? October 21, 2005 LCPC '05 13

Traditional vs. Our Approach Code Transformation Sched. & Reg. Alloc Traditional Approach Choose optimal unroll and scalar promotion parameters Unroll and Jam + Scalar Promotion DAG Scheduler or Software Pipelining + Scalar Register Allocation Our Approach Choose optimal skew and tile parameters Permutation or Skewing + Tiling Software Pipelining + Scalar & Array Register Allocation October 21, 2005 LCPC '05 14

Program, Tiling, and Architecture Class Input loops: perfectly nested, rectangular loops uniform dependence bodies Rectangular tiling we assume: input loop nest admits rectangular tiling ILP-exposed by: permutation or skewing Architectures: superscalar or VLIW October 21, 2005 LCPC '05 15

Execution Time (When permutation exposes ILP) T = (ntiles * tile_cost) + loop_overhead tile_cost = max(comp_cost,load_store_cost) comp_cost = α * tile_vol load_store_cost = β * LS(t,D) loop_overhead = η * LO(t,N) ntiles = N 1 /t 1 * * N n /t n t = vector of tile sizes N = vector of iter. space sizes D = dependence matrix October 21, 2005 LCPC '05 16

Execution Time Model (when permutation cannot expose ILP: skew) Skewing affects iteration space shape -- makes counting of partial, full, and no. of tiles hard. dependence lengths -- affects the amount of data loaded / stored in a tile. Partial tiles treated as full tiles. Number of tiles approximated by N 1 /t 1 * * N n /t n Dep. matrix = SD LS(t,SD) is the load store volume October 21, 2005 LCPC '05 17

Optimal ILP and Register Tiling: Optimization Problem Formulation minimize TotalExecutionTime(t,S) subject to LoadStoreVolume(t,S) Registers For a fixed skew S t is the only variable opt. prob. reduces to an integer convex opt. prob. October 21, 2005 LCPC '05 18

Solution Steps Can permutation expose a parallel loop? Yes! No! No skewing, only tiling Fix S=I in opt. prob. Solve for optimal tile sizes Single integer convex opt. problem. Construct set (Γ)of valid skews For each element in Γ solve the fixed skew optimization problem Pick the best Only d(d-1) problems October 21, 2005 LCPC '05 19

Solving for Optimal Tile Sizes Opt. Prob. for tile sizes is a Integer Geometric Program (à la Integer Linear Programs) GPs can be transformed into convex opt. probs. Standard solvers are available Running time: depends on #vars & #constraints few seconds (< 10 secs.) October 21, 2005 LCPC '05 20

Validation Experimental validation requires array register allocator architectural support (like rotating registers) Similar model used for finding optimal unroll factor optimal unroll factors can be found with small tweaks In tiling for memory hierarchy we have successfully used a similar model almost all the cost models used by other researchers can be cast into our GP framework [RR-SC04] October 21, 2005 LCPC '05 21

Related Work Unroll and Jam approach [Callhan et al.-90], [Carr-Kennedy-94], [Sarkar-01] Hierarchical tiling [Carter et al.-95], [Mitchell et al.-98] Software pipelining of loop nests [Ramanujam-94], [Rong et al. 04], [Rong et al. 05] Code generation for register tiling [Jiminez et al.-02], [Sarkar-01] October 21, 2005 LCPC '05 22

Conclusions & Future Work A mathematical formulation of the combined ILP and register tiling problem. A globally optimal solution. Future work: adapting modulo schedulers to pipeline skewed loops developing an array register allocator experimental validation on benchmarks October 21, 2005 LCPC '05 23