IA-64 Compiler Technology
|
|
- Sydney Kennedy
- 6 years ago
- Views:
Transcription
1 IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1
2 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI) Inter-procedural Optimizer (IPO) Machine independent optimizations Vectorizer Machine dependent optimizations (Strength reduction, Code Scheduling) IA-64 compilers are focused on optimizations increasing instruction level parallelism Page-2
3 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples IA-64 specific optimizations IA-64 Compiler technology is ready, and exploits the architecture Page-3
4 Overview of s IA-64 compiler Compiler Architecture C++ Front End FORTRAN 90 Front End Profile Guidance (PGOPTI) Inter-procedural Optimizer (IPO) High-Level Optimizer (HLO) Global Optimizer (IL0) Code Generator (ECG) Page-4
5 Overview of s IA-64 compiler Global Optimizer (IL0) Loop Unrolling Register Variable Detection Constant Prop. Convert Opt. Strength Reduction Copy Propagation Re-association Redundancy Elim. Dead Store Elim. Dead Code Elim. Page-5
6 Overview of s IA-64 compiler Code generator (ECG) Machine instruction lowering Software Pipelining (rotating registers, loop branches) Predication (parallel compares) Global scheduling (control and data speculation, multi-way branches) Block ordering (branch hints) Global register allocation (register stack, ALAT, UNAT) Function splitting (I cache and TLB locality) Page-6
7 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples IA-64 specific optimizations Page-7
8 Machine independent optimizations Profile guided optimizations (PGOPTI) Inter-procedural optimizations (IPO) The larger the compilation scope, the better the performance Page-8
9 Machine independent optimizations Profile guided optimizations (PGOPTI) Execution profiling: Profile feedback Compile once with counters inserted. Run the profiled program with a sample input set. Compile again, using the counter values. Its effects: Branch probabilities guide Code generator region. Predication regions. Execution frequencies guide Procedure inlining/partial inlining Code placement Speculation heuristics Profile information key to get best out of predication and speculation techniques Page-9
10 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure Integration: Inlining Copying a function body into a call site. Partial inlining Copy the hot portion of a function into a call site. Remainder becomes a splinter function. Cloning Specializing a function to a specific class of call sites. Page-10 Its effects: Exact interprocedural information for a specific call site. Larger code region to schedule Spreads function boundaries Larger scope enables better scheduling and register allocation
11 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure integration: Inlining BEFORE: void func1() { int i; for (i=0; ) func2(i); } inlining AFTER: void func1() { int i; for (i=0; ) a[i] = 1.0 } void func2(int x) { a[x] = 1.0; } Eliminates function call overhead Page-11
12 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure integration: Partial inlining BEFORE: void func(tree *p) { func2(p); } void func2(tree *p) { if (p->left == NULL) return;. printf( ); } partial inlining Page-12 AFTER: void func(tree *p) { if (p->left == NULL) return; else { splinter(p); } } void splinter(tree *p) {. printf( ); } Avoid unnecessary function calls
13 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure integration: Cloning BEFORE: void func1() { func2(1, n); func2(1, m); func2(i, m); } void func2(int i, int j) { if (i == 0) // do something else // do something else } cloning Page-13 AFTER: void func1() { func2_0(1, n); func2_0(1, m); func2(i, m); } void func2_0(int i, int j) { // do something } Eliminate branches
14 Machine independent optimizations Inter-procedural optimizations (IPO) Inter-procedural Analysis: Alias propagation Mod/ref propagation Constant propagation Page-14 ITS EFFECTS: Alias propagation Indirect call conversion Better disambiguation Better scheduling, speculation Mod/ref propagation Improves registerization Constant propagation Makes constant parameters and globals explicit Remove unnecessary conditional code IPO improves many classical scalar optimizations
15 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples IA-64 specific optimizations Page-15
16 IA-64 code samples Strength reduction post increments Techniques to leverage IA-64 architecture features Page-16
17 IA-64 code samples Strength Reduction WHEN IT APPLIES: One operand is a loop invariant. Other operand is an induction variable. ITS EFFECTS: Multiplication is replaced by addition. Better chances for postincrement creation. Strength reduction key for optimizations of loops Page-17
18 IA-64 code samples Post-Increment Loads and Stores WHEN IT APPLIES: Memory address is modified after memory reference. ITS EFFECTS: Addition proceeds in parallel with load/store. No extra instructions required for induction variable update. Post increment increases parallelism Page-18
19 IA-64 Code samples Integer Multiplies in Loops (cont.) SOURCE: FUNCTION FUNC(A,B,N) REAL A(N,N), B(N,N) DO I = 1, N A(1,I) = B(1,I) RETURN END Algorithm: B(1,I) = base of B + size of element * (I-1) naive translation ASSEMBLY: r35 = base of B loop: setf.sig f32 = r32 setf.sig f33 = r33 xma f34 = f32,f33 getf.sig r34 = f34 add r36 = r34, r35 ld f34 = [r36] br... Page-19
20 IA-64 code samples Strength Reduction BEFORE: r35 = base of B loop: setf.sig f32 = r32 setf.sig f33 = r33 xma f34 = f32,f33 getf.sig r34 = f34 add r36 = r34, r35 ld f34 = [r36] br... strength reduction Page-20 AFTER: r36 = base of B r35 = stride of B loop: ld f34 = [r36] add r36 = r36, r35 br... Strength reduction replaces multiplication with additions
21 IA-64 code samples Strength Reduction+post increment BEFORE: r36 = base of B r35 = stride of B loop: ld f34 = [r36] cycle 0 add r36 = r36, r35 cycle 1 br... post increment AFTER: r36 = base of B r35 = stride of B loop: ld f34 = [r36],r35 br... Post increment replaces addition Page-21
22 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples Machine specific optimizations Page-22
23 Machine specific optimizations Predication Software Pipelining Page-23
24 Machine specific optimizations Predication S+=A A < B *p=s S+=B p1,p2=a<b (p1) S+=A (p2) S+=B *p=s Objective Eliminate hard to predict branches. Increase parallelism Reduces critical path A Side effect Reduce register usage (as compared to speculation) Eliminates branches, increases parallelism Page-24
25 Machine specific optimizations Predication with parallel compare A>B B>C Objective Reduce control height Reduce number of branches X Y p1=0 cmp.le.or p1,p0=a,b cmp.le.or p1,p0=b,c (p1) br X Y Page-25 If (a>b && b>c) then Y else X Parallel compare reduces control height
26 Machine specific optimizations Multi way branches with Predication X A>B B>C Y cmp.le p1,p0=a,b cmp.gt p2,p0=b,c (p1) br X (p2) br Y X Use Multi way branches Speculate compare (i.e.move above branch) Avoids predicate initialization More than 1 branch in a single cycle Allows n-way branching If (a>b && b>c) then Y else X Predication and Multi-way increase ILP Page-26
27 Machine specific optimizations Software Pipelining: Loop Example Convert string to uppercase for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i])) newline[i] = CNVT_TO_UPPERCASE(line[i]); else newline[i] = line[i]; } After macro expansion for (i=0, i< len, i++) { if (line[i] >= a && line[i] <= z ) newline[i] = line[i]-32; else newline[i] = line[i]; } Typical integer-type loop Page-27
28 Machine specific optimizations Software Pipelining Overlapping execution of different loop iterations vs. Whole loop computation in one cycle Overlap iterations for parallelism Page-28
29 Machine specific optimizations Software Pipelining Cycle 1 ld ld Input 2 ld cmps 3 ld cmps sub 4 ld cmps sub st Kernel cmps sub st Output Page-29
30 Machine specific optimizations Software Pipelining: introducing Rotating Registers GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a virtual register number RRB + virtual register number = physical register number. Rotating registers make data transfer between stages transparent Page-30
31 Machine specific optimizations Software Pipelining: Pipelined Loop RRB = 0 Kernel code Physical register file + Virtual register loop: ld r34 = xx r34 = xx s1 s2 s3 s4 ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop cmp< cmp> sub r35 = xx r36 = xx r35 = xx r36 = xx st r37 = xx r37 = xx Page-31
32 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = 0 Virtual register ld r34 = G r34 = G Execute prologue stage Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop Page-32 cmp< cmp> sub st r35 = xx r36 = xx r37 = xx r35 = xx r36 = xx r37 = xx
33 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = 0 Virtual register ld r34 = G r34 = G Perform a loop branch Decrement lc Rotate registers by decrementing RRB cmp< cmp> r35 = xx r35 = xx sub r36 = xx r36 = xx st r37 = xx r37 = xx Page-33
34 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = -1 Virtual register ld r33 = o r34 = o Execute prologue stage Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop Page-34 cmp< cmp> sub st r34 = G r35 = xx r36 = xx r35 = G r36 = xx r37 = xx
35 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = -2 Virtual register ld r32 = _ r34 = _ Execute prologue stage Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop Page-35 cmp< cmp> sub st r33 = o r34 = G r35 = xx r35 = o r36 = G r37 = xx
36 Machine specific optimizations Software Pipelining: Execute the Kernel G o B l u e! Physical register file + RRB = -3 Virtual register Execute kernel Whole iteration per cycle Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop G Page-36 ld cmp< cmp> sub st r37 = B r32 = _ r33 = o r34 = G r34 = B r35 = _ r36 = o r37 = G
37 Machine specific optimizations Software Pipelining: Execute the Kernel G o B l u e! Physical register file + RRB = -4 Virtual register Execute kernel Whole iteration per cycle Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop G O Page-37 ld cmp< cmp> sub st r36 = l r37 = B r32 = _ r33 = O r34 = l r35 = B r36 = _ r37 = O
38 Machine specific optimizations Software Pipelining: Execute the Kernel G o B l u e! Physical register file + RRB = -5 Virtual register Execute kernel Whole iteration per cycle Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop G O Page-38 ld cmp< cmp> sub st r35 = u r36 = l r37 = B r32 = _ r34 = u r35 = l r36 = B r37 = _
39 Machine specific optimizations Software Pipelining: IA-64 features that make this possible Full Predication Special branch handling features Register rotation: removes loop copy overhead Traditional architectures use loop unrolling High overhead: extra code for loop body Especially Useful for Integer Code with Small Number of Loop Iterations Page-39
40 Summary IA-64 compilers use existing compiler technologies Inter procedure optimizations to increase the compilation score profile guided optimizations to make efficient use of predication and speculation Compilers take advantage of IA-64 architecture with new compiler technology predication to remove branches, and increase parallelism architecture features allow significant speed up in software pipelined loops Page-40
41 IA-64 Compiler Status C/C++ FORTRAN Microsoft IBM/SCO Linux Novell Sun Win64 Monterey64 IA-64 Linux Modesto Solaris MS, IBM, IBM, EPC Cygnus, EPC Sun SGI, EPC, EPC, Compaq EPC PGI, EPC N/A PGI COBOL Fujitsu IBM N/A Fujitsu JAVA IBM IBM Novell Sun Early access versions planned for Q1 00 Engaging with additional tools vendors in Q3 99 Production compilers in sync with Itanium system availability Page-41
42 Call To Action IA-64 compiler technology is ready and able. Get started on IA-64 today. Review the IA-64 Track software sessions at intel.com/design/ia64/.com/design/ia64/devinfo.htm Getting Software Ready for IA-64 Learn how to get the best performance from your apps Learn what you can do to get started today OSV Breakout Sessions Get the latest OS status from the vendors themselves Learn more details on OS tools availability and optimizations Page-42
HPL-PD A Parameterized Research Architecture. Trimaran Tutorial
60 HPL-PD A Parameterized Research Architecture 61 HPL-PD HPL-PD is a parameterized ILP architecture It serves as a vehicle for processor architecture and compiler optimization research. It admits both
More informationFinal CSE 131B Winter 2003
Login name Signature Name Student ID Final CSE 131B Winter 2003 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 _ (20 points) _ (25 points) _ (21 points) _ (40 points) _ (30 points) _ (25 points)
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationUCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine
Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationUnder the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.
Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationMike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Intel Corporation
Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Corporation February 15-17, 17, 2000 Agenda: Key IA-64 Software Issues l Do I Need IA-64? l Coding
More informationOptimising with the IBM compilers
Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square
More informationTour of common optimizations
Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationPreface. Intel Technology Journal Q4, Lin Chao Editor Intel Technology Journal
Preface Lin Chao Editor Intel Technology Journal With the close of the year 1999, it is appropriate that we look forward to Intel's next generation architecture--the Intel Architecture (IA)-64. The IA-64
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationLecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2) 1 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
More informationFinal Exam Review Questions
EECS 665 Final Exam Review Questions 1. Give the three-address code (like the quadruples in the csem assignment) that could be emitted to translate the following assignment statement. However, you may
More informationCode Merge. Flow Analysis. bookkeeping
Historic Compilers Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials
More informationLanguages and Compiler Design II IR Code Optimization
Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU Spring 2010 rev.: 4/16/2010 PSU CS322 HM 1 Agenda IR Optimization
More informationLecture 6: Static ILP
Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week 1 Loop Dependences If a loop only has dependences within
More informationAgenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division
What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division thomas.siebold@hp.com Agenda Terminology What is the Itanium Architecture? 1 Terminology Processor Architectures
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS553 Lecture Profile-Guided Optimizations 3
Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 10 Compiler Techniques / VLIW Israel Koren ECE568/Koren Part.10.1 FP Loop Example Add a scalar
More informationIntroduction to Compilers
Introduction to Compilers Compilers are language translators input: program in one language output: equivalent program in another language Introduction to Compilers Two types Compilers offline Data Program
More informationLecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationCS553 Lecture Dynamic Optimizations 2
Dynamic Optimizations Last time Predication and speculation Today Dynamic compilation CS553 Lecture Dynamic Optimizations 2 Motivation Limitations of static analysis Programs can have values and invariants
More informationComputer Science 146. Computer Architecture
Computer rchitecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 11: Software Pipelining and Global Scheduling Lecture Outline Review of Loop Unrolling Software Pipelining
More informationUSC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4
CS553 Compiler Construction Instructor: URL: Michelle Strout mstrout@cs.colostate.edu USC 227 Office hours: 3-4 Monday and Wednesday http://www.cs.colostate.edu/~cs553 CS553 Lecture 1 Introduction 3 Plan
More informationCode generation and local optimization
Code generation and local optimization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best option What we will cover: Instruction
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationGroup B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction
Group B Assignment 8 Att (2) Perm(3) Oral(5) Total(10) Sign Title of Assignment: Code optimization using DAG. 8.1.1 Problem Definition: Code optimization using DAG. 8.1.2 Perquisite: Lex, Yacc, Compiler
More informationCode generation and local optimization
Code generation and local optimization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best option What we will cover: Instruction
More information計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches
4.1 Basic Compiler Techniques for Exposing ILP 計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches 吳俊興高雄大學資訊工程學系 To avoid a pipeline stall, a dependent instruction must be
More informationUnderstanding the IA-64 Architecture
Understanding the IA-64 Architecture Gautam Doshi Senior Architect IA-64 Processor Division Corporation August 31, 99 - September 2, 99 Agenda Today s Architecture Challenges IA-64 Architecture Performance
More informationAMD S X86 OPEN64 COMPILER. Michael Lai AMD
AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationIA-64 Architecture Innovations
IA-64 Architecture Innovations Carole Dulong Principal Engineer IA-64 Computer Architect Intel Corporation February 19, 1998 Today s Objectives Provide background on key IA-64 architecture concepts Today
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationThe SGI Pro64 Compiler Infrastructure - A Tutorial
The SGI Pro64 Compiler Infrastructure - A Tutorial Guang R. Gao (U of Delaware) J. Dehnert (SGI) J. N. Amaral (U of Alberta) R. Towle (SGI) Acknowledgement The SGI Compiler Development Teams The MIPSpro/Pro64
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationSupercomputing in Plain English Part IV: Henry Neeman, Director
Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationCompiler Optimization
Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationCode generation and optimization
Code generation and timization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best tion What we will cover: Peephole timizations
More informationComputer Science 246 Computer Architecture
Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these
More informationInstruction Scheduling
Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation
More informationPage # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?
Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the
More informationHW 2 is out! Due 9/25!
HW 2 is out! Due 9/25! CS 6290 Static Exploitation of ILP Data-Dependence Stalls w/o OOO Single-Issue Pipeline When no bypassing exists Load-to-use Long-latency instructions Multiple-Issue (Superscalar),
More informationCompiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005
Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer
More information(just another operator) Ambiguous scalars or aggregates go into memory
Handling Assignment (just another operator) lhs rhs Strategy Evaluate rhs to a value (an rvalue) Evaluate lhs to a location (an lvalue) > lvalue is a register move rhs > lvalue is an address store rhs
More informationLecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )
Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationProgram Optimization
Program Optimization Professor Jennifer Rexford http://www.cs.princeton.edu/~jrex 1 Goals of Today s Class Improving program performance o When and what to optimize o Better algorithms & data structures
More informationAdministration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering
dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors
ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2015-11-30-13-42 1 Motivating VLIW Processors
More informationComputer Architecture. MIPS Instruction Set Architecture
Computer Architecture MIPS Instruction Set Architecture Instruction Set Architecture An Abstract Data Type Objects Registers & Memory Operations Instructions Goal of Instruction Set Architecture Design
More informationData Prefetch and Software Pipelining. Stanford University CS243 Winter 2006 Wei Li 1
Data Prefetch and Software Pipelining Wei Li 1 Data Prefetch Software Pipelining Agenda 2 Why Data Prefetching Increasing Processor Memory distance Caches do work!!! IF Data set cache-able, able, accesses
More informationLecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining
Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software
More informationIntroduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.
Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More information16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as
372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct
More informationComputer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key
Computer Architecture and Engineering CS152 Quiz #5 April 23rd, 2009 Professor Krste Asanovic Name: Answer Key Notes: This is a closed book, closed notes exam. 80 Minutes 8 Pages Not all questions are
More informationPreventing Stalls: 1
Preventing Stalls: 1 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards:
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationCOSC 6385 Computer Architecture - Review for the 2 nd Quiz
COSC 6385 Computer Architecture - Review for the 2 nd Quiz Fall 2006 Covered topic area End of section 3 Multiple issue Speculative execution Limitations of hardware ILP Section 4 Vector Processors (Appendix
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationWhat Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009
What Compilers Can and Cannot Do Saman Amarasinghe Fall 009 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Optimization
More informationLecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) 1 Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR :
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationJust-In-Time Compilers & Runtime Optimizers
COMP 412 FALL 2017 Just-In-Time Compilers & Runtime Optimizers Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved.
More informationIF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4
12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall
CS 701 Charles N. Fischer Class Meets Tuesdays & Thursdays, 11:00 12:15 2321 Engineering Hall Fall 2003 Instructor http://www.cs.wisc.edu/~fischer/cs703.html Charles N. Fischer 5397 Computer Sciences Telephone:
More informationCOMPUTER ORGANIZATION & ARCHITECTURE
COMPUTER ORGANIZATION & ARCHITECTURE Instructions Sets Architecture Lesson 5a 1 What are Instruction Sets The complete collection of instructions that are understood by a CPU Can be considered as a functional
More informationThe CPU IA-64: Key Features to Improve Performance
The CPU IA-64: Key Features to Improve Performance Ricardo Freitas Departamento de Informática, Universidade do Minho 4710-057 Braga, Portugal freitas@fam.ulusiada.pt Abstract. The IA-64 architecture is
More informationCS 6354: Static Scheduling / Branch Prediction. 12 September 2016
1 CS 6354: Static Scheduling / Branch Prediction 12 September 2016 On out-of-order RDTSC 2 Because of pipelines, etc.: RDTSC can actually take its time measurement after it starts Earlier instructions
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationOptimization Prof. James L. Frankel Harvard University
Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory
More informationHiroaki Kobayashi 12/21/2004
Hiroaki Kobayashi 12/21/2004 1 Loop Unrolling Static Branch Prediction Static Multiple Issue: The VLIW Approach Software Pipelining Global Code Scheduling Trace Scheduling Superblock Scheduling Conditional
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationSpring 2 Spring Loop Optimizations
Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition
More informationLast time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls
CS 6354: Branch Prediction (con t) / Multiple Issue 14 September 2016 Last time: scheduling to avoid stalls 1 Last time: forwarding/stalls add $a0, $a2, $a3 ; zero or more instructions sub $t0, $a0, $a1
More informationComputer Science 160 Translation of Programming Languages
Computer Science 160 Translation of Programming Languages Instructor: Christopher Kruegel Code Optimization Code Optimization What should we optimize? improve running time decrease space requirements decrease
More informationCS 614 COMPUTER ARCHITECTURE II FALL 2005
CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix
More informationLecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6) 1 Predication A branch within a loop can be problematic to schedule Control
More informationARM Assembly Programming II
ARM Assembly Programming II Computer Organization and Assembly Languages Yung-Yu Chuang 2007/11/26 with slides by Peng-Sheng Chen GNU compiler and binutils HAM uses GNU compiler and binutils gcc: GNU C
More informationInstruction-Level Parallelism (ILP)
Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit
More information