IA-64 Compiler Technology

Size: px

Start display at page:

Download "IA-64 Compiler Technology"

Sydney Kennedy
6 years ago
Views:

1 IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1

2 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI) Inter-procedural Optimizer (IPO) Machine independent optimizations Vectorizer Machine dependent optimizations (Strength reduction, Code Scheduling) IA-64 compilers are focused on optimizations increasing instruction level parallelism Page-2

3 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples IA-64 specific optimizations IA-64 Compiler technology is ready, and exploits the architecture Page-3

4 Overview of s IA-64 compiler Compiler Architecture C++ Front End FORTRAN 90 Front End Profile Guidance (PGOPTI) Inter-procedural Optimizer (IPO) High-Level Optimizer (HLO) Global Optimizer (IL0) Code Generator (ECG) Page-4

5 Overview of s IA-64 compiler Global Optimizer (IL0) Loop Unrolling Register Variable Detection Constant Prop. Convert Opt. Strength Reduction Copy Propagation Re-association Redundancy Elim. Dead Store Elim. Dead Code Elim. Page-5

6 Overview of s IA-64 compiler Code generator (ECG) Machine instruction lowering Software Pipelining (rotating registers, loop branches) Predication (parallel compares) Global scheduling (control and data speculation, multi-way branches) Block ordering (branch hints) Global register allocation (register stack, ALAT, UNAT) Function splitting (I cache and TLB locality) Page-6

7 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples IA-64 specific optimizations Page-7

8 Machine independent optimizations Profile guided optimizations (PGOPTI) Inter-procedural optimizations (IPO) The larger the compilation scope, the better the performance Page-8

9 Machine independent optimizations Profile guided optimizations (PGOPTI) Execution profiling: Profile feedback Compile once with counters inserted. Run the profiled program with a sample input set. Compile again, using the counter values. Its effects: Branch probabilities guide Code generator region. Predication regions. Execution frequencies guide Procedure inlining/partial inlining Code placement Speculation heuristics Profile information key to get best out of predication and speculation techniques Page-9

10 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure Integration: Inlining Copying a function body into a call site. Partial inlining Copy the hot portion of a function into a call site. Remainder becomes a splinter function. Cloning Specializing a function to a specific class of call sites. Page-10 Its effects: Exact interprocedural information for a specific call site. Larger code region to schedule Spreads function boundaries Larger scope enables better scheduling and register allocation

11 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure integration: Inlining BEFORE: void func1() { int i; for (i=0; ) func2(i); } inlining AFTER: void func1() { int i; for (i=0; ) a[i] = 1.0 } void func2(int x) { a[x] = 1.0; } Eliminates function call overhead Page-11

12 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure integration: Partial inlining BEFORE: void func(tree *p) { func2(p); } void func2(tree *p) { if (p->left == NULL) return;. printf( ); } partial inlining Page-12 AFTER: void func(tree *p) { if (p->left == NULL) return; else { splinter(p); } } void splinter(tree *p) {. printf( ); } Avoid unnecessary function calls

13 Machine independent optimizations Inter-procedural optimizations (IPO) Procedure integration: Cloning BEFORE: void func1() { func2(1, n); func2(1, m); func2(i, m); } void func2(int i, int j) { if (i == 0) // do something else // do something else } cloning Page-13 AFTER: void func1() { func2_0(1, n); func2_0(1, m); func2(i, m); } void func2_0(int i, int j) { // do something } Eliminate branches

14 Machine independent optimizations Inter-procedural optimizations (IPO) Inter-procedural Analysis: Alias propagation Mod/ref propagation Constant propagation Page-14 ITS EFFECTS: Alias propagation Indirect call conversion Better disambiguation Better scheduling, speculation Mod/ref propagation Improves registerization Constant propagation Makes constant parameters and globals explicit Remove unnecessary conditional code IPO improves many classical scalar optimizations

15 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples IA-64 specific optimizations Page-15

16 IA-64 code samples Strength reduction post increments Techniques to leverage IA-64 architecture features Page-16

17 IA-64 code samples Strength Reduction WHEN IT APPLIES: One operand is a loop invariant. Other operand is an induction variable. ITS EFFECTS: Multiplication is replaced by addition. Better chances for postincrement creation. Strength reduction key for optimizations of loops Page-17

18 IA-64 code samples Post-Increment Loads and Stores WHEN IT APPLIES: Memory address is modified after memory reference. ITS EFFECTS: Addition proceeds in parallel with load/store. No extra instructions required for induction variable update. Post increment increases parallelism Page-18

19 IA-64 Code samples Integer Multiplies in Loops (cont.) SOURCE: FUNCTION FUNC(A,B,N) REAL A(N,N), B(N,N) DO I = 1, N A(1,I) = B(1,I) RETURN END Algorithm: B(1,I) = base of B + size of element * (I-1) naive translation ASSEMBLY: r35 = base of B loop: setf.sig f32 = r32 setf.sig f33 = r33 xma f34 = f32,f33 getf.sig r34 = f34 add r36 = r34, r35 ld f34 = [r36] br... Page-19

20 IA-64 code samples Strength Reduction BEFORE: r35 = base of B loop: setf.sig f32 = r32 setf.sig f33 = r33 xma f34 = f32,f33 getf.sig r34 = f34 add r36 = r34, r35 ld f34 = [r36] br... strength reduction Page-20 AFTER: r36 = base of B r35 = stride of B loop: ld f34 = [r36] add r36 = r36, r35 br... Strength reduction replaces multiplication with additions

21 IA-64 code samples Strength Reduction+post increment BEFORE: r36 = base of B r35 = stride of B loop: ld f34 = [r36] cycle 0 add r36 = r36, r35 cycle 1 br... post increment AFTER: r36 = base of B r35 = stride of B loop: ld f34 = [r36],r35 br... Post increment replaces addition Page-21

22 Compiler Optimizations designed to achieve the best performance on IA-64 Agenda Overview of s IA-64 Compiler Machine independent optimizations IA-64 code samples Machine specific optimizations Page-22

23 Machine specific optimizations Predication Software Pipelining Page-23

24 Machine specific optimizations Predication S+=A A < B *p=s S+=B p1,p2=a<b (p1) S+=A (p2) S+=B *p=s Objective Eliminate hard to predict branches. Increase parallelism Reduces critical path A Side effect Reduce register usage (as compared to speculation) Eliminates branches, increases parallelism Page-24

25 Machine specific optimizations Predication with parallel compare A>B B>C Objective Reduce control height Reduce number of branches X Y p1=0 cmp.le.or p1,p0=a,b cmp.le.or p1,p0=b,c (p1) br X Y Page-25 If (a>b && b>c) then Y else X Parallel compare reduces control height

26 Machine specific optimizations Multi way branches with Predication X A>B B>C Y cmp.le p1,p0=a,b cmp.gt p2,p0=b,c (p1) br X (p2) br Y X Use Multi way branches Speculate compare (i.e.move above branch) Avoids predicate initialization More than 1 branch in a single cycle Allows n-way branching If (a>b && b>c) then Y else X Predication and Multi-way increase ILP Page-26

27 Machine specific optimizations Software Pipelining: Loop Example Convert string to uppercase for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i])) newline[i] = CNVT_TO_UPPERCASE(line[i]); else newline[i] = line[i]; } After macro expansion for (i=0, i< len, i++) { if (line[i] >= a && line[i] <= z ) newline[i] = line[i]-32; else newline[i] = line[i]; } Typical integer-type loop Page-27

28 Machine specific optimizations Software Pipelining Overlapping execution of different loop iterations vs. Whole loop computation in one cycle Overlap iterations for parallelism Page-28

29 Machine specific optimizations Software Pipelining Cycle 1 ld ld Input 2 ld cmps 3 ld cmps sub 4 ld cmps sub st Kernel cmps sub st Output Page-29

30 Machine specific optimizations Software Pipelining: introducing Rotating Registers GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a virtual register number RRB + virtual register number = physical register number. Rotating registers make data transfer between stages transparent Page-30

31 Machine specific optimizations Software Pipelining: Pipelined Loop RRB = 0 Kernel code Physical register file + Virtual register loop: ld r34 = xx r34 = xx s1 s2 s3 s4 ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop cmp< cmp> sub r35 = xx r36 = xx r35 = xx r36 = xx st r37 = xx r37 = xx Page-31

32 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = 0 Virtual register ld r34 = G r34 = G Execute prologue stage Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop Page-32 cmp< cmp> sub st r35 = xx r36 = xx r37 = xx r35 = xx r36 = xx r37 = xx

33 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = 0 Virtual register ld r34 = G r34 = G Perform a loop branch Decrement lc Rotate registers by decrementing RRB cmp< cmp> r35 = xx r35 = xx sub r36 = xx r36 = xx st r37 = xx r37 = xx Page-33

34 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = -1 Virtual register ld r33 = o r34 = o Execute prologue stage Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop Page-34 cmp< cmp> sub st r34 = G r35 = xx r36 = xx r35 = G r36 = xx r37 = xx

35 Machine specific optimizations Software Pipelining: Fill the pipe... G o B l u e! Physical register file + RRB = -2 Virtual register ld r32 = _ r34 = _ Execute prologue stage Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop Page-35 cmp< cmp> sub st r33 = o r34 = G r35 = xx r35 = o r36 = G r37 = xx

36 Machine specific optimizations Software Pipelining: Execute the Kernel G o B l u e! Physical register file + RRB = -3 Virtual register Execute kernel Whole iteration per cycle Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop G Page-36 ld cmp< cmp> sub st r37 = B r32 = _ r33 = o r34 = G r34 = B r35 = _ r36 = o r37 = G

37 Machine specific optimizations Software Pipelining: Execute the Kernel G o B l u e! Physical register file + RRB = -4 Virtual register Execute kernel Whole iteration per cycle Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop G O Page-37 ld cmp< cmp> sub st r36 = l r37 = B r32 = _ r33 = O r34 = l r35 = B r36 = _ r37 = O

38 Machine specific optimizations Software Pipelining: Execute the Kernel G o B l u e! Physical register file + RRB = -5 Virtual register Execute kernel Whole iteration per cycle Kernel code loop: ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123) (p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop G O Page-38 ld cmp< cmp> sub st r35 = u r36 = l r37 = B r32 = _ r34 = u r35 = l r36 = B r37 = _

39 Machine specific optimizations Software Pipelining: IA-64 features that make this possible Full Predication Special branch handling features Register rotation: removes loop copy overhead Traditional architectures use loop unrolling High overhead: extra code for loop body Especially Useful for Integer Code with Small Number of Loop Iterations Page-39

40 Summary IA-64 compilers use existing compiler technologies Inter procedure optimizations to increase the compilation score profile guided optimizations to make efficient use of predication and speculation Compilers take advantage of IA-64 architecture with new compiler technology predication to remove branches, and increase parallelism architecture features allow significant speed up in software pipelined loops Page-40

41 IA-64 Compiler Status C/C++ FORTRAN Microsoft IBM/SCO Linux Novell Sun Win64 Monterey64 IA-64 Linux Modesto Solaris MS, IBM, IBM, EPC Cygnus, EPC Sun SGI, EPC, EPC, Compaq EPC PGI, EPC N/A PGI COBOL Fujitsu IBM N/A Fujitsu JAVA IBM IBM Novell Sun Early access versions planned for Q1 00 Engaging with additional tools vendors in Q3 99 Production compilers in sync with Itanium system availability Page-41

42 Call To Action IA-64 compiler technology is ready and able. Get started on IA-64 today. Review the IA-64 Track software sessions at intel.com/design/ia64/.com/design/ia64/devinfo.htm Getting Software Ready for IA-64 Learn how to get the best performance from your apps Learn what you can do to get started today OSV Breakout Sessions Get the latest OS status from the vendors themselves Learn more details on OS tools availability and optimizations Page-42

HPL-PD A Parameterized Research Architecture. Trimaran Tutorial

60 HPL-PD A Parameterized Research Architecture 61 HPL-PD HPL-PD is a parameterized ILP architecture It serves as a vehicle for processor architecture and compiler optimization research. It admits both