The Use of Traces in Optimization

Size: px

Start display at page:

Download "The Use of Traces in Optimization"

Shannon Wright
6 years ago
Views:

1 The Use of Traces in Optimization by Borys Jan Bradel A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto 2004 c Copyright by Borys Jan Bradel 2004

2 ABSTRACT The Use of Traces in Optimization by Borys Jan Bradel Master of Applied Science Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto 2004 We build trace collection systems for Jupiter and the Jikes Research Virtual Machine. We use the systems to create traces based on the execution of the SPECjvm98 and Java Grande benchmarks. We characterize these traces and show that they contain the most frequently executed instructions, that traces are a compact representation of a program, and that traces are a good means of predicting the control flow of a program. Furthermore, we evaluate the use of traces for inlining. We execute the benchmarks using Jikes while providing information based on previously collected traces. We find that the use of traces leads to lower execution time, by 10%, compared to providing similar information from Jikes s adaptive system from a previous execution. This increase in performance, however, has an associated code expansion of 47%. Our work indicates that traces are beneficial for a single optimization and may also be beneficial for general optimizations. ii

3 ACKNOWLEDGEMENTS I am grateful to everyone who has made this thesis possible. First, I would like to thank my supervisor, Professor Tarek S. Abdelrahman. He has provided guidance and help throughout this research. I would like thank Henry Jo for proof reading this thesis. His insights and persistence have improved this thesis greatly. I would like to thank Patrick Doyle and the creators of the Jikes Research Virtual Machine for creating the Java Virtual Machines that I used. I would like to thank Carlos Cavanna and Patrick Doyle for helping me set up Jupiter and answering all of the questions that I had. I would like to thank Tomasz Czajkowski for his suggestions regarding my presentation in defence of this thesis. I would also like to thank all of my family and friends who have provided support and encouragement while I was working on this thesis. Lastly, I would like to acknowledge the financial support provided by the National Sciences and Engineering Research Council of Canada. iii

4 TABLE OF CONTENTS ACKNOWLEDGEMENTS iii LIST OF TABLES vii LIST OF FIGURES ix CHAPTERS 1. Introduction Thesis Overview Thesis Organization Background Program Model Control Flow Graphs Traces Trace Collection Trace Collection Example Mispredicting Returns Optimization using Traces Jupiter Jikes The Adaptive System The Optimization Test Harness Inlining in Jikes Inlining Oracles Related Work Static Trace Scheduling Path Profiling Hardware Trace Systems Software Trace Systems Feedback Directed Systems in Java Java Trace Systems iv

5 3. Trace Collection Trace Collection within Jupiter Basic Block Identification Profile Information Trace Collection within Jikes Control Flow Information Basic Block Identification Trace Formation Trace Characterization Number of Traces Static Trace Length Dynamic Trace Length Static Program Coverage Dynamic Program Coverage Method Coverage Exit Behaviour Exit Predictability Trace Cache Execution Trace Execution Patterns Inlining Benefits of Inlining Approaches to Inlining Traces and Inlining Candidate Method Selection Results Methodology Experimental Platform Jupiter Jikes Benchmarks Comparison of the Adaptive System and the Optimization Test Harness Trace Characterization Results Number of Traces Static Trace Length Dynamic Trace Length Static Program Coverage Dynamic Program Coverage v

6 6.2.6 Method Coverage Exit Behaviour Exit Predictability Trace Cache Execution Trace Execution Patterns Alternate Trace Collection Parameters Runtime Performance Results Inlining with Traces from Jupiter Inlining with Traces from Jikes Effects of Inline Sequences Effects of Compilation Queue Filling Ahead of Time Compilation Details of the Provided Inline Information Trace Collection Overhead Conclusion APPENDICES 7.1 Future Work A. Trace Characterization for Alternate Parameters B. Inlining Performance Data BIBLIOGRAPHY vi

7 LIST OF TABLES Table 6.1 Number of traces Average static lengths in instructions Dynamic trace lengths Static program coverage Dynamic program coverage Method coverage Trace exits Biases Bias differentiation Improved biases Execution lives Several characterizations for alternate traces Total number of methods to optimize Number of methods that are part of the benchmarks Number of methods that are part of the Java class library Number of inlining requests Time spent in the main and organizer threads vii

8 A.1 Number of traces A.2 Average static lengths in instructions A.3 Dynamic trace lengths A.4 Static program coverage A.5 Dynamic program coverage A.6 Method coverage A.7 Trace exits A.8 Biases A.9 Bias differentiation A.10 Improved biases A.11 Execution lives B.1 Main thread s execution time in the adaptive system B.2 Compile time in the adaptive system B.3 Machine code in kilobytes in the adaptive system B.4 Benchmark execution time in the optimization test harness B.5 User time in the optimization test harness B.6 LIR instructions generated in the optimization test harness viii

9 LIST OF FIGURES Figure 1.1 Two types of feedback directed systems Source code of a simple loop Control flow graph for the simple loop Instruction sequence of a simple program Traces mapped onto the control flow graph Example of interaction between a JVM and a trace collection system Source code for a return misprediction example Traces for a return misprediction example Source code for an optimization example Control flow graph for the optimization example Control flow graph with a trace for the optimization example An example of inlining using inline sequences Execution mapped to target sequences Potential invocation sequence identification. Invocation of method a() is common in the three cases Example frequency graphs Example of inline sequences Frequency graphs for SPECjvm98 benchmarks ix

10 6.2 Frequency graphs for Java Grande benchmarks Inlining with traces from Jupiter Inlining with traces from Jikes Inlining with inline sequences Inlining with filling of compilation queue Inlining with ahead-of-time compilation Overhead of collecting traces in Jikes A.1 Frequency graphs for SPECjvm98 benchmarks A.2 Frequency graphs for Java Grande benchmarks x

11 CHAPTER 1 Introduction Traditional static compilation has shortcomings in that it cannot take advantage of information available only at runtime to produce high-performance executables. Such runtime information includes processor architecture, specific input data characteristics, and control flow patterns within the program. One way to incorporate this runtime information is to use a feedback directed system that monitors the execution of a program and uses the information it collects to optimize the program. Feedback-directed systems can be divided into two categories: offline and online. The overall structure of each category is depicted in Figure 1.1. Offline feedback-directed systems monitor the execution of a program and use the collected data to optimize the program after it has completed executing so that its next execution will complete more quickly. Online feedback directed systems on the other hand collect information and optimize the program while it is executing. One advantage of online systems is that the feedback loop is shorter, because the information collected is based on the executing program and is immediately applied to it. An offline system uses information gathered during previous executions of the program, which may have different run-time behaviour from the execution of the optimized program due to user interaction and/or different program inputs. The information collected in an online system may therefore be more relevant and beneficial. The main disadvantage of an online system is that the system must be executing simultaneously with the program. Therefore, whatever amount of time and resources that the system uses will increase the execution time of the program or use up resources that could be otherwise utilized by the program. This dictates 1

12 Chapter 1. Introduction 2 that the information cannot be analyzed as extensively with an online system and this may lead to less effective optimizations. This thesis focuses on offline feedback-directed systems. Run Time Run Time Compiler Program Compiler Program Feedback Feedback Offline Feedback Directed System Online Feedback Directed System Figure 1.1: Two types of feedback directed systems. There have been many systems that employ both online and offline feedback. Examples include feedback directed systems created by Arnold et al [Mat00], Whaley [Wha01], and Suganuma, Yasue, and Nakatani [SYN02]. One common aspect to these feedback directed systems is that they employ counters to collect information regarding which instructions and methods are frequently executed while a program is executed and to direct optimization. This work focusses on using traces, instead of counters, to direct program optimization. 1.1 Thesis Overview A trace is a sequence of unique basic blocks 1 that is executed by a program [BDB99]. In this thesis we explore the effects of collecting the runtime information in a different manner, by employing traces. We characterize traces for a collection of different programs and apply the traces to one type of optimization, inlining. Our hypothesis is that traces are a better method of representing runtime information than counters and that they can be utilized more effectively when a feedback directed system performs 1 A basic block is a sequence of consecutive instructions that are executed together.

13 Chapter 1. Introduction 3 optimization. We support our hypothesis by performing two studies. The first is a characterization of traces to show the potential of their benefit to optimization. Our characterization shows that compilation and optimization based on traces indeed does have the potential to improve program performance. The second study employs traces to perform a specific optimizing, inlining, and the results show that traces are useful in this regard. Our work is based on two Java Virtual Machines (JVMs): Jupiter and Jikes. We have added a trace collection system to the Jupiter JVM. Our system collects traces and detailed statistics of their execution. We have also added a trace collection system to the Jikes Research Virtual Machine, referred to simply as Jikes. Furthermore we have modified and used Jikes to analyze the effects of using traces within an offline feedback-directed Java system. 1.2 Thesis Organization In Chapter 2 we present background information on traces, Jupiter, Jikes, and related research. In Chapter 3 we describe our trace collection architectures in Jupiter and Jikes. Chapter 4 contains a description of the trace characterizations that we employ. This is followed by Chapter 5 in which we describe several approaches for inlining. We present our results in Chapter 6. The final chapter, Chapter 7, gives concluding remarks and directions for future work.

14 CHAPTER 2 Background The first part of this chapter contains a brief description of how traces can be used in compilers. We first give a simple definitions of programs, control flow graphs, and traces. We then focus on how traces can be collected. We also show how traces can be beneficial to optimization. The second part of this chapter contains a description of related work. We first describe the Jupiter and Jikes virtual machines, which we use as underlying frameworks for our work. Then, in the last section of this chapter, we describe related research in the areas of feedback directed optimization and trace-based optimization. 2.1 Program Model A program contains a sequence of instructions to be executed by a computer. These instructions are generated from the source code of the program and are grouped into methods. Each method consists of a sequence of consecutive instructions. In this thesis, programs are written in Java and their instructions are therefore Java bytecodes, executed by a Java Virtual Machine (JVM). Each Java method consists of a sequence of bytecodes. Each bytecode has an associated index that indicates its location within the method. The computer, which is equivalent to a JVM, keeps track of which instruction it is executing by using a pointer to this instruction. This pointer is referred to as the instruction pointer. When the computer finishes executing an instruction it executes the next instruction in the instruction sequence and updates the instruction pointer to point to this new instruction. This is repeated until the computer encounters an instruction 4

15 Chapter 2. Background 5 that tells it to stop. Some instructions can change the instruction pointer to point to an instruction other than the next instruction in the sequence. These instructions are called control flow instructions. The instructions that can be executed after a control flow instruction are the control flow instruction s targets. Instructions can be grouped into basic blocks. A basic block is a maximal sequence of consecutive instructions such that execution must begin at the first instruction and the remaining instructions must be executed [Muc97]. Usually, basic blocks end in control flow instructions. The different types of control flow instructions are: branches, jumps, switch statements, invokes, returns, and exceptions. Branches and jumps are generated from if statements and loops in the original source code. A branch or jump is backwards if its target appears earlier in the sequence of instructions of the method that these instructions are in 1. These backward control flow instructions are often loop headers in the source code and can be used to identify corresponding loops. 2.2 Control Flow Graphs Control flow graphs (CFGs) are a commonly used representation of programs. A control flow graph is a directed graph in which each node represents a basic block [ASU86]. An edge from a basic block A to a basic block B exists if it is possible for the first instruction in basic block B to be executed immediately after the last instruction in basic block A is executed. For simplicity of presentation, and without loss of generality, we ignore certain programming constructs, such as exceptions, that complicate control flow analyses and graphs. Figure 2.2 contains the control flow graph for the method shown in Figure 2.1. Each node in the CFG is a basic block from the example method. We say that execution flows through a basic block or that the basic block is executed if the instructions in that basic block are executed. We use the same terminology when describing the execution of instructions that are on a trace as well. 1 This definition differs from the traditional definition. The traditional definition states that a branch or jump is backwards if its target is visited before the branch or jump during a depth first traversal of the instructions [Muc97].

16 Chapter 2. Background 6 public static int foo() { int a=0; for (int i=0;i<5;i++) a+=i; return a; } Figure 2.1: Source code of a simple loop B0 a=0 i=0 goto B2 B1 a+=i i++ B2 if (i<5) goto B1 B3 return a Figure 2.2: Control flow graph for the simple loop

17 Chapter 2. Background Traces A trace is a sequence of n unique basic blocks (b 1, b 2,..., b n ) such that basic blocks b 1, b 2,..., b n are executed in sequential order during the execution of a program [BDB99]. Block b 1 is called the start of the trace and b n is the end of the trace. The trace may contain any basic blocks of the program as long as the sequence corresponds to a path on the control flow graph. This is just one of several different definitions of traces [Fis81]. We use this definition because it expresses the traces that we collect more precisely. The sequence of basic blocks that corresponds to our execution of the program in Figure 2.1 is shown in Figure 2.3. Three potential traces are (B0,B2,B1), (B1,B2,B3), and (B1,B2). Note that the sequence (B1,B2,B1) is not a valid trace because B1 appears twice in the sequence, and the sequence therefore does not consist of unique basic blocks. The three traces mapped onto the control flow graph are shown in Figure 2.4. A trace is executed when each of its basic blocks is executed in sequence. The execution of the trace stops when the executed sequence diverges from the sequence of basic blocks on a trace. We refer to this point of divergence as a trace exit. There are two reasons for the two sequences to not correspond to each other. The first is that all the basic blocks in the trace have been executed. We refer to an exit that occurs in this case as a normal trace exit. The second reason is that the actual execution may differ from the trace. This occurs when several consecutive basic blocks are executed that are on a trace, but the next basic block that is executed does not correspond to the next basic block that is on the trace. We refer to such a trace exit as an early trace exit. Normal trace exits can also be divided into regular and self-loop exits [Ber03]. Self-loop exits of a trace are exits such that the instruction executed after the trace is the first instruction of the same trace. Regular exits are normal trace exits that are not self-loop exits. Categorization of trace exits in such a manner is useful when reasoning about traces and their use in optimization. In particular, optimizations that create a large amount of cleanup code to execute on early trace exits can result in poor program performance. A large number of self-loop exits indicates that loops repeatedly executed the same path while a low number indicates that either the loops did not take the same

18 Chapter 2. Background 8 path repeatedly or the path that was taken does not correspond exactly to the initially recorded trace for that loop. a=0 i=0 B0 goto B2 if (i<5) goto B1 B2 a+=i i++ } B1 B2 if (i<5) goto B1 a+=i B1 i++ if (i<5) goto B1 a+=i i++ if (i<5) goto B1 a+=i i++ if (i<5) goto B1 a+=i i++ if (i<5) goto B1 return a } } B2 B1 B2 B1 B2 B1 B2 } B3 Figure 2.3: Instruction sequence of a simple program. 2.4 Trace Collection Each trace is created by starting at some basic block in the control flow graph and creating a path by adding basic blocks until the desired trace is generated. This is done using a trace collection system (TCS). In this section we describe such a generic system, which operates by monitoring a program s execution, collecting information, and then creating traces based on the information that it collects. There are two types of information that the system collects: profile information regarding how often certain events occur, and the traces themselves 2. The profile information includes the number of times certain basic blocks are executed, how often methods are invoked, and how often branches are taken. By default the TCS only collects profile information. However, 2 JVMs can also interact with native code. In this thesis we do not keep any information regarding the details of this native code s execution.

19 Chapter 2. Background 9 B0 Trace 1 a=0 i=0 goto B2 B0 a=0 i=0 goto B2 B0 a=0 i=0 goto B2 Trace 2 Trace 3 B1 a+=i i++ B1 a+=i i++ B1 a+=i i++ B2 if (i<5) goto B1 B2 if (i<5) goto B1 B2 if (i<5) goto B1 B3 return a B3 return a B3 return a Figure 2.4: Traces mapped onto the control flow graph. when certain events occur the TCS records a sequence of basic blocks (i.e. a trace) being executed and continuously checks whether recording should stop or not. When recording stops the trace is stored in a buffer referred to as the trace cache. The TCS goes back to its default behaviour of collecting profile information. This process is repeated until the execution of the program ends. At this point the traces in the trace cache may be saved for further analysis. The system keeps track of when basic blocks and control flow instructions are executed. Furthermore it keeps track of when traces are executed. When the TCS detects that the first block of a trace is executed it notes this event as the start of a specific trace, and keeps track of which basic blocks are executed. A trace exit occurs when all the basic blocks of a trace are executed or when the block that is executed is not the next block in the sequence of basic blocks on the trace. When a trace exit occurs the TCS records this event, and resumes operating as before. The recorded information represents what would happen if the TCS would be able to execute traces. There are certain key events that trigger the recording of basic blocks as well as events that stop this recording. The events that start recording occur when some counter exceeds a certain threshold. In this section we will consider two events that

20 Chapter 2. Background 10 start recording of a trace and several different events that stop the recording. The first event that starts trace recording is a basic block being executed immediately after a backward taken branch or jump a specific number of times (i.e. the specific counter reaches some threshold). Traces that start at the target of backward branches and jumps will usually be a frequently executed path from the top of a loop until the end of a loop. The second event is a certain trace exit occurring a specific number of times. Traces that start at trace exits represent other frequently executed paths. The recording of a trace stops when: A backward branch or jump is taken. This corresponds to the end of a loop. This ensures that traces start at most a single loop and ensures that separate loops are not represented by a single trace. The block that is about to be recorded is the start of a different trace. This stops a trace because it is assumed that the instructions that are about to execute are already on a trace and have already been optimized. This prevents code explosion and duplicate work. The block that is about to be recorded is already in the trace that is being recorded. This ensures that the instructions on a trace are unique. The recorded trace is too long and recording should therefore stop. The length is arbitrary and may be selected based on hardware limitations, such as instruction cache size Trace Collection Example We will now give an example demonstrating the operation of a trace collection system. Figure 2.5 shows a JVM and a TCS. The TCS is linked to a JVM that executes the program in Figure 2.1. The JVM contains the program to execute as well as storage for the program s variables. These variables values are modified as shown when the program is executed. The lower part of the figure shows the sequence of steps that the JVM performs when executing the program. Each line represents the step taken for a

21 Chapter 2. Background 11 single instruction 3. When the JVM executes each control flow instruction it calls the TCS with information regarding which instruction 4 is executed and what it does. Here we represent this behaviour by showing the JVM calling a function in the TCS called notify 5. The TCS keeps track of what is being executed and records profile information when this notify function is executed. The TCS contains three components: a set of event counters that are used to determine when recording should start, a recording buffer that is used to hold basic blocks as they are recorded, and a trace cache. The tasks that are performed when the control flow instructions of the example program are executed are shown in the figure on the left side of the TCS. The solid arrows show the effects of certain commands executed by the JVM and the TCS while the dotted arrows indicate the recording of a trace. In the figure the JVM executes basic blocks B1 and B2 repeatedly and the TCS keeps track of how often the backward branch between B2 and B1 is taken. When the backward branch is executed often enough 6, the TCS starts recording a sequence of basic blocks (i.e. a trace). The system then records the execution of the program until it detects that the next basic block is already in the recorded sequence. After recording is stopped the sequence of basic blocks, (B1,B2), is stored in the trace cache. After the trace (B1,B2) is saved the instruction i3 is executed and the system detects that i3 is in B1, which is the head of a trace. The recorded trace is then being executed and the TCS keeps track of this trace s execution. This is shown by incrementing the number of times that the trace starts. As the loop is repeatedly executed, which is not shown, this trace is executed to completion several times. After the loop exits, a return is executed. Assuming that the method was called by the first instruction, i0, in main(), then the target of the return is the second instruction, i1, in main() and the appropriate counter is incremented. If this return were executed often enough a trace would be recorded 3 The instruction t=(c)?a:b is equivalent to if (c) t=a; else t=b;. 4 We assume that the JVM executes bytecode instructions as they appear in the program and does not perform any optimizations when the trace collection system is being called. 5 This, however, is not the only approach. One approach could be to call the TCS at every instruction. Another approach would be to save the list of executed control flow instructions and process this list later. 6 We have set the threshold for the counter to 2 in this example for demonstration purposes.

22 Chapter 2. Background 12 starting at the second instruction in main(). Data a = i = Java Virtual Machine Code i0 a=0 i1 i=0 i2 goto i5 i3 a+=i i4 i++ i5 if (i<5) goto i3 i6 return a a=0; next=i1 i=0; next=i2 notify(i2,i5); next=i5 t=(i<5)?i3:i6; notify(i5,t); next=t a+=i; next=i4 i++; next=i5 t=(i<5)?i3:i6; notify(i5,t); next=t a+=i; next=i4 i++; next=i5 t=(i<5)?i3:i6; notify(i5,t); next=t... t=(i<5)?i3:i6; notify(i5,t); next=t t=main:i1; notify(i5,t); next=t Trace Collection System notify(i5,i3) increment counter notify(i5,i3) increment counter start recording notify(i5,i3) stop recording add trace to cache trace 1 starts... notify(i6,main:i1) increment counter Program Flow Data backward taken branch at i5: counter: return to main() i1: counter: 0 1 Recording Buffer i3: a+=i i4: i++ i5: if (i<5) goto B1 i3 Trace Cache Trace 1: Blocks: B1, B2 Starts: B2 Figure 2.5: Example of interaction between a JVM and a trace collection system Mispredicting Returns One observation that we make is that returns on traces can be frequently mispredicted. The problem arises because traces can start in methods that are invoked in multiple places and can go outside of these methods. If several of these call sites are frequently executed then it is likely that the trace expects to return to one of these call sites when the actual execution returns to another call site. In this case a trace exit occurs and the return is mispredicted. This behaviour is illustrated using the example program in Figure 2.6. Figure 2.7 contains three traces of this program. The traces are generated when method a() is called and then method b() is called. Trace 1 starts in method a(), goes through one path in method c(), and returns to method a(). Trace 2 starts at the early trace exit of Trace 1 in method c() and returns to method a(). Trace 3 starts in method

23 Chapter 2. Background 13 b(), goes through one path in method c(), and returns to method a(). Half the time that Trace 3 is started, an early trace exit to Trace 2 occurs. This will always result in the misprediction of the return in Trace 2. In Trace 2 the return is to method a(), but execution must return to method b(). It is therefore possible for returns on traces to be frequently mispredicted. void a() { for (int i=0;i<10000;i++) c(i); } void b() { for (int i=0;i<10000;i++) c(i); } void c(int i) { if ((i%2)==0)... work 1... else... work 2... return; } Figure 2.6: Source code for a return misprediction example. 2.5 Optimization using Traces The quality of program optimization is dependent on the scope and exactness of the analysis that is performed on the program s control flow graph. The use of traces may improve the opportunities for optimization in three ways. First, traces can span multiple methods, thus facilitating inter-procedural analysis and extending the scope of analyses. Second, traces contain only the most frequently executed portions of a program and therefore can be used to only optimize frequently executed instructions

24 Chapter 2. Background 14 Trace 1 Trace 2 Trace 3 A1: invoke c(i) C2: work 2 B1: invoke c(i) C0: if ((i%2)==0) C3: return C0: if ((i%2)==0) C1: work 1 A2: i++ C1: work 1 C3: return A3: if (i<10000) goto A1 C3: return A2: i++ B2: i++ A3: if (i<10000) goto A1 B3: if (i<10000) goto A1 Figure 2.7: Traces for a return misprediction example. saving compilation and optimization times. Finally, traces can be used to eliminate infrequently executed instructions from the control flow graph. The resulting control flow graph is simpler and therefore more amenable to optimization. In this case, however, because execution may go off trace, fix-up code must be added to ensure that when this occurs the program s execution is still correct. We illustrate the impact that traces have on optimization through an example. We describe the execution of a program and a trace that is recorded. We then show how the trace affects the control flow analysis and how this can be advantageous for optimization. The program in Figure 2.8 contains an array variable, zarray, and two methods, f() and foo(). Assuming that the condition in the if statement in method f() is true one hundred times and false five times, the trace that will be collected when foo() is executed is (B1,BB0,BB1,BB3,B2,B3,B5,B6). This is shown on the program s control flow graph in Figure 2.9. This trace has the three beneficial qualities that we have described earlier. First, the trace spans two methods. Second, the basic blocks on the trace are frequently executed. Finally, the instructions that are not frequently executed are not on the trace. The program can be optimized based on the extra information provided by this trace. Figure 2.10 contains the control flow graph with the optimized traced linked to the remainder of the program. The trace is optimized by adding, removing, and reordering instructions. The goto statements at the end of basic blocks BB1 and B3 are removed since they are not necessary on the recorded trace. The second if statement that was recorded is also removed because it is redundant on the trace. The statement

25 Chapter 2. Background 15 is redundant because on the trace the method f() returns the value 3, which is less than 50. There are two cases that can cause the trace to exit early. The first case occurs when the method that the program invokes is not the one on the trace. This is possible because Java s methods are virtual. That is, the method that an invoke actually calls may be different than the one specified by the actual instruction. An extra check must therefore be inserted to ensure that the appropriate method is executed. The second case occurs when the value of zarray[i] is greater than or equal to 5. In general it may be necessary to execute extra clean-up instructions after an early trace exit because the trace is optimized in such a way that the execution may only be valid if the trace executes to completion. Furthermore, the fact that the trace has a potential self-loop exit can be used to optimize the trace further. This optimization involves the movement of the computation of a+b to the beginning of the trace and the linking of the end of the trace with its second instruction instead of the first. This in effect shortens the trace when it is executed repeatedly with no early exits. This code motion is possible because the optimization considers only the straight line execution of the trace instead of all possible paths of execution in the program. The amount of time that it takes to optimize the trace should be less than the amount of time that it takes to optimize both functions. The main reason for this is that the trace s control flow is only a sequence of instructions while the functions have more complicated control flow. 2.6 Jupiter Jupiter is an interpreter-based JVM developed by Patrick Doyle at the University of Toronto [Doy02]. It is written in C in an object oriented style. Jupiter is composed of many separate modules that all interact with each other via simple interfaces. The modules can be divided into three parts: data structures used when executing Java such as classes, methods, threads, locks, stacks, and stack frames modules that manage these structures such as a class source or memory source

26 Chapter 2. Background 16 int f(int i) { int result; if (zarray[i]<5) result=3; else result=(i*zarray[i]*zarray[i-1]+zarray[i-2]*zarray[i-3]); return result; } int foo() { int res=0; int a=3; int b=6; int c; int d; for (int i=0;i<105;i++) { c=f(i); if (c<50) d=a+b; else d=a+c; res+=d; } return res; } Figure 2.8: Source code for an optimization example.

27 Chapter 2. Background 17 B0 res=0 i=0 a=3 b=6 goto B6 Trace 1 B1 invoke f(i) BB0 if (zarray[i]]>=5) goto BB2 BB1 result=3 goto BB3 BB2 result=i*zarray[i]*... BB3 return result B2 c=returned value if (c>=50) goto B4 B3 d=a+b goto B5 B4 d=a+c B5 res+=d i++ B6 if (i<105) goto B1 B7 return res Figure 2.9: Control flow graph for the optimization example.

28 Chapter 2. Background 18 Trace1 B0 res=0 i=0 a=3 b=6 goto B6 d=a+b if (f(i) is wrong) goto trace_exit1 if (zarray[i]>=5) goto trace_exit2 res+=d i++ if (i<105) goto if (f(i)...) trace_exit3 B1 invoke f(i) trace_exit1 cleanup code BB0 if (zarray[i]]>=5) goto BB2 trace_exit2 cleanup code BB1 result=3 goto BB3 BB2 result=i*zarray[i]*... trace_exit3 BB3 return result B2 c=returned value if (c>=50) goto B4 B3 d=a+b goto B5 B4 d=a+c B5 res+=d i++ B6 if (i<105) goto B1 B7 return res Figure 2.10: Control flow graph with a trace for the optimization example.

29 Chapter 2. Background 19 the execution engine module that directs all the other modules Jupiter s execution engine module is an interpreter that is designed so that each bytecode is interpreted in a case statement within a big switch block. The interpreter s main function contains this switch block and is passed in a set of bytecodes and information about the context in which the bytecodes should execute. The case statement for each bytecode performs the required tasks by calling the appropriate functions in various modules, including the class, object, thread, and memory source modules. Many of the frequently executed portions of the interpreter are written as macros so as to make the interpreter easier to manage. We have extended Jupiter with a framework that collects information about the runtime behaviour of a program and generates traces. 2.7 Jikes The Jikes Research Virtual Machine (RVM) is an open source JVM developed at IBM that is designed to be used for just-in-time (JIT) compiler research [AFG + 00]. Jikes is written in Java and bootstraps itself 7. It is designed to deliver performance that is comparable to commercial JVMs. To achieve this, it uses a compile-only strategy and indeed employs two compilers. The first, or baseline, compiler quickly translates Java bytecodes into unoptimized native code. The second compiler is an optimizing compiler that takes longer to generate native code, but the native code is much faster than the one produced by the first compiler. This second compiler is only used to compile methods that are frequently executed. This strategy is necessary because the time spent optimizing the infrequently executed sections of the program would not be recovered by the increase in speed of that section s execution. Jikes consists of four major parts: The core runtime which, in turn, consists of threads, the class loader, the hardware interface, etc., and is the portion of Jikes that loads programs from files, executes native code produced by the compilers. It also provides a mechanism for the native code to access the other parts of Jikes. 7 A scaled down version of Jikes is executed by another JVM to produce the final Jikes executable.

30 Chapter 2. Background 20 A memory manager that is responsible for all of the memory that Jikes uses when it is executing a program and is called when new memory is allocated or when garbage collection is performed. Baseline and optimizing compilers that take methods loaded by the runtime and turn their bytecode instructions into native code that can be executed. Optimizing systems that control when certain frequently executed methods are recompiled and which optimizations are performed during the recompilation. The different optimization systems in Jikes are: a static system, an adaptive system, and an ahead-of-time system. The static system just uses the default compiler that is selected by the user. The adaptive system uses instrumentation and a second thread so that it can monitor the behaviour of the program and optimize code as the program executes. The ahead-of-time system, which is also called the optimization test harness, performs compilation before the program executes. The compilation is based on options given to Jikes by the user. The options include methods of both Jikes and the program to be compiled as well as the optimization options that should be used when compiling. In all cases, when Jikes is invoked, parameters can be passed via the command line to affect the behaviour of the optimization system and all the other components of Jikes. Consider how Jikes, with the adaptive system, executes a program that contains method a() in class A, and method main() in class B. When Jikes starts executing this program, it loads the program s primary class, class B, and compiles the main method of the program using one of the two compilers. Once the main method and anything else that is required is compiled, the program is started in its own thread. Classes and objects are loaded and compiled when they are first called. So if main() is executed and calls method a() then Jikes will load and compile method a(). The program s thread is suspended periodically so that Jikes can perform internal tasks and, if a multi-threaded program is executing, other threads can execute. The internal tasks include the optimization system being called so that it keeps track of what is being frequently executed and methods being compiled. As the program executes, the

31 Chapter 2. Background 21 optimization system controls method compilation. Frequently executed methods are therefore compiled and recompiled, and the newest versions are always used. Methods that are infrequently executed on the other hand will be compiled only at the beginning with little or no optimization. The Jikes executable can only have one type of memory manager and one type of optimization system. The user must therefore choose between these different types of components when the Jikes executable is created via command line parameters. The selected components are compiled along with the core runtime, the compilers, and certain parts of the Java class library that the selected components use. All of this is put into one large executable. The remainder of this section consists of a more detailed description of the operation of the optimization systems and how the optimizing compiler performs inlining The Adaptive System The adaptive system controls the recompilation of frequently executed methods so that they are optimized and will therefore execute faster. The adaptive system consists of a set of listeners, a set of organizer threads, a compilation directing thread, and a set of compilation threads. Listeners record information about the program s execution. Organizer threads analyze this information and generate compilation requests based on the analysis. The compilation directing thread takes the compilation requests and creates a compilation thread for each request that it services. The compilation threads perform the actual compilation by invoking the optimizing compiler. The program that is being executed is instrumented by Jikes s compilers so that at certain time intervals the executing thread yields to another thread. While the thread is in the process of yielding it calls the adaptive system s listeners, which record the point at which the yield occurred. When a listener determines that there is enough information to process, it wakes up its associated organizer thread. The organizer thread carries out the necessary analyses to determine what should be compiled and how it should be compiled based on the data available, and then it puts all compilation requests on a compilation queue. When the compilation directing thread executes, it goes through this

32 Chapter 2. Background 22 queue and processes the compilation requests through the use of compilation threads. The adaptive system can have multiple listener organizer thread pairs so that it can use multiple adaptive strategies at the same time. The adaptive system can be configured by the use of command line parameters to behave in various ways. These parameters specify which optimizations are enabled, how inlining is performed, and other aspects of compilation The Optimization Test Harness Another type of optimization system within Jikes is the optimization test harness. The harness takes compiler commands as input. All of the commands are passed in as command line parameters or in a file whose name is passed in as a command line parameter. The commands consist of: parameters to the baseline and optimizing compilers, the methods to compile using the baseline compiler, the methods to compile using the optimizing compiler, which inlining plan to use, and what the main method to execute should be. The harness acts on each command one at a time starting from the first one that it is given. Once the test harness compiles all the methods that it is supposed to, it executes the program. If a method that has not been compiled ahead of time is encountered by the program s execution then the execution is suspended until the compilation is performed. Once the program is completely executed the harness prints out the elapsed time between the beginning and end of the program s execution. This allows us to measure the amount of time that a program takes to execute given everything is compiled and inlined according to a selected strategy. We can therefore compare different compilation and inlining strategies in terms of the corresponding execution times Inlining in Jikes Inlining is the replacement of the invoke of a method with that method s instructions. When inlining is performed at an invoke, referred to as a call site, the call site is said to be inlined. Inlining enables inter-procedural analysis by turning several methods into one large method. In Jikes, inlining is performed via a set of inlining oracles, which

33 Chapter 2. Background 23 are objects that specify whether a specific call site should be inlined or not. The oracle can base its decision on several factors: the current call site, the target method, information about the current method and the target method, as well as an inline sequence, which is a list of call sites that specifies the location of the call site within an inlining hierarchy. Consider the example in Figure 2.11 where method a() is being compiled. In this example a call to method b() at bytecode index 5 is already inlined, and there is a call site in method b() at bytecode index 10 to method c() for which the inlining oracle is called. In this example an inlining oracle that bases its decision on inline sequences must decide whether the inline sequence a() 5, b() 10, c() is valid. The oracle must have a list of acceptable inline sequences that it uses for this purpose. The specified sequence is valid only when one of the acceptable inline sequences is a suffix of it. Therefore if the list of acceptable sequences contains the sequence a() 5, b() 10, c() then the sequence a() 5, b() 10, c() is valid. Furthermore, if b() 10, c() is in the list of acceptable sequences then a() 5, b() 10, c() is also valid. These are the only two sequences that the acceptable sequence list can contain which would make the sequence a() 5, b() 10, c() valid. method a()... 5: invoke b()... compile a() method b()... 10: invoke c() method... a() Should c() be inlined? method b()... 10: invoke c()... Inline Sequence: a() 5, b() 10, c() method c()... If one of the acceptable sequences is the sequence b() 10, c() or the sequence a() 5, b() 10, c() then yes, and otherwise no. Figure 2.11: An example of inlining using inline sequences.

34 Chapter 2. Background Inlining Oracles There are several inlining oracles. The simplest is a static inlining oracle that just looks at the source of the method and how much code expansion would be caused by inlining the method. Inlining will occur only if the expansion is below a certain threshold. This is done because it is often beneficial to inline small methods since they do not cause a large code expansion and the overhead of calling them is a substantial part of their execution. Another type of oracle bases its decision on a set of call site/method pairs that have been identified as beneficial. This oracle first calls the static oracle to determine whether a method should be inlined regardless of what the inlining plan contains. If the static inlining oracle decides that the method should be inlined then the method is inlined. Otherwise, the target method is inlined only if a call site method pair exists that corresponds to a call site and a target method that is being considered for inlining. The inline may not be performed in some cases, such as when the inline would cause the code to grow beyond memory limits imposed by Jikes. Both the adaptive and the ahead-of-time systems use inlining oracles with inline plans. In the case of the adaptive system, the inlining plan is generated as the system collects information about the program. Optionally, the system can read in an initial inline plan that can be further enhanced as the system collects information about the executing program. When the system decides it is time to compile a method, it invokes a compiler and passes to it a reference an inlining oracle with a specific inlining plan. The name of the file that contains an initial plan can be passed in as a command line parameter by the user. The ahead-of-time system, also referred to as the optimization test harness, can also be controlled via the command line to load in a certain inlining plan and to use an inlining oracle based on it. 2.8 Related Work There is a great deal of work that has been done both in the area of feedback directed optimization [Smi00] and just-in-time (JIT) compilation [Ayc01]. Although in a sense it is all related to our work, we will only focus on related work that deals with traces and related work that deals with feedback directed optimization in Java.

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS