ONLINE PROFILING AND FEEDBACK-DIRECTED OPTIMIZATION OF JAVA

Size: px

Start display at page:

Download "ONLINE PROFILING AND FEEDBACK-DIRECTED OPTIMIZATION OF JAVA"

Bethanie Sparks
5 years ago
Views:

1 ONLINE PROFILING AND FEEDBACK-DIRECTED OPTIMIZATION OF JAVA BY MATTHEW ARNOLD A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Barbara Gershon Ryder and approved by New Brunswick, New Jersey October, 2002

3 ABSTRACT OF THE DISSERTATION Online Profiling and Feedback-directed Optimization of Java by MATTHEW ARNOLD Dissertation Director: Barbara Gershon Ryder The dynamic nature of the Java TM programming language presents a number of challenges for Java Virtual Machine (JVM) implementations. Constructs such as dynamic class loading and reflection make traditional whole program analysis and optimization difficult, or even impossible; however, Java s dynamic execution environment also presents a potential performance advantage over the traditional static compilation model: the ability to perform feedback-directed optimizations, where profiling information is collected at runtime and used to tailor the optimization decisions that are made. Although previous work has shown that feedback-directed optimizations can substantially improve program performance, most of these systems used offline profiles collected using a separate training run. Performing profiling and optimization online during the same run is an attractive approach because it avoids the need for a separate training run. Unfortunately, the overhead of collecting online profiles is often a problem, and is one of the main reasons why today s JVM s perform only limited forms of feedback-directed optimizations. The first contribution of this thesis is a new technique called an instrumentation sampling framework, a mechanism that allows previously expensive instrumentation to be executed with low overhead. The instrumentation sampling framework is designed as an automatic code transformation that takes instrumented code as input, and produces a ii

4 modified version of the code that collects a similar profile, but executes with low overhead. We implemented and evaluated our framework in the Jikes Research Virtual Machine; our results demonstrate that the sampling framework effectively reduces the overhead of several types of instrumentation while having only a minimal effect on the accuracy of the profiles collected. The second contribution of this thesis is the design and implementation of an online system that uses instrumentation sampling to drive feedback-directed optimizations and improve the performance of Java programs. Our implementation is built on top of the general adaptive optimization architecture of the Jikes RVM. Our system collects intra-procedural edge profiles using the instrumentation sampling framework and uses the resulting profiles to drive four feedback-directed optimizations. Our empirical evaluation demonstrates that our online approach can improve the performance of long-running programs without degrading the performance of short-running programs. iii

5 Acknowledgements This thesis would not have been possible without the help of many people... I would like to start by thanking my advisor, Professor Barbara Ryder, for the guidance and support that she gave me throughout my entire graduate school experience. She taught me many of the fundamental skills required to become an independent researcher, and I am thankful for the time that she spent working with me. I appreciated her encouragement to pursue ideas that interested me and her continued support while following through on these ideas. I would also like to thank Michael Hind from IBM Research, who acted as a mentor to me during my time at IBM. I am grateful not only for the numerous technical interactions that we shared while discussing research ideas, but also for the advice and support that he provided to help ensure that my time at IBM was a success; without his help, the outcome of this thesis would most likely have been very different. In addition, his help in writing our OOPSLA 02 paper directly influenced the second half of this thesis. I had a tremendous experience at IBM Research working with Stephen Fink, David Grove, Michael Hind, and Peter Sweeney on the Jikes RVM adaptive optimization system. I would like to thank them for creating an environment in which I could learn so much, yet have fun in the process. The conversations and brainstorming sessions we had together are responsible for many of the ideas present in this thesis. I would also like to thank Michael Burke for offering me my original summer internship at IBM, and Peter Sweeney who not only guided me during that internship, but also helped convince me to extend my internship and continue my relationship with IBM. I would also like to thank everyone else at IBM who helped me along the way, including David Bacon, Julian Dolby, Igor Pechtchanski, Vivek Sarkar, Martin Trapp, and Mark Wegman. In addition, I would like to acknowledge IBM research for their financial support. I am grateful to Craig Chambers, Michael Hind, Rich Martin, and Barbara Ryder for iv

6 their helpful comments on this thesis. I also thank all of the members of the PROLANGS lab for providing an enjoyable research environment during my time at Rutgers. In particular, Nasko Rountev and I had endless conversations about topics ranging from research, to the meaning of life; these conversations were not only educational, but helped make graduate school a memorable experience. Finally, I would like to thank my family for their endless support during my time in school. From the very beginning my parents encouraged me to stay in school, and made several strategic moves along the way to ensure that I did. I am very grateful for their efforts that ultimately helped me choose that path that I did. And last, but certainly not least, I want to thank my wife Jelena for many so things. First, she deserves an award simply for putting up with me through all that we experienced during the last several years. But even more importantly, I owe her more than can be expressed for the continual support that she gave me: being there when I needed help, encouraging me to make rational decisions and, in general, helping me to maintain my sanity. I am forever grateful. v

7 Dedication To my family: To my parents, who encouraged and supported me throughout the long journey of my college career; it helped more than you may realize. And to my wife Jelena, who endured all of the ups and downs together with me. I will always remember and appreciate the conversations that we had and the support that you gave me. vi

8 Table of Contents Abstract ii Acknowledgements iv Dedication vi List of Tables xii List of Figures xiii 1. Introduction Thesis Contributions Framework for Low-overhead Instrumentation Online System Performing Instrumentation and Feedback-Directed Optimization Thesis Organization Background: The Jikes RVM The Jikes RVM Optimizing Compiler Threading Model Adaptive Optimization System General Architecture Current Instantiation Controller Model I Low-overhead Instrumentation Instrumentation Sampling Framework vii

9 3.1. Technique Check placement Reducing dynamic check frequency Trigger mechanisms Counter-based sampling Implementation options Timer-based sampling Event-based sampling Discussion: The effect of polling Applicability to various types of instrumentation Space-saving variations Variation 1: Partial-Duplication Variation 2: No-Duplication Implementation and Experimental Evaluation Implementation Full-Duplication code transformation Check implementation Jikes RVM yieldpoint optimization Phase ordering Interaction with Instrumentation Experimental Results Benchmarks Methodology Instrumentation examples Framework overhead Full-Duplication algorithm No-Duplication algorithm Sampled instrumentation overhead and accuracy viii

10 Overhead Accuracy Jikes RVM-specific optimization Trigger Mechanisms II Online Instrumentation and Feedback-Directed Optimization Online FDO: Design and Implementation Background Challenges in Performing Online FDO Existing Online Strategies Profile early during unoptimized execution Profile optimized code Sample throughout execution Design Goals Online Strategy Implementation Online Strategy Instrumentation: Intraprocedural Edge Profiles Collecting Edge Profiles Using Edge Profiles Feedback-Directed Optimizations Splitting Method Inlining Code Positioning Loop Unrolling Online FDO: Experimental Results ix

11 6.1. SPECjvm98 Benchmark Suite Steady-State Performance Online Performance Space Overhead Server Benchmark Performance Related Work Complete Online Systems Self Java IBM DK Hotspot MRL Binary translators Prefetching Dynamo Ephemeral Instrumentation Oberon Other Systems Offline Optimization FDO Selective Optimization Profiling Exhaustive Instrumentation Sampling Conclusions and Future Work Low-overhead Instrumentation Online FDO x

12 8.3. Discussion and Future Work References Vita xi

13 List of Tables 4.1. Benchmark suite used to evaluate sampling framework Time overhead of example instrumentations (without framework) Time overhead of Full-Duplication framework Space overhead of Full-Duplication framework Time overhead of No-Duplication framework Total overhead and accuracy of sampling example instrumentations Comparing timer- and counter- based triggers Cost/benefit rations for controller model Characteristics of long-running SPECjvm98 suite Recompilation and space statistics of FDO on SPECjvm98 benchmarks SPECjbb2000 server benchmark performance, with and without online feedback-directed optimization xii

14 List of Figures 2.1. Overview of the Jikes RVM optimizing compiler Design of Jikes RVM adaptive optimization system Implementation of Jikes RVM adaptive optimization system High-level view of instrumentation-sampling framework A high-level view of an instrumented method generated by the sampling framework Detailed example of sampling framework Code inserted for a counter-based check Example of how a timer-based trigger can lead to non-intuitive sampling results Removing nodes increases checks Example of Partial-Duplication Example of No-Duplication Performing the Full-Duplication code transformation Assembly pseudocode for counter-based check Placement of Full-Duplication transformation within optimizing compiler Graphical overlap percentage example Overhead of Full-Duplication framework with yieldpoint optimization applied Online strategy for online FDO Modifications to Jikes RVM adaptive optimization system for performing FDO Feedback-directed splitting algorithm xiii

15 5.4. One iteration of feedback-directed splitting Peak performance improvement when using instrumentation and feedbackdirected optimization Online performance of non-fdo and FDO systems Online improvement of FDO vs non-fdo systems SPECjbb2000 server benchmark performance xiv

16 1 Chapter 1 Introduction The dynamic nature of the Java TM programming language presents a number of challenges for Java Virtual Machine (JVM) implementations. Constructs such as dynamic class loading and reflection make traditional whole program analysis and optimization difficult, or even impossible. To improve application performance in the presence of these restrictions, many of today s JVMs employ a dynamic optimizing compiler which compiles Java bytecode into native code at runtime, while the application program is executing. Dynamic compilation has a number of disadvantages compared to traditional static compilation, most notably that the overhead incurred by performing compilation at runtime can be substantial. To minimize this overhead, attention has been focused on 1) reducing the execution time of the optimizer, and 2) applying optimization to only the key portions of the application [51, 64, 71, 36, 8]. This second approach, often referred to as selective optimization, avoids the overhead of optimizing all methods, thus is particularly beneficial for shorter running programs that do not execute long enough to recoup the time spent optimizing all methods [11]. Despite its potential disadvantages, dynamic compilation also has a potential performance advantage over traditional static compilation: the ability to tailor optimizations to the current execution environment. Such an approach, typically referred to as feedbackdirected optimization (FDO), not only instructs the optimizer what to optimize, but also specifies how the method should be optimized. By observing and optimizing the common execution patterns of the executing program, a system performing feedback-directed optimization compiler has the potential to outperform a traditional static compiler. Several systems [45, 59, 58, 26] have shown that performance can be substantially improved by exploiting invariant runtime values; however, these systems were not fully

17 2 automatic and relied on programmer directives to identify regions of code to be optimized. There exists a large body of work on collecting profiling information by performing instrumentation [25, 44, 4, 18, 17], as well as fully-automatic optimizations based instrumented profiles [34, 27, 46, 30, 31, 10, 62, 65, 53]. However this work assumes the execution model where a profiles can be collected offline, using a separate training run. Although the resulting speedups are often promising, this approach fails in scenarios where 1) it is impractical to collect a profile prior to execution, or 2) the application does not behave like the training run. Performing profiling and optimization online, during the same run, is an attractive approach because it avoids the previously mentioned drawbacks of offline profiling. Unfortunately, using online profiles to guide optimization has limitations of its own because the amount of work that must be performed at runtime is increased, including 1) collecting the profiling information, 2) examining the profile data and making decisions based on it, and 3) performing the actual feedback-directed optimizations. All 3 of these steps involve overhead and creating the potential for degrading performance rather than improving it. Most importantly, the overhead of collecting instrumented profiles is a problem. Overheads in the range of 30% 1,000% above non-instrumented code is not uncommon [46, 17, 18, 27, 26, 4] for collecting the kinds of profiles often used to drive feedbackdirected optimizations, and overheads in the range of 10,000% (100 times slower) have been reported [26]. This overhead is one of the main reasons why today s JVM s perform only limited forms of feedback-directed optimizations [8, 71, 36, 64]. Optimizations that are currently being used online are usually based on profiles that can be collected easily with low overhead. Some online systems, such as Dynamo [16], are designed to identify when performance is being degraded so that profile-guided optimizations can be disabled for the remainder of execution. This thesis presents a new approach for performing online instrumentation and feedback-directed optimization. First, we present instrumentation sampling [12], a new technique for reducing the the runtime overhead of executing instrumented code. By allowing a wide range of traditionally offline instrumentation techniques to be collected

18 3 with low overhead, one the biggest obstacles to performing feedback-directed optimizations online is eliminated. Second, we describe how instrumentation sampling can be incorporated into an online, adaptive Java Virtual Machine. We show that with minimal overhead, instrumentation sampling can be used effectively to reduce the overhead of collecting online instrumented profiles. Using several examples of feedback-directed optimizations, we also show that our online approach can effectively improve the performance of long-running Java applications, without sacrificing the performance of short-running applications. 1.1 Thesis Contributions The specific contributions of this thesis can be divided into the following two categories Framework for Low-overhead Instrumentation The first contribution of this thesis is a new technique called an instrumentation sampling framework, a mechanism that allows previously expensive instrumentation to be executed with low overhead. The main goal of the framework is to automate the process of reducing instrumentation overhead, allowing a wide range of profiles to be collected efficiently, without requiring a separate low-overhead implementation for each. The instrumentation sampling framework is designed as an automatic code transformation that takes instrumented code as input, and produces a modified version of the code that collects a similar profile, but executes with low overhead. We implemented and evaluated our framework in the Jikes Research Virtual Machine; our results demonstrate that the sampling framework effectively reduces the overhead of several types of instrumentation while having only a minimal effect on the accuracy of the profiles collected Online System Performing Instrumentation and Feedback- Directed Optimization The second contribution of this thesis is the design and implementation of an online system that uses instrumentation sampling to drive feedback-directed optimizations to improve the

19 4 performance of Java programs. Our implementation is built on top of the general adaptive optimization architecture of the Jikes RVM described in [8]. We describe a fully automated system that makes online decisions regarding when instrumentation and feedback-directed optimization should be performed. Our system collects intra-procedural edge profiles using the instrumentation sampling framework and uses the resulting profiles to drive four feedback-directed optimizations. Our empirical evaluation demonstrates that our online approach can improve the performance of long-running programs without degrading the performance of short-running programs. 1.2 Thesis Organization The remainder of this thesis is organized as follows. Chapter 2 describes background information about the Jikes RVM, the infrastructure in which this thesis work is implemented. Part I (Chapters 3 and 4) describe the design and implementation of the instrumentation sampling framework. Part II (Chapters 5 and 6) describe the design and implementation of an online system that uses instrumentation sampling to perform online profiling and optimization of Java programs. Chapter 7 presents the related work, and Chapter 8 presents our conclusions.

20 5 Chapter 2 Background: The Jikes RVM The Jikes Research Virtual Machine (Jikes RVM) 1 is a virtual machine developed at the IBM T.J. Watson Research Center. This section gives a brief overview of the Jikes RVM system, and provides a detailed description of those components that are directly relevant to this thesis. The Jikes RVM is written almost entirely in Java. It begins execution by reading from a boot image file, which contains the core services of Jikes RVM precompiled to machine code. The Jikes RVM uses a compile-only approach (no interpreter); thus all methods are compiled to native code upon first execution. The Jikes RVM currently contains two compilers, a fast, non-optimizing compiler called the baseline compiler, and an aggressive optimizing compiler [23]. The Jikes RVM also contains an adaptive optimization system [8] which profiles the applications and makes online decisions regarding when and how the optimizing compiler should be applied. Although it is not a fully complete JVM, the performance of the Jikes RVM has been shown to be competitive with that of commercial JVMs on the PowerPC platform. Components of the Jikes RVM that are of particular relevance to this thesis include 1) the optimizing compiler, 2) the quasi-preemptive threading model, and 3) the adaptive optimization system. These components are discussed in detail in the three sections that follow. 1 The Jikes RVM is an open-source version of the Jalapeño Research Virtual Machine [3, 8] and is available at

21 6 2.1 The Jikes RVM Optimizing Compiler The Jikes RVM optimizing compiler takes Java bytecode as input and produces native code as output. The optimizing compiler begins by converting the bytecode into a register-based intermediate representation, referred to as the IR. Jikes RVM s optimizing compiler consists of a series of optimization phases that transform the intermediate representation (IR) of a method from an unoptimized to an optimized state. As shown in Figure 2.1, there are three categories of optimization phases: 1) high-level optimizations, which are architecture and VM independent, 2) low-level optimizations, which are architecture independent, but not necessarily VM independent, and 3) machine-level optimizations, which are specific to the target architecture. The optimizations performed by the optimizing compiler are grouped into the following three predefined optimization levels. Level 0: Local, on-the-fly optimizations and register allocation is performed. No inlining is performed. Level 1: Augments Level 0 with more sophisticated local optimizations such as common subexpression elimination, array bounds check elimination and redundant load elimination. Inlining is performed based on size heuristics. Level 2: Augments Level 1 with SSA-based global optimizations. Each optimization level performs a superset of the optimizations performed at lower optimization levels, and therefore incurs additional compilation cost with the hope of generating better quality code. These optimization levels are chosen automatically by the adaptive optimization system, or can be chosen manually for non-adaptive versions of the Jikes RVM. 2.2 Threading Model To allow scalability and rapid thread switching, the Jikes RVM uses an M xn threading model where M Java threads are multiplexed on N operating system threads. In the

22 Figure 2.1: Overview of the Jikes RVM optimizing compiler. This figure is recreated from the original description of the Jalapeño optimizing compiler [23] (Figure 3). HIR, LIR, and MIR represent (respectively) High-, Low-, and Machine-level Intermediate Representations. 7

23 8 current implementation, N is the number of physical processors being used by the application, so there is one operating system thread for each physical processor. The Jikes RVM implements its own thread scheduler which multiplexes the Java threads on top of these operating system threads. The Jikes RVM scheduling model is quasi-preemptive, meaning that Java threads can be preempted, but only at certain predefined points, called yieldpoints. A yieldpoint is a sequence of instructions that checks a threadswitch bit to determine whether it is time for the currently executing thread to stop executing and yield control back to the thread scheduler. This bit is set every 10 milliseconds by an operating system interrupt. To guarantee that a Java thread cannot execute indefinitely, the Jikes RVM must ensure that only a finite amount of execution can occur before a yieldpoint is executed. The guarantee is currently met by simply placing yieldpoints on all method entries and loop backedges Adaptive Optimization System This sections describes the Jikes RVM adaptive optimization system [8], focusing on the components that are extended by this thesis General Architecture Figure 2.2 gives an overview of the general design of the Jikes RVM s adaptive optimization system. The architecture contains three main components: runtime measurements, the controller, and the recompilation subsystem. Methods are compiled with the non-optimizing baseline compiler upon their first execution, and an aggressive optimizing compiler is applied selectively by the adaptive optimization system. 2 There are slight exceptions to this yieldpoint placement, depending on whether the adaptive optimization system is being used. For versions of the Jikes RVM without the adaptive optimization system, yieldpoints are excluded from the prologues of trivial methods that are guaranteed to execute for a finite amount of time. For versions with the adaptive optimization system, yieldpoints are placed on the prologues, as well as the epilogues of all methods; this placement is not necessary for correctness but helps increase the accuracy of profiling system.

24 9 Core VM Classes Baseline Compiler Application Classes Unoptimized Code Optimizing Compiler Boot Image Executable Code Runtime Measurements Profile Information Optimized/ Instrumented Code Recompilation Subsystem Dynamic Measurements Controller Compilation/ Instrumentation Plan Figure 2.2: Overview of the general design of Jikes RVM adaptive optimization system [8] The runtime measurements component collects profiling information about the currently executing methods, and periodically summarizes the information and passes it to the controller. The architecture of the runtime measurements component is designed to be flexible enough to support collecting profiling information in a variety of ways, including the use of periodic sampling, instrumentation, and hardware performance monitors. The controller is the brain of the adaptive optimization system; it makes all decisions regarding profiling and optimization activity. Based on the profiling information provided by the runtime measurements system, the controller can choose to perform additional profiling, or perform optimization by constructing a compilation plan and passing it to the recompilation subsystem. The recompilation subsystem consists of one or more compilation threads that execute compilation plans created by the controller. A compilation plan provides instructions regarding how to the method should be compiled. The next section reviews the current instantiation of this general design.

25 10 Baseline Compiler Application Classes Unoptimized Code Runtime Measurements Method Samples Call Edge Samples Executable Code Optimized Code Recompilation Subsystem Optimizing Compiler Hot Method Organizer Inlining Organizer Decay Organizer AOS Database Compilation Thread Dynamic Call Graph Event Queue Controller Compilation Plan Compilation Queue Figure 2.3: Default implementation of Jikes RVM adaptive optimization system [8] Current Instantiation Figure 2.3 presents the base implementation (version 1.1b) of the Jikes RVM adaptive optimization system architecture as described in [8]. This differs from Figure 2.2, which provides a general design of what could be implemented. The runtime measurements subsystem performs profiling throughout execution using a low overhead timer-based sampling mechanism to identify frequently executed methods. This timer-based sampling is based on the yieldpoint mechanism described in Section 2.2. A sample is taken each time a yieldpoint notifies the thread scheduler that it is time to switch threads. The thread switching code triggers a callback to the runtime measurements system, which examines the currently executing method and increments a counter for that method. Because this sampling mechanism is built into thread-switching code of the Jikes RVM thread scheduler, no instrumentation of the executing method is required. Periodically the Hot Method Organizer notifies the controller with a set of methods that have been sampled frequently. For each method, the controller then examines the profile data and makes a decision regarding whether the method should be optimized, and if so, at what optimization level. The controller makes these decisions using a cost/benefit model, which is discussed in detail in the next section. Once the controller has decided

26 11 that a method should be recompiled, the controller notifies the recompilation subsystem, which currently consists of a single compilation thread. The Jikes RVM also performs one form of FDO, called adaptive inlining. Adaptive inlining uses the same timer-based sampling mechanism described above to detect hot call edges, so that they can be inlined if the caller method is recompiled in the future. These samples are processed by the inlining organizer and decayed by the decay organizer. If the inlining organizer detects a hot call edge in a method that is already optimized at the highest level (O2), it informs the controller of this fact, which can lead to the recompilation of that method (from O2 O2) for the sole purpose of incorporating the new inlining decision. The general design of the adaptive system also allows the controller to request that the optimizing compiler insert instrumentation during compilation to collect additional profiling information for driving feedback-directed optimizations. This feature, however, as well as the necessary modifications to the three components of the adaptive system, were not implemented in [8]. Controller Model The controller in the current adaptive optimization systems uses a cost/benefit model to determine what action should be taken for each recompilation candidate it considers. The goal of the controller is to make decisions in such a way that good performance is achieved by both short- and long-running applications. To evaluate desirability of optimizing a method M at optimization level O, the controller computes two values: 1. Estimated cost: The cost of performing the optimization is the expected time required to compile method M at optimization level O. 2. Estimated benefit: The benefit is the speedup that can be expected after method M is compiled at optimization level O. This speedup is estimated using the assumption that the past will repeat itself and method M will execute twice as long as it already has.

27 12 When considering whether to optimize a particular method, the viable choices are to do nothing, or recompile at one of Jikes RVM s three optimization-levels (O0, O1, O2). The controller estimates the cost and benefit of each potential recompilation choice, then picks the choice that would minimize total execution time.

28 Part I Low-overhead Instrumentation 13

29 14 Chapter 3 Instrumentation Sampling Framework This chapter presents an instrumentation sampling framework, a technique that allows previously expensive instrumentation to be performed with low overhead. The sampling framework is an automatic code transformation that takes an instrumented method (that would execute with high overhead) as input, and transforms the method to produce a modified version that will execute with low overhead, yet collect a similar profile (see figure 3.1). The main goal of the framework is to automate the process of reducing instrumentation overhead, allowing a wide range of profiles to be collected efficiently, without requiring a separate low-overhead implementation for each. Instead, high-overhead versions of instrumentation can be used and the sampling framework automatically reduces the overhead. Our framework offers the following advantages: Overhead is reduced substantially, allowing previously expensive instrumentation techniques to be used at runtime, even in situations where performance is important. In our experience, the accuracy of the profile being collected remains high, allowing the technique to be used even when accurate profiles are needed. The technique can be used to collect a wide range of profiles. Many common instrumentation techniques can be incorporated into our framework without modification. Multiple types of instrumentation can be inserted simultaneously and sampled by the framework. The sampling framework is easy to implement and can be applied at any level of abstraction, ranging from a source-to-source transformation to a binary-to-binary

30 15 Figure 3.1: An illustration of the goal of the instrumentation-sampling framework. The input to the framework is an instrumented method that would execute with high overhead; the output is a modified version that will execute with low overhead, but produce a similar profile. transformation. The framework is tunable, allowing the tradeoff between overhead and accuracy to be adjusted easily at runtime. The framework does not rely on any hardware or operating system support. Sections describes the instrumentation sampling framework in detail. Section 3.4 describe the types of profiles for which this framework is effective, and possible modifications for collecting other types of profiles. Section 3.5 describes two variations of the framework designed to reduce the space requirements. 3.1 Technique Assume that a method F is to be instrumented. The sampling transformation of F is accomplished as follows. A second version of F, called the duplicated code, is introduced within the instrumented method, as shown in Figure 3.2. The duplicated code contains all of the heavyweight (high-overhead) instrumentation. The original version of the code is now referred to as the checking code because it is modified only slightly in a way that allows execution to swap back and forth between the checking code and the duplicated code in a fine-grained, controlled manner. At regular sample intervals, execution moves into the duplicated code for a small, bounded amount of time. Total overhead can be kept

31 16 Figure 3.2: A high-level view of an instrumented method generated by the sampling framework. A second version of the code is introduced, called the duplicated code, which contains all instrumentation. The original code becomes the checking code, which is minimally instrumented to allow control to transfer in and out of the duplicated code in a fine-grained and controlled manner. to a minimum by ensuring that the majority of execution occurs in the checking code. As long as the duplicated code is executed infrequently, expensive instrumentation inserted into the duplicated code will have only a small impact on overall overhead. This version of the framework will be referred to as Full-Duplication, since all of the code in the method is duplicated. The switching between the checking and duplicated code is illustrated in Figure 3.3. The checking code has conditional branches inserted (which will be referred to as checks) that monitor a sample condition. When a check determines that the sample condition is true, a sample is triggered and control jumps to the duplicated code, rather than continuing in the checking code. The duplicated code is also slightly modified to ensure that only a bounded amount of execution occurs in the duplicated code before the sample condition is re-evaluated. This is accomplished by modifying the backward branches (which will be referred to as backedges) in the duplicated code to transfer control back to the checking code, allowing the sample condition to be re-evaluated to determine whether execution should continue in the duplicated code or checking code. Therefore, taking a sample implies executing one

32 17 Figure 3.3: Illustration of the flow of control between the checking code and duplicated code. All method entries and backedges in the checking code contain a conditional branch that jumps to the duplicated code when a sample condition is true. All backedges in the duplicated code are modified to return to the checking code. acyclic path through the duplicated code, and then re-evaluating the sample condition. Where the checks are placed within the checking code, and how samples are triggered are two of the flexible aspects of the framework; they are discussed in Sections 3.2 and 3.3, respectively. The key idea of the Full-Duplication framework is that the ratio of time spent in each version of the code can be controlled by changing the rate at which the sample condition is true. The overhead of the duplicated code, and all instrumentation inserted into it, can be controlled by reducing the rate at which samples are taken. As the sample rate is decreased, the total overhead will converge to the overhead of the checks being executed in the checking code. Another more subtle advantage of this framework is that it makes it easy to stop executing the instrumentation when no more profiling information is needed. In an online system (as discussed further in Chapter 5) it may be desirable to execute instrumented code for some period of time, after which execution should transfer back to the non-instrumented version. It is important to have some mechanism for stopping the instrumented code from

33 18 executing to prevent the program from running indefinitely with poor performance. This could be achieved by using dynamic code patching [49, 75, 71, 72] to insert and remove instrumentation without recompiling the method, or by performing on-stack replacement [50] to hot-swap execution back to the non-instrumented version while the method is running. In our sampling framework, this problem is avoided because setting the sample condition to be permanently false will ensure that execution remains in the checking code and no more instrumentation will be executed. Unless the system performs on-stack replacement, execution cannot switch back to a totally non-instrumented version of the method (i.e., a version of the method with no instrumentation and without the sampling transformation applied) until the method exits; however, no samples will be triggered during this time, so the total overhead will be that of the checking code. Depending on the implementation of the checks, this overhead should be small compared to the cost of instrumentation. 3.2 Check placement Placement of the checks within the checking code is one of the flexible aspects of the sampling framework. To reduce overhead it is desirable to minimize the number of checks executed; however, to ensure that an accurate profile is collected, enough checks should be placed so that all of the duplicated code has a chance to be sampled. One approach is to place checks on all method entries and backedges in the checking code. This placement of the checks ensures that (a) only a bounded amount of execution occurs between checks so that execution cannot execute indefinitely in either the checking code or the duplicated code, and (b) all the code has an opportunity to be sampled. An important property of this check placement is that the number of checks executed (and thus the overhead of the checking code) is completely independent of the instrumentation inserted in the duplicated code. For ease of reference this will be referred to as Property 1, defined as follows: Property 1 The number of checks executed in the checking code is less than or equal to the number of backedges and methods entries executed, independent of the instrumentation being performed.

34 19 Property 1 is an important characteristic of the Full-Duplication framework because it implies scalability in regard to the amount of instrumentation that can be inserted in the duplicated code. The checking overhead is a fixed cost overhead that is independent of the instrumentation in the duplicated code; therefore, it becomes possible to insert as much instrumentation as desired, and the total overhead can be reduced to approach that of the checking code by reducing the sample rate Reducing dynamic check frequency The runtime overhead caused by the checks depends on many factors, such as the efficiency of the checks themselves. The number of checks executed at runtime plays a major role in determining the overhead of the checking code. The simple check placement described above (method entries and loop backedges) resulted in acceptable overhead in our implementation (see Section 4.2). However, there could be situations in which the checking overhead is unacceptably high. For example, the backedge checks will introduce more overhead in programs that contain tight loops. Similarly, the method entry checks will introduce more overhead in programs that make frequent method calls. In scenarios where the overhead of this basic check placement is too high there are several possible approaches to help reduce the number of checks executed at runtime. First, several common code transformations can help reduce the number of checks executed at runtime. For example, loop unrolling or loop tiling can be used to reduce the number of backedge checks executed. These transformations increase the number of instructions executed per loop iteration, and thus increase the number of instructions executed between successive backedge checks. By performing more work between checks, the checking overhead caused by backedge checks will be reduced. Similarly, more aggressive inlining can be used to reduce the number of method entries executed, thus reducing the overhead caused by method entry checks. A second possibility for reducing the number of checks executed is to perform analysis of the instrumentation that is inserted in the duplicated code, and remove checks that are unnecessary. For example, if a particular loop contains no instrumentation then no

35 20 backedge check is needed for that loop. This approach was used by Hirzel and Chilimbi [48] to reduce the checking overhead when collecting memory reference profiles. 3.3 Trigger mechanisms The sampling framework relies on runtime checks which need some kind of trigger mechanism to determine when execution should be transferred from the checking code to the duplicated code. There is a wide range of possible strategies that could be used for triggering samples. Different triggers may make sense in different situations, and the ability to select a trigger mechanism to match the desired use of the profiling system is one of the flexible aspects of the framework. Three examples of trigger mechanisms are discussed in the sections that follow Counter-based sampling Counting a particular event and sampling when the counter reaches a threshold (which we refer to as counter-based sampling) is an effective mechanism for triggering samples proportionally to the frequency of that event. Counter-based sampling is particularly appealing when collecting profiles to guide feedback-directed optimizations because these optimizations often rely on the relative execution frequencies of certain events. Counterbased sampling has been used in previous systems, such as DCPI [6], where interrupts signaled by hardware performance counters are used to sample instructions proportionally to their execution frequency. To trigger samples in our framework we propose implementing a counter-based trigger in software by having the compiler insert code to decrement and check a global counter as shown in Figure 3.4. We call this technique compiler-inserted counter-based sampling. As long as the overhead of the counting and checking is kept to a minimum, the advantages of compiler-inserted counter-based sampling are numerous. Such advantages include: Easy to implement Counter-based sampling is a simple, but effective approach for triggering samples. It

36 21 globalcounter--; if (globalcounter <= 0) { globalcounter = resetvalue; takesample(); } Figure 3.4: Code inserted for a counter-based check easy to implement and does not rely on any hardware or operating system support. 1 Flexible, high-frequency sample rate A counter-based trigger provides a flexible, high-frequency sample rate. Any desired sample rate can be achieved by simply changing the value of resetvalue, and this value can even be changed dynamically to vary the sampling rate during profiling. This is an important advantage over other sampling mechanisms, such as hardware or operating system timer interrupts, which provide a fixed-frequency sample rate that may be too infrequent for some profiling scenarios [6]. Samples are triggered proportionally to execution frequency The number of times each check triggers a sample is proportional to the number of times that particular check is executed, therefore the instructions in the duplicated code are executed proportionally to their execution frequency in the noninstrumented code. This property makes counter-based sampling effective for estimating the execution frequencies of program events. Deterministic sampling A counter-based trigger has the advantage of triggering samples deterministically. If the application being executed is deterministic, two runs of the programs (with the same input) will produce the same sampled profile. One potential disadvantage of using a deterministic sampling strategy is that it is possible for the program behavior to correlate with the sampling behavior, resulting in a 1 Although hardware and operating system techniques may be used to lower the overhead of the checks, no support from either is required.

37 22 highly inaccurate profile. For example, if a program performs some uncommon behavior every 1000th loop iteration, any sample interval that is a multiple of 1000 could result in the uncommon behavior being observed on every sample. Although our experimental results suggest that this problem did not occur for benchmarks used in this study (see Section 4.2), the problem can be easily avoided by adding some degree of randomness to the sampling mechanism. One possibility is to add small pseudo-random factor to the reset value (as done in [6]) to reduce the probability of program behavior correlating with the sample interval. Such an approach could potentially increase accuracy in the average case as well by eliminating inaccuracy caused by correlation between the sample interval and program behavior. Implementation options There are several options for implementing a counter-based sampling approach; the simplest approach is to have each check execute the code exactly as shown in Figure 3.4. The counter variable (globalcounter) will most likely be in a register, or in the cache, and the branch will be predicted (not taken), therefore the performance overhead should be low. Such an approach was implemented in Jikes RVM without using a dedicated register, placing the code in Figure 3.4 on all backedges and method entries, and the overhead averaged 4.9% (for executing the checks only, when no samples were taken). A detailed evaluation of the overhead of counter-based checks is included in Section For multi-threaded applications, the global counter may raise some concerns. First, access to the global counter is not synchronized for performance reasons, so data races may occur. Fortunately, it may not be necessary to maintain 100% accuracy of the global counter, as it is simply a means of triggering samples roughly proportional to execution frequency. Having the counter value off-by-one occasionally would have little affect on the resulting accuracy. A more serious problem is that access to a single global counter could become a performance bottleneck as the number of threads and processors increases. In this case, the global counter could be replaced by thread- or processor-specific counters, allowing access to the counter with no resource contention.

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS