Draft. Debugging of Optimized Code through. Comparison Checking. Clara Jaramillo, Rajiv Gupta and Mary Lou Soa. Abstract

Similar documents
Comparison Checking. Dept. of Computer Science. Chatham College. Pittsburgh, USA. Rajiv Gupta. University of Arizona. Tucson, USA

Comparison Checking: An Approach to Avoid Debugging of Optimized Code?

Debugging and Testing Optimizers through Comparison Checking

Bisection Debugging. 1 Introduction. Thomas Gross. Carnegie Mellon University. Preliminary version

\Symbolic Debugging of. Charles E. McDowell. April University of California at Santa Cruz. Santa Cruz, CA abstract

Debugging Optimized Code. Without Being Misled. Max Copperman UCSC-CRL June 11, University of California, Santa Cruz

SRC Research. A Practical Approach for Recovery of Evicted Variables. Report 167. Caroline Tice and Susan L. Graham.

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

DTIC ELECTE. co AD-A Computer Science 00 ~~MAY 06 19

S1: a = b + c; I1: ld r1, b <1> S2: x = 2;

REDUCTION IN RUN TIME USING TRAP ANALYSIS

director executor user program user program signal, breakpoint function call communication channel client library directing server

times the performance of a high-end microprocessor introduced in 1984 [3]. 1 Such rapid growth in microprocessor performance has stimulated the develo

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

C. E. McDowell August 25, Baskin Center for. University of California, Santa Cruz. Santa Cruz, CA USA. abstract

Tour of common optimizations

Machine-Independent Optimizations

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Parallelization System. Abstract. We present an overview of our interprocedural analysis system,

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

A Debugging Tool for Software Evolution

Automatically Locating software Errors using Interesting Value Mapping Pair (IVMP)

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991.

Incremental Flow Analysis. Andreas Krall and Thomas Berger. Institut fur Computersprachen. Technische Universitat Wien. Argentinierstrae 8

Chapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract

The members of the Committee approve the thesis of Baosheng Cai defended on March David B. Whalley Professor Directing Thesis Xin Yuan Commit

1 Introduction Testing seeks to reveal software faults by executing a program and comparing the output expected to the output produced. Exhaustive tes

2. Modulo Scheduling of Control-Intensive Loops. Figure 1. Source code for example loop from lex. Figure 2. Superblock formation for example loop.

Lecture 1 Contracts : Principles of Imperative Computation (Fall 2018) Frank Pfenning

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

DTIC ELECTF- FEB i. v928 ' Evicted Variables and the Interaction of Global Register Allocation and Symbolic Debugging "AD-A.

2 BACKGROUND AND RELATED WORK Speculative execution refers to the execution of an instruction before it is known whether the instruction needs to be e

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

r[2] = M[x]; M[x] = r[2]; r[2] = M[x]; M[x] = r[2];

task object task queue

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

ABCDE. HP Part No Printed in U.S.A U0989

residual residual program final result

Automatic Counterflow Pipeline Synthesis

Process Time Comparison between GPU and CPU

Adaptive Methods for Distributed Video Presentation. Oregon Graduate Institute of Science and Technology. fcrispin, scen, walpole,

Lecture Notes for Chapter 2: Getting Started

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

Lecture Notes on Loop Optimizations

The of these simple branch prediction strategies is about 3%, but some benchmark programs have a of. A more sophisticated implementation of static bra

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

VM instruction formats. Bytecode translator

Improving the Scalability of Comparative Debugging with MRNet

Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations

Compiler and Architectural Support for. Jerey S. Snyder, David B. Whalley, and Theodore P. Baker. Department of Computer Science

INPUT SYSTEM MODEL ANALYSIS OUTPUT

Measuring the User Debugging Experience. Greg Bedwell Sony Interactive Entertainment

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

CS354 gdb Tutorial Written by Chris Feilbach

An Object Oriented Runtime Complexity Metric based on Iterative Decision Points

iii ACKNOWLEDGEMENTS I would like to thank my advisor, Professor Wen-Mei Hwu, for providing me with the resources, support, and guidance necessary to

Optimization on array bound check and Redundancy elimination

Accelerated Library Framework for Hybrid-x86

Short Notes of CS201

Cost Effective Dynamic Program Slicing

School of Computer Science. Scheme Flow Analysis Note 3 5/1/90. Super-: Copy, Constant, and Lambda Propagation in Scheme.

Interaction of JVM with x86, Sparc and MIPS

Two Problems - Two Solutions: One System - ECLiPSe. Mark Wallace and Andre Veron. April 1993

CS201 - Introduction to Programming Glossary By

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably

PT = 4l - 3w 1 PD = 2. w 2

Compiler Support for Software-Based Cache Partitioning. Frank Mueller. Humboldt-Universitat zu Berlin. Institut fur Informatik. Unter den Linden 6

Static WCET Analysis: Methods and Tools

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

Plaintext (P) + F. Ciphertext (T)

Heap Management. Heap Allocation

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Solve the Data Flow Problem

Code Placement, Code Motion

2 Related Work Often, animation is dealt with in an ad-hoc manner, such as keeping track of line-numbers. Below, we discuss some generic approaches. T

Unication of Register Allocation and Instruction Scheduling. in Compilers for Fine-Grain Parallel Architectures. David A. Berson

Modify Compiler. Compiler Rewrite Assembly Code. Modify Linker. Modify Objects. Libraries. Modify Libraries

Type T1: force false. Type T2: force true. Type T3: complement. Type T4: load

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1

Lecture 5. Data Flow Analysis

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26

Destination-Driven Code Generation R. Kent Dybvig, Robert Hieb, Tom Butler Computer Science Department Indiana University Bloomington, IN Februa

A 100 B F

PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.

Harvard School of Engineering and Applied Sciences CS 152: Programming Languages

Improving the Static Analysis of Loops by Dynamic Partitioning Techniques

Distributed Algorithms for Detecting Conjunctive Predicates. The University of Texas at Austin, September 30, Abstract

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

Outline. Computer Science 331. Information Hiding. What This Lecture is About. Data Structures, Abstract Data Types, and Their Implementations

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Transcription:

Draft Debugging of Optimized Code through Comparison Checking Clara Jaramillo, Rajiv Gupta and Mary Lou Soa Abstract We present a new approach to the debugging of optimized code through comparison checking. In this scheme, both the unoptimized and optimized versions of an application execute, and values they compute are compared in order to ensure that the behaviors of the two versions are the same. To determine what values should be compared and where the comparisons must take place, statement instances in the unoptimized code are mapped to statement instances in the optimized code. The mappings are derived automatically as optimizations are performed. Annotations for both versions of the code are developed from the mappings. Using the annotations, a driver checks, while the programs are executing, that both programs are producing the same values. If values are dierent, the user determines if there is a bug in the unoptimized code. If so, a conventional debugger is used to debug the code. If the bug is in the optimized code, the user is told where in the code the problem occurred and what optimizations are involved in producing the error. The user can then turn o those oending optimizations and leave the other optimizations in place. This information is also helpful to the optimizer writer in debugging the optimizer. We implemented our checker, COP, and ran experiments which indicate that the approach is practical. Keywords - code optimization, program transformation, comparison checking, debugging. Supported in part by a grant from Hewlett Packard Labs to the University of Pittsburgh.

1 Introduction Although code transformations are important in improving the performance of programs, an application programmer typically compiles a program during the development phases with the optimizer turned o. One reason for not using the optimizer is that since a program under development is expected to have bugs, and therefore undergo changes and recompilation, the time spent on optimization is often wasted. An equally important reason is the lack of eective tools for debugging optimized programs. If an error is detected while debugging an optimized program, the user is uncertain as to whether the error was present in the original program or was introduced by the optimizer. Determining the cause of the error is hampered by the limitations of current techniques for debugging optimized code. For example if the user wishes to observe the value of a variable at some program point while debugging the optimized program, the debugger may not be able to report this value because the value of the variable requested by the user may not have been computed yet or it may have been overwritten. Techniques that are able to recover the values for reporting purposes work in limited situations and for a limited set of optimizations[15, 12, 6, 19, 10, 14, 16, 5, 18, 4, 9, 21, 3]. While it may be acceptable to turn o optimizations during the development of the application software, the optimizations should be turned on when the application is in production in order to gain the performance benets provided by optimizations. However when the application, apparently free of bugs, is optimized, its behavior may not be the same as the unoptimized program. In this situation the programmer is likely to assume that errors in the optimizer are responsible for the change in behavior and thus the optimizer is turned o. However, information that reveals the cause for the diering behaviors of unoptimized and optimized code would prove useful. It is possible that the program contained an error which was not previously observed, and changes in the program due to optimization have unmasked the error. For example optimizations change the data layout of a program, which may cause an uninitialized variable to be assigned dierent values in the unoptimized and optimized programs causing them to behave dierently [7]. Clearly in this case the application program must be further debugged. If it is found that an error was introduced by the optimizer, it would be benecial to know the statements and optimizations that were involved in the error. Using this information, the application programmer could turn o oending optimizations in the eected parts of the program. Thus, the error would be removed without sacricing the benet of correctly applied optimizations. Moreover the same information could be used by the optimizer writer to debug the optimizer. In this paper we present comparison checking, a new approach to the debugging of optimized code. In comparison checking, both optimized and unoptimized versions of a program are executed, and any deviations between the behaviors of the two programs are detected. If the outputs of both the optimized and unoptimized programs are the same and correct, the optimized program can be run with condence. If the outputs are dierent, and if the user determines that the output of the unoptimized program is incorrect, the user can debug the unoptimized program using conventional debuggers. On the other hand, if the output of the unoptimized program is correct but the behavior of the optimized program diers from that of the unoptimized program, then the application programmer is provided with the information necessary to turn o selected optimizations in parts of the program. In this manner the application can benet from correctly applied optimizations without the application programmer ever having to directly debug the optimized code. The optimizer writer is also provided with valuable information that can be used to debug the optimizer. Our comparison checking system, COP, is shown in Figure 1. In this system, the behaviors of the optimized and unoptimized program are compared by checking that corresponding assignments of values to source level variables and results of corresponding branch predicates are the same throughout the execution of the optimized and unoptimized versions of the program. In addition, when assignments are made through arrays and pointers, we ensure that the addresses to which the values are assigned correspond to each other. All assignments to source level variables are compared with the exception of values that are dead and hence never computed in the optimized code. The compiler generates annotations for the optimized and unoptimized programs that enable the comparisons of corresponding values and addresses to be made. The advantage of such detailed comparison of the two program versions is that when a comparison fails, we can report the statement where the failure occurred and 1

C o m p i l e r unoptimized program program input and correct output Comparison Checking System optimized program unopt. output correct comparisons successful unopt. output incorrect unopt. output correct comparisons unsuccessful Application programmer uses conventional means to debug the unoptimized programmer and modify the program Application programmer turns off selected optimizations in parts of the program Optimizer writer debugs the optimizer using info. on statements and optimizations related to failed checks Figure 1: The Comparison Checking System. the optimizations that involved the statement. This information is valuable in determining the cause of diering behaviors and hence locating the bugs in the application program or the optimizer. The merits of a comparison checking system include: The user has the benet of the full capabilities of a conventional debugger and does not need to deal with the optimized code. The optimized code is not changed, and thus no recompilation is required. The user can condently use optimizers without concern for the correct semantics of the program or the correctness of the optimizations. Information about where an optimized program diers from the unoptimized version benets both the user and optimizer writer. A wide range of optimizations including code reordering transformations, loop transformations, register allocation and inlining can be handled. In this paper, we present the design of the comparison checking system, COP. We also implemented COP and present experimental results that demonstrate the practicality of the system. 2 Comparison Checking The comparison checker executes the unoptimized and optimized programs and compares the values computed by the two programs to ensure that their behaviors are the same. To accomplish this task, three questions must be answered: How should the programs be executed? One approach to comparison checking is to execute the optimized and unoptimized versions one at a time, save the values computed in a trace, and then compare the traces. To avoid the problem of generating long traces, we present a strategy that simultaneously executes the optimized and unoptimized programs and orchestrates their relative progress so that values can be checked on-the-y as they are computed. Which values must be compared? To correctly perform the checks, it is necessary to determine the correspondences between instances of statements in the unoptimized and optimized code. Code transformations can result in the addition, deletion, and reordering of statement instances. Using the knowledge of the changes made by the optimizations, a mapping between statement instances in the unoptimized and optimized code is established. When should the comparisons be made? By analyzing the mappings between statements instances, the optimized and unoptimized programs are annotated with directives that guide the checking of values during execution. These annotations indicate which values must be compared and when the comparisons must be made. Since the values to be compared are not computed in the same order by the unoptimized and optimized code, some values may need to be temporarily saved until they can be compared. The values are saved in a memory-pool 2

and annotations direct the checker as to when the values are to be saved in the pool and when they can be safely discarded. While the basic steps of the above strategy are generally applicable to a wide range of optimizations from simple code reordering transformations to complex loop transformations, the complexity of the steps depends upon the nature of the optimizations. In the remainder of this section, we focus on a class of transformations that satisfy the following constraints: Control ow structure constraint: The branching structure of the program is not altered by the optimizations. While statements may be either inserted along edges of the control ow graph or existing statements may be removed by the optimizations, no branches can be added or deleted by the optimizations. Instance reordering constraint: The execution of instances of a statement in the unoptimized code cannot be arbitrarily reordered by the optimizations. If a statement lies within the same loop nest before and after optimization, then the order in which its instances are executed by the unoptimized and optimized code is the same. If a statement is inside a loop nest in one program version and outside the loop nest in the other, then either all instances within the loop nest correspond to the single instance outside the loop nest (e.g., loop invariant code motion) or a specic single instance within the loop nest corresponds to the single instance outside the loop nest (e.g., PDE). Finally if the statement is in dierent loop nests in the two programs then all values computed by all instances in the two programs must be the same. Extensions to handle transformations that do not satisfy these restrictions are discussed in a later section. We assume that code improving transformations are done at the intermediate code level, and thus our checker is language independent. We also assume a ow graph representation of the program. In the remainder of this section we discuss details of the execution strategy, mappings, and annotations. 2.1 Execution Strategy The execution of the unoptimized code drives the execution of the optimized code in COP. A statement in the unoptimized code is executed, and using the annotations, the checker can determine if the value computed can be checked at this point. If so, the optimized program executes until the corresponding value is computed, at which time the check is performed on the two values. While the optimized program executes, any values that are computed \early" (i.e., the corresponding value in the unoptimized code has not been computed yet) are saved in the memory pool, as directed by the annotations. If annotations indicate that the checking of the value computed by the unoptimized program cannot be performed at this point, then the value is saved for future checking. The system continues to alternate between executions of the unoptimized and optimized programs. Annotations also indicate when values that were saved for future checking can be nally checked and when the values can be removed from the memory pool. Any statement instances that are eliminated in the optimized code are not checked. Consider the example program segment in Figure 2 and assume that all the statements shown are source level statements. The unoptimized code is given in Figure 2(a) and the optimized code is given in Figure 2(b). The mappings between statements of the unoptimized and optimized representations are shown by dotted lines, and annotations are displayed in dotted boxes. In the example, the following optimizations have been applied: constant propagation - the constant 1 in S1 is propagated, as shown in S2 0 and S9 0. loop invariant code motion - S3 is moved out of the loop. (partial) redundancy elimination (PRE) - S7 is partially redundant with S6. copy propagation - the copy M in S4 is propagated, as shown by S5 0 and S8 0. dead code elimination - S1 and S4 are dead after constant and copy propagation. (partial) dead code elimination (PDE) - S8 is partially dead along all paths in the loop except the last iteration. 3

Unoptimized Program S1 A = 1 S2 T1 = A Optimized Program S2 T1 = 1 S3 M = X * X Check S2 Save S3 all 1 S3 M = X * X S4 B = M S5 IF (B > T1) S5 IF (M > T1) Check S3 Check S5 F T F T S6 C = T1 + X Save S7 S7 C = T1 + X S6 C = T1 + X Check S6 Save S6 S7 C = T1 + X S8 D = B + C S9 T1 = T1 + A S10 IF (T1 < 100) F S11 E = D * 2 Delay S8 T Checkable S8 Delete S8 Delete S8 last 1 S9 T1 = T1 + 1 S10 IF (T1 < 100) F S8 D = M + C S11 E = D * 2 T Check S7 Delete S6,S7 Check S9 Check S10 Delete S3 Check S8 Check S11 Figure 2: Mappings and Annotations for Unoptimized and Optimized Code Given the mapping and annotations, we now describe the operations of the checker on the example in Figure 2. Techniques to determine the mappings and annotations are discussed in subsequent sections. The unoptimized program starts to execute with S1. Since S1 is eliminated from the optimized program, the unoptimized program continues to execute. After S2 executes, the checker determines that the value computed can be checked at this point and so the optimized program executes until Check S2 is encountered, which occurs at S2 0. The values computed by S2 and S2 0 are compared. The unoptimized program resumes execution and the loop iteration at S3 begins. S3 executes and again the optimized program executes until the value computed by S3 can be compared, indicated by the annotation Check S3. However, when S3 0 executes, the annotation Save S3 0 is encountered and consequently, the value computed by S3 0 is stored in the memory pool. This was done because a number of comparisons has to be performed using this value. The next check is the control predicate found at S5. Assume that the T branch is taken. S6 executes and its value is checked with the value computed by S6 0. The value of S6 0 is also saved in the memory pool because there is another comparison that will need this value. S7 in the unoptimized code executes and is compared with the value saved from S6 0. The value computed by S6 0 (or S7 0 if the F branch was taken) is now deleted from the pool, as directed by the annotation Delete S6 0,S7 0. S8 executes and the checker nds the Delay S8 annotation, indicating that the check on S8 cannot be performed at this point and so the value computed by S8 is stored in the memory pool. S9 and S10 are executed and compared. Assume that only one iteration of the loop is performed. As the unoptimized program continues to execute, the checker nds the Checkable S8 annotation. The checker knows that S8 can now be checked and the optimized code resumes. The checker immediately nds the Delete S3 0 annotation so the value computed by S3 0 can now be deleted from the pool as it will never be needed again. S8 0 of the optimized code executes and the values of S8 and S8 0 are compared. Now, back in the unoptimized code, the value computed by S8 can be deleted as directed by the Delete S8 annotation. S11 is treated similarly. 2.2 Mappings The mappings capture the correspondences between statement instances in the unoptimized and those in the optimized programs, which are created as program transformations are applied. Transformations can be applied in any order and as many times as desired. A mapping must indicate which statements and, in particular, what instances of the statements must produce the same value in the unoptimized and optimized programs. Thus, a mapping has two components: an association of a statement in the unoptimized code with a statement in the optimized code, and an association of instances of the statements. 4

For the class of transformations that we are considering in this section, we refer to a statement's instances as follows. If a statement is not enclosed in a loop, then we refer to any one instance of the statment. If the statement is enclosed in a loop nest (could be just one loop) either we refer to any one instance of the statement in the loop nest, all instances of the statement in the nest, or the last instance of the statement in the loop nest. A mapping is dened from statement instances in the unoptimized to statement instances in the optimized code, e.g., one! one, all! one, etc. Consider the case when the corresponding statements from the unoptimized and optimized code are inside the same loop nest, and their instances are referred to as one. The number of times the two statements are executed is the same and the values computed by corresponding instances must be the same. If the statement on the optimized side is moved inside a loop, then its instances are referred to as all or last. The values computed by all instances or the last instance in the optimized code should be equal to the value computed by the one instance in the unoptimized code. Let us assume that a statement from the unoptimized code inside a loop nest is referred to as all. In this case all values computed by the statement in the unoptimized during a single complete execution of the loop nest must be equal. This value must equal one or more values computed by the corresponding statement in the optimized code. If the corresponding statement instance in the optimized code is immediately outside the loop nest and therefore referred to as one, then this is the value that must be compared. If the corresponding statement is inside a dierent loop nest then values of all or last instance in a single execution of this loop nest must be compared to the corresponding value from the unoptimized code. Finally let us assume that the statement in the unoptimized code is inside a loop nest and its last instance is of interest. If the corresponding statement instance in the optimized code is immediately outside the loop nest and therefore referred to as one, then the last value from the execution of the loop nest in the unoptimized code should equal the corresponding one value computed in the optimized code. If instead the statement is inside a dierent loop nest in the optimized code, then the last or all values computed by the statement in a single execution of this loop nest must be compared with the last value from the unoptimized code. We determine the mappings for individual transformations by using the semantics of those transformations. From the mappings of individual transformations, the mapping for any series of transformations can be easily determined. The optimized program initially starts as an identical copy of the unoptimized program with a one! one mapping between statements in the two programs. As optimizations are applied, the mappings are updated according to the mappings of the individual transformations. As code reordering transformations are applied, the mappings are changed to reect the eects of the transformation on a statement and its instances. Code transformations that do not involve code motion in and out of loops, either do not change the mapping, delete the mapping or require copies of the mapping to be placed on inserted statements. Moving statements in and out of loops causes the instance associations to change. When moving a statement out of a loop to above the loop, a one association changes to all and changes to last when moving a statement out of a loop but below the loop. If the instance association already was all or last, it remains as it was. When moving into loops, the association changes from one to all. When moving a statement with an all into another loop, the instance association is not changed. Consider the example in Figure 3 which shows a series of transformations and the way that the instance associations change. In the rst graph, assume the statement x = a + b is partially dead and is moved out of the inner loop, as shown in the second graph. The original association of one! one is changed to last! one. Assuming that the statement x = a + b is still partially dead, it is moved out of the outer loop, as shown in the third graph, and the mapping is last! one. Now assume that the statement x = a + b is moved into the lower loop due to code scheduling. The last! one mapping changes to one! all mapping. Figure 3(b) shows the nal mapping of the statement after all three transformations have been applied. The instance association has changed from the initial one! one to the nal association last! all. This is semantically correct as the last value computed by the inner loop must match the values computed by all instances of the bottom loop. 2.3 Annotations Code annotations are derived from the mappings after all of the code transformations have been applied. Code annotations guide the comparison checking of values computed by corresponding statement instances from the optimized and unoptimized code. These annotations (1) identify program points where comparison checks can 5

x=a+b x=a+b last:1 x=a+b last:1 x=a+b last:all 1:all x=a+b x=a+b (a) Applying PDE and Code Scheduling. (b) Final Mapping. Figure 3: Combinations of Transformations. be made, (2) indicate if values should be saved in the memory pool so that they will be available when checks are performed, and (3) indicate when a value currently residing in the memory pool can be discarded. The annotations are either associated with a statement or program point in the optimized code or the unoptimized code. In all cases once the statement or program point is reached, the actions associated with the annotation are executed by the checker. Four dierent types of annotations are needed to implement our comparison checking strategy. Check S uopt annotation: This annotation is associated with a statement or program point in the optimized code to indicate that a check of a value from the unoptimized program is to be performed. The corresponding value that it has to be compared with is either the result of the most recently executed statement in the optimized code or the value is in the memory pool. The positions of checks are determined as follows. Given the position of the unoptimized statement in the ow graph, if the corresponding statement in the optimized code is at the same or a later point in the ow graph, then the check annotation is associated with the statement in the optimized code. On the other hand if the statement is executed by the optimized code at an earlier point, then the check annotation is associated with the program point in the optimized code which represents the original position of the statement in the unoptimized code. In this case the value computed by the statement in the optimized code is in the memory pool. In Figure 2, since the positions of corresponding predicates S5 and S5 0 are at the same positions in the ow graph, the annotation Check S5 is associated with S5 0. Since S7 has been moved to an earlier point S7 0, the annotation Check S7 is associated with the original position in the optimized code. Finally since the check for statement S8 must be delayed until the point after the loop, the annotation Check S8 is introduced at this point in the optimized code. Save S opt annotation: If a value computed by a statement S opt in the optimized code cannot be immediately compared with the corresponding value computed by the unoptimized code then the value of S opt must be saved in the memory pool. In some situations a value computed by S opt is to be compared with multiple values computed by the unoptimized code and therefore it must be saved until all those value have been computed and compared. The annotation Save S opt is associated with S opt in the optimized code to ensure that the value is saved. In Figure 2 the statement S3 in the optimized code, which is moved out of the loop by invariant code motion, corresponds to statement S3 0 in the optimized code. The value computed by S3 0 cannot be immediately compared with the corresponding values computed by S3 in the unoptimized code since S3 0 is executed prior to the execution of S3. Thus, the annotation Save S3 0 is associated with S3 0. Delay S uopt and Checkable S uopt annotations: If the value computed by the execution of a statement S uopt in the unoptimized code cannot be immediately compared with the corresponding value in the optimized code because the correspondence between the values cannot be immediately established, then the value of S uopt must be saved in the memory pool. The annotation Delay S uopt is associated with S uopt to indicate that the checking of value computed by S opt should be delayed until the correspondence can be established and the value must be saved in the memory pool until then. The point in the unoptimized code at which checking can nally be performed is marked using the annotation Checkable S uopt. 6

In some situations the correspondence between the statement instances cannot be established unless the execution of the unoptimized code is further advanced. Thus, the checking of the value computed by the unoptimized code must be delayed. In Figure 2 statement S8 inside the loop in the unoptimized code is moved after the loop in the optimized code by the PDE optimization. In this situation only the value computed by statement S8 during the last iteration of the loop is to be compared with value computed by S8 0 in the optimized code. However, we can only determine that an execution of S8 corresponds to the last loop iteration when the unoptimized code exits the loop. Therefore the checking of S8's value is delayed. There is another situation in which a check is delayed for reasons of eciency. Consider the example in Figure 4a in which the computation of x's value is moved from before the loop to after the loop. In this case, after x has been computed by the unoptimized code, the execution of the optimized code is advanced to the point after the loop and the value of x is checked. However, all values of y that are computed inside the loop would have to be saved resulting in potentially a large memory pool. In order to avoid the creation of a large pool, we can delay the checking of the value of x until after the loop as shown in Figure 4b. S: x =... Delay S S: x =... y =... y =... 1:1 y =... y =... 1:1 S : x =... Check S Checkable S Delete S S : x =... Check S (a) Using save annotation. (b) Using delay annotation for efficiency. Figure 4: Types of Annotations. Delete S: This annotation is associated with a statement or program point in the optimized/unoptimized code to indicate that the value computed by statement S in the optimized/unoptimized code can now be discarded. A value computed by the unoptimized/optimized code that is placed in the pool is removed by a delete annotation in the unoptimized/optimized code. Since a value may be involved in multiple checks, a delete annotation must be introduced at a point where we are certain that all relevant checks have been performed and therefore it is safe to discard the value. In Figure 2 the annotation Delete S3 0 is introduced after the loop in the optimized code because at that point we are certain that all values computed by statement S3 in the unoptimized code have been compared with the corresponding value computed by S3 0 in the optimized code. The algorithms for introducing annotations require control ow analysis to locate the positions at which the annotations must be introduced. We omit the details of these algorithms from this abstract due to space limitations. The algorithms will be included in the completed paper. 3 Experimental Results We implemented COP to test our algorithms for instruction mapping, annotation placement and checking, and performed experiments to assess the practicality of COP. Lcc [11] was used as the compiler for the application program and was extended to include a set of optimizations, namely loop invariant code motion, dead code elimination, PRE, copy propagation, and constant propagation and folding. As a program is optimized, mappings are updated. Besides generating target code, lcc was extended to generate a le containing breakpoint information and annotations that are derived from the optimization mappings. Thus compilation and optimization of the application program produces the target code for both the unoptimized program and optimized program and auxiliary les containing breakpointing information and annotations for both the unoptimized and optimized programs. These auxiliary les are used by the checker. Breakpoints are generated whenever the value of a source level assignment or a predicate is computed, and whenever array and pointer addresses are computed. Breakpoints are also generated to save base addresses for dynamically allocated storage of structures (e.g. malloc, 7

Table 1: Test Programs Optimization Statistics yacc wc 8q.c sort.c 124.m88ksim 130.li Loop invariants 9 0 6 0 86 8 Copies propagated 146 41 24 11 4637 572 Constants propagated 2 9 8 1 184 9 Dead statements 281 64 48 20 7919 991 PRE expressions 135 32 29 14 4653 562 Table 2: Execution Times (minutes:seconds). Program Source Unoptimized Version Optimized Version length annotated annotated (lines) (cpu) (cpu) (cpu) (cpu) yacc 591 00:01.27 00:15.55 00:01.28 00:21.48 wc 338 00:01.15 00:17.56 00:01.04 00:26.86 8q 39 00:00.04 00:00.19 00:00.03 00:00.23 sort 65 00:00.21 00:00.36 00:00.25 00:00.49 124.m88ksim 17939 00:31.48 06:48.52 00:31.50 10:06.10 130.li 6916 01:04.04 14:45.18 01:03.78 20:35.76 free etc.). We compare array addresses and pointer addresses by actually comparing their osets from the closest base addresses collected by the checker. Breakpointing is implemented using fast breakpoints [17]. Two versions of COP were implemented. In the rst version, traces of the unoptimized and optimized program were collected and the comparison checks were performed on the traces. Since traces can be arbitrarily large, our second version instead performs on-the-y checking, which performs comparisons during the execution of both programs. The unoptimized program, optimized program, and checker can execute on the same machine or on dierent machines, and messages are sent between the programs and checker. A buer is used to reduce the number of messages that are sent between the executing programs and the checker. In our experiments we ran COP on an HP 712/100 and the unoptimized and optimized programs on separate SPARC 5 workstations. Messages were passed through sockets. We ran some of the integer Spec95 benchmarks as well as some smaller test programs. (Note: we are continuing to run more programs and will add the results to the nal paper.) Although the benchmarks did not include oating point numbers, these can be handled by our system by allowing for inexact equality; that is by allowing two oating point numbers to dier by a certain small delta[2]. Table 1 shows the total number of times various optimizations were applied by our optimizer. Table 2 shows the cpu execution times of the unoptimized and optimized programs with and without annotations. On average the annotations slowed down the execution of the unoptimized programs by a factor of 10 and that of the optimized programs by a factor of 15. The optimized program experiences greater overhead than the unoptimized program because more annotations are added to the optimized program. Although the optimizations were frequently applied, no signicant reduction in execution time was observed. This is because we have not yet incorporated a global register allocator into our compiler. Thus, at present, the introduction of temporaries during PRE actually slows down program execution. The response time of the checker depends greatly upon the lengths of the execution runs of the programs. For small programs, comparison checking took from a few seconds (7 seconds for sort) to a few minutes (12 minutes for wc). Both value and address comparisons were performed in this experiment. For the two Spec95 benchmarks the comparison checker took several hours to execute (3 hours for 124.m88ksim and 6 hours for 130.li). These times are clearly acceptable if comparison checking is performed o-line. We found that the response time of the checker is bounded by the speed of the network which was 10 Mbits of data per second for our experiments. A faster network would considerably lower these response times. We also measured the memory pool size during 8

our experiments and found it to be quite small. A maximum pool size of 90 was observed during the execution of the 124.m88ksim benchmark. 4 Extensions Although in the previous sections we described mappings, annotations and an implementation that considered a certain class of code transformations, our basic approach is general and can be extended to consider more complex transformations. In this section we describe the extensions to allow inlining, loop transformations, and checking in the presence of register allocation. Inlining: Function inlining replaces calls to a function in the unoptimized code by bodies of the function in the optimized code and each of the inlined bodies may be optimized dierently. Therefore for each call site, a separate mapping is maintained between the statements in the function in the unoptimized code and the inlined copy in the optimized code. By analyzing the mappings corresponding to each call site, a set of annotations is computed. At runtime when the function is executed, the checker must select and follow the appropriate set of annotations by using the knowledge of the call site encountered during program execution. Loop Transformations: Our approach can also be extended to allow loop transformations such as loop reversal, distribution, fusion, interchange, etc. The statement instances must be more precisely identied by the mappings for these transformations. It is no longer sucient to refer to the instances as one, all or last. Instead we must refer to the instances by the iteration numbers (or formulas) during which they are executed. For example consider a loop whose control variable i takes the values 1 through 10 and it contains an assignment to A[i]. After loop reversal is applied to the loop, the mapping (1; 10)! (10; 1) can be used to express the relationship between the instances of assignment to A[i] in the unoptimized and optimized code. The annotations and checks can be performed correctly according to this mapping. If a loop nest is involved in the transformation (eg., loop interchange), then the mapping will be multidimensional. Register allocation: We can handle code in which values are stored in registers. In terms of the machine code, the mappings would exist between instructions whose resulting values are to be compared. By examining the instruction we can determine whether the result of the instruction is stored in a register or a memory location and the value can be appropriately retrieved. If the value cannot be compared as yet, it would be saved in the pool. 5 Related Work The problem of debugging optimized code has long been recognized, with most of the previous work focusing on the development of debugging tools for optimized code [15, 12, 22, 6, 19, 10, 14, 16, 5, 18, 4, 8, 9, 21, 3]. In these approaches, limitations have been placed on the debugging system by either restricting the type or placement of optimizations, modifying the optimized program or inhibiting debugging capabilities. They have also varied results in handling the code location and data value problems, which are introduced by code optimizations, although none can handle all the problems. Unfortunately, these techniques have not found their way into production type environments, and debugging of optimized code still remains a problem. The goal of our system is not to build a debugger for optimized code but a comparison checker of optimized code. In our approach, the user can still use conventional debuggers for unoptimized code that are currently in use. The most closely related work to our approach is Guard, which is a relative debugger, but not designed to debug optimized programs [20, 2, 1]. Using Guard, users can compare the execution of one program, the reference program, with the execution of another program, the development version. Guard requires the user to formulate assertions about the key data structures in both versions which specify the locations at which the data structures should be identical. The relative debugger is then responsible for managing the execution of the two programs and reporting any dierences in values. The technique does not require any modications to user programs and can perform comparisons on-the-y. The important dierence between Guard and COP is that in Guard, the user essentially has to put all of the mappings and annotations by hand, while this is done automatically in COP. Thus using COP, the optimized program is transparent to the user. We also are able to check the entire program which would be dicult in Guard since that would require the user putting in all mappings. In COP, 9

we can easily restrict checking to certain regions or statements as Guard does. We can also report the particular optimizations that were involved in producing erroneous behavior. The concept of a bisection debugging model, and a high level approach, was recently presented that also has as its goal the identication of semantic dierences between two versions of the same program, one of which is assumed to be correct [13]. The bisection debugger attempts to identify the earliest point where the two versions diverge. However, in order to handle the debugging of optimized code, all data values problems have to be solved at all breakpoints. References [1] Abramson, D. A., Foster, I., Michalakes, J., and Sosic, R. Relative Debugging and its Application to the Development of Large Numerical Models. In Proceedings of IEEE Supercomputing 1995, December 1995. [2] Abramson, D., Foster, I, Michalakes, J., and Sosic, R. A New Methodology for Debugging Scientic Applications. Communications of the ACM, 39(11):69{77, November 1996. [3] A. Adl-Tabatabai. Source-Level Debugging of Globally Optimized Code. PhD dissertation, Carnegie Mellon University, 1996. Technical Report CMU-CS-96-133. [4] Adl-Tabatabai, A., and Gross, T. Evicted Variables and the Interaction of Global Register Allocation and Symbolic Debugging. In Proceedings 20th POPL Conference, pages 371{383, January 1993. [5] Brooks, G., Hansen, G.J., and Simmons, S. A New Approach to Debugging Optimized Code. In Proceedings ACM SIGPLAN'92 Conf. on Programming Languages Design and Implementation, pages 1{11, June 1992. [6] Chase, B. and Hood, R. Selective Interpretation as a Technique for Debugging Computationally Intensive Programs. In ACM SIGPLAN '87 Symposium on Interpreters and Interpretive Techniques, pages 113{124, June 1987. [7] Copperman, M. Debugging Optimized Code Without Being Misled. Technical Report 92-01, Board of Studies in Computer and Information Sciences, University of California at Santa Cruz, May 1992. [8] Copperman, M., and McDowell, C.E. Detecting Unexpected Data Values in Optimized Code. Technical Report 90-56, Board of Studies in Computer and Information Sciences, University of California at Santa Cruz, October 1990. [9] Copperman, Max. Debugging Optimized Code Without Being Misled. ACM Transactions on Programming Languages and Systems, 16(3):387{427, 1994. [10] Coutant, D.S., Meloy, S., and Ruscetta, M. A Practical Approach to Source-Level Debugging of Globally Optimized Code. In Proceedings ACM SIGPLAN'88 Conf. on Programming Languages Design and Implementation, pages 125{ 134, June 1988. [11] Fraser, Chris, and Hanson, David. A Retargetable C Compiler: Design and Implementation. Benjamin/Cummings, 1995. [12] Fritzson, P. A Systematic Approach to Advanced Debugging through Incremental Compilation. In Proceedings ACM SIGSOFT/SIGPLAN Software Engineering Symposium on High-Level Debugging, pages 130{139, 1983. [13] Gross, Thomas. Bisection Debugging. In Proceedings of the AADEBUG'97 Workshop, pages 185{191, May, 1997. [14] Gupta, R. Debugging Code Reorganized by a Trace Scheduling Compiler. Structured Programming, 11:141{150, 1990. [15] Hennessy, J. Symbolic Debugging of Optimized Code. ACM Transactions on Programming Languages and Systems, 4(3):323{344, July 1982. [16] Holzle, U., Chambers, C., and Ungar, D. Debugging Optimized Code with Dynamic Deoptimization. In Proceedings ACM SIGPLAN'92 Conf. on Programming Languages Design and Implementation, pages 32{43, June 1992. [17] Kessler, P. Fast Breakpoints: Design and Implementation. In ACM SIGPLAN Proceedings of Conf. on Programming Languages Design and Implementation, pages 78{84, 1990. [18] Pineo, P.P. and Soa, M.L. A Practical Approach to the Symbolic Debugging of Code. Proceedings of International Conference on Compiler Construction, 26(12):357{373, April 1994. [19] Pollock, L.L., and Soa, M.L. High-Level Debugging with the Aid of an Incremental Optimizer. In 21st Annual Hawaii International Conference on System Sciences, volume 2, pages 524{531, January 1988. [20] Sosic, R. and Abramson, D. A. Guard: A Relative Debugger. Software Practice and Experience, February 1997. 10

[21] Wismueller, R. Debugging of Globally Optimized Programs Using Data Flow Analysis. In Proceedings ACM SIG- PLAN'94 Conf. on Programming Languages Design and Implementation, pages 278{289, June 1994. [22] Zellweger, P.T. An Interactive High-Level Debugger for Control-Flow Optimized Programs. In Proceedings ACM SIGSOFT/SIGPLAN Software Engineering Symposium on High-Level Debugging, pages 159{171, 1983. 11