Mark Weiser University of Maryland

Human Aspects of Computing Programmers Use Slices When Debugging Mark Weiser University of Maryland Henry Ledgard Editor Computer programmers break apart large programs into smaller coherent pieces. Each of these pieces: functions, subroutines, modules, or abstract datatypes, is usually a contiguous piece of program text. The experiment reported here shows that programmers also routinely break programs into one kind of coherent piece which is not coniguous. When debugging unfamiliar programs programmers use program pieces called slices which are sets of statements related by their flow of data. The statements in a slice are not necessarily textually contiguous, but may be scattered through a program. CR Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging--debugging aids; H. 1.2 [Models and Principles]: User/Machine Systems-human information processing; D.2.7 [Software Engineering]: Distribution and Maintenance--corrections, enhancement, restructuring General Terms: Experimentation, Human Factors, Languages Additional Key Words and Phrases: program decomposition, slice Introduction Experts differ from novices in their processing of information. This difference has been studied in chess [2, 4], physics [3, 1], and computer programming [12, 16]. An expert in physics problem-solving encodes and processes physics problems in terms of laws such as conservation of energy or Newton's second law. A chess expert processes only reasonable positions when thinking about a game. An expert computer programmer encodes and processes information semantically, ignoring programming language syntactic details [17]. This work was supported in part by the Air Force Office of Scientific Research Grant no. F4962-8-C-1, by National Science Foundation Grant no. MCS-8-18294, and by a grant from the General Research Board of the University of Maryland. Computer time was provided in part by the Computer Science Center of the University of Maryland. Author's Present Address: Mark Weiser, Computer Science Department, University of Maryland, College Park, MD 2742. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 1982 ACM 1-782/82/7-446 $.75. How do expert programmers encode and process information for program debugging? Gould [7] reports that many programmers start debugging by carefully reading the faulty program from top to bottom, without ever bothering to look closely at the erroneous program output. Dijkstra [5] and others have proposed that debugging time could be shortened by rigorous reasoning about a program's correctness. However, perhaps the most basic method of debugging is to start at the point in the program where an error first becomes manifest, and then proceed to reason about the sequence of events (as verified by the program text) that could have led to that error. Since this reasoning generally moves through the program's flow-of-control backwards (compared to its ordinary execution sequence), this debugging strategy is called working backwards. Gould [7] and Lukey [11] report instances of programmers working backwards from an error's appearance, attempting to locate its source. Supporting this, Sime, Green, and Guest [2] report that debugging is better aided by program constructs describing program state than by the usual program constructs describing flow-of-control. A flow-of-control construct (such as ELSE) can be understood only in context (with its accompanying IF) while program state constructs (Sime, Green, and Guest use ELSE (assertion)) have meaning in isolation and hence are more useful while working backwards. Zelkowitz [27] reports on the efficacy of a interactive debugger capable of backwards execution. Less rigorously, programmers generally accept working backwards as an important debugging method [15], but there has been tittle investigation of the working backwards process or its advantages for the programmer. The results reported here clarify this process by showing that while working backwards, programmers construct in their minds a specific kind of abstract representation of the program being debugged. Program Slicing Tracing backwards from a particular variable in a particular statement to identify all possible sources of influence on the value of that variable often reveals that many statements in a program have no influence. The process of stripping a program of statements without influence on a given variable at a given statement is called program slicing. A brief summary of automatic program slicing follows. More details may be found in [21, 22, 23]. Proofs of many of the assertions below are in [24]. Example slices are shown in Fig. 1. Definition: An elementary slicing criterion of a program P is a tuple (i, V), where i denotes a specific statement in P and V is a subset of variables in P. An elementary slicing criterion determines a projection function from a program's state trajectory, in which 446 Communications July 1982

Fig. I. A Program and Some Program Slices. The original program: 1 2 READ(X, Y) 3 TOTAL :=. 4 SUM :=. 5 IFX_<I 6 THEN SUM := Y 7 ELSE 8 READ(Z) 9 TOTAL := X*Y 1 END 11 WRITE(TOTAL, SUM) 12 END. Slice on the variable Z at statement 12. READ(X, Y) IFX_I THEN ELSE READ(Z) END. Slice on the variable X at statement 9. READ(X, Y) END. Slice on the variable TOTAL at statement 12. READ(X, Y) TOTAL := IFX_<I THEN ELSE TOTAL := X.Y END. only the value of variables in V just prior to the execution of i are preserved. Definition: A slicing criterion is a set of elementary slicing criteria. The projection function associated with a slicing criterion is the union of the projection functions of its elementary slicing criteria. Definition: A slice S of a program P on a slicing criterion C is any executable program with the following two properties. (1) S can be obtained from P by deleting zero or more statements from P. (2) Whenever P halts on an input I with state trajectory T, then S also halts on input I with state trajectory T', and PROJ(T) = PROJ(T'), where PROJ is the projection function associated with criterion C. There can be many different slices for a given program and slicing criterion. There is always at least one slice for a given slicing criterion--the program itself. Slices are most useful for understanding a program when they are considerably smaller than the original program. One way to automatically find small slices is by dataflow analysis [8]. The author has built several automatic slicers based on dataflow analysis and finds that they generate tolerably small slices in most cases. For what follows, "slice" will refer to a dataflow and control-flow generated slice. (See King [9] for a symbolic execution approach to slicing-like program decomposition.) Hypothesis The hypothesis about debugging and slicing to be tested is (H1): "... debugging programmers, working backwards from the variable and statement of a bug's appearance, use that variable and statement as a slicing criterion to construct mentally the corresponding program slice." The debugging and slicing hypothesis is in contrast to the contiguous chunk hypothesis (H2): "... programmers look at code only in contiguous pieces." Atwood and Ramsey [1], for instance, assume that the meaningful chunks of programs are small contiguous code sequences. Slices are generally not contiguous pieces, but contain statements scattered throughout the code. As the results below show, programmers remembered contiguous code as well as but no better than relevant slices. Method The experiment to test H1 against H2 took the following form: (1) have programmers debug three programs; (2) after completing all three programs, see if the programmers remember various code fragments embedded in the programs--particularly the slice relevant to the bug in each program. If the programmers did slice, then their memories for the relevant slices should be at least as good as their memories of contiguous code, and somewhat better than their memories of other noncontiguous code. One procedure for testing program comprehension is to have programmer's memorize and then reconstruct the program [ 18]. Programmers were not asked to reconstruct slices in this experiment because explaining about slices would have biased the results and because reconstruction might be an insensitive detector of a weak and transient memory for slices. Instead programmers were required only to recognize the slices. This procedure also has the advantage of sometimes detecting knowledge that participants are unaware they possess [ 13, 14]. The "Materials" section describes the three programs, their bugs, and the code fragments embedded in each program for which the programmers' memories were tested. The "Participants" section describes the group of experienced programmers who participated in this study, and the "Procedure" section describes the instructions the participants received and the manner of presenting the problems. Materials The programs to be debugged were designed to appear to be difficult to debug, but in reality had only simple errors. The programs had to appear difficult because Gould and Drongowski [6] found that program- 447 Communications July 1982

Table I. Program Bugs. Program Original Code Bug EVADE LEFTOT:=LEFTOT-HORT.THRUST LEFTOT'=HORT*THRUST TALLY SCNT:=SCNT+ 1. SCNT:=SCNT- 1. PAYROLL EXEMPTHOURS'=OVERTIMEPAY:=. OVERTIMEPAY:=. mers approach debugging hierarchically, resorting to a more difficult strategy only when easier ways fail to work. The present experiment used difficult looking programs so that programmers, expecting the worst, would initially choose a more difficult but reliable debugging strategy. Slicing was expected to be such a strategy. The errors were constructed to be easy to find so that the entire experiment could be completed in less than an hour. Each debugging task was on a program of from 75 to 15 lines of Algol-W [26], with at least one major subroutine that performed most of the computation. In this subroutine, a single statement was changed to cause a bug (see Table I). One of the programs, TALLY, was taken from the IBM scientific subroutine package. It computes various statistics about a set of variables and displays poor coding practices such as GOTO's and nonmnemonic variable names. The other two programs, PAYROLL and EVADE, were written specifically for this experiment. EVADE is a simulation of an airplane making random turns. PAYROLL computes salaries and deductions for two kinds of employees. EVADE and PAYROLL are both well structured. They are well modularized, use good control structures, and have mnemonic variable names. Each has global comments only. After completing all three programs, participants were shown various fragments of code drawn from the three programs. Each program contributed each of the following five fragments, except TALLY which had no irrelevant contiguous fragment, making 14 fragments altogether. Fragments were truncated at top and bottom to be about ten lines long, although fragments with lots of 's or END's were a little longer. As Table II shows, fragments often had several statements in common. (1) Relevant Slice. A set of statements necessary for understanding the bug, usually near the faulty statement, taken from the slice on the variable and write statement whose execution first caused an error to appear in the program output. (2) Relevant Contiguous. A region of contiguous code, overlapping the relevant slice fragment. (3) Irrelevant Contiguous. A region of contiguous code, not overlapping the relevant contiguous or the relevant slice fragments. (4) Irrelevant Slice. A set of statements near the faulty statement, taken from a slice on a variable not directly related to the bug. (5) Jumble. Every third or fourth statement in the program, minimally modified to display reasonable syntax. There were several problems with choosing fragments for the specific programs used. EVADE had several sections of code all alike and all adjacent, which consisted of a test followed by a subroutine call, and the faulty statement was located in one of these sections. In order to focus on algorithmic rather than syntactic memories the two relevant fragments were chosen from a more computational portion of EVADE's main subroutine. PAYROLL had a rather short main subroutine and so its fragments, uniquely among the three programs, crossed subroutine boundaries. As the results show, this may have been too severe a test of the programmers' integrating abilities. A general problem in constructing relevant fragments for all three programs was whether the two relevant fragments should be made to overlap or be disjoint. The choice was made to overlap the fragments so that their relevance would more likely be similar. However, this Table II. Overlap I and Correlations 2 of Some Fragments. Relevant sequence EVADE TALLY PAYROLL Overlap Correlation Overlap Correlation Overlap Correlation VS. Relevant slices.75..58.29.33 -.33 Irrelevant slices.5..53.44.33 -.3 Jumble.35.46.25.21.18.21 Irrelevant sequence VS. Relevant slices..42.....2 Irrelevant slices..39..... l 1 Jumble.25.31.....27 Overlap is the fraction of statements shared by the two fragments. 2 Correlation is the Spearman's rank correlation coefficient [19], which has critical value.37 for a significance level of.5. 448 Communications July 1982

raised the possibility that participants would recognize relevant slice fragments merely because of the overlap with relevant sequential fragments. However, a comparison of fragment overlap with correlations of fragment memory (see Table II) makes it appear unlikely that this happened. A possible source of error in testing for recognition of programs is programmers' memory for detail. Successful debugging often depends upon noticing significant details such as a counter off by one or a line containing one blank instead of two. This kind of memory could throw results awry because experts might recognize code fragments by certain idiosyncratic details. To get at programmers' semantic understanding of the fragments apart from any memory of syntactic details, syntactic changes were made in the appearance of each fragment. Variables and constants in the fragments were renamed as single letters followed by a unique number. For variables the letter was the variable's original first letter. For constants the letter was X. Indenting was adjusted from the original program to a form internally consistent with each fragment. Two judges, familiar with the experimental design but unfamiliar with the specific programs used, looked at all the fragments. They rated the fragments for inherent plausibility as code sequences and found them approximately equal. This provided a check on any inherent bias in the particular fragments chosen, apart from their use in these programs. As a check on how well these programs and bugs represented all programs and bugs, at the end of the experiment participants rated each program and each bug for its typicalness in Algol-W programming. Ratings were on a four-point scale, from "very typical" to "not at all typical." "Not sure" was also permitted as an answer, but was not used. Such ratings do not show that any program is typical, even if there were such a thing as a typical program. But they can and do show that no program was especially atypical, and thus there is no reason to expect on this basis that the results do not generalize from the three programs used here to all programs. Table III. Demographics of Participants. Standard Question Min Max Median Mean Deviation FTE months of 6 11 26 39. 31.4 programming Number of software 2 16 8 7.9 3.8 courses taken FTE months of user 1 1 8.5 21.9 counseling Number of software 31 1 5.4 9.6 courses taught Percent experience 3 95 15 28.6 28.3 with Algol-W "Ever heard of slicing? ''l 7 1 1.5 2.2 A 1 meant "know enough to use in my work," a 7 meant "know something about it," a 3 meant "have heard the term," and a meant "never heard of it." Participants The participants in the experiment were experienced Algol-W programmers. They were chosen from Computer and Communication Sciences Department graduate student teaching assistants and Computing Center programming and counseling staff, all from the University of Michigan in Ann Arbor. Out of 31 potential participants with Algol=W experience, 26 volunteered. Four of these participated in pilot studies and one did not follow instructions in the actual experiment, leaving 21 final participants. The participants' background is summarized in Table III. Procedure Each participant signed a consent form, filled out a questionnaire about his or her programming experience, and then began the experiment. The first eight participants carded out the study with the experimenter present. Since none of those eight had any questions or comments during the study, the remaining participants were permitted to perform the experiment at home during an uninterrupted hour. One participant did not follow the instructions and his data were not used. Each participant was given all three programs to debug, in random order. Each program began with a brief description of what it was supposed to do followed by a sample run which revealed the bug followed by the program text itself. Participants could refer to a program's description and output any time they were working on that program. After finding a correction to each program, participants recorded the time and then were shown the correct answer. They were told not to look back at a program after beginning work on the next. After completing all three debugging tasks, participants were told they could take a short break before being shown the code fragments. They were asked to rate each fragment for how sure they were it had been used in one of the three programs. The rating scale is shown in Fig. 2. The instructions described the fragments as algorithms with changed variable names, truncated at top and bottom, all of whose individual statements were taken from the three programs. Participants were encouraged to give their first impressions of each fragment rather than use detailed analyses. Code fragments were presented in random order, each on a separate page with its rating scale. Participants were told not to look back either at the programs or at previously rated code fragments. After rating all 14 fragments, participants concluded the experiment by rating each program and its bug for typicalness. Results All 21 participants found the bugs in TALLY and EVADE, but only 17 found the bug in PAYROLL. The mean time to debug each program is shown in Table IV. 449 Communications July 1982

Fig. 2. An Example Code Fragment. This fragment is a portion of the relevant slice for program PAYROLL. The statements were originally at line numbers 47, 49, 51, 14, 17, 19, 21, 22, 6, and 65, respectively. Part A shows the fragment with its rating scale as presented to the participants. Part B shows the same fragment before changing variable names. A T1 := Xl; FOR E1 := X2 UNTIL N1 DO RI(N2, H1, P1, E2); IF MI > X3 THEN IF E2 THEN E3 := (H1 - X3) T1 := T1 + E3; i'-i almost certainly used al--* I probably used 'l--~ probably not used i J almost certainly not used 13 TOTAL_EXEMPT : =.; FOR EMPLOYEE := 1 UNTIL NUM_EMPLOYEES DO BEAD(NAME,HOURS,PAY RATE,EXEMPT); IF HOURS > 4. THEN IF EXEMPT THEN EXEMPT_HOURS := (HOURS-4.O) TOTAL_EXEMPT := TOTAL_EXEMPT + EXEMPT_HOURS Time to debug showed a significant correlation only with the particular program being debugged. The ratings for typicalness given by participants to the programs and bugs are shown in Table V. There seemed to be a tendency for low typicalness ratings to be correlated with slicing (see Fig. 5 and discussion below.) The hypothesis that programmers mentally construct slices when debugging (HI) was tested by comparing the ratings given the relevant slice fragments to the ratings given the other fragments. A two-way analysis of variance using Friedman's test [19] indicated that there was an overall difference in the ratings of the different fragments. Pairwise comparisons of ratings were then made in two ways: the first by pooling ratings for all three programs, and the second by looking at ratings program by program. Ratings across all three programs were pooled by considering the total number of times participants recognized each kind of code fragment. Ratings of "probably used" and "almost certainly used" were considered to be recognition. Under this scheme, the relevant slice Table IV. Debugging Times (minutes). Standard Program Mean Deviation TALLY 13. 6.9 EVADE 8. 6. PAYROLL 9.2 3.1 was recognized 54 percent of the time, while the irrelevant slice and the jumble were recognized only 28 percent and 2 percent of the time, respectively. (See Fig. 3.) Using the Wilcoxon matched-pairs signed-ranks test [ 19], the difference between relevant slices and irrelevant slices is significant at the.3 level, and the difference between relevant slices and jumbles is very significant at the.5 level. To evaluate results program by program, participants' answers were converted to a one to four ordinal scale. Mean (median) scores, given in order for the relevant slice, the irrelevant slice, and the jumble were: for EVADE, 2.6 (3), 2.1 (2), and 1.6 (1); for TALLY, 3. (3), 2. (2), and 2. (1); for PAYROLL, 2.4 (2), 2.1 (2), and 1.6 (2). Wilcoxon's matched-pairs signed-ranks test was used to test for significant differences in rankings for the fragments. Differences between relevant slice scores and jumble scores were very significant at the.2, and.3 levels for EVADE and TALLY, respectively, and marginally significant at the.8 level for PAYROLL. Differences between relevant slice scores and irrelevant slice scores were marginally statistically significant for EVADE at the.8 level, very significant for TALLY at the.6 level, and not at all significant for PAYROLL. (See Fig. 4.) Because the relevant slice fragment overlapped the relevant sequential fragment in each program, this experiment gives no absolute assurance that relevant slices were not recognized only because of that overlap. However, Table II indicates that participants were probably not recognizing the nonsequential fragments based on their overlap with sequential fragments. It shows correlation scores (computed using the Spearman rank correlation [19]) between each sequential fragment and the other fragments in a program. The generally low correlations and their indifference to overlap indicate that individual participants probably did not rate fragments based on similarity to a sequential chunk. There was no statistically significant difference between participants' memory for the relevant slice vs the relevant contiguous code or between the relevant slice and the irrelevant contiguous code. There was also no Table V. Typicalness of Programs and Bugs, N = 171 Program Rating Bug Rating Debugging Standard Standard Task Mean Deviation Median Mean Deviation Median EVADE 3.1.8 3 2.9 1. 3 TALLY 2.4 1.2 3 2.6 1.1 2 PAYROLL 3.2 1.1 4 3.5.9 4 i Typicalness was rated by all but the first four participants, using a 1 to 4 scale where 4 meant "very typical" and 1 meant "not at all typical." 45 Communications of the ACM July 1982 Volume 25 Number 7

Fig. 3. Percent Recognition of Code Fragment Type. The totals exclude the four participants who did not find the PAYROLL bug. C._o 1 t- O~ O ~ 5 e- " ~ H " II ~ oo ~-- r'- o 1-.,,~ -..-.~ t- Type of Algorithm statistically significant difference between any component of participants' experience and their memory for different kinds of code fragments. Discussion The results are evidence that programmers slice when debugging. However, EVADE and PAYROLL do not show this as clearly as TALLY, and it is useful to try to account for TALLY's exceptional performance. Before discussing TALLY, the data compel asking why contiguous code far away from the bug was recognized so well. This was very likely an artifact of the small program size, which restricted the area from which irrelevant code could be drawn. The irrelevant contiguous fragment was often very near the output statements which wrote the erroneous variable values. Therefore, participants were very likely to have looked at this region of the program first and their ratings reflect this. What was exceptional about TALLY? Two factors are supported by the data. TALLY took significantly Fig. 4. Mean Recognition Scores for Code Fragments. (good) 4 - E longer to debug than the other programs, implying it was more difficult. TALLY was also the only unstructured program with many GOTO's, which would have contributed to its difficulty. Thus, EVADE and PAYROLL may have been too easy to debug or they may not have appeared sufficiently difficult to elicit a slicing strategy from some participants. A second exceptional factor may have been participants' familiarity with programs of this type. Slicing is likely a difficult mental task, useful primarily for programs of uncertain structure and purpose. An unfamiliar program is therefore more likely to be sliced. Evidence for this is that the group of 12 participants who rated EVADE less than very typical also rated the relevant slice higher than the irrelevant slice (see Fig. 5), a result significant at the.1 level by Wilcoxon's test. Implications Slicing may be put to practical use. For instance, an automatic slicer could be used interactively by a programmer to ask for statements which could be the source of erroneous behavior. Slicing could also be the basis of program complexity measures, by examining the number of distinct slices in a program or their relative distribution through the code. An automatic slicer is not difficult to build and various applications are now under investigation [24, 25]. As an example of some slices from real programs, Fig. 6 shows slices from three compilers, using slicing criteria automatically generated from each variable at every "write" statement. That is, if statement i in the program wrote variables X, Y, and Z, then slices were found for criteria (i, X), (i, Y}, and (i, Z); this was done for every write statement. Using this method, many slices end up being nearly the same, and so only slices which differed by more than ten statements were consid- Fig. 5. Mean Recognition Scores of EVADE, N = 12. Mean scores are shown for participants who rated EVADE less than very typical, i.e., between 1 and 3 on the scale shown in Table V. (good) 4 (/) 3.-...Z-...... (/3 8 3 U n~ g ~E ( poor ) 2 --- Tally "" Payroll I I I I I e" t.) t" ~ r" t~.~ ~ ~o oo o: >" E o 8 Type of Fragment 8 g =E (poor) I I 4-- ~lj t" O > a) n~ I I I I C :3 " :3 C O OO OO O"-.D >~ >= >u> E O ~- O O ~O Type of Fragment 451 Communications of the ACM July 1982 Volume 25 Number 7

Fig. 6. Examples of Slices from Student Written Compilers. (Only distinct slices are shown.) No. of Statements Approximate Contents COMPILER 1, 554 statements. Slices: 474 Interpreter output. 429 Scanning and high-level parsing. 8 Constant messages. 149 Error: no program. 432 Symbol table listing. 453 Object code listing COMPILER 2, 662 statements. Slices: 563 Interpreter execution error. 288 Miscellaneous 1 257 Error: no "end" card. 184 Miscellaneous 2 148 Flow-of-control statements. 79 "read" and assignment. 125 Expressions. 49 Symbol table search. 8 Array subscript parsing. 14 Memory trace. 55 Error: procedure syntax. COMPILER 3, 497 statements. Slices: 476 Scanning, parsing, interpreter. 2 constant messages. 11 Memory dump headers 481 Memory dump. ered to be distinct. Figure 6 shows only the distinct slices. Recall that each slice, executed on the compiler's input data, accurately reproduces its portion of the compiler's output. The "approximate contents" column of Fig. 6 represents the author's best guess of the purpose of the code in each slice. Slicing is now reinvented by every programmer who uses it. Beginning programmers taught the concept of slicing could avoid this reinvention and could more rapidly improve their debugging skills. The idea of ignoring irrelevant code also goes well with using diagnostic write statements to narrow in on the cause of a program failure. Conclusion When debugging, programmers view programs in ways that need not conform to the programs' textual or modular structures. In particular, the statements in a slice may be scattered throughout the code of the larger program and yet experienced programmers routinely abstract the slices from a program. Understanding slicing may be useful in producing debugging and maintenance aids and in training programmers, and further research on slicing may lead to a more complete understanding of the many skills that make up debugging ability. Acknowledgments. Ben Shneiderman and a very helpful referee contributed substantially to the clarity of this paper. With his constant encouragement, Stephen Kaplan helped this experiment become a reality. Received 3/81; revised 6/81; accepted 7/81 References 1. Atwood, M.E. and Ramsey, H.R. Cognitive structures in the comprehension and memory of computer programs: An investigation of computer program debugging. TR-78-A21, U.S. Army Research Institute for the Behavioral and Social Sciences, Alexandria, Virginia, August, 1978. 2. Chase, W.G. and Simon, H.A. Perception in chess. Cognitive Psychology 5, 4, Oct 1973, 55-81. 3. Chi, M.T.H. and Glasser, R. Encoding process characteristics of experts and novices in physics. Symposium on Process Models of Skilled and Less Skilled Behavior in Technical Domains. American Educational Research Association, April, 1979. 4. DeGroot, A.D. Thought and Choice in Chess. Mouton Press, The Hague, 1965. 5. Dijkstra, E.W. Correctness concerns and, among other things, why they are resented. Proc. Int. Conf. on Reliable Software June 1975, 546-55. SIGPLAN Notices 1, 6. 6. Gould, J.D. and Drongowski, P. An exploratory study of computer program debugging. Human Factors 1, 6, June 1974, 258-277. 7. Gould, J.D. Some psychological evidence on how people debug computer programs. International J. of Man-Machine Studies 7, 1, Jan 1975, 151-182. 8. Hecht, M.S. Flow Analysis of Computer Programs. North- Holland, New York 1977. 9. King, J. Program reduction using symbolic evaluation. Software Engineering Notes 6, 1, (Jan. 1981) ACM SIGSOFT. 1. Larkin, J., McDermott, J., Simon, D.P., and Simon, H.A. Expert and novice performance in solving physics problems. Science 28, June 2, 198, 1335-1342. 11. Lukey, F.J. Understanding and debugging programs. International J. of Man-Machine Studies 12, 2, (Feb. 198), 189-22. 12. McKeithen, K.B. Assessing knowledge structures in novice and expert programmers. PhD Thesis, University of Michigan, Ann Arbor, MI, 1979. 13. Posner, M.I. and Keele, S.W. On the genesis of abstract ideas. 3". of Experimental Psychology 77, 3, July 1968, 353-363. 14. Posner, M.I. and Keele, S.W. Retention of abstract ideas. J. of Experimental Psychology 83, 2, Feb. 197, 34-38. 15. Schwartz, J.T. An overview of bugs, In Debugging Techniques in Large Systems. Edited by Randall Rustin. Prentice-Hall, Englewood Cliffs, NJ, 1971. 16. Shneiderman, B. Exploratory experiments in programmer behavior. International J. of Computer and Information Sciences 5, 2, April 1976, 123-143. 17. Shneiderman, B. and Mayer, R. Syntactic/semantic interactions in programmer behavior: A model and experimental results. International J. of Computer and Information Sciences 7, 1979, 219-239. 18. Shneiderman, B. Software Psychology: Human Factors in Computer and Information Systems, Winthrop, Reading, MA, 198. 19. Siegel, S. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, New York, 1955. 2. Sime, M.E., Green, T.R.G., and Guest, D.J. Scope marking in computer conditionals--a psychological evaluation. International J. of Man-Machine Studies 5, 1973, 15-113. 21. Weiser, M. Program slices: Formal, psychological, and practical investigations of an automatic program abstraction method. PhD Thesis, University of Michigan, Ann Arbor, MI, 1979. 22. Weiser, M. The slicing abstraction in software production and maintenance. Reliable Software Systems Group Technical Memo (RSSM), University of Michigan, Ann Arbor, 1979. 23. Weiser, M. Theoretical foundations of program slices. Reliable Software Systems Group ( RSSM) Technical Memo 69, University of Michigan, Ann Arbor, MI 1979. 24. Weiser, M. Program slicing. Proceedings of the Fifth International Conference on Software Engineering, San Diego, CA, March, 1981. 25. Weiser, M. Towards an iterative enhancement software development environment. Fourteenth Hawaii International Conference on System Science, Honolulu, HA, Jan. 1981. 26. Wirth, N. and Hoare, C.A.R. A contribution to the development of ALGOL. Comm. A CM 9, 6, (June 1966), 413-431. 27. Zelkowitz, M.V., Reversible Execution. Comm. ACM 16, 9, (Sept. 1973) 566. 452 Communications July 1982