How We Refactor, and How We Know It

Size: px

Start display at page:

Download "How We Refactor, and How We Know It"

Laura Green
6 years ago
Views:

1 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 1 How We Refactor, and How We Know It Emerson Murphy-Hill, Chris Parnin, and Andrew P. Black Abstract Refactoring is widely practiced by developers, and considerable research and development effort has been invested in refactoring tools. However, little has been reported about the adoption of refactoring tools, and many assumptions about refactoring practice have little empirical support. In this article, we examine refactoring tool usage and evaluate some of the assumptions made by other researchers. To measure tool usage, we randomly sampled code changes from four Eclipse and eight Mylyn developers and ascertained, for each refactoring, if it was performed manually or with tool support. We found that refactoring tools are seldom used: 11% by Eclipse developers and 9% by Mylyn developers. To understand refactoring practice at large, we drew from a variety of datasets spanning more than developers, tool-assisted refactorings, 2500 developer hours, and version control commits. Using these data, we cast doubt on several previously-stated assumptions about how programmers refactor, while validating others. Finally, we interviewed the Eclipse and Mylyn developers to help us understand why they did not use refactoring tools, and to gather ideas for future research. Index Terms refactoring, refactoring tools, floss refactoring, root-canal refactoring 1 INTRODUCTION Refactoring is the process of changing the structure of software without changing its behavior. While the practice of restructuring software has existed ever since software has been structured, the term was introduced by Opdyke and Johnson [12]. Later, Fowler popularized refactoring when he cataloged 72 different refactorings, ranging from localized changes such as INLINE TEMP, to more global changes such as TEASE APART INHERITANCE [5]. Especially in the last decade, the body of research about refactoring has been growing rapidly. Examples of such research include studies of the effect of refactoring on errors [20] and the relationship between refactoring and software quality [17]. Such research builds upon a foundation of previous work about how programmers refactor, such as what kinds of refactorings programmers perform and how frequently they perform them. Unfortunately, this foundation is, in some cases, based on limited evidence, or on no evidence at all. For example, consider Murphy-Hill and Black s Refactoring Cues tool that allows programmers to refactor several program elements at once [8]. If the assumption that programmers frequently want to refactor several program elements at once holds, this tool would be very useful. However, prior to the tool being introduced, no foundational research existed to support this assumption. As we show in this article, this case is not isolated; other research also rests on unsupported or weakly-supported E. Murphy-Hill is with the Department of Computer Science, North Carolina State University, Raleigh, NC, emerson@csc.ncsu.edu C. Parnin is with the College of Computing, Georgia Institute of Technology, Atlanta, GA, chris.parnin@gatech.edu A. P. Black is with the Department of Computer Science, Portland State University, Portland, OR, black@cs.pdx.edu foundations. Without strong foundations for refactoring research, we can have only limited confidence that research built on these foundations will be valid in the larger context of real-world software development. As a step towards strengthening these foundations, this article revisits some of the assumptions and conclusions drawn in previous research. Our experimental method takes data from eight different sources (described in Section 2) and applies several different analysis strategies to them. The contributions of our work lie in both the experimental method and in the conclusions that we are able to draw: The RENAME refactoring tool is used much more frequently by ordinary programmers than by the developers of refactoring tools (Section 3.1); about 40% of refactorings performed using a tool occur in batches (Section 3.2); about 90% of configuration defaults in refactoring tools are not changed when programmers use the tools (Section 3.3); messages written by programmers in version-control commit logs do not reliably indicate the presence of refactoring in the commit (Section 3.4); programmers frequently floss refactor, that is, they interleave refactoring with other types of programming activity (Section 3.5); about half of refactorings are not high-level, so refactoring detection tools that look exclusively for high-level refactorings will not detect them (Section 3.6); refactorings are performed frequently (Section 3.7); close to 90% of refactorings are performed manually, without the help of tools (Section 3.8); and the kind of refactoring performed with a tool differs from the kind performed manually (Section 3.9). In Section 4 we discuss the interaction between these

2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 2 conclusions and the assumptions and conclusions of other researchers. This article is an extension of work reported at ICSE 2009 [11], which provided evidence for each of the conclusions above. The primary weakness of the ICSE work was that several of our conclusions were based on data from a single development team. This article includes analysis of four additional data sets, with the consequence that every conclusion drawn here is based on data from at least two development teams. 2 THE DATA THAT WE ANALYZED The work described in this article is based on eight sets of data. The first set we will call Users; it was originally collected in the latter half of 2005 by Murphy and colleagues [7], who used the Mylyn Monitor tool to capture and analyze fine-grained usage data from 41 volunteer programmers in the wild using the Eclipse development environment ( These data capture an average of 66 hours of development time per programmer; about 95 percent of the programmers wrote in Java. The data include information on which Eclipse commands were executed, and at what time. Murphy and colleagues originally used these data to characterize the way programmers used Eclipse, including a coarse-grained analysis of which refactoring tools were used most often. Murphy-Hill and Black have also used these data as a source of evidence for the claim that refactoring tools are underused [10]. The second set of data we will call Everyone; it is publicly available from the Eclipse Usage Collector [18], and includes data from every user of the Eclipse Ganymede release who consented to an automated request to send data back to the Eclipse Foundation. These data aggregate activity from over Java developers between April 2008 and January 2009, but also include non-java developers. The data include information on the number of programmers who have used each Eclipse command (including the refactoring commands), and how many times each command was executed. We know of no other researchers who have used this data for investigating programmer behavior. The third set of data we will call Toolsmiths; it includes refactoring histories from 4 developers who primarily maintain Eclipse s refactoring tools. These data include detailed histories of which refactorings were executed, when they were performed, and with what configuration parameters. These data include all the information necessary to recreate the usage of a refactoring tool, assuming that the original source code is also available. These data were collected between December 2005 and August 2007, although the date ranges are different for each developer. This data set is not publicly available. The only author that we know of using similar data is Robbes [16]; he reports on refactoring tool usage by himself and one other developer. The fourth set of data we will call Eclipse CVS, because it is the version history of the Eclipse and JUnit ( code bases as extracted from their Concurrent Versioning System (CVS) repositories. Commonly, CVS data must be preprocessed before analysis. This is because CVS does not record which file revisions were committed in a single transaction. The standard approach for recovering transactions is to find revisions committed by the same developer with the same commit message within a small time window [22]; we use a 60 second time window. Henceforth, we use the word revision to refer to a particular version of a file, and the word commit to refer to one of these synthesized commit transactions. We excluded from our sample (a) commits to CVS branches, which would have complicated our analysis, and (b) commits that did not include a change to a Java file. In our experiments, we focus on a subset of the commits in Eclipse CVS. Specifically, we randomly sampled from about 3400 source file commits (Section 3.4) that correspond to the same time period, the same projects, and the same developers represented in Toolsmiths. Using these data, two of the authors (Murphy-Hill and Parnin) inferred which refactorings were performed by comparing adjacent commits manually. While many authors have mined software repositories automatically for refactorings (for example, Weißgerber and Diehl [20]), we know of no other researchers that have compared refactoring tool logs with code histories. The fifth set of data we call Mylyn; it includes the refactoring histories from 8 developers who primarily maintain the Mylyn project, a task-focused interface for Eclipse (www. eclipse.org/mylyn). The data format is the same as for Toolsmiths, although we obtained it through different means; the developers checked their refactoring histories into CVS while working on the project, so those histories are publicly available from Mylyn s open-source code repository. The refactoring history spans the period from February 2006 to August 2009, though different developers worked on the project at different periods of time. The sixth dataset we call Mylyn CVS, which is the version control history corresponding to the Mylyn refactoring history, in the same way that Eclipse CVS corresponds to Toolsmiths. We analyzed, filtered, randomly drew from, and inspected the data in the same way as with Eclipse CVS. The seventh dataset is called UDC Events, which is a subset of the Everyone data containing more detail: instead of aggregating counts of Eclipse command uses, UDC Events contains timestamps for each command usage. This data is much like the Users data but includes developers spanning several weeks in June and July The final dataset is called Developer Responses. When we completed our analysis of the first six data sets, we sent a survey to three developers in the Toolsmiths dataset and four developers in the Mylyn dataset. The survey included several questions about those developers refactoring habits and refactoring tool use habits. 1 In total, we received two responses from Toolsmith developers and three responses from Mylyn developers. We use this qualitative Developer Responses data to augment the quantitative data in the other seven data sets. 3 FINDINGS ON REFACTORING BEHAVIOR In this section we analyze these eight sets of data and discuss our findings. 1. The survey template appears in Appendix 2.

3 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 3 Refactoring Tool Toolsmiths Mylyn Users Everyone Uses Use % Batched Batched % Uses Use % Batched Batched % Uses Use % Batched Batched % Uses Use % Rename % % % % % % % Extract Local Variable % % % % % % % Inline % % % % % % % Extract Method % % % % % % % Move % % % % % % % Change Method Signature % % % % % % % Convert Local To Field % % % % % % % Introduce Parameter % % 1 0.1% % % % Extract Constant % % % % % % % Convert Anonymous To Nested % 0 0.0% % 0 0.0% % % % Move Member Type to New File % 0 0.0% % 4 7.3% % % % Pull Up % 0 0.0% % % % % % Encapsulate Field % % % % 4 0.1% % % Extract Interface 2 0.1% 0 0.0% % 2 6.9% % 0 0.0% % Generalize Declared Type 2 0.1% 0 0.0% 0 0.0% % % % Push Down 1 0.1% % % 1 0.1% % Infer Generic Type Arguments 0 0.0% % % 3 0.1% 0 0.0% % Use Supertype Where Possible 0 0.0% % % 0 0.0% % Introduce Factory 0 0.0% % % % Extract Superclass 7 0.3% 0 0.0% % 0 0.0% * - * * % Extract Class 1 0.1% 0 0.0% 0 0.0% 0 - * - * * % Introduce Parameter Object 0 0.0% % 0 - * - * * % Introduce Indirection 0 0.0% % 0 - * - * * % Total % % % % % % % TABLE 1: Refactoring tool usage in Eclipse. Some tool logging began in the middle of the Toolsmiths and Mylyn data collection (shown in light grey) and after the Users data collection (denoted with a *). The - symbol denotes a percentage corresponding to a fraction for which the denominator is zero.

4 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR Toolsmiths and Users Differ We hypothesize that the refactoring behavior of the programmers who develop the Eclipse refactoring tools differs from that of the programmers who use them. Toleman and Welsh assume a variant of this hypothesis that the designers of software tools erroneously consider themselves typical tool users and argue that the usability of software tools should be objectively evaluated [19]. To test our hypothesis, we compared the refactoring tool usage in the Toolsmith data set against the tool usage in the User and Everyone data sets. For this comparison, we will omit the Mylyn dataset, because Users and Everyone are more likely to represent the behavior of people who do not develop refactoring tools. In Table 1, the Uses columns indicate the number of times each refactoring tool was invoked in that dataset. The Use % column presents the same measure as a percentage of the total number of refactorings. (The Batched columns are discussed in Section 3.2.) Notice that while the rank order of each tool is similar across Toolsmiths, Users, and Everyone RENAME, for example, always ranks first the proportion of uses of the individual refactoring tools varies widely between Toolsmiths and Users/Everyone. In Toolsmiths, RENAME accounts for about 29% of all refactorings, whereas in Users it accounts for about 62% and in Everyone for about 72%. We suspect that this difference is not because Users and Everyone perform more RENAMES than Toolsmiths, but because Toolsmiths are more frequent users of the other refactoring tools. This analysis is limited in two ways. First, each data set was gathered over a different period of time, and the tools themselves may have changed between those periods. Second, the Users data include both Java and non-java RENAME and MOVE refactorings, but the Toolsmiths, Mylyn and Everyone data report on just Java refactorings. This may inflate actual RENAME and MOVE percentages in Users. 3.2 Programmers Repeat Refactorings We hypothesize that when programmers perform a refactoring, they typically perform several more refactorings of the same kind within a short time period. For instance, a programmer may perform several EXTRACT LOCAL VARIABLES in preparation for a single EXTRACT METHOD, or may RENAME several related instance variables at once. Based on personal experience and anecdotes from programmers, we suspect that programmers often refactor multiple pieces of code because several related program elements need to be refactored in order to perform a composite refactoring. In previous research, Murphy-Hill and Black built a refactoring tool that supported refactoring multiple program elements at once, on the assumption that this is common [8]. To determine how often programmers do in fact repeat refactorings, we used the Toolsmiths, Mylyn and Users data to measure the temporal proximity of multiple invocations of a refactoring tool. We say that refactorings of the same kind that execute within 60 seconds of each another form a batch. From our personal experience, we think that 60 seconds is usually long enough to allow the programmer to complete an Eclipse wizard-based refactoring, yet short enough to exclude Refactoring Tool Toolsmiths Mylyn MOVE 1 (n=147) 1 (n=958) PUSH DOWN 5 (n=1) 3 (n=3) PULL UP 2.5 (n=12) 4 (n=17) EXTRACT INTERFACE 4.5 (n=2) 4 (n=29) EXTRACT SUPERCLASS 17 (n=7) 9 (n=16) TABLE 2: The median number of explicitly batched elements used for several refactoring tools, where n is the number of total uses of that refactoring tool. % batched Users Mylyn Toolsmiths batch threshold, in seconds Fig. 1: Percentage of refactorings that appear in batches as a function of batch threshold, in seconds. 60-seconds, the batch size used in Table 1, is drawn in green. refactorings that are not part of the same conceptual group. Additionally, a few refactoring tools, such as PULL UP in Eclipse, can refactor multiple program elements, so a single application of such a tool is an explicit batch of related refactorings; we measured the median batch size for these tools. In Table 1, each Batched column indicates the number of refactorings that appeared as part of a batch, while each Batched % column indicates the percentage of refactorings that appeared as part of a batch. Overall, we can see that certain refactorings, such as RENAME, INLINE, and ENCAP- SULATE FIELD, are more likely to appear as part of a batch, while others, such as EXTRACT METHOD and PULL UP, are less likely to appear in a batch. In total, we see that 30% of Toolsmiths refactorings, 39% of Mylyn refactorings, and 47% of Users refactorings appear as part of a batch. 2 The median batch size for explicitly batched refactorings in tools that can refactor multiple program elements varied between tools in both Toolsmiths and Mylyn (Table 2). Overall, the table indicates that, with the exception of MOVE, most refactorings are performed in batches. The main limitation of this analysis is that, while we wished to measure how often several related refactorings are performed in sequence, we instead used a 60-second heuristic. It may be that some related refactorings occur outside our We suspect that the difference in percentages arises partially because the Toolsmiths and Mylyn data set counts the number of completed refactorings while Users counts the number of initiated refactorings. We have observed that programmers occasionally initiate a refactoring tool on some code, cancel the refactoring, and then re-initiate the same refactoring shortly thereafter [9].

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 5 Fig. 2: A configuration dialog box in Eclipse. second window, and that some unrelated refactorings occur inside the window.

5 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 5 Fig. 2: A configuration dialog box in Eclipse. second window, and that some unrelated refactorings occur inside the window. To show how sensitive these results are to the batch threshold, Figure 1 displays the total percentage of batched refactorings for several different batch thresholds. Other metrics for detecting batches, such as burstiness, should be investigated in the future. 3.3 Programmers often don t Configure Refactoring Tools Refactoring tools are typically of two kinds: they either force the programmer to provide configuration information, such as whether a newly created method should be public or private an example is shown in Figure 2 or they quickly perform a refactoring without allowing any configuration. Configurable refactoring tools are more common in some environments, such as Netbeans ( whereas non-configurable tools are more common in others, such as X- develop ( Which interface is preferable depends on how often programmers configure refactoring tools. We hypothesize that programmers often don t configure refactoring tools. We suspect this is because tweaking code manually after the refactoring may be easier than configuring the tool. In the past, we have found some limited evidence that programmers perform only a small amount of configuration of refactoring tools. When we conducted a small survey in September 2007 at a Portland Java User s Group meeting, 8 programmers estimated that, on average, they supply configuration information only 25% of the time. To validate this hypothesis, we counted how often programmers used various configuration options in the Toolsmiths and Mylyn data, when performing the 5 refactorings most frequently performed by Toolsmiths. We skipped refactorings that did not have configuration options. The results of the analysis are shown in Table 3. Configuration Option refers to a configuration parameter that the user can change. Default Value refers to the default value that the tool assigns to that option. Change Frequency refers to how often a user used a configuration option other than the default. The data suggest that refactoring tools are configured infrequently: the overall mean change frequency for these options is about 10% in Toolsmiths and 12% in Mylyn. Although different configuration options are changed from defaults with varying frequencies, almost all configuration options that we inspected were below the average configuration percentage predicted by the Portland Java User s Group survey. This analysis has several limitations. First, we could not count how often certain configuration options were changed, such as how often parameters are reordered when EXTRACT METHOD is performed. Second, we examined only the 5 mostcommon refactorings; configuration may be more frequent for less popular refactorings. Third, we measured how often a single configuration option is changed, but not how often any configuration option is changed for a refactoring. Fourth, we were not able to distinguish between the case where a developer purposefully used a non-default configuration option, and the case where she blindly used the non-default option left over from the last time that she used the tool. 3.4 Commit Messages don t predict Refactoring Several researchers have used messages attached to commits into a version control as indicators of refactoring activity [6], [14], [15], [17]. For example, if a programmer commits code to CVS and attaches the commit message refactored class Foo, we might predict that the committed code contains more refactoring activity than if a programmer commits with a message that does not contain the word refactor. However, we hypothesize that this assumption is false, perhaps because refactoring can be an unconscious activity [2, p. 47], and perhaps because the programmer may consider the refactoring subordinate to some other activity, such as adding a feature [10]. In his thesis, Ratzinger describes the most sophisticated strategy for finding refactoring messages of which we are aware [14]: searching for the occurrence of 13 keywords, such as move and rename, and excluding needs refactoring. Using two different project histories, the author randomly drew 100 file modifications from each project and classified each as either a refactoring or as some other change. He found that his keyword technique accurately classified modifications 95.5% of the time. Based on this technique, Ratzinger and colleagues concluded that an increase in refactoring activity tends to be followed by a decrease in software defects [15]. We replicated Ratzinger s experiment for the Eclipse code base. Using the Eclipse CVS data, we grouped individual file revisions into global commits as previously discussed in Section 2. We also manually removed commits whose messages referred to changes to a refactoring tool (for example,

6 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 6 Refactoring Tool Configuration Option Default Value Change Frequency Toolsmiths Mylyn Extract Local Variable Declare the local variable as final false 5% 0% Extract Method New method visibility private 6% 6% Declare thrown runtime exceptions false 24% 0% Generate method comment false 9% 1% Rename Type Update references true 3% 0% Update similarly named variables and methods false 24% 26% Update textual occurrences in comments and strings false 15% 23% Update fully qualified names in non-java text files true 7% 35% Rename Method Update references true 0% 0% Keep original method as delegate to renamed method false 1% 1% Inline Method Delete method declaration true 9% 1% TABLE 3: Refactoring tool configuration in Eclipse. [refactoring] Convert Local Variable to Field has problems with arrays ), because such changes are false positives that occur only because the project s code is itself implementing refactoring tools. Next, using Ratzinger s 13 keywords, we automatically classified the log messages for the remaining 2788 commits. 10% of these commits matched the keywords, which compares with Ratzinger s reported 11% and 13% for two other projects [14]. Next, a third party randomly drew 20 commits from the set that matched the keywords (which we will call Labeled ) and 20 from the set that did not match ( Unlabeled ). Without knowing whether a commit was in the Labeled or Unlabeled group, two of the authors (Murphy-Hill and Parnin) manually compared each committed version of the code against the previous version, inferring how many and which refactorings were performed, and whether at least one non-refactoring change was made. Together, Murphy-Hill and Parnin compared these 40 commits over the span of about 6 hours, comparing the code using a single computer and Eclipse s standard compare tool. The results are shown in Table 4, under the Eclipse CVS heading. In the left column, the kind of Change is listed. Pure Whitespace means that the developer changed only whitespace or comments; No Refactoring means that the developer did not refactor but did change program behavior; Some Refactoring means that the developer both refactored and changed program behavior, and Pure Refactoring means the programmer refactored but did not change program behavior. The center column counts the number of Labeled commits with each kind of change, and the right column counts the number of Unlabeled commits. The parenthesized lists record the number of refactorings found in each commit. For instance, the table shows that in 5 commits, when a programmer mentioned a refactoring keyword in the CVS commit message, the programmer made both functional and refactoring changes. The 5 commits contained 1, 4, 11, 15, and 17 refactorings. These results suggest that classifying CVS commits by commit message does not provide a complete picture of refactoring activity. While all 6 pure-refactoring commits were identified by commit messages that contained one of the refactoring keywords, commits labeled with a refactoring keyword contained far fewer refactorings (63, or 36% of the total) than those not so labeled (112, or 64%). Figure 3 shows the variety of refactorings in Labeled (darker bars) commits and Unlabeled (lighter bars) commits. We will explain the (H), (M), and (L) tags in Section 3.6. We replicated this experiment once more for the Mylyn CVS data set. As with Eclipse CVS, 10% of the commits were classified as refactorings using Ratzinger s method. Under the Mylyn CVS heading, Table 4 shows the results of this experiment. These results confirm that classifying CVS commits by commit message does not provide a complete picture of refactoring, because commits labeled with a refactoring keyword contained fewer refactorings (52, or 46%) than those not so labeled (60, or 54%). There are several limitations to this analysis. First, while we tried to replicate Ratzinger s experiment [14] as closely as was practicable, the original experiment was not completely specified, so we cannot say with certainty that the observed differences were not due to methodology. Likewise, observed differences may be due to differences in the projects studied. Indeed, after we completed this analysis, a personal communication with Ratzinger revealed that the original experiment included and excluded keywords specific to the projects being analyzed. Second, because the process of gathering and inspecting subsequent code revisions is labor intensive, our sample size (40 commits in total) is smaller than would otherwise be desirable. Third, the classification of a code change as a refactoring is somewhat subjective. For example, if a developer removes code known to her to never be executed, she may legitimately classify that activity as a refactoring, although to an outside observer it may appear to be the removal of a feature. We tried to be conservative, classifying changes as refactorings only when we were confident that they preserved behavior. Moreover, because the comparison was blind, any bias introduced in classification would have applied equally to both Labeled and Unlabeled commit sets. 3.5 Floss Refactoring is Common In previous work, Murphy-Hill and Black distinguished two tactics that programmers use when refactoring: floss refactoring and root-canal refactoring [10]. During floss refactoring, the programmer uses refactoring as a means to reach a specific

7 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 7 Eclipse CVS Mylyn CVS Change Labeled Unlabeled Labeled Unlabeled Pure Whitespace No Refactoring Some Refactoring 5 (1,4,11,15,17) 6 (2,9,11,23,30,37) 9 (1,1,1,2,3,5,7,9,11) 11(1,1,2,2,2,3,4,6,8,9,14) Pure Refactoring 6 (1,1,2,3,3,5) 0 4 (1,1,4,6) 2 (1,7) Total 20(63) 20(112) 20(52) 20(60) TABLE 4: Refactoring between commits in Eclipse CVS and Mylyn CVS. Plain numbers count commits in the given category; tuples contain the number of refactorings in each commit. end, such as adding a feature or fixing a bug. Thus, during floss refactoring the programmer intersperses other kinds of program changes with refactorings to keep the code healthy. Root-canal refactoring, in contrast, is used for correcting deteriorated code and involves a protracted process consisting exclusively of refactoring. A survey of the literature suggested that floss refactoring is the recommended tactic, but provided only limited evidence that it is the more common tactic [10]. Why does this matter? Case studies in the literature, for example those reported by Pizka [13] and by Bourqun and Keller [1], describe root-canal refactoring. However, inferences drawn from these studies will be generally applicable only if most refactorings are indeed root-canals. We can estimate which refactoring tactic is used more frequently from the Eclipse CVS and Mylyn CVS data. We first define behavioral indicators of floss and root-canal refactoring during programming sessions, which (in contrast to the intentional definitions given above) we can hope to recognize in the data. For convenience, we define a programming session as the period of time between consecutive commits to CVS by a single programmer. In a particular session, if a programmer both refactors and makes a semantic change, then we say that that the programmer is floss refactoring. If a programmer refactors during a session but does not change the semantics of the program, then we say that the programmer is rootcanal refactoring. Note that a true root-canal refactoring must also last an extended period of time, or take place over several sessions. The above behavioral definitions relax this requirement, and so will tend to over-estimate the number of root canals. The results suggest that floss refactoring is more common than root-canal refactoring. Returning to Table 4, we can see that Some Refactoring, indicative of floss refactoring, accounted for 28% (11/40) of commits in Eclipse CVS and 50% (20/40) in Mylyn CVS. Comparatively, Pure Refactoring, indicative of root-canal refactoring, accounts for 15% (6/40) of commits in both Eclipse CVS and Mylyn CVS. Normalizing for the fact that only 10% (4/40) of all commits were labeled with refactoring keywords in Eclipse CVS, commits indicating floss refactoring would account for 30% of all commits while commits indicating root-canal would account for only 3% of all commits. 3 Normalizing in Mylyn CVS, floss commits account for 54% while root-canal commits account for 11%. Looking at the Eclipse CVS data another way, 98% 3. Our normalization procedure is described in Appendix 1. of individual refactorings would occur as part of a Some Refactoring (floss) commit, while only 2% would occur as part of a Pure Refactoring (root-canal) commit, again after normalizing for labeled commits. For the Mylyn CVS data set, 86% of individual refactorings would occur as part of a Some Refactoring (floss) commit, while 14% would occur as part of a Pure Refactoring (root-canal) commit. We also notice that for Eclipse CVS in Table 4, the Some Refactoring (floss) row tends to show more refactorings per commit than the Pure Refactoring (root-canal) row. However, this trend was not confirmed in the Mylyn CVS data; the number of refactorings in floss commits does not appear to be significantly different from the number in rootcanal commits. Pure refactoring with tools is relative infrequent in the Users data set, suggesting that very little root-canal refactoring occurred in Users as well. We counted the number of refactorings performed using a tool during sessions in that data. In no more than 10 out of 2671 sessions did programmers use a refactoring tool without also manually editing their program. In other words, in less that 0.4% of commits did we observe the possibility of root-canal refactoring using only refactoring tools. Our analysis of Table 4 is subject to the same limitations described in Section 3.4. The analysis of the Users data set (but not the analysis of Table 4) is also limited in that we consider only those refactorings performed using tools. Some refactorings may have been performed by hand; these would appear in the data as edits, thus possibly inflating the count of floss refactoring and reducing the count of root-canal refactoring. 3.6 Many Refactorings are Medium and Low-level Refactorings operate at a wide range of levels, from as low-level as single expressions to as high-level as whole inheritance hierarchies. Past research has often drawn conclusions based on observations of high-level refactorings. For example, several researchers have used automatic refactoringdetection tools to find refactorings in version histories, but these tools can generally detect only those refactorings that modify packages, classes, and member signatures [3], [4], [20], [21]. The tools generally do not detect sub-method level refactorings, such as EXTRACT LOCAL VARIABLE and INTRODUCE ASSERTION. We hypothesize that in practice programmers also perform many lower-level refactorings. We suspect this because lower-level refactorings will not change

8 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 8 Rename Constant (H) Move Member (H) Push Down (H) Extract Method (M) (H) Generalize Declared Type (L) Extract Local (L) Remove Parameter (H) Inline Local (L) Move Class (H) Add Parameter (H) Rename Field (H) Rename Method (H) Inline Constant (M) Inline Method (M) Rename Local (L) Rename Constant (H) Rename Type (H) Decrease Method Visibility (H) Extract Constant (M) Increase Method Visibility (H) Introduce Parameter (M) Reorder Parameter (H) Extract Class (H) Rename Class (H) Manual (Labeled) Rename Resource (H) Manual (Unlabeled) Rename Parameter (H) Tool (Labeled) Rename Package (H) Tool (Unlabeled) Introduce Factory (H) Fig. 3: Refactorings over 40 sessions for Mylyn (at left) and 40 sessions for Eclipse (at right). Refactorings shown include only those with tool support in Eclipse. the program s interface, and thus programmers may feel more free to perform them. To investigate this hypothesis, we divided the refactorings that we observed in our manual inspection of Eclipse CVS commits into three levels High, Medium and Low. We classified refactoring tool uses in the Mylyn and Toolsmiths data in the same way. High-level refactorings are those that change the signatures of classes, methods, or fields; refactorings at this level include RENAME CLASS, MOVE STATIC FIELD, and ADD PARAMETER. Medium-level refactorings are those that change the signatures of classes, methods, and fields and also significantly change blocks of code; this level includes EXTRACT METHOD, INLINE CONSTANT, and CONVERT ANONYMOUS TYPE TO NESTED TYPE. Because medium-level refactorings affect both signatures and code alike, more sophistication is needed for automated analysis and may not be properly identified by automated refactoring detectors. Low-level refactorings are those that make changes only to blocks of code; low level refactorings include EX- TRACT LOCAL VARIABLE, RENAME LOCAL VARIABLE, and ADD ASSERTION. Refactorings with tool support that were found in the Eclipse CVS and Mylyn CVS data set are labeled as high (H), medium (M), and low (L) in Figure 3. 4 The results of this analysis are displayed in Table 5. For each level of refactoring, we show what percentage of refactorings from Eclipse CVS (normalized), Toolsmiths, Mylyn CVS (nor- 4. Note that one refactoring, GENERALIZE DECLARED TYPE can be either high (if the type is declared in a signature) or low (if the type is declared in the body of a method). This refactoring was excluded from the analysis reflected in the data in Table 5.

9 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 9 Eclipse CVS Toolsmiths Mylyn CVS Mylyn Low 21% 33% 28% 10% Medium 11% 27% 22% 14% High 68% 40% 50% 76% TABLE 5: Refactoring level percentages in four data sets. malized), and Mylyn make up that level. We see that many low and medium-level refactorings do indeed take place; as a consequence, tools that detect only high-level refactorings will miss 24 to 60 percent of refactorings. 3.7 Refactorings are Frequent While the concept of refactoring is now popular, it is not entirely clear how commonly refactoring is practiced. In Xing and Stroulia s automated analysis of the Eclipse code base, the authors conclude that indeed refactoring is a frequent practice [21]. The authors make this claim largely based on observing a large number of structural changes, 70% of which are considered to be refactoring. However, this figure is based on manually excluding 75% of semantic changes resulting in refactorings accounting for 16% of all changes. Further, their automated approach suffers from several limitations, such as the failure to detect low-level refactorings, imprecision when distinguishing signature changes from semantic changes, and the course granularity available from the inspection of CVS revisions. To validate the hypothesis that refactoring is a frequent practice, we characterize the occurrence of refactoring activity in the Users, Toolsmiths, Mylyn data. Note that these data sets contain records of only those refactorings that were performed with tools. In order for refactoring activity to be defined as frequent, we seek to apply criteria that require refactoring to be habitual and to occur at regular intervals. For example, if refactoring occurs just before a software release, but not at other times, then we would not want to claim that refactoring is frequent. First, we examined the Toolsmiths and Mylyn data to ascertain how refactoring activity was spread throughout the development cycle. Second, we examined the Users data to determine how often refactoring occurred within a programming session, and whether there was significant variation across the population. In the Toolsmiths data, we found that refactoring occurred throughout the Eclipse development cycle. In 2006, an average of 30 refactorings took place each week; in 2007, there were 46 refactorings per week. Only two weeks in 2006 did not have any refactoring activity, and one of these was a winter holiday week. In 2007, refactoring occurred every week. In the Mylyn data, we find a similar trend for the first two years, but drop in the frequency and eventually the magnitude of refactoring in the last two years. Specially, refactoring activity did not occur for 1 week in both 2006 and 2007, 7 weeks in 2008, and 8 (out of 34) weeks in The respective averages were 31, 28, 36, and 6 tool refactorings per week. However, the average number of commits per week also declined in recent years (62, 65, 41 and 22); because there was decreased development activity, we would expect lower refactoring activity. In the Users data set, we found refactoring activity distributed throughout the programming sessions. Overall, 41% of programming sessions contained refactoring activity. More interestingly, if we assume that the number of edits (changes to a program made with an editor) approximates how much work was done during a session, then significantly more work was done in sessions with refactoring than without refactoring. We found that, on average, sessions without refactoring activity contained an order of magnitude fewer edits than sessions with refactoring. Looking at it a different way, sessions that contained refactoring also contained, on average, 71% of the total edits made by the programmer. This was consistent across the population: 22 of 31 programmers had an average greater than 72%, whereas the remaining 9 ranged from 0% to 63%. This analysis of the Users data suggests that when programmers must make large changes to a code base, refactoring is a common way to prepare for those changes. Inspecting refactorings performed using a tool does not have the limitations of automated analysis; it is independent of the granularity of commits and semantic changes, and captures all levels of refactoring activity. However, it has its own limitation: the exclusion of manual refactoring. Including manual refactorings can only increase the observed frequency of refactoring. Indeed, this is likely: as we will see in Section 3.8, many refactorings are in fact performed manually. 3.8 Refactoring Tools are Underused A programmer may perform a refactoring manually, or may choose to use an automated refactoring tool if one is available for the refactoring that she needs to perform. Ideally, a programmer will always use a refactoring tool if one is available, because automated refactorings are theoretically faster and less error-prone than manual refactorings. However, in one survey of 16 students, only 2 reported having used refactoring tools, and even then only 20% and 60% of the time [10]. In another survey of 112 agile enthusiasts, we found that the developers reported refactoring with a tool a median of 68% of the time [10]. Both of these estimates of usage are surprisingly low, but they are still only estimates. We hypothesize that programmers often do not use refactoring tools. We suspect this is because existing tools may not have a sufficiently usable user-interface. To validate this hypothesis, we correlated the refactorings that we observed by manually inspecting Eclipse CVS commits with the refactoring tool usages in the Toolsmiths data set. Similarly, we performed the same correlation for the Mylyn CVS commits and tool usages in the Mylyn data set. A refactoring found by manual inspection can be correlated with the application of a refactoring tool by looking for tool applications between commits. For example, the Toolsmiths data provides sufficient detail (time, new variable name and location) to correlate an EXTRACT LOCAL VARIABLE performed with a tool with an EXTRACT LOCAL VARIABLE observed by manually inspecting adjacent commits in Eclipse CVS. After analyzing the Toolsmiths data, we were unable to link 89% of 145 observed refactorings that had tool support to

10 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 10 Frequency >10 Percentage 79% 10% 8% 1% 0.05% TABLE 6: Distribution of number of tool refactorings per week from tool-using developers collected from weeks with development activity. any use of a refactoring tool (also 89% when normalized). After analyzing the Mylyn data, we were unable to link 78% of 72 observed refactorings that had tool support to any use of a refactoring tool (91% when normalized). This suggests that the developers associated with the Toolsmiths and Mylyn data primarily refactor manually. An unexpected finding was that 31 refactorings that were performed with tools were not visible by comparing revisions in CVS for the Toolsmiths data; the same phenomenon was observed 13 times with the Mylyn data. It appeares that most of these refactorings occurred in methods or expressions that were later removed, or in newly created code that had not yet been committed to CVS. Overall, the results support the hypothesis that programmers refactor manually in lieu of using tools. Measured tool usage was even lower than the median estimate from the professional agile developer survey. This suggests that either programmers unconsciously or consciously overestimate their tool usage (perhaps refactoring is often an unconscious activity, or perhaps expert programmers are embarrassed to admit that they do not use refactoring tools), or that expert programmers prefer to refactor manually. To observe if tools were underused by a larger population, we analyzed the UDC Events data, which includes timestamped Eclipse commands from developers. In the UDC Events data, of the participants, only participants had used refactoring tools during the period covered by the data. We would expect that not all of the participants would have used refactoring tools, because this dataset included non- Java developers, and many participants were not active users of Eclipse. From the tool-using participants, we examined how many times they used a refactoring tool each week. We also counted weeks where developers had development activity, but no refactoring tool usage. The distribution is presented in Table 6. Nearly 80% of the weekly development sessions did not have any refactoring tool usage, even among those who had used refactoring tools at some point. When developers did use refactoring tools, the usage within a week mostly remained in the single digits. The results suggest that tool usage may be as low among a wider population as with the Eclipse and Mylyn developers we have studied. This analysis suffers from several limitations. First, it is possible that some tool usage data from Toolsmiths may be missing. If programmers used multiple computers during development, some of which were not included in the data set, this would result in under-reporting of tool usage. Given a single commit, we could be more certain that we have a record of all refactoring tool uses over the code in that commit if we have a record of at least one refactoring tool use applied to that code since the previous commit. If we apply our analysis only to those commits, then 73% of refactorings (also 73% when normalized) cannot be linked with a tool usage. Likewise, data from Mylyn may be missing because developers may not have checked their refactoring histories into CVS. In fact, one developer in Developer Responses explicitly confirmed that sometimes he does not commit refactoring history. Applying our analysis to only those commits for which we have at least one refactoring tool use, then 41% of refactorings (45% when normalized) cannot be linked with a tool usage. Second, refactorings that occurred at an earlier time might not be committed until much later; this would inflate the count of refactorings found in CVS that we could not correlate to the use of a tool, and thus cause us to underestimate tool usage. We tried to limit this possibility by looking back several days before a commit to find uses of refactoring tools, but may not have been completely successful. Finally, in our analysis of UDC Events, we cannot discount the possibility that this population refactors less frequently in general, because we have no estimate of their manual refactoring. 3.9 Different Refactorings are Performed with and without Tools Some refactorings are more prone to being performed by hand than others. We have recently identified a surprising discrepancy between how programmers want to refactor and how they actually refactor using tools [10]. Programmers typically want to perform EXTRACT METHOD more often than RENAME, but programmers actually perform RENAME with tools more often than they perform EXTRACT METHOD with tools. (This can also be seen in all four groups of programmers in Table 1.) Comparing these results, we inferred that the EXTRACT METHOD tool is underused: the refactoring is instead being performed manually. However, it is unclear what other refactoring tools are underused. Moreover, there may be some refactorings that must be performed manually because no tool yet exists. We suspect that the reason that some kinds of refactoring especially RENAME are more often performed with tools is because these tools have simpler user interfaces. To explore this suspicion, we examined how the kinds of refactorings differed between refactorings performed by hand and refactorings performed using a tool. We once again correlated the refactorings that we found by manually inspecting Eclipse CVS commits with the refactoring tool usage in the Toolsmiths data. We repeated this process for the Mylyn CVS commits and Mylyn data. Finally, when inspecting the Eclipse CVS and Mylyn CVS commits, we identified several refactorings that currently have no tool support. The results are shown in Figure 3. Tool indicates how many refactorings were performed with a tool; Manual indicates how many were performed without. The figure shows that manual refactorings were performed much more often for certain kinds of refactoring. For example, EXTRACT METHOD is performed 9 times manually but just once with a tool in Eclipse CVS; REMOVE PARAMETER is never performed with a tool in the Mylyn CVS commits. Moreover, no refactorings were performed more often with a tool than manually in both Eclipse CVS and Mylyn CVS together. We can also see from the figure that many kinds of refactorings were performed exclusively by hand, despite having tool support.

11 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 11 Most refactorings that programmers performed had tool support. However, 30 refactorings from the Eclipse CVS commits and 36 refactorings from the Mylyn CVS commits did not have tool support. One of the most popular of these was MODIFY ENTITY PROPERTY, performed 8 times in the Eclipse CVS commits and 4 times in the Mylyn CVS commits, which would allow developers to safely modify properties such as static or final. A frequent but unsupported refactoring, REMOVE DECLARED EXCEPTION, occurred 12 times in the Mylyn CVS commits; it was commonly used to remove unnecessary exceptions from method signatures. Finally, we observed 3 instances of a REPLACE ARRAY WITH LIST refactoring in the Mylyn CVS commits. The same limitations apply as in Section DISCUSSION How do the results presented in Section 3 affect future refactoring research and tools? 4.1 Tool-Usage Behavior Several of our findings illuminate the behavior of programmers using refactoring tools. For example, our finding about how toolsmiths differ from ordinary programmers in terms of refactoring tool usage (Section 3.1) suggests that most kinds of refactorings will not be used as frequently as the toolsmiths hoped, when compared to the RENAME refactoring. For the toolsmith, this means that improving the underused tools or their documentation (especially the tool for EXTRACT LOCAL VARIABLE) may increase tool use. Other findings provide insight into the typical workflow involved in refactoring. Consider that refactoring tools are often used repeatedly (Section 3.2), and that programmers often do not configure refactoring tools (Section 3.3). For the toolsmith, this means that configuration-less refactoring tools, which have recently seen increasing support in Eclipse and other environments, will suit the majority of, but not all, refactoring situations. In addition, our findings about the batching of refactorings provides evidence that tools that force the programmer to repeatedly select, initiate, and configure can waste programmers time. This was in fact one of the motivations for Murphy-Hill and Black s refactoring cues, a tool that allows the programmer to select several program elements for refactoring at one time [8]. Questions still remain for researchers to answer. Why is the RENAME refactoring tool so much more popular than other refactoring tools? Why do some refactorings tend to be batched while others do not? Moreover, our experiments should be repeated in other projects and for other refactorings to validate our findings. 4.2 Detecting Refactoring In our experiments we investigated the assumptions underlying several commonly used refactoring-detection techniques. It appears that some techniques may need refinement to address some of the concerns that we have uncovered. Our finding that commit messages in version histories are unreliable indicators of refactoring activity (Section 3.4) is at variance with an earlier finding by Ratzinger [14]. It also casts doubt on the reliability of previous research that relies on this technique [6], [15], [17]. Thus, further replication of this experiment in other contexts is needed to establish more conclusive results. Our finding that many refactorings are medium or lowlevel suggests that refactoring-detection techniques used by Weißgerber and Diehl [20], Dig and colleagues [4], Counsell and colleagues [3], and to a lesser extent, Xing and Stroulia [21], will not detect a significant proportion of refactorings. The effect that this has on the conclusions drawn by these authors depends on the scope of those conclusions. For example, Xing and Stroulia s conclusion that refactorings are frequent can be strengthened by taking low-level refactorings into consideration. In contrast, Dig and colleagues tool was intended to help automatically upgrade library clients, and thus has no need to find low-level refactorings. In general, researchers who wish to detect refactorings automatically should be aware of what level of refactorings their tool can detect. Researchers can make refactoring detection techniques more comprehensive. For example, we observed that a common reason for Ratzinger s keyword-matching to mis-classify changes as refactorings was that a bug-report title had been included in the commit message, and this title contained refactoring keywords. By excluding bug-report titles from the keyword search, accuracy could be increased. In general, future research can complement existing refactoring detection tools with refactoring logs from tools to increase recall of low-level refactorings. 4.3 Refactoring Practice Several of our findings add to existing evidence about refactoring practice across a large population of programmers. Unfortunately, the findings also suggest that refactoring tools need further improvements before programmers will use them frequently. First, our finding that programmers refactor frequently (Section 3.7) confirms the same finding by Weißgerber and Diehl [20] and Xing and Stroulia [21]. For toolsmiths, this highlights the potential of refactoring tools, telling them that increased tool support for refactoring may be beneficial to programmers. Second, our finding that floss refactoring is a more frequently practiced refactoring tactic than root-canal refactoring (Section 3.5) confirms that floss refactoring, in addition to being recommended by experts [5], is also popular among programmers. This has implications for toolsmiths, researchers, and educators. For toolsmiths, this means that refactoring tools should support flossing by allowing the programmer to switch quickly between refactoring and other development activities, which is not always possible with existing refactoring tools, such as those that force the programmer s attention away from the task at hand with modal dialog boxes [10]. For researchers, studies should focus on floss refactoring for the greatest generality. For educators, it means that when they teach refactoring to students, they should teach it throughout the course rather than as one unit during which students are

12 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 12 taught to refactor their programs intensively. Students should understand that refactoring can be practiced both as a way of incrementally improving the whole program design, and also as a way to simplify the process of adding a feature, or to make a fragment of code easier to understand. Lastly, our findings that many refactorings are performed without the help of tools (Section 3.8) and that the kinds of refactorings performed with tools differ from the kinds performed manually (Section 3.9) confirm the results of our survey on programmers under-use of refactoring tools [10]. Toolsmiths need to explore alternative interfaces and identify common refactoring workflows, such as reminding users to EXTRACT LOCAL VARIABLE before an EXTRACT METHOD or finding a easy way to combine these refactorings: the goal should be to encourage and support programmers in taking full advantage of refactoring tools. For researchers, more information is needed about exactly why programmers do not use refactoring tools as much as they could. 4.4 Developer Responses on Manual Refactoring Our discussion with Eclipse and Mylyn developers (Developer Responses data) about their refactoring behavior generated several insights that may guide future research. We do not claim any generality for these insights, but believe that they do provide useful seeds for future investigation. We provided developers with several examples of a refactoring that they themselves performed with a tool in one case but without a tool in another. We then asked them to explain this behavior. From their responses, we identified three factors awareness, opportunity, and trust and two issues with tool workflow touch points and disrupted flow that may limit tool usage. Awareness is whether a developer realizes that there is a refactoring tool in her programming environment that can change her code on her behalf. As an example, one developer was not familiar with how the tools for the INLINE refactoring worked despite being an experienced developer. Similarly, one toolsmith described awareness problems occuring in the following scenario: I already know exactly how I want the code to look like. Because of that, my hands start doing copy-paste and the simple editing without my active control. After a few seconds, I realize that this would have been easier to do with a refactoring [tool]. But since I already started performing it manually, I just finish it and continue. Awareness is partially a problem of education, exposure, and encouragement; future research can consider how to improve sharing and exposing developers to different IDE features. But, as the toolsmith s scenario suggests, awareness can also be a problem when a developer knows what refactoring tools are available; future research may be able to help developers realize that the change that they are about to make can be automated with a refactoring tool. Opportunity is similar to awareness, but differs in that the developer knows about a feature, but does not know of an opportunity to use that feature. One developer described this situation: My problem isn t so much with how these tools do their job, but more with [opportunity]. Are these tools... available within the editor when pressing Ctrl+1 [a Eclipse automated suggestion tool]? Perhaps more should be recommended based on where I m at in the code and then perhaps I d use them more. As it is I have to remember the functionality exists, locate it in a popup menu (hands leaving keyboard = bad) and figure out the proper name for what it is I want to achieve. Researchers and toolsmiths can consider how to improve refactoring opportunities by introducing mechanisms for filtering applicable refactorings or conveying bad smells local to the source code currently being worked on. The final factor is trust: does a developer have faith in an automated refactoring tool? Developers must often act on faith because they are not informed of the limitations of a tool, or of cases in which an automated refactoring may fail. Several developers mentioned they would avoid using a refactoring tool because of worries about introducing errors or unintended side-effects. Perhaps future research can improve trust by generating inspection checklists to cover situations where the tool may fail? Developers also described several limitations with the way that refactoring tools fit into their development workflow. The first limitation involves touch points: the set of places in the code that may be affected by a potential change. One developer described how using an automated refactoring tool would curtail what would otherwise be a potentially more rich and thorough experience: More often than not, when starting a manual refactoring, upon saving that first change, all dependent code throughout the code base will light up (compile errors). At this point you can survey what the refactoring will involve and potentially discover potions of the code you didn t realize the refactoring would affect. Tool support tries to achieve this by presenting users with a preview of the changes, but this information is always being presented in an unfamiliar way (i.e., presented in a wizard dialog instead of the PackageExplorer for example). The previews are not only unfamiliar UI, but usually more difficult to use to explore the affected code than it would be in the native package explorer and editor. Current preview dialogs do not provide the same level of participation and interaction available by visiting error locations one by one in the editor. Researchers and toolsmiths might well consider how to better engage a developer with a thorough review or more intelligent summarization of a proposed change. A second limitation involves disrupted flow, an interruption to the focus and working memory of the developer. Besides perceived slowness, developers mentioned disrupted flow as a general concern when using refactoring tools. Consider

13 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 13 why refactoring tools may be disruptive: development often requires deep concentration and maintenance of mental thoughts and plans, which are facilitated by the availability of textual cues and fluid movement through the source code. A developer may feel that using a refactoring tool would disrupt her concentration: by initiating a refactoring that requires configuration, a developer traps herself within a modal window, temporarily isolating herself from the source code until the refactoring is completed. The modal window blocks source text previously available on the screen and limits mobility to other parts of the code that the developer may want to inspect. Toolsmiths should consider how to better integrate refactoring tools within the programming environment in a way that limits disruption. Finally, we believe that further examination of why developers use refactoring tools in some cases, but not in others, will help identify why refactoring tools break down and what can be done to improve these tools. 4.5 Limitations of this Study In addition to the limitations noted in each subsection of Section 3, some characteristics of our data limit the validity of all of our analyses. First, all the data report on refactoring Java programs in the Eclipse environment. While this is a widelyused language and environment, the results presented in this article may not hold for other languages and environments. Second, Users, Toolsmiths, and Mylyn may not represent programmers in general. Third, the Users and Everyone data sets may overlap the Toolsmith or Mylyn data set: both the Users and Everyone data sets were gathered from volunteers, and some of those volunteers may have been Toolsmiths or Mylyn developers. Finally, our inspection of the Eclipse CVS and Mylyn CVS data excluded commits to branches; readers of our previous work [11] have expressed concern that developers may have refactored more in branch commits than in regular commits. However, our interviews with both Toolsmith and Mylyn developers confirmed that refactoring in branches was discouraged because of the difficulty of later merging refactored code. 4.6 Experimental Data and Tools Our publicly available data, the SQL queries used for correlating and summarizing that data, and the tools we used for batching refactorings and grouping CVS revisions can be found at Our normalization procedure and survey sent to developers can be found in the appendices. 5 CONCLUSIONS Research about refactoring, like research in all areas, relies on a solid foundation of data. In this article, we have examined the foundations of refactoring from several perspectives. In some cases, the foundations appear solid; for example, programmers do indeed refactor frequently. In other cases, the foundations appear weak; for example, when committing code to version control, developers messages appear not to reliably indicate refactoring activity. Our results can help the research community build better refactoring tools and techniques in the future through enhanced knowledge of how we refactor, and how we know it. ACKNOWLEDGMENTS We thank Barry Anderson, Christian Bird, Tim Chevalier, Danny Dig, Thomas Fritz, Markus Keller, Ciarán Llachlan Leavitt, Ralph London, Gail Murphy, Suresh Singh, and the Portland Java User s Group for their assistance, as well as the National Science Foundation for partially funding this research under CCF Thanks to our anonymous reviewers and the participants of the Software Engineering seminar at UIUC for their excellent suggestions. ERRATA In the ICSE paper on which this work is based [11], we made two minor mistakes that have been corrected in this article. We highlight the corrections here. In the ICSE paper, we said that Low-, Medium-, and High-level refactorings made in Eclipse CVS accounted for 18%, 22% and 60% of refactorings, respectively. We did not correctly apply our normalization procedures when we reported these numbers. In this article, in Table 5, we include the corrected numbers: 21%, 11% and 68%. In the ICSE paper, we reported finding 21 PUSH DOWN refactorings. We misclassified 5 of these refactorings; they should have been classified as MOVE METHOD refactorings. The correct classification is shown in this article in Figure 3. APPENDIX 1: NORMALIZATION PROCEDURE In Section 3.5, we discussed a normalization procedure for some reported data. To explain the procedure we used for calculating, we ll give the intuitive explanation and an example calculation below: We wish to estimate how many pure-refactoring commits were made to CVS. Recall that previously, we sampled 20 Labeled projects and 20 Unlabeled projects, and we know that 6 Labeled commits were pure-refactoring and 0 Unlabeled commits were pure-refactoring. Naively, we might simply do the addition (6+0) and divide over the total commits to get the estimate: 6/40 = 15%. However, this is a good estimate for our sample, but a bad estimate for the population as a whole, because our sample was drawn from two unequal strata. Specifically, in this naive estimate, we are giving too much weight to the 6 pure-refactoring commits, because Labeled commits only account for about 10% of total commits. So what do we do? Instead of the naive approach, we normalize our estimate for the relative proportions of Labeled ( 10%) to Unlabeled commits ( 90%). The following calculation gives the normalized result: 6 is the number of Labeled pure-refactoring commits. 0 is the number of Unlabeled pure-refactoring commits. 290 is the number of Labeled commits is the number of Unlabeled commits.

14 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 14 (6/20)*(290/( )) + (0/20)*(1-290/( )) = And thus, we estimate that about 3% of commits contained pure-refactorings. APPENDIX 2: SURVEY The following template describes a survey sent to people who developed code for the Mylyn and Toolsmiths data sets. In this template, XXX was instantiated with the developer s first name. YYY was instantiated with either Mylyn or Eclipse, depending on which project the developer worked on. ZZZ.refactoringname was instantiated with a transaction number and a refactoring name, DD/MM/YYYY indicates when that refactoring was checked in to CVS, and some comment was instantiated with the comment that the developer made when checking in to CVS. Subject: Your thoughts on our refactoring study Dear XXX, My colleagues Chris Parnin, Andrew Black, and I are completing a study about refactoring and refactoring tool use. We investigated two case studies of refactoring, one of which was of the YYY project, of which you were a committer. In short, our analysis compared your refactoring tool history (produced when you used the Eclipse refactoring tools) with what we inferred as refactoring in the code when you committed to CVS. From this, we made estimates about how often people refactor, what kinds of refactoring tools they use, and when they use or do not use refactoring tools. We are hoping you will answer a few questions about your thoughts on issues related to your refactoring. We hope this will provide some insights into how we can interpret our results. You are one of less than 10 people that we are inviting to participate, so your comments are extremely valuable to us. Below, you will find several interview questions; we anticipate that they will take around 15 minutes to complete in total. Unless you indicate otherwise, we would like to reserve the right to summarize or repeat your answers verbatim in our forthcoming paper. As for privacy, in the paper we do not personally identify developers by name or by CVS username (although we do say which projects we analyzed). You can respond to this questionnaire simply by replying to this . Because we are on a tight publishing deadline, if you choose to participate, we would appreciate your response by February 25th. Sincerely, Emerson Murphy-Hill, The University of British Columbia Chris Parnin, Georgia Institute of Technology Andrew P. Black, Portland State University We published a first version of our analysis in a research paper called How We Refactor, and How We Know It in 2009 at the International Conference on Software Engineering. Did you happen to read it? One of our main findings was that, by comparing refactoring tool histories and the refactorings apparent in CVS, developers appear to use refactoring tools for about 10% refactorings for which a refactoring tool is available. Speaking for yourself, why do you think you would not use a refactoring tool when one was available? In the attached PDFs, we have included a few snippets of code where we inferred that you performed a refactoring for which Eclipse has tool support. However, for some of the examples, we did not have a record of you using a refactoring tool and thus concluded that you refactored without one. If you are able to remember the change that you made, could you recall or infer why you did or didn t use the tool for that refactoring? (For each file, we use green and red annotations highlights to show what was added and removed, and grey highlights to draw your attention to specific parts of code.) File: ZZZ.refactoringname-tool.diff.pdf Change date: DD/MM/YYYY CVSComment: some comment Another finding was that developers sometimes repeatedly used the same refactoring tool in quick succession (e.g., used Inline, then used it again in the next few seconds). Can you think of any reasons why you might do this? Off the top of your head, please try to name the three refactorings that you perform most often, and three that you perform most often using refactoring tools. Do you plan long-term refactoring campaigns, where you engage in extended refactoring for a period of time? If so, what is the motivation? How long do these usually take? Are there pitfalls during these campaigns? How would you want refactoring tools to help at those times? In our analysis, we looked at only refactorings in commits to the main line. Do you think you refactored differently when you committed to branches? How you think the fact that your team developed tools for Eclipse affected how you used refactoring tools? Do you think you used the refactoring tools more/less/about the same as the average Eclipse Java developer? Is there some particular Eclipse refactoring tool (or part of a tool) that doesn t fit with the way that you refactor? If so, please tell us which tool, and what the problem is. Do you desire additional tool support when you refactor? (Below, we ve included a list of Eclipse refactoring tools to help jog your memory.) Rename Extract Local Variable Inline Extract Method Move Change Method Signature Convert Local To Field Introduce Parameter Extract Constant Convert Anonymous To Nested

15 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 15 Move Member Type to New File Pull Up Encapsulate Field Extract Interface Generalize Declared Type Push Down Infer Generic Type Arguments Use Supertype Where Possible Introduce Factory Extract Superclass Extract Class Introduce Parameter Object Introduce Indirection Would you like a copy of the paper when it is complete? REFERENCES [1] F. Bourqun and R. K. Keller. High-impact refactoring based on architecture violations. In CSMR 07: Proceedings of the 11th European Conference on Software Maintenance and ReEngineering, pages , Washington, DC, USA, IEEE Computer Society. [2] S. Counsell, Y. Hassoun, R. Johnson, K. Mannock, and E. Mendes. Trends in Java code changes: the key to identification of refactorings? In PPPJ 03: Proceedings of the 2nd International Conference on Principles and Practice of Programming in Java, pages 45 48, New York, NY, USA, Computer Science Press, Inc. [3] S. Counsell, Y. Hassoun, G. Loizou, and R. Najjar. Common refactorings, a dependency graph and some code smells: an empirical study of Java OSS. In ISESE 06: Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering, pages , New York, NY, USA, ACM. [4] D. Dig, C. Comertoglu, D. Marinov, and R. Johnson. Automated detection of refactorings in evolving components. In D. Thomas, editor, ECOOP, volume 4067 of Lecture Notes in Computer Science, pages Springer, [5] M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, [6] A. Hindle, D. M. German, and R. Holt. What do large commits tell us?: A taxonomical study of large commits. In MSR 08: Proceedings of the 2008 International Workshop on Mining Software Repositories, pages , New York, ACM. [7] G. C. Murphy, M. Kersten, and L. Findlater. How are Java software developers using the Eclipse IDE? IEEE Software, 23(4):76 83, [8] E. Murphy-Hill and A. P. Black. High velocity refactorings in Eclipse. In Proceedings of the Eclipse Technology exchange at OOPSLA 2007, New York, ACM. [9] E. Murphy-Hill and A. P. Black. Breaking the barriers to successful refactoring: Observations and tools for extract method. In ICSE 08: Proceedings of the 30th International Conference on Software Engineering, pages , New York, ACM. [10] E. Murphy-Hill and A. P. Black. Refactoring tools: Fitness for purpose. IEEE Software, 25(5):38 44, [11] E. Murphy-Hill, C. Parnin, and A. P. Black. How we refactor, and how we know it. In ICSE 09: Proceedings of the 31st International Conference on Software Engineering, New York, [12] W. F. Opdyke and R. E. Johnson. Refactoring: An aid in designing application frameworks and evolving object-oriented systems. In SOOPPA 90: Proceedings of the 1990 Symposium on Object-Oriented Programming Emphasizing Practical Applications, September [13] M. Pizka. Straightening spaghetti-code with refactoring? In H. R. Arabnia and H. Reza, editors, Software Engineering Research and Practice, pages CSREA Press, [14] J. Ratzinger. space: Software Project Assessment in the Course of Evolution. PhD thesis, Vienna University of Technology, Austria, [15] J. Ratzinger, T. Sigmund, and H. C. Gall. On the relation of refactorings and software defect prediction. In MSR 08: Proceedings of the 2008 International Workshop on Mining Software Repositories, pages 35 38, New York, ACM. [16] R. Robbes. Mining a change-based software repository. In MSR 07: Proceedings of the Fourth International Workshop on Mining Software Repositories, pages 15 23, Washington, DC, USA, IEEE Computer Society. [17] K. Stroggylos and D. Spinellis. Refactoring does it improve software quality? In WoSQ 07: Proceedings of the 5th International Workshop on Software Quality, pages 10 16, Washington, DC, USA, IEEE Computer Society. [18] The Eclipse Foundation. Usage Data Collector Results, February 12, Website, commands.csv. [19] M. A. Toleman and J. Welsh. Systematic evaluation of design choices for software development tools. Software - Concepts and Tools, 19(3): , [20] P. Weißgerber and S. Diehl. Are refactorings less error-prone than other changes? In MSR 06: Proceedings of the 2006 International Workshop on Mining Software Repositories, pages , New York, ACM. [21] Z. Xing and E. Stroulia. Refactoring practice: How it is and how it should be supported an Eclipse case study. In ICSM 06: Proceedings of the 22nd IEEE International Conference on Software Maintenance, pages , Washington, DC, USA, IEEE Computer Society. [22] T. Zimmermann and P. Weißgerber. Preprocessing CVS data for fine-grained analysis. In MSR 04: Proceedings of the International Workshop on Mining Software Repositories, pages 2 6, Emerson Murphy-Hill Emerson is an assistant professor at North Carolina State University. His research interests include human-computer interaction and software tools. He holds a Ph.D. in Computer Science from Portland State University. Contact him at emerson@csc.ncsu.edu;

Black Andrew is a professor at Portland State University. His research interests include the design of programming languages and programming environments.

16 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 16 Chris Parnin Chris is a PhD student at the Georgia Institute of Technology. His research interests includes psychology of programming and empirical software engineering. Contact him at chris.parnin@gatech.edu; vector. Andrew P. Black Andrew is a professor at Portland State University. His research interests include the design of programming languages and programming environments. In addition to his academic posts he has also worked as an engineer at Digital Equipment Corp. He holds a D.Phil in Computation from the University of Oxford. Contact him at black@cs.pdx.edu; black.

How We Refactor, and How We Know it. Emerson Murphy-Hill, Chris Parnin, Andrew P. Black ICSE 2009

How We Refactor, and How We Know it Emerson Murphy-Hill, Chris Parnin, Andrew P. Black ICSE 2009 Introduction Refactoring is the process of changing the structure of a program without changing the way