Cloning by Accident? - PDF Free Download

Cloning by Accident? An Empirical Study of Source Code Cloning Across Software Systems Project Report for CS 846: Software Evolution and Design Winter 2005 By Raihan Al-Ekram, rekram@swag.uwaterloo.ca SWAG, School of Computer Science April 4, 2005 University of Waterloo

Abstract Code duplication or cloning is found to be a common phenomenon in software development. Studies showed that a large software system typically consists of 10-15% of duplicated code. Clones are introduced in a system because of ad-hoc code reuse by intentionally copying part of the existing code. Occasionally there can code fragments that are just accidentally identical and not a copy of another. Therefore if there is any cloning across two software systems, not originated from each other, is expected to be accidental. If the two software systems belong to the same problem domain, the accidental clones can be resulted from code fragments implementing similar functionality in that domain. The cloning can also be resulted from the reuse of functions and source files from system to another. In this report we present our findings on the nature of cloning across software systems based on an empirical study we have performed on 2 different problem domains and 7 software systems in each domain. The goal of the study is to determine whether the cloning across software systems is actually by accident, if the problem domain contributes to the amount of cloning, if there is significant code reuse among software systems and if there is any single factor that contributes to the cloning most. Cloning by Accident? Page 2 of 18

Table of Contents Abstract... 2 1 Introduction... 4 1.1 Source Code Cloning... 4 1.2 Reasons of Cloning... 4 1.3 Cloning Across Software Systems... 5 1.4 Cloning vs. Reuse... 5 1.5 Project Goal... 5 2 Study Approach... 6 2.1 Steps... 6 2.2 CCFinder... 6 2.3 CICS... 7 3 Case Study: Text Editors... 7 3.1 Cloning Statistics... 8 3.2 Clone Classification... 9 3.3 Detailed Analysis... 9 3.4 Summary... 13 4 Case Study: Network Firewalls... 13 4.1 Cloning Statistics... 14 4.2 Clone Classification... 15 4.3 Detailed Analysis... 15 4.4 Summary... 17 5 Conclusion... 17 References... 18 Cloning by Accident? Page 3 of 18

1 Introduction 1.1 Source Code Cloning Source code cloning can be defined as the act of intentional duplication of code fragments by brute-force copying with possible customization in it [3]. Studies have shown cloning to be a common phenomenon in the development of a large software system. Baker [1] reported a 19% of duplication out of 515K lines of C code in the X- Window System source. Kamiya et al. [4] demonstrated that 21% of 570K lines of Java code in the JDK source are part of a clone. Kapser [5, 6] in his study revealed that at least 12% out of 280K lines of code written in C in the File System subsystem of Linux source code are involved in cloning. Baxter et al. [3] analyzed a 400K lines industrial process control system and identified about 28% cloned code. The highest amount of cloning reported so far is by Ducasse [2], which is a Payroll system with 40K lines of COBOL code containing as high as 50% duplication. 1.2 Reasons of Cloning The main reason of cloning is the ad-hoc reuse of some existing code in the program that implements a similar functionality. Instead of forming a proper abstraction of the similar code a programmer may simply make a copy of it in order to save effort and time. Sometimes it can be difficult to identify a proper abstraction due to the presence of crosscutting concerns [7]. A second reason of cloning can be the planned duplication of code, for example, to avoid the formation of an abstraction due to strict performance requirement. This type of cloning is desirable. Another such desirable cloning is the code duplication between two architectural entities order to maintain architectural clarity. Finally, there can be code fragments that are just accidentally identical. Technically these are not clones since they were not intentionally copied from each other. But any clone detection tool will identify them as clones since they look similar. We name such accidental identical code fragments Clones by Accident. Cloning by Accident? Page 4 of 18

1.3 Cloning Across Software Systems The clone detection studies mentioned in section 1.1 identify duplications in a single software system. But Kamiya et al. [4] presents an interesting analysis of cloning across the source code of three different operating systems: 2.4 millions lines Linux, 2.2 million lines FreeBSD and 2.6 million lines NetBSD. Their analysis showed that there are about 20% cloning between FreeBSD and NetBSD, whereas this amount is less than 1% between Linux and FreeBSD or NetBSD. FreeBSD and NetBSD originate from the same BSD OS and hence they share a certain amount of common code. Linux, on the other hand, originated and grew totally independently. Any cloning between Linux and FreeBSD or NetBSD is therefore expected to be accidental. But it was found that the cloning were mostly reuse of codes for the device drivers. 1.4 Cloning vs. Reuse Cloning is the ad-hoc reuse of some existing code fragment with possible customization, hence giving slightly different copies of the same code. In a true reuse scenario, instead of copying the code, a common abstraction will be formed from the copies to accommodate the customization and the abstraction will be called in place of the copies. Reuse is traditionally achieved using program libraries. Object-oriented languages provide reusability through inheritance and genericity. Reuse between two software systems is usually achieved by means of library routines and API calls provided by them. Source level reuse between two software systems can be described as sharing an abstraction at a level higher than just few lines in the source code, e.g. functions, classes or files. 1.5 Project Goal In this research we extend the study of Kamaiya [4] and investigate code cloning across different software systems in more details. We select a number of problem domains and a number of software for each of the domains. For each problem domain we perform clone detection across all the software in that domain. We then analyze the results of the clone detection to answer the following questions: Cloning by Accident? Page 5 of 18

What is the amount of code cloning across the software systems? What are the different types clones according to Kapser s categorization [5]? Is there any dominant category? How much code is reused from the available existing software when developing new software in the same problem domain? Does a problem domain contribute to the amount of cloning? Is there a single factor that contributes to the cloning most? What is the amount of accidental cloning across the software systems? 2 Study Approach 2.1 Steps We select two problem domains to perform our study: text editor and network firewall. Software in these domains can be characterized as front-end and back-end software respectively. We then select 7 software in each domain to perform the cloning analysis. All of the selected software are from open source, since we do not have access to the source code of any closed source software in those domains. Finally we apply CCFinder [4] and CICS [6] tools on the combined source code of the software on each domain for the clone detection and analysis purpose. 2.2 CCFinder The CCFinder uses a parameterized string matching technique and a suffix-tree based matching algorithm on the token stream of the input code. CCFinder currently works on C/C++, Java and COBOL source files. Because of the parameterized string matching technique CCFinder gives a high recall and low precision on the result [8]. In our study, for each domain we setup the source files of each software as a group and configure CCFinder to detect clone pairs between the source files that belong to the different groups only. We configure the minimum number of tokens in a code fragment to be considered as clone to be 30. Cloning by Accident? Page 6 of 18

2.3 CICS In order to improve the precision we use the Clone Interpretation and Classification System (CICS) to filter out the false positives from the result generated by CCFinder. In addition CICS provides different cloning statistics and a categorization of clones based on the subsystem distance taxonomy. Since we are interested in the clones across different software system, the location wise partitioning of clones in the taxonomy is not relevant for us. We are more interested in the region wise partitioning. The most important category of clones in this taxonomy is the clones between two functions, the Function to Function Clones. Function to Function clones are further divided into Function Clones, Partial Function Clones, Cloned Function Body and Cloned Blocks. Function Clones are functions sharing at least 60% of the code, Partial Function Clones are Function Clones where one function is a slightly extended copy of the other, Cloned Function Body are where one function is copied into a considerably larger one and Cloned Blocks are blocks of code not large enough to fall in one of the above categories. 3 Case Study: Text Editors Our first case study is on the text editor domain. In this domain we have selected seven text editors as guinea pigs for analysis: Emacs, Gedit, Glimmer, Nano, Nedit, Vim and Xenon. Table 1 lists the basic information about them. Table 1: Text Editor Guinea Pigs Editor Release Files LOC Lang Platform Description Emacs 21.3 495 359,259 C Console GNU text editor Gedit 2.8.2 121 47,086 C X11 Gnome text editor Glimmer 1.2.1 124 44,560 C X11 X-based code editor Nano 1.2.4 12 13,602 C Console pico clone Nedit 5.5 136 124,055 C X11 Motif text editor Vim 6.3 94 260,348 C Console vi clone Xenon 0.6.6 78 17,064 C X11 X-based text editor Cloning by Accident? Page 7 of 18

The size of the source code of each editor ranges from some 13K to over 350K lines of code. All of them are written in C. Some of them use the console mode, whereas others use X11 graphical mode. The combined size of all the editors is 866K lines of code spanned in 1,060 files. 3.1 Cloning Statistics Facts about the cloning in the text editors are gathered after applying the CCFinder and CICS tools on the source code. Table 2 shows the overall cloning statistics in the text editor domain and Table 3 shows statistics about each of the software in the domain. Table 2: Overall Cloning Stats Source 1,060 Number of files Clone 114 % 10.75 Source 865,974 Lines of code (LOC) Clone 4,858 % 0.56 Number of clone pairs 11,606 Number of clone classes 516 Average Lines per Clone 7 Table 3: Cloning Stats for Individual Guinea Pig Editor #Clones %Clones Density = LOC:Clones Emacs 9,378 80 38 Gedit 522 4.5 90 Glimmer 566 4.8 79 Nano 115 1 118 Nedit 10,543 90 12 Vim 2,083 18 125 Xenon 5 0.04 3,413 Cloning by Accident? Page 8 of 18

There are 11,606 clone pairs and 516 clone classes across the 7 text editors. Each clone has an average size of 7 lines of code. The clones encompass over 10% of the total files but only 0.56% lines of total source code. The editor Nedit contains the most amounts of clones, 90% of the total clones pairs. Followed by Emacs containing 80%. Emacs is one of the largest software among the seven and therefore is expected to have many clones in it. A more meaningful metric would be the clone density, the ratio of LOC and number of clones, in a system. We find that Nedit has the maximum clone density, one clone for every 12 lines of code, followed by Emacs. 3.2 Clone Classification Table 4 presents the significant portion of the classification of clones obtained from CICS. It shows that almost all clones are Function to Function with no bias to any specific subcategory. This implies that the clones are of all different size and granularity inside the functions. Table 4: Classification of Clones Taxonomy #Clone #Clone Pairs Classes Total Clones 11,606 516 Function to Function 11,600 510 Function Clones 3,700 13 Partial Function Clones 2,593 11 Cloned Function Body 1,659 46 Cloned Blocks 3,648 440 3.3 Detailed Analysis If we take closer look at the cloning stats we find that a total of 9,837 out of 10,543 clones in Nedit comes from only 7 out of 136 files in the source code. Only 3 three functions of 109 in the search.c file contribute to over half of them. In Emacs 9,160 clones out of 9,378 comes from only 3 files out of 495. Only 1 function out of 55 in the lwlib-xm.c file contributes to 6,419 of them. In Vim a single file gui_motif.c out of 93 contributes to 1,936 out of 2,083 clones in the system. Interestingly, almost all the clone Cloning by Accident? Page 9 of 18

pairs that originate in the files mentioned above of one system also terminate in the files mentioned above of another system. So most of the clones originating from the 3 files of Emacs form a pair with either the 7 files in Nedit or the 1 file in Vim; forming the most and the largest clone classes. We investigate the source code of these clone classes and apprehend that the purpose of the functions are setting up and creating widgets and dialog boxes for graphical user interaction. Table 5 depicts fragments of code with 4 clone pairs between the make_dialog function of lwlib-xm.c file of Emacs and CreateFindDlog function of search.c file of Nedit. Table 6 lists the files and functions involved in the clone classes of this group that accounts for over 10,000 of the total clone pairs. The clones in these classes clearly were not copied across different text editors; they are similar because they perform similar tasks in different systems. But there is a good chance that the similar code inside one editor system were copied and modified as needed. Therefore, the cloning across different editors in this clone class is by accident. Table 5: Example Clones of Class 1 Emacs: lwlib-xm.c/make_dialog 1093: XtSetArg(al[ac], XmNnumColumns, left_buttons + right_buttons + 1); ac++; 1094: XtSetArg(al[ac], XmNmarginWidth, 0); ac++; 1095: XtSetArg(al[ac], XmNmarginHeight, 0); ac++; 1096: XtSetArg(al[ac], XmNspacing, 13); ac++; 1097: XtSetArg(al[ac], XmNadjustLast, False); ac++;... 1259: XtSetArg(al[ac], XmNleftAttachment, XmATTACH_WIDGET); ac++; 1260: XtSetArg(al[ac], XmNleftOffset, 13); ac++; 1261: XtSetArg(al[ac], XmNleftWidget, icon); ac++; 1262: XtSetArg(al[ac], XmNrightAttachment, XmATTACH_FORM); ac++; 1263: XtSetArg(al[ac], XmNrightOffset, 13); ac++; Nedit: search.c/createfinddlog 1219: XtSetArg(args[argcnt], XmNrightAttachment, XmATTACH_FORM); argcnt++; 1220: XtSetArg(args[argcnt], XmNtopWidget, label1); argcnt++; 1221: XtSetArg(args[argcnt], XmNleftOffset, 6); argcnt++; 1222: XtSetArg(args[argcnt], XmNrightOffset, 6); argcnt++; 1223: XtSetArg(args[argcnt], XmNmaxLength, SEARCHMAX); argcnt++;... 1351: XtSetArg(args[argcnt], XmNlabelString, st1=mkstring("cancel")); argcnt++; 1352: XtSetArg(args[argcnt], XmNtopAttachment, XmATTACH_FORM); argcnt++; 1353: XtSetArg(args[argcnt], XmNbottomAttachment, XmATTACH_NONE); argcnt++; 1354: XtSetArg(args[argcnt], XmNleftAttachment, XmATTACH_NONE); argcnt++; 1355: XtSetArg(args[argcnt], XmNrightAttachment, XmATTACH_POSITION); argcnt++; Cloning by Accident? Page 10 of 18

Table 6: Clone Class 1 GUI Specific Clones Editor File #Clones Function Purpose xfns.c 421 x_window Setup X Widgets Emacs lwlib-xaw.c lwlib-xm.c 2,320 make_dialog xaw_create_scrollbar 6,419 make_dialog Lucid Widget Library search.c 5,380 CreateFindDlog CreateReplaceDlog CreateReplaceMultiFileDlog smartindent.c 677 EditCommonSmartIndentMacro EditSmartIndentMacros Nedit usercmds.c highlightdata.c 533 EditShellMenu EditMacroOrBGMenu 496 EditHighlightPatterns Create Dialog Boxes EditHighlightStyles shell.c 355 createoutputdialog fontsel.c 1,628 FontSel printutils.c 768 CreateForm gui_motif.c 1,936 find_replace_dialog_create Vim gui_mch_add_menu_item gui_mch_dialog Motif Support gui_x11_create_widgets A second significant clone class contains clones between mdi-routines.c file of Glimmer and gedit-mdi-child.c file of Gedit. Although the class consists of only 127 clones in it, it is significant because of the code in the clones. The codes in the two files deal with signaling among the MDI children when editing multiple files in the editor. Table 7 depicts code fragments with a clone pair for the clone class given in Table 8. The clones in this class are specific to the text editor problem domain, since they both implement the support for editing multiple documents in the editors. The third interesting clone class, as given in Table 9, comprises of two files of the same name regex.c in two different text editors Glimmer and Emacs. There are a total 54 clone pairs between the files. Investigating the source code reveals that regex.c is an Cloning by Accident? Page 11 of 18

extended regular expression matching and search library that is used by both the editors. But the version in Emacs is newer and contains additional helper macros and functions with the older macros and functions almost unchanged. This is clearly a case of code reuse between the two editors. Table 7: Example Clones of Class 2 Glimmer: mdi-routines.c 67: gtk_signal_connect (GTK_OBJECT (new_file->text), "undo_changed", 68: GTK_SIGNAL_FUNC (undo_changed), new_file); 69: gtk_signal_connect (GTK_OBJECT (new_file->text), "move_to_column", 70: GTK_SIGNAL_FUNC (file_pos_changed), 0); 71: gtk_signal_connect (GTK_OBJECT (new_file->text), "focus_in_event", 72: GTK_SIGNAL_FUNC (editor_window_focused), new_file); 73: gtk_widget_set_style (new_file->text, extext_style); Gedit: gedit-mdi-child.c 322: g_signal_connect (G_OBJECT (child->document), "name_changed", 323: G_CALLBACK (gedit_mdi_child_document_state_changed_handler), 324: child); 325: g_signal_connect (G_OBJECT (child->document), "readonly_changed", 326: G_CALLBACK (gedit_mdi_child_document_readonly_changed_handler), 327: child); 328: g_signal_connect (G_OBJECT (child->document), "modified_changed", 329: G_CALLBACK (gedit_mdi_child_document_state_changed_handler), 330: child); 331: g_signal_connect (G_OBJECT (child->document), "can_undo", Table 8: Clone Class 2 Domain Specific Clones Editor File #Clones Purpose Glimmer mdi-routines.c 79 MDI Signaling Gedit gedit-mdi-child.c 48 Table 9: Clone Class 3 Code Reuse Editor File LOC #Clones Purpose Glimmer 4,948 Extended regular Emacs regex.c 6,057 54 expression matching and search library Cloning by Accident? Page 12 of 18

3.4 Summary The presented cloning statistics and analysis indicates that almost all the clones in text editor domain are related to user interface, e.g. setting-up environments and widgets for creating dialog boxes. These clones are not due to intentional copying of code, but rather resulted accidentally while implementing similar functionality. The problem domain has very little contribution in cloning, but the front-end nature of the domain contributes most of the clones. We also observe a little reuse of code between the systems. 4 Case Study: Network Firewalls Our second case study is on the network firewall domain. In this domain we have selected seven firewalls as guinea pigs for analysis: Firestarter, Guarddog, Ip_fil, LnxFire, Nessus, Seahorse and Xfwall. Table 10 lists the basic information about them. Table 10: Network Firewall Guinea Pigs Firewall Release Files LOC Lang Description Firestarter 1.0.3 778 284,159 C/C++ Visual firewall tool for Linux Guarddog 2.4.0 17 8,771 C++ Firewall generation for KDE Ip_fil 4.1.6 239 86,033 C TCP/IP packet filter LnxFire 0.1 48 9,478 C Firewall-creation wizard Nessus 2.2.3 290 103,468 C Security auditing tool Seahorse 0.7.6 94 29,173 C Gnu Privacy Guard program Xfwall 1.3 52 29,570 C/C++ Firewall for corporate users The size of the source code of firewalls ranges from some 8K to over 280K lines of code. The firewalls are written in C, C++ or a combination of both. The combined size of all the firewalls is 550K lines of code spanned in 1,518 files. Cloning by Accident? Page 13 of 18

4.1 Cloning Statistics Table 11 shows the overall cloning statistics in the network firewall domain and Table 12 shows statistics about each of the software in the domain. Table 11: Overall Cloning Stats Source 1,518 Number of files Clone 69 % 4.55 Source 550,652 Lines of code (LOC) Clone 4,070 % 0.74 Number of clone pairs 6,030 Number of clone classes 210 Average Lines per Clone 12 Table 12: Cloning Stats for Individual Guinea Pig Editor #Clones %Clones Density = LOC:Clones Firestarter 2,015 33 141 Guarddog 35 0.6 250 Ip_fil 1,736 29 50 LnxFire 2,975 49 3.2 Nessus 4,733 78 22 Seahorse 169 2.8 173 Xfwall 397 6.6 74 There are 6,030 clone pairs and 210 clone classes across the 7 firewalls. Each clone has an average size of 12 lines of code. The clones encompass over 4% of the total files but only 0.74% lines of total source code. LnxFire has the maximum clone density of one clone for every 3 lines of code, whereas Guarddog is the firewall with lowest clone density. Cloning by Accident? Page 14 of 18

4.2 Clone Classification Table 13 presents the significant portion of the classification of clones obtained from CICS. It shows that most of the clones are Function to Function with more clones being Cloned Function Body or Cloned Blocks. This implies more clones to be of smaller granularity. Table 13: Classification of Clones Taxonomy #Clone #Clone Pairs Classes Total Clones 6,030 210 Function to Function 6,027 207 Function Clones 176 22 Partial Function Clones 891 11 Cloned Function Body 2,122 38 Cloned Blocks 2,908 136 4.3 Detailed Analysis If we take closer look at the cloning stats we find that a total of 3,305 out of 4,733 clones in Nessus comes from only 4 files in the source code. In LnxFire 2,952 clones out of 2,975 comes from only 3 files. In Firestarter 2 files contributes to 1,872 out of 2,015 clones in the system. Most of the cloning activities in this domain are among these files. Table 14 lists the files involved in this clone class. We investigate the source code of clones and apprehend that the purpose of the files is generating reports in different output format or to generate code for creating IP filters. The clones in this class were not copied across from one firewall to another; they are similar because they perform similar tasks in different systems. Therefore, the cloning across different firewalls in this clone class is by accident. A second interesting clone class comprises of two files of the same name eggtrayicon.c in two different firewalls Firestarter and Seahorse. It is basically the same file of about 470 LOC. Investigating the source code reveals that the file contains code for Gnome Egg Library that is used by both the firewalls. This is clearly a case of code Cloning by Accident? Page 15 of 18

reuse between the two firewalls. Similarly Ip_fil and and Nessus share code from bpf_filter.c file. These two firewalls also share two functions inet_aton in inet_addr.c file in Ip_fil and system.c file in Nessus and swap_hdr function in ipft_pc.c file in Ip-fil and savefile.c in Nessus respectively. Table 15 lists the code reuses between systems. Table 14: Clone Class 1 Report or Code Generation Clones Firewall File #Clones Purpose Nessus LnxFire Firestarter latex_output.c 1,243 report_ng.c 984 Reporting in different html_output.c 806 format html_graph_output.c 272 writer.c 1,746 Create shell script interface.c 631 Generate GUI app pref.c 575 Dialog box netfilterscript.c 1,746 Create shell script wizard.c 126 Generate setup window Table 15: Clone Class 2 Code Reuse Firewall File/Function LOC #Clones Purpose Firestarter 468 eggtrayicon.c Seahorse 471 Ip-fil 515 bpf_filter.c Nessus 532 Ip-fil inet_addr.c/inet_aton - Nessus system.c/inet_aton - Ip-fil ipft_pc.c/swap_hdr - Nessus savefile.c/swap_hdr - 11 Gnome Egg Library 7 Enet packet filter 3 Check valid IP addres 2 Swap header Cloning by Accident? Page 16 of 18

4.4 Summary The presented cloning statistics and analysis indicates that most the clones in network firewall domain are related IO, e.g. generating report or generating code. These clones are not due to intentional copying of code, but rather resulted accidentally while implementing similar functionality. The problem domain has almost no contribution in cloning. We also observe some reuse of code between the systems. 5 Conclusion Based on our findings from the two case studies we now attempt to answer the questions we posed at section 1.5. What is the amount of code cloning across the software systems? The amount of cloning across software systems is less than 1%. This supports the initial findings of Kamiya on the OS cloning study. What are the different types clones according to Kapser s categorization [5]? Is there any dominant category? Most of the cloning is Function to Function with no significant bias to any of its sub categories. How much code is reused from the available existing software when developing new software in the same problem domain? There is very little source code level reuse in open source software development (when a software is not a clear descendent of another). Does a problem domain contribute to the amount of accidental cloning? Contribution of problem domain in cloning is almost non-existent. Is there a single factor that contributes to the cloning most? User interface and IO contributes to most of the cloning across software systems. What is the amount of accidental cloning across the software systems? Almost all of the cloning across software systems is by accident. Cloning by Accident? Page 17 of 18

References [1] B. S. Baker. On Finding Duplication and Near-Duplication in Large Software Systems. Proceedings of the Working Conference on Reverse Engineering, 1995. [2] S. Ducasse, M. Rieger and S. Demeyer. A Language Independent Approach for Detecting Duplicated Code. Proceedings of the International Conference on Software Maintenance, 1999. [3] I. D. Baxter, A. Yahin, L. Moura, M. Sant'Anna and L. Bier. Clone Detection Using Abstract Syntax Trees. Proceedings of the International Conference on Software Maintenance, 1998. [4] T. Kamiya, S. Kusumoto and K. Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE Transactions on Software Engineering, July 2002. [5] C. Kapser and M. W. Godfrey. Toward a Taxonomy of Clones in Source Code: A Case Study. Proceedings of the International Workshop on Evolution of Large Scale Industrial Software Applications, 2003. [6] C. Kapser and M. W. Godfrey. Aiding Comprehension of Cloning Through Categorization. Proceedings of the International Workshop on Principles of Software Evolution, 2004 [7] M. Bruntink, A. van Deursen and T. Tourwe. An Evaluation of Clone Detection Techniques for Identifying Crosscutting Concerns. Proceedings of the International Conference on Software Maintenance, 2004. [8] F. Van Rysselbergh and S. Demeye. Evaluating Clone Detection Techniques. Proceedings of the International Workshop on Evolution of Large Scale Industrial Software Applications, 2003 Cloning by Accident? Page 18 of 18