A Technique to Detect Multi-grained Code Clones

Detection Time The Number of Detectable Clones A Technique to Detect Multi-grained Code Clones Yusuke Yuki, Yoshiki Higo, and Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, Japan {y-yusuke, higo, k-kusumoto@ist.osaka-u.ac.jp Abstract It is said that the presence of code s makes software maintenance more difficult. For such a reason, it is important to understand how code s are distributed in source code. A variety of code detection tools has been developed before now. Recently, some researchers have detected code s from a large set of source code to find library candidates or overlooked bugs. In general, the smaller the granularity of the detection target is, the longer the detection time. On the other hand, the larger the granularity of the detection target is, the fewer detectable code s are. In this paper, we propose a technique that detects in order from coarse code s to fine-grained ones. In the coarse-tofine-grained-detections, code fragments detected as code s at a certain granularity are excluded from detection targets of more fine-grained detections. Our proposed technique can detect code s faster than fine-grained detection techniques. Besides, it can detect more code s than coarse detection techniques. Index Terms multi-grained detection technique, file-level, method-level, code-fragment-level Short File Method Block Few I. INTRODUCTION & RELATED WORK A code (hereafter, ) is a code fragment that is similar or identical to another code fragment in source code. Copy-and-paste operation in code implementation is the main reason of occurrences [1] [2]. For example, if a code fragment includes a bug, we need to check its s too. For such a reason, it is important to understand how s are distributed in source code. A variety of detection tools has been developed before now. Recently, some researchers have detected s from a large set of source code to find library candidates or overlooked bugs [3]. Merging the same function in multiple software into a library is beneficial from the viewpoint of development efficiency improvement. The existing detection techniques detect only single granularity s, such as file-level ones or code-fragmentlevel ones. The existing detection techniques can be roughly classified into the following two detection techniques [4]. Coarse detection technique: detecting files, methods, and blocks as s. Fine-grained detection technique: detecting any code fragments as s. Some s are large but others are not so large. Any size duplicated code fragments are found as s if they are larger than a given threshold. Fine-grained detection techniques can detect s whose code fragments are parts of classes, methods, or blocks. Long Code-Fragment Fig. 1. The Features of Each Detection Techniques Many Coarse and fine-grained detection techniques have the following advantages and disadvantages, respectively. The features of those detection techniques are shown in Figure 1. Detection Time: the smaller the granularity of the detection target is, the longer the detection time. The number of files, methods, and blocks is much fewer than the number of statements, tokens, vertices in tree structures. Therefore, coarse detection techniques can detect s at a higher speed than fine-grained detection ones. The number of detectable s: the larger the granularity of the detection target is, the fewer detectable s are. Coarse detection techniques cannot detect s whose code fragments are parts of files, methods, or blocks. Therefore,

class A { methodb (int x) { File-level class D{ methode (int x) { methodc (int y) { while (y > 0) { methodf (int y) { while (y > 0) { Fig. 3. Secondary Merit of Our Proposed Technique STEP1: File-Level Clone Detection Detect STEP2: Method-Level Clone Detection Detect method1 method2 method3 method4 method5 STEP3: Code-fragment-Level Clone Detection method1 method2 method4 method5 Detect Fig. 2. Overview of Our Proposed Technique fine-grained detection techniques can detect more s than coarse detection ones. In this paper, we propose a technique that detects in order from coarse s to fine-grained ones (multi-grained detection technique). More concretely, our proposed technique detects s in the order of file-level, method-level, and code-fragment-level. In the coarse-to-fine-grained-detections, code fragments detected as s at a certain granularity are excluded from detection targets of more fine-grained detections. Our proposed technique can detect s faster than fine-grained detection techniques. Besides, it can detect more s than coarse detection techniques. In a previous study, equivalence files partitioning is conducted to the input source files prior to the code-fragmentlevel (token-level) detection [5]. This preprocessing is similar to the file-level detection. However, the number of filelevel s is fewer than the number of method-level s. Excluding method-level s in addition to file-level s is more effective to save the detection time. Thus, our proposed technique detects method-level s in addition to file-level ones and code-fragment-level ones. Moreover, our proposed technique can generate detection results which are easier to understand than the technique in the previous study. This is because our proposed technique detects file-level s, not as preprocessing, and clearly indicates the granularity levels of detected s. We have developed a tool based on our proposed technique and applied it to some open source software. Then, we evaluated our proposed technique compared to coarse detection techniques and fine-grained detection ones. The main contributions of this research are as follows. We proposed a technique that detects in order from coarse s to fine-grained ones. can detect s faster than fine-grained detection techniques. can detect more s than coarse detection techniques. can generate detection results that are easier to analyze than coarse detection techniques and fine-grained detection ones. II. PROPOSED TECHNIQUE We propose multi-grained detection technique to solve the disadvantages described in Section I. A. Overview of Our Proposed Technique Our proposed technique detects s in the order of filelevel, method-level, and code-fragment-level. An image of our proposed technique is shown in Figure 2. In the file-level detection, files detected as s are excluded from detection targets of the next method-level detection. In the method-level detection, methods detected as s are excluded from detection targets of the next code-fragment-level detection. There is a research report that about 49% of methods in about 13,000 software were method-level s [3]. Since there are a large number of method-level s across software. The authors expect that the code-fragment-level

E.html $A $C F.java STEP1-3. Filtering out files Input $B $D STEP2-3. Filtering out methods method1 $1 method2 $2 method3 $3 method6 $6 method1 $1 method2 $2 method6 $6 STEP1-1. Identifying detection target files STEP1-4. Generating hash values 20 F.java 30 STEP1-2. Normalizing files STEP1-5. Making hash groups (a) File-Level Clone Detection STEP2-1. Extracting methods method1 method2 method3 method4 method5 method6 STEP2-4. Generating method1 $1 hash method2 $2 20 method3 $3 values 30 method6 $6 40 STEP2-2. Normalizing methods STEP2-5. Making hash groups (b) Method-Level Clone Detection STEP3-1. Identifying statements STEP3-3. Identifying similar hash sequences method1 $1 20 30 40 50 60 20 30 50 60 void $ () { $. $. $ ( $ ) ; STEP3-4. Outputting information $A $C 20 method1 $1 method2 $2 method3 $3 method5 $5 method6 $6 $B $D F.java $F 30 method1 $1 method2 $2 20 method3 $3 30 method6 $6 40 STEP3-2. Generating hash values (c) Code-Fragment-Level Clone Detection Output Fig. 4. The Details of Each Clone Detection detection time can be greatly saved by excluding detected methods from detection targets of the next code-fragment-level detection. B. Secondary Merit of Our Proposed Technique Detecting multiple pairs as a single pair can reduce the number of detected s. An example is shown in Figure 3. For example, in the method-level detection, two pairs of method B and method E, method C and method F are detected. However, it is possible to detect as a single pair of class A and class D in the file-level detection prior to the method-level detection. In other words, by detecting such contiguous s at a stage with coarser detections, our proposed technique can generate 20 detection results which are easier to analyze than fine-grained detection techniques. III. IMPLEMENTATION We have developed a tool, Decrescendo based on our proposed technique. The inputs of Decrescendo are as follows: single or multiple software (currently, Decrescendo targets only source code written in Java), minimum length (minimum number of tokens considered to be s), and maximum gap rate (ratio of gapped tokens in the detected tokens). The outputs are detection results, such as location information and types. The detection results are output to a database. A simplified procedure of Decrescendo is as follows. STEP1: File-level detection STEP1-1: identifying detection target files STEP1-2: normalizing files STEP1-3: filtering out files STEP1-4: generating hash values STEP1-5: making hash groups STEP2: detection STEP2-1: extracting methods STEP2-2: normalizing methods STEP2-3: filtering out methods STEP2-4: generating hash values STEP2-5: making hash groups STEP3: Code-fragment-level detection STEP3-1: identifying statements STEP3-2: generating hash values STEP3-3: identifying similar hash sequences STEP3-4: outputting information Decrescendo can switch detection on/off for each granularity by setting. Even if the method-level detection is off, the inputs of the code-fragment-level detection are methods. Besides, even if the file-level and, or the methodlevel detection are off, filterings (STEP1-3, STEP2-3) are not conducted. The details of each STEP are described below. A. File-level Clone Detection The procedure for detecting file-level s is as follows. An overview is shown in Figure 4(a). STEP1-1: identifying detection target files Decrescendo identifies detection target files from single or multiple software given as the inputs. Decrescendo currently analyzes only source code written in Java. STEP1-2: normalizing files The following normalization processing is conducted to files identified in STEP1-1. Removing white space, tabs, blank lines, annotations, and comments.

Replacing variables and literals with special tokens. This process makes it possible to detect two files as s even if their coding styles differ between them, or their variables and literals are different. STEP1-3: filtering out files If the number of tokens included in a normalized file is less than a given minimum length, the file is excluded from the detection targets. The reasons for the filtering are as follows: saving time required for the matching process in STEP1-5, excluding files not to be detected as s. STEP1-4: generating hash values Hash values are generated for each detection target file. Files with the same hash value are a file-level. Decrescendo uses MD5 as a hash function. STEP1-5: making hash groups Decrescendo groups files whose hash values are the same. If a group consists of two or more files, it is detected as a filelevel. For example, in Figure 4(a), and are a file-level. In the file-level detection, Type-1 and Type-2 s are detected. B. Clone Detection The procedure for detecting method-level s is as follows. An overview is shown in Figure 4(b). STEP2-1: extracting methods The inputs are files which did not have the same hash values in STEP1-5. In addition, Decrescendo randomly selects a file from each group including two or more files whose hash values are the same. Those files are added to the inputs as files representing the groups. For example, in Figure 4(a), and are a file-level. In Figure 4(b), is added to the input as a representing file. This is a process for preventing oversight in the method-level detection. If the inputs were normalized files, types of method-level s could not be classified. Thus, the inputs are files before the normalization. Then, Decrescendo builds abstract syntax trees for every file and extract their subtrees corresponding to methods. STEP2-2: normalizing methods Decrescendo normalizes methods identified in STEP2-1 as well as STEP1-2. STEP2-3: filtering out methods If the number of tokens included in a normalized method is less than a given minimum length, it is excluded from the detection targets. STEP2-4: generating hash values Hash values are generated for each detection target method. Methods with the same hash value are a method-level. Decrescendo uses MD5 as a hash function. STEP2-5: making hash groups Decrescendo groups methods whose hash value are the same. If a group consists of two or more methods, it is detected as a method-level. For example, in Figure 4(b), method1 of and method3 of are a method-level. In the method-level detection, Type-1 and Type-2 s are detected. C. Code-fragment-level Clone Detection The procedure for detecting code-fragment-level s is as follows. An overview is shown in Figure 4(c). STEP3-1: identifying statements The inputs are methods which did not have the same hash values in STEP2-5. In addition, Decrescendo randomly selects a method from each group including two or more methods whose hash values are the same. Those methods are added to the inputs as methods representing the groups. For example, in Figure 4(b), method1 of and method3 of are method-level s. In Figure 4(c), method1 is added to the input as a representing method. Then, Decrescendo identifies statements for these input methods. We define a statement as every subsequence between semicolon, opening brace and closing brace. Decrescendo records the number of tokens included in every statement. STEP3-2: generating hash values Hash value is generated for every statement identified in STEP3-1. Decrescendo uses MD5 as a hash function. STEP3-3: identifying similar hash sequences We use Smith-Waterman algorithm [6] to identify similar hash sequences. Smith-waterman algorithm is used in the field of biology. This algorithm detects pairs of similar partial alignments from two alignments. It has an advantage that it can identify similar alignments even if they include some gaps. The reason why we uses Smith-Waterman algorithm is that the detection tool to which Smith-Waterman algorithm is applied is faster to detect than the other Type-3 detection tools [7]. Decrescendo identifies similar hash sequences for each combination of detection target methods. STEP3-4: outputting information If the following two conditions are satisfied, information is output. In the code-fragment-level detection, Type-1, Type-2 and Type-3 s are detected. match θ, gap/match ϕ, where match, gap, θ, and ϕ represent the number of tokens included in the similar statements, the number of tokens included in the mismatch statements, a given minimum length, and a given maximum gap ratio. IV. EXPERIMENT We applied Decrescendo to some open software. Then, we evaluated our proposed technique compared to coarse detection technique and fine-grained detection technique. In this experiment, minimum length is 50 tokens, and maximum gap rate is 0.3. These are based on previous studies [8] [9]. A. Experimental Questions EQ1: can the multi-grained detection technique detect s faster than fine-grained detection technique?

TABLE II EXPERIMENTAL RESULTS Detection Granularity # of Detected Clone Pairs Detection Time [s] File Method Code-fragment File Method Code-fragment Sum 1,067 - - 1,067 4.0-31,807-31,807 15.3 - - 202,537 202,537 4,695.4 1,067 31,291-32,358 15.9 1,067-202,021 203,088 4,137.7-31,807 170,730 202,537 690.2 1,067 31,291 170,730 203,088 671.6 EQ2: can the multi-grained detection technique detect more s than coarse detection technique? EQ3: are multi-grained detection results easier to analyze than coarse detection results and fine-grained detection ones? To research these questions, we executed Decrescendo in all cases where detection on each granularity was switched on/off. B. Experimental Environment The CPU of the computer used in this experiment is 2.40 GHz Intel Xeon CPU (8 logic processors), and the memory size is 32.0 GB. Experimental targets and output database were located on SSD. C. Experimental Target Table I shows an overview of target software. We randomly selected latest software from Apache s repository 1 as of September 13, 2016. To avoid detection from the same software with different versions, only files under trunk are detection target. D. Experimental Results Table II shows experiment results. Columns of Detection Granularity indicate whether the detection is on or not. We answer to each EQ with Table II. 1) EQ1: comparing the case of detecting all granularities s and the case of detecting only code-fragment-level s, the detection time is greatly saved (4,695.4s 671.6s). The time required for each STEP is shown in Table III. The execution time from STEP2-1 to STEP2-4 is slightly saved TABLE I OVERVIEW OF TARGET SOFTWARE software # of Java files LOC Any23 369 46,957 ctakes 1253 209,545 Forrest 252 32,491 JSPWiki 491 7,530 juddi 938 163,3 Onami 572 43,784 OODT 1,628 223,846 OpenOffice 3871 774,670 Roller 612 96,617 Wink 1,372 2,167 Sum 11,358 1,908,7 1 http://svn.apache.org/repos/asf/ (13.4s.1s). This is because detected file-level s are excluded from the inputs of the method-level detection. Furthermore, the most remarkable point is that the execution time of STEP3-3 was greatly saved (4,672.0s 637.0s). In detecting all granularities s, the execution time is about 1/7 against the code-fragment detection. In both cases, almost all of the total execution time is the execution time of STEP3-3. Thus, excluding detected files and methods, and also, files and methods which are less than a given minimum length is effective for saving the detection time. Moreover, comparing the case of detecting file-level, codefragment-level s and the case of detecting all granularities s, excluding detected methods, and also, methods which are less than a given minimum length is especially effective for saving the detection time. In conclusion, our answer to EQ1 is YES. The multigrained detection technique can detect s greatly faster than the fine-grained detection techniques. A large number of file-level and method-level s were detected in cases of detecting s from a large set of source code in previous studies [3] [] [11]. Especially in such cases, the authors consider that the multi-grained detection technique can save the detection time effectively. 2) EQ2: comparing the case of detecting file-level, method-level s and the case of detecting all granularities s, the number of detected pairs increased (32,358 203,088). Comparing the case of detecting codefragment-level s and the case of detecting all granularities s, the number of detected pairs did not differ much (202,537 203,088). In conclusion, our answer to EQ2 is YES. The multigrained detection technique can detect more s than the coarse detection techniques. Moreover, multi-grained detection TABLE III THE TIME REQUIRED FOR EACH PROCESS STEP All [s] Code-fragment [s] STEP1-1 STEP1-4 3.7 4.0 STEP1-5 0.0 - Output(File) 2.0 - STEP2-1 STEP2-4.1 13.4 STEP2-5 0.0 - Output(Method) 1.4 - STEP3-1 STEP3-2 1.6 1.5 STEP3-3 637.0 4,672.0 Output(Code-fragment) 13.8 3.8

156 public class NewViewDoc extends Wizard implements INewWizard { 182 public boolean performfinish() { 183 final String containername = page.getcontainername(); 184 final String filename = page.getfilename(); 6 114 private void dofinish( 115 String containername, 116 String filename, 145 150 protected void createfile( 151 IResource resource, 152 String filename, 171 187 File-level 156 public class NewXDoc extends Wizard implements INewWizard { 182 public boolean performfinish() { 183 final String containername = page.getcontainername(); 184 final String filename = page.getfilename(); 6 114 private void dofinish( 115 String containername, 116 String filename, 145 150 protected void createfile( 151 IResource resource, 152 String filename, 171 187 Fig. 5. An Example where Multiple Clones are Detected as a File-level Clone 28 public class Menu { 30 private List<MenuTab> tabs = new ArrayList<MenuTab>(); 33 public void addtab(menutab tab) { 34 this.tabs.add(tab); 35 38 public List<MenuTab> gettabs() { 39 return tabs; 40 42 public void settabs(list<menutab> menus) { 43 this.tabs = menus; 44 46 File-level method-level method-level 28 public class ParsedMenu { 30 private List<ParsedTab> tabs = new ArrayList<ParsedTab>(); 32 public void addtab(parsedtab tab) { 33 this.tabs.add(tab); 34 36 public List<ParsedTab> gettabs() { 37 return tabs; 38 40 public void settabs(list<parsedtab> tabs) { 41 this.tabs = tabs; 42 44 Fig. 6. An Example where a File-level Clone can be Refactored 186 public class NeuralEventTimeSelfRelationAnnotator extends TemporalRelationExtractorAnnotator { 168 @Override 169 public List<IdentifiedAnnotationPair> getcandidaterelationargumentpairs( 170 JCas jcas, 171 Annotation sentence) { 206 return pairs; 207 283 file-level 186 public class EventTimeSelfRelationAnnotator extends TemporalRelationExtractorAnnotator { 171 @Override 172 public List<IdentifiedAnnotationPair> getcandidaterelationargumentpairs( 173 JCas jcas, 174 Annotation sentence) { 209 return pairs; 2 286 Fig. 7. An Example where a Clone can be Refactored technique detects as many s as fine-grained detection techniques. This indicates that the detection time has saved without losing detected s. 3) EQ3: we confirmed three cases generating detection results which were easy to analyze. The first case is that multiple method-level pairs were detected as a file-level pair. An example is shown in Figure 5. When the file-level detection is off, performfinish, dofinish, and createfile methods were detected as different method-level pairs. However, when the file-level detection is on, they were detected as a single pair of NewXDoc and NewViewDoc classes. In the file-level detection, 516 method-level pairs were detected as 404 file-level pairs. By detecting such contiguous s at a stage with coarser detections, there is a possibility that we notice refactoring opportunities to merge multiple classes into a class. In this case, the multi-grained detection technique can generate detection results which are easier to analyze than fine-grained detection techniques. The second case is that methods in a file were not detected in the method-level detection although the file was detected in the file-level detection. This is because

all methods in the file-level were less than a given minimum length. It is not necessary to detect methods as method-level s, but there are cases where the entire class needs to be detected as a file-level. An example is shown in Figure 6. When the file-level detection is off, addtab, gettabs, and settabs methods were not detected because those methods were less than a given minimum length. However, when the file-level detection is on, Menu and ParsedMenu classes were detected as a file-level because those classes were more than a given minimum length. By detecting such file-level s, there is a possibility of refactoring opportunities. In this case, the multigrained detection technique can also generate detection results which are easy to analyze. The third case is that methods in files which were not detected as file-level s were detected as method-level s. An example is shown in Figure 7. In the methodlevel detection, getcandidaterelationargumentpairs methods in NeuralEventTimeSelfRelationAnnotator and EventTimeSelfRelationAnnotator classes were detected as method-level s. In addition, NeuralEventTimeSelfRelationAnnotator and EventTimeSelfRelationAnnotator classes were extended the same TemporalRelationExtractorAnnotator class, getcandidaterelationargumentpairs methods were overridden. Those methods can be pulled up to the parent class. The multigrained detection technique generates detection results which include granularity levels of detected s, and from above examples, each granularity is refactored in different ways. Thus, by using the information of granularity levels, its detection results become easy to analyze because refactoring techniques to be applied become clearer. In conclusion, our answer to EQ3 is YES. V. THREAD TO VALIDITY Target systems: we targeted only software. If we had selected other software, the results might have been different from this experiment. However, even if we select other software, we consider that we can obtain the same experimental results. Hash collision: we use MD5 as a hash function. If hash collisions occur, non-duplicated files, methods, and code fragments are accidentally regarded as s. However, in this research, MD5 which outputs a 128-bit hash value is used, and the probability of a hash collision is considered to be sufficiently low. Normalization: in this experiment, the normalization described in Section III was conducted. If we had conducted a different normalization, the detection results would have been different from this experiment. VI. CONCLUSION We proposed a technique that detects in order from coarse s to fine-grained ones (multi-grained detection technique). We have developed a tool based on our proposed technique and applied it to some open source software. Then, we evaluated our proposed technique compared to coarse detection techniques and fine-grained detection techniques. The main contributions of this research are as follows. We proposed a technique that detects in order from coarse s to fine-grained ones. can detect s faster than fine-grained detection techniques. can detect more s than coarse detection techniques. can generate detection results that are easier to analyze than coarse detection techniques and fine-grained detection ones. Our future works are as follows. Supporting for other languages. Comparing with other detection tools. Evaluating the multi-grained detection technique with traditional quality metrics such as precision, recall, and f-measure. Detecting s from larger set of source code to find library candidates or overlooked bugs. ACKNOWLEDGMENT This work was supported by JSPS KAKENHI Grant Number JP25220003. REFERENCES [1] C. K. Roy, J. R. Cordy, and R. Koschke, Comparison and evaluation of code detection techniques and tools: A qualitative approach, Science of Computer Programming, vol. 74, no. 7, pp. 470 495, 2009. [2] D. Rattan, R. Bhatia, and M. Singh, Software detection: A systematic review, Information and Software Technology, vol. 55, no. 7, pp. 1165 1199, 2013. [3] T. Ishihara, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto, Inter-project functional detection toward building libraries-an empirical study on 13,000 projects, in Proc. of the 19th Working Conference on Reverse Engineering, 2012, pp. 387 391. [4] K. Hotta, J. Yang, Y. Higo, and S. Kusumoto, How accurate is coarsegrained detection?: Comparison with fine-grained detectors, in Proc. of the 8th International Workshop on Software Clones, 2014, pp. 1 18. [5] E. Choi, N. Yoshida, Y. Higo, and K. Inoue, Proposing and evaluating detection approaches with preprocessing input source files, IEICE Transactions on Information and Systems, vol. 98, no. 2, pp. 325 333, 2015. [6] T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, Journal of molecular biology, vol. 147, no. 1, pp. 195 197, 1981. [7] H. Murakami, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto, Gapped code detection with lightweight source code analysis, in Proc. of the 21st International Conference on Program Comprehension, 2013, pp. 93 2. [8] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, Comparison and evaluation of detection tools, Software Engineering, IEEE Transactions on, vol. 33, no. 9, pp. 577 591, 2007. [9] J. Svajlenko and C. K. Roy, Evaluating detection tools with bigbench, in Proc. of the 31st International Conference on Software Maintenance and Evolution, 2015, pp. 131 140. [] Y. Sasaki, T. Yamamoto, Y. Hayase, and K. Inoue, Finding file s in freebsd ports collection, in Proc. of the 7th Working Conference on Mining Software Repositories, 20, pp. 2 5. [11] J. Ossher, H. Sajnani, and C. Lopes, File cloning in open source java projects: The good, the bad, and the ugly, in Proc. of the 27th International Conference on Software Maintenance, 2011, pp. 283 292.