Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018

Size: px

Start display at page:

Download "Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018"

Gertrude Newton
5 years ago
Views:

Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018 Long Term Goals The goal of this project is to enhance the identification of code

1 Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication across projects which can result in high cost reductions for a minimal price. Background for Long Term Goals Code duplication (or clones) causes an increase in software size and, sooner or later, every supplementary line of code enters the maintenance process, thereby increasing time and cost for an organization. Studies on code duplication percentages range from 15% to 25%, which lends to the fact that the software industry is a principal candidate for improvement. Often, the modification is only performed one instance at a time and developers are not aware of the existence of the other instances. Routinely, the full resolution of a bug fix occurs only after several costly implementation-test iterations. A common occurrence in commercial software systems, code duplication (or clones), can adversely impact the development and stability of software engineering projects, and multiple studies have confirmed this duplicitous nature [JARZ10]. Clones can be introduced unintentionally or intentionally through independent development efforts. Clones may be intentionally created to improve program reliability or development speed. The same functionality may be intentionally duplicated to minimize dependencies to deliver individual control. On the other hand, the process of applying design patterns (e.g., pattern-driven development on component platforms such as JEE and.net) increases the possibility of unintentional cloning. A survey of clone research by Koschke lists many other root causes for software clones [KOSC07]. A clone is a segment of code possessing different types of similarity in text, lexical or syntactical structure, a pattern or semantics. Identical code fragments varying only in layout and comments are labeled as type-1 clones. Code can be used as is or slightly altered to fit a purpose. Code containing lexically identical fragments except for variations in identifiers, literals and types are labeled as type-2 clones. Type-3 clones occur when code statements have been added, removed or modified in segments. Type-4 clones, the semantic clones, have similar semantics but different implementations in code and are the most difficult type of clone to detect [RATT13]. As the suffix number used for the designation of the clone type increases, the process of clone identification also becomes more difficult. Type-1 and Type-2 clones can be detected by many of the current tools. There are a handful of tools that claim to detect Type-3 clones. A study by Tiarks reports that the type-3 clone detectors only detected 25% of the clones accepted by human oracle [TIAR10]. Identifying the distance between near-miss clones is not available and very few tools can detect all of the Type-3 clones. Type-4 clones remain mainly undetectable [SHEN16]. Two other basic types of clones have been added to the clone taxonomy: model-based and structural. Since clones are present in higher-level languages, it is not surprising that clones exist in other development products such as UML models. Structural clones compare the patterns of interrelated classes emerging from the design. A poor design with clones will result in rework, slower development productivity and a decrease in the software s future extendibility. These consequences indicate the necessity for continuous assessment of the design, including tracking clones throughout the software lifecycle. Identifying design clones and tracking them through

2 implementation offers valuable insights into their behavior and the consequential development actions. Of all the currently identified basic clone types, type-4, model-based and structural have been the subject of very little previous research. Why is there a growing interest in clones or code duplication? Several studies have shown that redundancy increases the risk of update anomalies and increased maintenance effort. For example, if an error is detected in a code fragment, all similar fragments should be checked for the same error. Duplicated fragments can also significantly increase the work to be done when enhancing or adapting code. Kent Beck states that programmers read more code than they write [McCR15]. When cloning is intentional, it can be useful, such as in reuse or in the presence of an aspect. Clone detection can find shared and common features in software product lines. It can also assist in plagiarism detection, copyright infringement, software evolution analysis, virus detection and bug detection, making it an important and valuable part of software analysis [RATT13]. Clones are defined based on their granularity, which defines the boundary of comparison. Granularity can be fixed as in a file, function and class or free such as the number or statements or tokens. Selecting an appropriate granularity is important. Segments that are too small are of little pragmatic significance, while those too large are unlikely to be reused. The granularity chosen should exhibit characteristics of being useful in refactoring and maintenance. Approaches to identifying clones are an active research topic. Most techniques involve comparison of the identified artifact expressed as direct or abstracted structures or measures. The distance between each clone comparison can range from exact to various degrees of precision. Each technique holds both benefits and drawbacks in identifying exact copies, syntactically identical copies and copies with modifications. The metrics-identified clones are rated as the best choice in identifying all types of clones and can scale up for evaluating large systems [KOSC07]. Several researchers have noted that clone detection techniques can be improved through hybrid methods [SHEN16]. Our Previous Code Duplication Research: We have created an ordered 3-tuple distance metric for identifying clones that can identify Type-3 and Type-4 clones not previously identified by other tools. This advancement will support developers in clone tracking, thus avoiding the problems caused by clones. Breaking down information access barriers to code, such as identifying and tracking clones, can save organizations millions of wasted dollars each year and help refocus development on strategically significant opportunities. A study by Chanchal, Cordy, and Koschke compared the detection and classification adequacy of 42 clone detecting tools. For the study set, 16 different code clones were divided into 4 scenarios. Most of the tools performed well with three of the scenarios and the majority failed on one scenario. The failed scenario was based on the code contained in Figure 1. Figure 1: Source Basis for Clones Four clones were created from the source code in Figure 1 and are displayed in Table 1. Thirty of the 42 most popular clone detection tools could not detect Table 1 clones. The other remaining 12 clone detection tools only detected a partial set of Table 1 clones. Table 1 clones also were used in other more recent studies [SHEN16] producing similar results. Thus, Table 1 s code initially was used to validate our method of calculating a naming sequence and usage distance to identify clones and their respective type. This approach proved to be successful. Table 1: Clones types created from code in Figure 1 2

The type of clone can be distinguished by using the values of the three distance metrics. A pair of blocks is a Type-1 clone if all three metrics are zero.

3 The type of clone can be distinguished by using the values of the three distance metrics. A pair of blocks is a Type-1 clone if all three metrics are zero. A pair of blocks would be a Type-2 clone if the sequence and usage metrics are zero and the naming metric is greater than zero. If the usage metric is greater than zero and less than the threshold (the maximum distance defined by the user), then it is a Type-3 clone. If the sequence metric returns a number greater than zero, the pair of blocks is a possible Type-4 clone. For a Type-4 clone, user feedback is required to determine if the pair is actually a clone. Table 2 presents the possible outcomes of the 3 metrics and the resulting possible clone types. Table 2: Clone types determined by Naming, Usage and Sequence Metrics Currently, there is no tool capable of discovering all kinds of Type-4 clones. Using the distance metrics improves the process of finding Type-4s by highlighting all the possible Type-4 clones and requires the user s feedback to determine if they are actual clones. Current Spectrum of Clone Research: The keynote paper from the 2014 IEEE conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE) entitled The Vison of Software Clone Management: Past, Present, and Future, states that software clone research in the past mostly focused on the detection and analysis of code clones, while research in recent years extends to the whole spectrum of clone management [ROY14]. The paper included a Kiviat graph, Figure 2, depicting the achievements and scopes along different dimensions of clone management activities. The graph contains ten dimensions of code clone management and scopes for further improvements concentrating on clone identification, tracing, refactoring, and cost-benefit analysis. A red oval has been added to draw attention to a particular deficiency in clone research that this research addresses. Also, an 11 th axis has been added to highlight the fact that in addition to detecting clones, it can be useful to subdivide clones of a specific type into meaningful classes such as Type-4 semantic clones and Type-4 syntactic clones or vulnerable clones and non-vulnerable clones. As denoted by the region in the red oval on Figure 2, the detection of Type-3 and Type-4 clones remains an open problem. Some of the problem is vagueness in the definition of clones and the different levels of analysis granularity which cause difficulties in identifying and comparing techniques and tools. Our new metrics labeled as usage and sequence and our new technique, 3

module metric distance, can address these issues. Type-3 clones, those with similar fragments that differ through modified, deleted or added statements, may be the most useful to identify.

4 module metric distance, can address these issues. Type-3 clones, those with similar fragments that differ through modified, deleted or added statements, may be the most useful to identify. Our metric approach can also identify type-3 clones. However, it is not enough to label a function or block of code as a type-3 clone. It is also important to inform the user about transformations Classification between the clone pairs. Currently, we are working on a technique to map the identified type-3 clone s xml tokens to a matrix representation and compare clone matrices to identify differences in patterns, thereby highlighting the transformations. Figure 2: Achievements and scopes along different dimensions of clone management activities However, identifying significant or flawed patterns within the code is more significant and not addressed within the ten areas of clone research seen in the Kiviat graph. We have developed a metrics-based approach for analyzing software designs (during both design and implementation) 4

5 that can help designers engineer quality into the product. From our analysis of large industrial projects, we have discovered that within a software system are hidden relationships and structures that can be illuminated by evaluating and measuring software development artifacts. Among these patterns are clones. These relationships can be used to answer questions about the stability of the software product and perhaps, more importantly, guide software development techniques. Software development is itself a pattern-selecting and pattern-making process and the development pattern is inherent in the structure and execution of the software. The coherent patterns or arrangements of modules that are gradually realized become effective and ontologically significant by virtue of their development. Our aim is to understand and categorize software modules to support developers efforts to recognize and apply patterns. Previous Related Project Findings: In the Secure Coding project, we concentrated on the Open Web Application Security Project (OWASP) benchmark test suite designed to verify the accuracy of software vulnerability detection tools. Benchmark version 1.1 contains 21,041 test cases across 11 different attack categories, including false positives. A beta Version 1.2 consisting of 2,740 test cases was also published to assist in identifying vulnerabilities. The Benchmark was carefully designed to test variants for each vulnerability to measure tool strengths and weaknesses. Each of these vulnerability types corresponds to its respective Common Weakness Enumeration (CWE) identification number. It is important to note that the OWASP Benchmark tests have been written in Java and have been derived from coding patterns from actual applications with the intent of helping users determine if the detection tool is capable of finding vulnerabilities accurately. Even though this is its intended purpose, our project, Secure Coding, can use the established test cases as a basis for uncovering patterns in vulnerabilities. We analyzed the 23,781 OWASP Java test cases, each consisting of at least five separate modules and transformed them into a specific XML file using a research tool called the Design Metrics Analyzer (DMA). From this XML format, we collect 82 unique XML tag count metrics and calculate 114 Java primitive and composite metrics for each method and class. One class or method is a metric clone of another if both possess exactly the same values for all of the metrics considered. As part of the analysis, each unique Java tag type from the XML project files is totaled. Since these tags identify the various structures of the program, they are a more precise representation of its content. The distribution of the individual tags can be used to calibrate different types of programs. The XML tags provide a static internal view of each module. The calculated Java metrics include such information as fan-in, fan-out, data-in and data-out, providing a static, but external, view of each module. Together, the XML tags and the metrics can provide a powerful technique to identify a wider spectrum of clones. The resulting record of 198 metric entities plus identification fields was matched with the result of the test case file. After this intensive data preparation phase, a preliminary similarity analysis was conducted using Euclidean and cosine distance techniques. The result indicated that the vectors of module metrics were all very similar (correlation of.95 and higher) suggesting their ability to identify patterns. The next step in the analysis process of the OWASP modules was to determine what techniques will potentially yield relevant results in terms of finding patterns from a large set of numbers. We attempted the traditional statistical models and applied numerous machine learning techniques. Our study employed the test set module data of the OWASP 1.2 test set and analyzed each model s or techniques predicative quality in either identifying the type of vulnerability (the CWE) or the vulnerability status (true or false). We repeated the analysis on the combined test sets 1.1 and 1.2. We focused first on identifying the type of vulnerability (the CWE). All of the statistical tests clearly demonstrate the likelihood that by using the 198 metrics to describe the modules, patterns can be established to determine the vulnerability type of vulnerable Java code. 5

6 Further research included a metric clone analysis, which indicated a liberal use of repeated code for the test cases. In our analysis, we found the following results when attempting a clone analysis on the 198 metrics in the dopost modules found in the combined OWASP test set: For dopost modules in the combined OWASP test set: 3004 clone groups singletons (non-clones) = 9481 unique modules Only 7 of the clone groups possessed more than one vulnerability type and in all seven clone groups the types were CWE 327 (weak cryptography) and 328 (hashing). The remaining 2997 clone groups possessed one unique CWE. The statistical tests and the clone analysis demonstrated the likelihood that patterns can be established to determine the vulnerability type of vulnerable Java code. With such strong data discriminating the vulnerability type, the outcome will result in an identifying pattern per vulnerability. The next step is to determine if other source code also contains these patterns or minor variations. In the OWASP study, we repeated the process, focusing on identifying whether a module (test case) was vulnerable or not vulnerable. The statistical tests and techniques were conducted on this new objective. A big question is Why did the metrics distinctly indicate a module s vulnerability type and not a module s vulnerability status?. Clones clearly point out that clone groups cluster around vulnerability type and can perhaps provide a clue on the vulnerability status. Reviewing the clone groups, over 80% of the clone groups possessed both true and false vulnerability status within the same group. This implies that even when having 198 identical metric values, the status of the vulnerability could still be different. Investigating the code structure of some of the clones having both true and false vulnerability within a group, the code was almost identical except for a single assignment to the same variable changing the execution path (i.e., one executing the true branch and the other executing the false branch). Both the Code Duplication and the Secure Coding project use metrics as the basis for identification of duplication and insecure patterns. Some metrics and their subsequent values could be dependent on format manipulation, such as lines of code. Other metrics do not impact the semantics, but are useful to collect, such as the number of comments. As we search for duplication and matching patterns in the code, it could prove useful to determine how each technique provides additional or complementary information to enhance the identification process by either confirming the result or narrowing the cases that require user feedback, such as in the case of a Type-4 clone. The Code Duplication project added two more metrics, usage and sequence, to enhance the analysis for patterns. These two clone distance metrics provide a granular measure of the differences in almost alike code or those that are labeled metric clones. These differences may be of value when assessing the vulnerability status of a pattern. The 198 metrics of the OWASP 1.2 test cases were visualized in TensorBoard applying a technique known as T-SNE used to reduce our multi-dimensional metric data to a three-dimensional representation. A highly clustered picture was presented and, therefore, we hypothesize that our metrics patterns are also clustered. To extract these clusters, we will use a new technique we call module metric distance (MMD). What is a metric distance? Given a set of metrics, m 1,m 2,m 3 m n, for a set of modules (records), determine the similarity (distance) among the modules in the set by comparing the individual values of the metrics. The output will be the smallest number of metric differences between the metric records. By applying these two additional discriminators, we expect to improve the classification of significant patterns in software. To accomplish our intermediate term objectives, we have been revising our tools to incorporate 6

7 the calculation and analysis of module, tag and distance metrics individually and the analysis of combination of module and tag, and the combination of module, tag and distance. Each analysis will be compared to determine its contribution to highlighting unique or interesting patterns. Moreover, the software has been updated to include our MMD metric. There are many techniques to measure the distance or similarity between structured data. Many rely on actual distance measures between data points. Intuitively, we believe that pairwise distance comparisons may not provide a measure of similarity between patterns and that the number of differences between the number of metrics is more in agreement. An example of the MMD visualization is depicted in Figure 3. Figure 3: The visualization of 10 modules with 6 metrics where the first integer column denotes the module s vulnerability status. The current DMA tool supports the Java 1.6 syntax. There are newer Java releases available. When we move to a new release, it is important to have a complete and stable parser available and for this reason we have chosen Java 1.8. Many of the enhanced Java features do not vary from the syntactical constructs used in 1.6 and are handled by the compiler s internal typing mechanisms. The syntactical language changes are the most important to correct parsing. A new syntax for lambda equations was included in Java 8. We tested a Java example using lambda equations as input to the DMA and, as expected, the tool raises an error on these code blocks, not understanding the syntax, but does continue with the analysis. The OWASP Java files do not use any of the newer Java 8 constructs and produces no errors in the XML output. Many newer programming language features are not used by developers and often development teams shy away from the newest features because of unintended side effects. If some of the industrial files contain lambda equations, these can be manipulated manually to prevent the parser from throwing an error in the short term. However, we are currently working on upgrading to Java 8 to handle lambda expressions, and a new Java language XML tag will be added to the set of current Java language tags for the DMA tool. Intermediate Term Objectives Combine the module and tag metrics with the two clone distance metrics on the OWASP test data and analyze the pairing. Calculate the MMD for the data set and evaluate the attributes of each of the clusters. Schedule of Major Steps Planned and Actual: Update the software to unite the module and tag metrics and the two clone distance metrics. Test the MMD software. Execute on the test set and analyze results. 7

8 Repeat applying different parameters to determine the relationship between metric clones and distance metrics. Try the approach on a new test set, the Juliet Java Test Suite (NIST). Dependencies: Our analysis will be on Java systems and some of the insights gained may only reflect these types of systems. Major Risks: Tools to assist in analysis require significant effort. Researchers have found it difficult to detect semantic clones (Type-4). The functional equivalence problem is undecidable in general. Budget: $76,040 Staffing: Dolores Zage, Wayne Zage, principal investigators Nathan White, graduate student Nicolas Egierski, undergraduate student Category of Current Stage: Preliminary work and analysis has begun. Contacts with Affiliates: Through conversations with affiliates about the Secure Coding project, we have been able to identify the importance of identifying clones and their types. We are providing monthly updates to Cisco personnel via telephone conferences. Publications and Research Products: 1. Status reports on major steps. 2. Analysis of the technique of using additional discriminators to uniquely identify vulnerable and nonvulnerable patterns. 3. Rudimentary Java clone classifier and MMD analyzer. References: [CHAN09] Chanchal K., Roy, James R. Cordy, and Rainer Koschke, Comparison and evaluation of code clone detection techniques and tools: A qualitative approach, Science of Computer Programming, Volume 74 (2009), no. 7, [JARZ10] Jarzabek, S., and Y. Xue, Are Clones Harmful for Maintenance?, IWSC 10, May 8, 2010, Cape Town, South Africa, Proceedings of the 4 th International Workshop on Software Clones, ACM, New York, NY. [KOSC07] Koschke, R., Survey of research on software clones., Duplication, Redundancy, and Similarity in Software, number 06301, Dagstuhl Seminar Proceedings, Dagstuhl, Germany, [McCR15] McCreary, J. Patterns, Code Smells and the Pragmatic Programmer, Slideshare.net, May 2015, Accessed on May 10, [RATT13] Rattan, D, R. Bhatia and M. Singh, Software clone detection: A systematic review, Information and Technology Journal, Volume 55, Issue 7, Pages , July 2013, Elsevier 8

9 Publication. [ROY14] Roy, C. K, M. Zibran and R. Koschke, The Vison of Software Clone Management: Past, Present, and Future (Keynote paper), 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), Antwerp, Belgium, 2014, pp [SHEN16] Sheneamer, A. and J. Kalita, A Survey of Software Clone Detection Techniques, International Journal of Computer Applications, Volume 137-No 10, March 2016, Accessed May pdf [TIAR10] Tiarks, R., R. Koschke, R. Falke, An extended assessment of type-3 clones as detected by state-of-the-art-tools, Software Quality Journal, : Published online November Accessed May 10, [ZAGE09] Before and After Project, Design Metric Analysis of P3, S 2 ERC Final Report. Copies made available upon request. 9

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

Code Duplication New Proposal Dolores Zage, Wayne Zage Ball State University June 1, 2017 July 31, 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication which