Aspect Mining Using Self-Organizing Maps With Method Level Dynamic Software Metrics as Input Vectors

Size: px

Start display at page:

Download "Aspect Mining Using Self-Organizing Maps With Method Level Dynamic Software Metrics as Input Vectors"

Amanda Hardy
6 years ago
Views:

1 Aspect Mining Using Self-Organizing Maps With Method Level Dynamic Software Metrics as Input Vectors By Sayyed Garba Maisikeli A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Information Systems Graduate School of Computer and Information Sciences Nova Southeastern University 2009

2 We hereby certify that this dissertation submitted by Sayyed G. Maisikeli, conforms to acceptable standards and is fully adequate in scope and quality to fulfill the dissertation requirements for the degree of Doctor of Philosophy. Frank Mitropoulos, Ph.D. Chairperson of Dissertation Committee Date Sumitra Mukherjee, Ph.D. Dissertation Committee Member Date Junping, Sun, Ph.D. Dissertation Committee Member Date Approved: Edward Lieblein, Ph.D. Dean Date Graduate School of Computer and Information Sciences Nova Southeastern University 2009

3 An Abstract of a Dissertation Submitted to Nova Southeastern University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Aspect Mining Using Self-Organizing Maps With Method Level Dynamic Software Metrics as Input Vectors By Sayyed Garba Maisikeli March 2009 As the size and sophistication of modern software system increases, so is the need for high quality software that is easy to scale and maintain. Software systems evolve and undergo change over time. Unstructured software updates and refinements can lead to code scattering and tangling which are the main symptoms of crosscutting concerns. Presence of crosscutting concerns in a software system can lead to bloated and inefficient software system that is difficult to evolve, hard to analyze, difficult to reuse and costly to maintain. A crosscutting concern in a software system represents implementation of a unique idea or a functionality that could not be properly encapsulated. The presence of crosscutting concerns in software systems is attributed to the limitations of programming languages, and structural degradation associated with repeated change. No matter how well large software applications are decomposed, some ideas are difficult to modularize and encapsulate resulting in crosscutting concerns scattered across the entire software system where code implementing one concept is tangled and mixed with code implementing unrelated concept. Aspect Mining is a reverse software engineering exploration technique that is concerned with the development of concepts, principles, methods and tools supporting the identification and extraction of re-factorable aspect candidates in legacy software systems. The main goal of Aspect Mining is to help software engineers and developers to locate and identify crosscutting concerns and portions of the software that may need refactoring, with the aim of improving the quality, scalability, maintainability and evolution of software system. The aspect mining approach presented in this dissertation involved three-phases. In the first phase, selected large-scale legacy benchmark test programs were dynamically traced and investigated. Metrics representing interaction between code fragments were derived from the collected data. In the second phase, the formulated dynamic metrics were then submitted as input to Self Organizing Maps (SOMs) for clustering. In the third phase, clusters produced by the SOM were then mapped against the benchmark test program in order to identify code scattering and tangling symptoms. Crosscutting concerns were identified and candidate aspect seeds mined.

4 Sayyed G. Maisikeli Overall, the methodology used in this dissertation is found to perform as well as and no worse than other existing Aspect Mining methodologies. In other cases, the methodology used in this dissertation is found to have outperformed some of the existing Aspect mining methods that use the same set of benchmark test programs. With regards to Aspect Mining precision as it relates to LDA, 100% precision was attained, and with respect to JHD 51% precision was attained by this dissertation methodology, which is the same as attained by existing Aspect mining methods. Lessons learned from the experiments carried out in this dissertation have shown that even highly structured software systems that are based on best practice software design principles are laden with code repetitions and presence of crosscutting concerns. One of the major contributions of this dissertation is the presentation of a new unsupervised Aspect Mining approach that minimizes human interaction, where hidden software features can be identified and inferences about the general structure of software system can be made, thereby addressing one of the drawbacks in currently existing dynamic Aspect Mining methodologies. The strength of the Aspect Mining approach presented in this dissertation is that the input metrics required to represent software code fragments can easily be derived from other viable software metric formulations without complex Mathematical formalisms. Other contributions made by this dissertation include the presentation of a good and viable software visualization technique that can be used for software visualization, exploration and study and understanding of internal structure and behavioral nature of large-scale software systems. Areas that may need further study include the need for determining the optimal number of vector components that may be required to effectively represent software extractible components. Other issues worth considering include the establishment of set of datasets derived from popularly known test benchmarks that can be used as a common standard for comparison, evaluation and validation of newly introduced Aspect Mining techniques.

5 Acknowledgements This dissertation is dedicated to my late father Sheikh Garba Maisikeli who exemplified perfection and demanded excellence, and the person who gave me the spirit of persistence. To my mother who nurtured me through her tender hands and gave me guaranteed unconditional love. To my late grandmother Nana Halima for the early morning breakfast (Maiwuta-Biyu), a person who provided tough love and acted as my attorney when I was wrong, and provided vacation escape when I needed one. My dedication is incomplete if I do not mention my wife Amina who persevered with me through trials and triumph, and my Big Brother Kamilu Maisikeli (a.k.a. Labaran), who taught me my three R s, and with little bit of bullying subliminally challenged me to excel. Last but not the least, to my late brother Mujtaba Maisikeli who fought my fights and defended me when my sceptre and armour were down. My special gratitude goes to my dissertation committee for giving me the guidance and help I needed to complete my dissertation work. My special thanks and appreciation to Dr. Mitropoulos (Dissertation committee chairman), and to committee members Dr. Mukherjee and Dr. Sun for their patience to listen and provide answers to my never ending questions.

6 Table of Contents Abstract Acknowledgement Table of Contents List of Tables List of Figures ii iv v viii ix Chapters 1. Introduction 1 Problem Statement and goal 4 Research Questions/Research Goal 6 Relevance, Significance 7 Barriers and Issues 9 Elements, Hypothesis and Research Questions 11 Limitations and Delimitations 11 Definitions of Terms 12 Summary Brief Review of Literature 15 Advantages Similarities and Differences to Previous methodologies Methodology/Approach 24 Phase-1. Benchmark program tracing and Data Collection 26 Metric and Features Extracted From Benchmark Software 30 Targeted Features 31 Phase-2. Vector Component Representation 32 Vector Component metric 1. Dynamic Fan-In (FI)/Fan-Out (F/O) 32 Vector Component metric 2. Information Flow metric (IF) 32 Vector Component metric 3. Method Signature 33 Vector Component metric 4. Method Spread 33 Vector Component metric 5. Method Internal Coupling (MIC) 34 Vector Component metric 6. Method External Coupling (MEC) 34 Vector Component metric 7. Method Cohesion Contribution (MCC) 35 SOM Data Input Structure 35 The Essence of Clustering 37 Clustering Code Fragments Using SOM 38 How SOM Performs Data clustering 39 Steps of the SOM Training Algorithm 41 SOM Result Visualization 42 Component Planes 43 SOM Versus Other Clustering Techniques 44 Benchmark/Test Source Programs 45 Notes on Laffra Dijkstra s Algorithm Implementation (LDA) and (JHD) 46 v

7 Phase-3. Structure Discovery, Concern Isolation, Pattern Identification 47 Identification of Aspect Candidates 48 Anatomy of Concerns in Software Systems 48 Definition of Scattering 49 Definition of Tangling 49 Definition of Crosscutting Concern 50 Definition of Concern Decomposition 50 Symbolic definition of Crosscutting 51 Formal Model for Clustering-based Aspect Mining 52 The Partition of a Software System 52 Cluster Mapping 53 Formal presentation of Mapping 55 Validation of Methodology 56 Validation Step-1 Software Metrics 57 Validation Step-2 Number of Clusters and Cluster Compactness 58 Validation Step-3 Validation of Aspect Candidates 60 Validation Step-4 Recall and Precision 61 Summary of Methodology Used Results 64 Analysis of LDA Results 66 LDA Mapping Example 69 LDA Findings and Result Comparisons 71 Findings and Data Analysis (JHotDraw 5.4b1) 72 JHD Component planes 75 Comparison of Results (JHotDraw 5.41b) 76 Result Precision 78 Summary of Results (General Observation and Findings) 79 Visualization Problems 80 Seed Disparity Ceccato et al., Vs Dissertation Approach 81 Problem of Incomplete Coverage 81 Data Analysis Problem 82 Summary of Results Conclusions, Implications, Recommendations, and Summary 84 Implications 86 Recommendations 87 Summary 88 vi

8 Appendixes 93 Appendix A. AspectJ Event Trace Code 93 Appendix B. VBA Code that Constructs metrics 95 Appendix C. Sample LDA Execution Trace Data 97 Appendix D. Sample Metric Data (LDA) 97 Appendix E. MatLab SOM Code 98 Appendix F. List of Aspects and associated Seeds Found 99 Appendix G. Method Detail and Cluster Tables (LDA) seeds 104 Appendix H. SQL Statements used for linking tables to mine Aspects 105 Reference List 106 vii

9 List of Tables Tables 1. Number of Event Traces for Benchmark Programs List of Functionalities Exercised in JHD Characteristics of Benchmark Programs Rules for extracting Aspect Seeds Sample data representing formulated metrics Comparison of discovered JHD seeds Select Concern type capability comparisons JHD seeds comparison. Dynamic Analysis Vs Dissertation approach LDA Benchmark Comparison JHD Benchmark Precision Comparison Series of Quantiztion and Topographic errors collected Proportion of Executed unique methods 81 viii

10 List of Figures Figure 1. Steps Used in Methodology 25 Figure 2. JHotDraw User Interface Screen Shot 28 Figure 3. LDA Screen Shot 29 Figure 4. Sample Java Method Signature 33 Figure 5. Method Spread 33 Figure 6. Method Internal Coupling (MIC) 34 Figure 7. Method Internal Coupling (MEC) 34 Figure 8. Methods Cohesion Contribution (MCC) 35 Figure 9. Definition of Vector Components 36 Figure 10. Data Input Layout for SOM 36 Figure 11. SOM Architecture 39 Figure 12. Updating Winner-neuron Known as Best Matching Unit (BMU) 40 Figure 13. U-Matrix Representation of the Self-Organizing Map 42 Figure 14. Concern Decomposition: The prism Analogy 50 Figure 15. Relation Between Source and Target Elements 51 Figure 17 Mapped Clusters and Obtained Aspect Seeds and Their Types 54 Figure 18 Mathematical Definition of Mapping 55 Figure 19 Radar Plots for Metrics Used on Benchmark test Programs 57 Figure 20. Optimal Number of Clusters Based on Davies-Bouldin Index 59 Figure 21. Precision and Recall 62 Figure 22. LDA U-Matrix and Components Planes 67 Figure 23. LDA Clusters Produced by SOM 68 ix

11 Figure 24. LDA Method Mapping Showing two Identified Aspects 70 Figure 25. LDA Mined Aspect Seeds 70 Figure 26. Method Mapping Showing the two Identified (LDA) Aspects 70 Figure 27. JHD U-Matrix 72 Figure 28. 3D JHD SOM Clusters 71 Figure 29. Graph Showing Seed Discovery Disparity 74 Figure 30. JHD Component Plane Maps 75 Figure 31. Comparison of Quantization and Topological Errors 80 x

12 1 Chapter 1 Introduction A concern in a software system represents an implementation of a unique idea or functionality. Robillard (2000) defined concern as any consideration about implementation of a program. Similarly, Ossher and Tarr (2001) defined concern as part of a software system that is relevant to specific concept or purpose. Concerns in software systems are usually not properly modularized or encapsulated in a single unit, but are scattered across different modules leading to what is known as code scattering problem, and or intermingled with other code implementing different functionalities known as code tangling problem. A crosscutting concern is an obstacle encountered by software developers brought about as a result of presence of certain module(s) of the system that cannot be perfectly localized or modularized, in which case many module boundaries are crossed making software maintenance and scalability a very difficult task. One of the reasons for code scattering and tangling includes implementation language limitations, poor design and software scavenging (known as code cloning). Rieger (2005), defined code cloning as a form of primitive software reuse, where reuse is applied in an informal and uncontrollable way; and observed that since such duplications are not documented, dependencies between of the system code parts are usually hidden. The task of identifying and detecting crosscutting concerns in software systems is called Aspect Mining. Loughran and Rashid (2002) stated that mining software aspects is an important exercise because it allows software engineers and developers to locate, manage and adapt assets efficiently.

13 2 Aspect Mining involves search for source code elements or components that belong to a crosscutting concern that are part of the component s extent; the resultant discovered elements are called aspect seeds. Cojocar and Serban (2007) presented a theoretical model definition of Aspect Mining as follows: Consider a software system represented by a set M ={m 1, m 2,..., m n } where m i is i th method of the software system, 1 i n and n=number of methods in the system, then Crosscutting concern (CCC) in such software system is defined as a set of concern C M where C = {c 1, c 2,, c cn }, a set of concerns that implement concern C. The number of methods in this crosscutting concern is cn = C. Let CCC = {C 1, C 2, C n } be the set of all crosscutting concerns that exist within M. The number of crosscutting concerns in the system M is therefore represented by q= CCC. Following this definition, the set K={k 1, k 2,, k p } is called a partition of system M, where M is made up of set of methods {m 1, m 2,, m p }, if and only if 1 <n< p, K i M, K i 0, 1 < i < p, and M = i=1 U p K i, and K i K j =, 1 i, j p, i j. One of the main goals of Aspect Mining is to identify parts of software that may need to be Refactored. According to Fowler and Martin (1999), Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior. The aim of Aspect Mining is to find and isolate crosscutting concerns in existing non-aspect oriented legacy system. The identified crosscutting concerns could then be improved by applying Aspect oriented solutions. The goal is to provide a means of refactoring aspects so that program comprehension can be improved, thereby making reusability, extendibility and general management and maintainability of the software system easy.

14 3 The basic idea of dynamic Aspect Mining is to observe the run-time behavior of a software system and to extract relevant information about the program under investigation. When a program is run and event traces collected, hidden behavior of the program is reflected, and within the execution traces, recurring patterns may provide a lot of pertinent information about potential crosscutting concerns in the software system. A three-phase dynamic Aspect Mining process is applied in this dissertation. In the first phase, a set of legacy benchmark programs was selected as test programs. An AspectJ tracing tool was developed to trace and profile the investigated benchmark programs. Some of the extractable features that were targeted for this purpose include dynamic inter method level interactions such as fan-in and fan-out, parameter passing interactions and method level cohesion and coupling characteristics of the investigated software programs. In the second phase, software metrics derived from the test programs were then submitted as input to a Neural Network clustering methodology known as Self Organizing Map (SOM) for clustering. The third phase of the project investigated clusters produced by SOM, to identify any recurring relationship patterns from which crosscutting concerns and aspect seeds can be identified and mined. At the end of the project, viable validation methodologies were applied to assess the performance of the Aspect Mining methodology used in the dissertation. Experimental data collected from the dissertation were then compared with results from other Aspect Mining research efforts to determine how well the methodology employed fairs. Cluster compactness was also validated to ensure that optimal clustering is achieved, with a view to obtaining best possible results.

15 4 Problem Statement and Goal According to Buntink (2004), no matter how well large software systems are decomposed, crosscutting concerns will not always fit the chosen decomposition; as a result, implementations of such crosscutting concerns will be scattered across the system tangled and mixed with non-related code addressing different concerns. In other words functional structure of the system requirements sometimes do not lend themselves to proper and perfect modularization and encapsulation, leading to what is known as code scattering and tangling. Code scattering and tangling are features that cause a piece of software to be difficult to extend, maintain and evolve. Giesecke (2006) mentioned that code clones in software systems arise as a result of inefficient design, and software scavenging. Code cloning occurs when a piece of tested code is copied and customized to implement similar or closely related concepts. Other reasons that lead to code cloning include adaptation of a piece of code for use in a different context, and lack of abstraction mechanisms in the programming language used in the development of the software. Code duplications in a software system can lead to code scattering and high possibility of code tangling (mixing unrelated functions in a software), causing presence of crosscutting concerns. According to Davey, Barson, Field, Frank and Tansley (1995), software code cloning contributes to redundancy of statements, and therefore steady increase in complexity of software system. To highlight the problems associated with code scattering and tangling, Greenan (2005) stated that, code duplication can over-complicate routine maintenance and observed that, change in one method can lead to cascading change across many other methods leading to possible propagation of bugs across the software system.

16 5 One of the areas recommended by Krinke and Breau (2003) for future work in the area of Aspect Mining is the development of a filtering methodology that extracts refactorable candidates from discovered clones. This dissertation aimed at attaining this noble goal by using method level metrics derived from extractable code features from code fragments of the benchmark test programs. The data for the metrics were collected dynamically when the benchmark program executes. The software code module granularity level to be used in this project is the method unit rather than at class level. Aggarwal and Sigh (2003) argued that traditional static metrics used for measuring software systems have been found to be inadequate for the analysis of Object Oriented systems. Rather than representing the static structural nature of the software system, dynamic metrics actually represent the dynamic behavior of the program, and therefore a better representation of the source code. Dufour, Goard, Hendren, De Moor, Sttamplam, Verbrugge (2004) stated that, when data about a source code is collected dynamically, features such as data structure, memory use, concurrency, size and polymorphic behavior of the system are exposed, and that although dynamic metrics may require more work than static ones, the results obtained are found to be more meaningful. Based on the obvious advantages dynamic metrics have over static ones, this dissertation therefore used dynamic metrics as vector component input to SOM. Some of the extractable features that were used for this purpose include lexical patterns, dynamic behavior and other relevant method-level characteristics of the software programs investigated. The aim was to extract features that are capable of uniquely qualifying and quantifying the structural as well as dynamic behavior of the code fragments corresponding to methods in the benchmark program being investigated.

17 6 Research Questions To achieve and accomplish the desired goal of presenting an Aspect Mining approach that addresses some of the drawbacks in existing research in Aspect Mining area, some of the questions this dissertation attempted to ask and answer, include the following: 1. Can method level software metrics be used as input to SOM so that code fragments with similar and overlapping functionalities can be properly clustered providing a base from which crosscutting concerns can be identified? 2. Will mapping SOM clusters against the benchmark program help identify code scattering and tangling leading to identification of possible aspect candidates? Research Goals In line with the research questions raised above, this dissertation attempted at achieving the following three goals: 1. Show that clustering based on easily extractable software features can be used to detect code clones, and hence provide a steppingstone towards identification of crosscutting concerns leading Aspect Mining. 2. Show that method interactions through calls, and parameter sharing represented as metrics and clustered by SOM can be used to determine code scattering and or tangling, thereby identifying crosscutting concerns leading to mining of possible aspect candidates. 3. Show that the Aspect Mining approach proposed in this dissertation does no worse than existing Aspect Mining methods.

18 7 Relevance and Significance According to Kontogiannis, Demori, Merlo, Galler and Bernstein (2003), legacy systems are operational, large-scale software systems that are maintained beyond the first generation of programmers. Kontogiannis et al., (2003) also stated that, legacy systems typically represent massive economic investments and are usually critical to the mission of the organization they serve; and that, as systems age, they become increasingly complex, brittle, and hence harder and risky to evolve and maintain. By their nature, legacy systems outlive their designers and developers, and proper documentation for such large-scale software systems are usually not available, and when available, are found to be lacking in details. Most software systems undergo changes over time, and unstructured and poorly planned software updates and refinements can lead to code scattering and tangling that can affect software efficiency, making the software difficult to evolve, hard to analyze, difficult to reuse and costly to maintain. Software applications reach the end of their life cycle when they fail to keep up with evolving needs, become too risky, difficult and expensive to modify. As a result of changes in business requirements or adaptation to changes occurring in industry, legacy software systems usually undergo changes, updates and modifications. Such modifications require refactoring existing code. Bruntink (2004) defined Aspect Mining as a specialized reverse engineering process where legacy system source code is investigated in order to discover part of the system representing aspects. When re-factoring large-scale software systems, software engineers and developers usually focus attention on identifying portions of the software that are possible candidates for change, this process is known as Aspect Mining.

19 8 Aspect Mining is concerned with identifying system-wide crosscutting concerns. Loughran and Rashid (2002) stated that mining software aspects is an important exercise because it allows software engineers and developers to locate and adapt assets efficiently. To highlight the benefits derived through Aspect Mining and refactoring, Shepherd et al., (2005) observed that novice programmers can benefit through recognition of refactoring opportunities, and experienced programmers gaining from more subtle opportunities. Greenan (2005) observed that, due to dynamic nature and complexity of modern software systems substantial fraction of such large-scale systems contain code clones, and presence of such clones can pose problems when the need to refactor such software arises. In small-scale software systems, code and logic duplication may not be a problem, but for industrial strength software systems, management, maintainability and extendibility are important issues. Refactoring software systems bloated with logic and code duplications and crosscutting concerns can be a difficult and expensive exercise. According to Kamiya et al. (2002), 15-20% of lines of code in large-scale software systems is cloned, and reiterated this point by stating that, a clone detection tool called CCFinder reports that 21.35% of the software code behind Java Software Development Kit (JSDK) is cloned. Davey, Barson, Field, Frank and Tansley (1995) also observed that, results of unofficial surveys carried out within large, long-term software development projects, suggested that 25-30% of modules in large-scale legacy system have been cloned. In support of using code clone detection as a means of mining aspects, Bruntink, Van Deursen, Van Englen and Tourwe (2005) hypothesized that clone detection techniques might be suitable for identifying some kinds of crosscutting concern code since they automatically detect duplicated code in a system source code.

20 9 Barriers and Issues A major challenge usually faced by software developers and engineers is the desire to develop and deliver a high quality software within a given time constraint. Sampaio, Loughran, Rashid and Rayson (2004) observed that, to accelerate development process, in many cases developers resort to ad-hoc shortcuts, leading to an unstructured way of software development impacting the quality of the process and its deliverables. Programming language-supported constructs like modules, classes, and aspects enable encapsulation of certain concerns, but due to the limitations of programming languages, structural degradation due to repeated changes, and the continual emergence of new issues, code implementing concerns is often found to be scattered and tangled throughout the system. Roy, GiasUddin, and Dean (2007) stated that studies and experience have shown that scattering and tangling of concerns greatly increases the difficulty of evolving software in a correct and cost-effective manner. Aspect Mining tries to identify crosscutting concerns with the aim of providing means for re-factoring and improving the quality, scalability and maintainability of software systems. Bruntink, Van Deursen, Van Englen and Tourwe (2005) observed that software systems need refactoring to reduce software decay that occurs over time, and that refactoring decreases complexity, improves maintainability, understandability and provides easy facilitation of future changes. Shepherd, Palm, Pollock and Chu-Carroll (2005) highlighted the need for Aspect Mining by stating that, identifying aspect candidates can have many benefits in that, while novice programmers can be helped to recognize refactorable opportunities in their code, advanced programmers can benefit from subtle refactoring opportunities.

21 10 Due to the exorbitant costs associated with new developments, it is usually not feasible to re-build existing systems using new techniques and technologies. Gibbs (1994) estimated that the average cost for well-managed code is about $100 per line. To highlight the costs associated with maintenance of large-scale software systems, Davey et al., (1995) estimated that NASA space shuttle software systems required 25.6 million lines of code, 22,096 staff-years of effort, and costs about $1.2 billion. Another example of software maintenance concerns presented by Schendler (1989), is that of a subsidiary of United Airlines Covia, which spends $120 million each year maintaining and updating software for its Apollo reservation system. Schendler (1989) also quoted Peat Marwick (KPMG) that software maintenance activities typically consume 80% of most corporate software budgets. With the increasing capital investment necessary to develop such large-scale software systems and associated maintenance costs, the management of these systems becomes even more vital, both during and after the initial development. The examples presented here have highlighted the need for new methodologies that can efficiently be used to help software engineers and developers in maintenance and refactoring of large-scale software systems. In support of new Aspect Mining methodologies that minimizes manual interaction and increases efficiencies in Aspect Mining, Shepherd, Palm, Pollock and Chu-Carroll (2005) observed that although existing tools provide the means to identify aspect seeds, most are found to suffer some drawbacks in the sense that such methodologies are not automated, and even those that are automated, usually return spurious seeds with tendencies to miss many potential seeds, and require large amount of human effort and provide little or no help in Aspect Mining decisions.

22 11 Elements, Hypothesis, Research Questions Investigated This dissertation hypothesized that crosscutting concerns and aspects can be mined from clustered extractable software fragments at method granularity level, after mapping obtained clusters against the entire benchmark software being investigated. Research Questions The two research questions this dissertation attempted to answer are as follows: 1. Can method level software metrics be used as input to SOM such that similar code fragments with overlapping functionalities can be clustered providing a base from which crosscutting concerns can be identified? 2. Does mapping SOM clusters against test benchmark program result in identification of code scattering and tangling leading to identification of possible aspect candidates? Limitations/Delimitation of Study The effectiveness of the Aspect Mining methodology used in this dissertation may have been impacted by the quality of software features targeted, and the quality of the software metrics used as input vector components, and how focused and relevant these vector components (i.e. extractible code fragments) are. Also, results may have been impacted by how compact the resultant SOM clusters are. The nature and inherent general structure of the software being investigated, and how well modularized components of the benchmark programs are, is another factor that may have influenced the obtained results. Since human interaction is required in exercising the tested software, not all functionalities in such software may have been exercised, thereby resulting in some trivial functionalities not being traced.

23 12 Definition of Terms Term Artificial Intelligence Aspect Aspect Mining Best Matching Unit Concern Clustering Crosscutting Concern Design patterns Dynamic Metric: Event Trace False Negative Definition The art of creating systems that perform functions that requires intelligence, which makes it possible to perceive, reason and act rationally and intelligently. Aspects are concerns in a software system representing implementation of unique ideas or functionalities. Reverse software engineering exploration technique that is concerned with the development of concepts, principles, methods and tools supporting the identification and extraction of re-factorable aspect candidates in legacy software systems. Known as (BMU), this is the neuron with the shortest distance to the input vector used to declared a winner in SOM clustering. Representation of an implementation of a unique idea or functionality in a software system. Part of a software system that is relevant to specific concept or purpose. Process of putting together similar things into groups such that similarity or cohesion among the units is high. The scattering and tangling of implementation in software that cuts across module boundaries. Set of tested software design problems representing best practice approach in solving commonly encountered programming problems. Measures used to describe and quantify the complexity of piece of software collected while the program is being executed. Sequence of method invocations and exits keeping track of relative order in which method executions are started and finished. This is a number of aspects reported as non-aspects while they are actually aspects.

24 13 False Positive Precision Quantization Error Recall Refactoring Scattering: Self Organizing Maps Separation of Concern Tangling: Topographic error U-Matrix Weaving This is a number of reported aspect candidates that are in fact not aspects. This is the percentage of relevant aspect candidates in a set of reported candidates. It is the ratio of confirmed good aspect candidates to the total sum of good and bad candidates. The mean of the Euclidean distance of each data vector to its Best Matching Unit s weight vector (BMU). The measure of how much of the code of a crosscutting concern is found. Activity of transforming potential aspects into real aspects in a software system. Concern is scattered if it is related to multiple target elements. This is an unsupervised Artificial Intelligence data visualization technique, attributed to Kohonen (1968), that reduces high dimensions of data through the use of neural networks, converting non-linear statistical relationships between high dimensional data into simple geometric relationships of their image points. Separation of Concerns is defined as the principle of breaking a program into smaller distinct parts without much overlapping in functionality. Separation of concerns is an important concept in software engineering and is a desired feature in software system modeling and design since it helps in managing the complexity of the system. A concern is tangled if both it and at least one other concern are related to the same target element. Proportion of all data vectors for which first and second BMU s are not adjacent units in SOM grid. This is a graphical representation of distances between reference vectors of neighboring map units. Usually presented in grid (matrix), depicting neuron distances. A phase during which aspect functionality is composed and injected into the module of a system.

25 14 Summary Enterprise-wide large-scale software systems change and evolve over time. The evolvability and ever changing nature of such system is a function of such factors as change in business rules and programming paradigm shift. The end of life of a oftware system comes about not because it fails to deliver required functionlities, but because the system has become too brittle, difficult to maintain or too risky and expensive to evolve. A major impediment to program comprehension, maintenance and evolvability is the presence of of crosscutting concerns scattered across different modules tangled with other implementations. Eaddy et al. (2008) has shown that a correlation exists between presence of crosscutting concerns and defects in a software; implying that the more scattered concern implementations are, the more defects a software system may have. This dissertation presents a dynamic Aspect Mining approach that targets extractible software features (software code fragments) within the investigated benchmark software (through dynamic execution event tracing), converting event trace values to software metrics that were then submitted to SOM for clustering. The resultant clusters were then mapped back to the test software to identify crosscutting concerns. Metrics derived and used in this dissertation have proven to be very useful and good representation of latent features existing in the investigated benchmarks software systems. In addition to this, the mapping strategy used has helped in exposing scattering and tangling behavior exhibited in the investigated benchmark software systems, thereby helping in identification of aspect candidate seeds. Based on capabilities exhibited by the applied methodology and comparison of obtained results against existing documented results, the two main questions raised in this dissertation have therefore been answered.

26 15 Chapter 2 Brief Review of the Literature The tyranny of the dominant decomposition as presented by Peri, Ossher, Harrison and Sutton (1999), states that no matter how well a system is decomposed into modular units like functions and classes, some software functionality will always cut across that modularity. When re-factoring large-scale software systems, software engineers and developers usually focus attention on identifying portions of the software that are possible candidates for change, this process is known as Aspect Mining. Marin, Deursen and Moonen (2004) stated that the origin of Aspect Mining could be traced back to the Concept Assignment Problem that deals with the problem of discovering domain concepts and assigning them to their realizations within specific program. Aspect Mining is concerned with identifying system-wide crosscutting concerns. Loughran and Rashid (2002) stated that mining software aspects is an important exercise because it allows software engineers and developers to locate and adapt assets efficiently. According to Favre (2002), maintaining a software system is difficult not only because of the number of artifacts but also because of the varieties of the artifacts; and that, difficulty associated with management and maintainability of legacy software system is greatly associated with how the software constituents are designed and laid out. Breu, Zimmermann and Lindig (2006) defined Aspect Mining as the identification of crosscutting concerns in legacy software, and presented an aspect mining methodology that uses the project history, where program version archives are mined, and results used to identify possible aspect candidates.

27 16 Deursen, Marin and Moonen (2003) also defined Aspect Mining as the search for candidate aspects in existing systems, isolating them from the system into separately described aspects; and further explained how Aspect Mining is concerned with the development of concepts, principles, methods and tools supporting the identification of aspects in software systems. Breu (2004) considered separation of concerns as the Holy Grail of Software Engineering since it is a crucial and fundamental software engineering principle. In order to fully understand design principles behind legacy systems, and be able to extract pertinent information, efficient reverse-engineering principles must be properly applied. Several Aspect Mining approaches have been presented and published. Marin, Deursen and Moonen (2004) stated that Aspect Mining can generally be classified into two groups, the Query-based and Generative approach. While Query-based approach utilizes manual input such as textual pattern, the generative approach aims at extracting and generating aspect seeds automatically using structural information of the program source code being investigated. Generative approaches usually utilize program analysis techniques to look for symptoms of code scattering and tangling, thereby identifying code elements qualifying as aspect seeds. Examples of Query-based approach include Aspect- Browser, the Aspect Mining Tool (ATM) and Feature Exploration and Analysis Tool (FEAT). Example of generative approaches include Fan-In analysis, Gybels and Kellens (2005), Clone detection based on Program Dependence Graph (PDG); Shepherd et al. (2004), and Formal concept Analysis attributed to Tourwe and Mens (2004). Further more, the Generative approach can either be static, Tonella and Ceccato (2004), or dynamic such as work done by Marin, Deursen and Monen (2004).

28 17 Shepherd and Pollok (2005) performed experiments using agglomerative hierarchical clustering to group related methods. This technique recursively merges and cluster the input methods, and merges the clusters based on a given threshold. In a similar research, He and Bai (2005) proposed another Aspect Mining technique based on cluster analysis on the assumption that, if same methods are called frequently from within different modules, then a hidden crosscutting concern exists. He and Bai (2005) also used distance based Static Direct Invocation Relationship (SDIR) between methods as input to a clustering algorithm to identify crosscutting concerns. Moldovan and Serban (2006) also used Vector space clustering approach based on the number of calling methods and the number of calling classes to identify symptoms of code scattering and possible aspect candidates. Moldovan and Serban (2006) findings have shown that clusters obtained from different clustering methods are found to contain almost the same set of methods independent of the clustering method used, and that most of the clustered methods are found to implement crosscutting concerns. According to Moldovan and Serban (2006), aspects should be identified based on code scattering and code tangling. So far it seems that most of the existing aspect mining research only emphasize and focus on code scattering. Based on the success of the three clustering algorithms in mining aspects as cited above, this dissertation used a Neural Network methodology known as Self Organizing Maps (SOM) as a clustering methodology, after which the obtained clusters were then mapped against the benchmark software in order to identify code scattering and tangling patterns leading to mining of possible aspect candidates.

29 18 Compared to using Euclidean distance techniques to determine closeness and similarities between code fragments, SOM has the advantage of performing comparisons and clustering at the same time, and the means to visualize the results graphically. Based on the dynamic metrics that were used in the project, it is observed that our use of Self Organizing Map (SOM) helped in decreasing processing time. Also increase in performance in recall, and possibly better precision results were observed. In developing a code clone detection tool, Davey, Barson Field, Frank and Tesley (1995) used Self Organizing map for clone detection purposes only. Three input vector components (frequency of keywords, code statement indentations, and length of each line in the software modules) were used. Due to poor results obtained using SOM, Davey et al., later used Dynamic Competitive Learning (DCL) methodology. The poor performance might possibly be attributed to the type of input vector components used. Moonen (2002) explained that Aspect Mining typically involves three steps, namely, (1) data collection from the source code, (2) knowledge inference based on abstraction from the collected data and (3) information presentation from the collected data. This dissertation used the steps presented by Moonen (2002) as a guideline. In Object-Oriented Programming, methods are the building blocks, and classes are the component units that organize and represent conceptual ideas being implemented; as such, observing characteristics of code contained in methods can present very valuable information about the software system. According to Ceccato, Marin, Mens, Moonen, Tonella and Tourwe (2005), methods are the most likely places where code clone and possibly where code scattering and code tangling may occur.

30 19 In support of using method as a viable code unit to investigate software, Ceccato, Marin, Mens, Moonen, Tonella and Tourwe (2005) also stated that crosscutting functionalities at method level, reside in the calls to methods that address different concerns than the core logic of the caller. Since methods in Object Oriented software represent implementation of concepts, the granularity of details to be used in the project is the class method unit. In line with this, the metrics that were derived from the benchmark programs and used in this project were based at class method level. Since the dissertation used code clone-class detection in the second phase through SOM clustering, after which aspect-mining techniques will be applied in the third phase, some notable research efforts in both areas will be presented in the following paragraphs. In an Aspect Mining research, Baxter, Yahin, Moura, St. Anna and Bier (1998) used Abstract Syntax tree (AST) to determine code similarities. Komondoor and Horwitz (2001) used Program dependence Graph (PDG) to detect clones from semantic information such as data flow and control of a program. Other methods include metricbased clone detection by Mayrand, Leblanc and Merlo (1996), and data mining retrieval based methods presented by Marcus and Maletik (2001). Most of the existing Aspect Mining techniques and methodologies can be categorized into two groups, static and dynamic Aspect Mining. While static Aspect Mining focuses on the structural nature of the software, dynamic Aspect Mining observes the run-time behavior of software system. In early Aspect Mining work, lexical and exploratory tools were produced with the capability to help the user find possible aspect seeds, providing little or no help, in decision making.

31 20 Breu, Zimmermann and Lindig (2006) defined dynamic Aspect Mining as an approach that analyzes program traces reflecting the runtime behavior of a system, searching for recurring execution patterns, reflecting recurring functionalities and identification of possible aspect candidates. Work by Tonella and Chicatto (2004), and Bruntink (2004) are examples of static and dynamic Aspect Mining examples respectively. In addition to static and dynamic Aspect Mining methods, Loughran and Rashid (2002) and Breu (2005) discussed and presented hybrid Aspect Mining approach where both static and dynamic methodologies are applied to mine aspects in software systems. The use of clone detection to mine software aspects was first suggested by Kamiya et al. (2002), but Bruntink and Deursen (2004) however, implemented a manual investigation to help explain the relationship between code clones and Aspect Mining. In a similar research, Bruntink (2004), used clone detection to create clone classes, and then derived clone-class metrics from which possible aspect candidates were identified and mined. The relationship between clone detection and Aspect Mining is based on the fact that scattered code by its nature is not modularized, and unmodularized code is poorly encapsulated leading to spread of functionality across different parts of the software system. Bruntink and Deursen (2004) also stated that, most clone detection methodologies generate output consisting of pairs of clones that are similar enough to be called clones. In another research effort, Kontogiannis (2003) experimented on the use of five data control flow software metrics, to detect program patterns; and used information retrieval methods to retrieve similar code fragments.

32 21 Bruntink, Van Deursen, van Engelen and Tourwe (2005) argued that finding crosscutting concerns is a completely new application area potentially requiring specialized types of clone detection, and hypothesized that, since clone detection methods are capable of detecting duplicated code, they are therefore capable of identifying some types of crosscutting concern code. Moldovan and Serban (2006) used Vector space clustering approach to mine aspects, based on the number of calling methods, and the number of calling classes to identify symptoms of code scattering. Previous work in this area used text-based representation of code modules. Instead of static metrics, the input metrics used in this dissertation were dynamically collected thereby reflecting the benchmark programs dynamic behavior. The formulated metrics were then submitted to SOM for clustering. In this dissertation, some sort of a combination of Moldovan and Serban (2006), and Kamiya et al., (2002) approaches were applied, where group of code fragments that are all clones of each other can be handled at the same time rather than in pairs. Another advantage is that, instead of using text based comparisons with processing order of magnitude O(n 2 ), the methodology in this dissertation uses (SOM) to cluster code fragments into clone classes, where code fragments that are structurally and functionally similar were grouped together in a cluster. Another advantage of the methodology used in this dissertation is that, substantial amount of processing time was reduced because the obtained clusters already contain much more information than mere textual similarity. The clustered code fragments provided a convenient way of mapping collection of similar code units back to the general structure of the source code of the investigated software.

33 22 So far, only one research effort by Davey et al. (2005) is known to have used SOM for the purpose of just clone detection. The limitations with Davey et al. (2005) approach, is that the input vectors used were based on mere three static software factors namely, frequency of keywords, code line indentations, and length of each line in the software modules being investigated. As an opinion, the metrics used by Davey et al. (2005) were inadequate in the sense that they were statically collected and do not reflect the internal structure and dynamic behavior of the benchmark software. As far as it is known, this is the first work that uses dynamic method-level metrics as input to SOM, where clustered results are mapped against the benchmark software being investigated in order to identify relationship patterns that may lead to identification of crosscutting concerns leading to mining of possible aspect candidates from the investigated software systems. According to Bauer (1996), one of the steps required to investigate a legacy software system is to understand how the software system works and how its structure is composed; to do so, it is necessary to capture internal and behavioral structure, and establish the interrelationship between various components of the legacy system. The dynamic method-level metrics that were formulated and used in this project provided a means of measuring the run-time behavior of each of the methods in the benchmark programs, thereby reflecting the dynamic interaction of the software fragments from which they are derived from. The contribution of the idea proposed in this dissertation lies in the unique hybrid approach that was used where dynamically collected metrics were supplied as input to SOM for clustering in an unsupervised manner.

34 23 Advantages, Similarities and Differences to Previous Methodologies The idea being presented in this dissertation differs from other similar research efforts such as, Bruntink (2004), Shepherd, Pollok (2005) and Moldovan and Serban (2006) in the following ways: 1. Direct text comparison was not applied as is the case with most vector space model approaches, improving performance over O(n 2 ) processing time. 2. The use of SOM in the second phase of the project produced a set of clusters containing code representations that are not only textually related, but also syntactically and semantically related. 3. The use of Self Organizing Map (SOM), and the clustered clone classes is expected to perform better because it does not apply the traditional binary clone relations, as is the case with other similar research efforts. Moreover, SOM is an unsupervised clustering method. 4. At the initial stage, the method of extracting the most frequently executed parts of the benchmark software being investigated (MFEM) helped focus efforts and energy on portions of the software artifact that is most frequently executed, eliminating the less executed portions of the software from further scrutiny. 5. Mapping the clone classes against entire benchmark program helped expose structural and dynamic behavioral patterns of the software leading to identification of possible aspect candidates. 6. Since the methodology compares numeric values (vector components of the code fragments), the methodology is time and space efficient, compared to other methodologies where direct textual comparisons are required.

35 24 Chapter 3 Methodology/Approach Biggerstaff et al., (1993) defined concern location problem in software systems as the problem of discovering human oriented concepts and assigning them into their realizations. To address the research questions and goals presented in chapter 1, the dissertation presents an approach that targets extractible software features within the investigated benchmark software (through dynamic tracing), converting these to values (metrics) that were then submitted to SOM for clustering. The resultant clusters were then mapped back to the test software to identify crosscutting concerns leading to mining of possible aspect candidates. Presence of scattered and tangled code, and cloned code fragments in large software systems results in higher maintenance costs, and less modular systems. Most code clone detection methodologies presented in different research papers, for instance, Rieger (2005) suggested use of source code partitioning, followed by code transformation and code classifications and clustering to mine aspects. Moldovan and Serban (2006) for example used Vector space model based clustering approach to identify and mine aspects, and also presented indicators that show presence of crosscutting concerns based on large number of calling methods and large number of calling classes. The dissertation applied three phases namely: 1. Data collection Data, Preprocessing and Metrics Formulation 2. Clustering Phase (Classification and clustering code clones) 3. Data mining Phase (Identification of crosscutting concerns)

36 25 In the first phase of the project, a preprocessing method was applied to filter the most executed portions of the benchmark software being investigated. The filtered portions of the benchmark software were then parsed, profiled and metrics collected dynamically. An AspectJ profiling and tracing tool was developed for this purpose. The collected data were then transformed into metrics and used as vector input to a neural network clustering method called Self Organizing Map (SOM). This clustering method was used to classify and cluster the modules based on the given input vectors. The output from the second phase was then passed to the second phase for further processing, mapping, structure analysis, and pattern discovery. The diagram displayed in figure 1 below summarizes the steps and phases of the approach applied in the dissertation. Figure 1. Steps Used in Methodology

37 26 Phase-1. Benchmark Program Tracing and Data Collection Obtaining Trace Events From Test Programs Methods in Object Oriented programming represent a modular unit by which programmers attribute well defined abstraction of ideas and concepts. Deitel and Deitel (2003) defined methods in object-oriented paradigm as self contained units where distinct tasks are defined, and where implementation details reside, making software reusability possible. According to Giesecke (2006) Methods are less complex than classes, are easier to compare and provide significant coverage and easy distinction, and have high probability of informal reuse. Therefore, the granularity level of the code fragments that was considered in investigating the benchmark software is method unit. Kontogiannis (1997) stated that, the first step towards analyzing software system is to represent the code in a higher level of abstraction. Six software metrics were calculated for each of the methods in the software being observed. The aim was to represent each of the software modules that constitute the software being investigated. Data needed to compute the required metrics was collected dynamically using tracing and profiling features available in AspectJ (AJDT). This allows insertion of probes at specific points of the test program being investigated, providing the means to collect execution traces and runtime method interactions of the benchmark software under investigation. To make sure suitable and appropriate feature trace events were executed, every effort was made to make sure that latent functionalities inherent in the benchmark software being tested were utilized, so that most; if not all pertinent aspects of the software are exercised, thereby guaranteeing invocation of all major test program components during execution.

38 27 Two benchmark test programs used in other Aspect Mining research efforts were selected for the purpose of collecting the data required in this dissertation. The first benchmark is an implementation of Dijkstra s algorithm implementation (attributed to Laffra (1996). This benchmark program was used by Tonella and Ceccato (2004) and Ceccato et al., (2005) as base for Aspect Mining exercise. The second benchmark JHotDraw (an open source graphics software) was designed as a framework for graphical design. Both programs were designed and developed to represent and exemplify best programming practice. Ceccato et al. (2005), Marin, van Deursen and Moonen (2004) suggested using the selected test programs as benchmarks for Aspect Mining exercise. Table 1 shown below is a summary of the event traces obtained from the benchmark test programs investigated in this dissertation. The second column shows the number of traces for each test program, and the third column shows number of unique event traces after preprocessing. It should be noted that duplicated method calls were eliminated, that is if a method is called from the same class more than once, all similar duplicated calls are discarded, retaining only one call as a representative. This ensures that metric values are not distorted by such duplicity. Table 1. Number of Event Traces Raw and Filtered Test Program Raw count Filtered count Laffra s Disjkstra's Algorithm Implementation (LDA) JHotDraw (JHD) 105, Since the approach presented in this dissertation was intended to be a semi automated Aspect Mining approach, noise usually associated with collecting event trace data was not preprocessed. A good example was the case of JHotDraw execution session in which mouse events were observed to generate a lot of unnecessary noise having little or nothing to do with crosscutting concerns.

39 28 To preserve integrity and minimize human intervention, such noise was left intact. The encouraging result was that SOM was able to discriminate such noise by not including them as part of any of the clusters, thereby remaining simply scattered entities on SOM U-Matrix. To make sure every aspect of the test JHotDraw application is exercised, the following list of activities in table 2 below was used as a guideline. Table 2. List of Functionalities Exercised in JHotDraw Activity Description 1 Create Create (Draw) Square Rectangle Create (Draw) Rounded Rectangle Create (Draw) Ellipse Create (Draw) Straight line Create (Draw) Wiggle curve line 2 Fill Figures Select and fill sounded rectangle with blue color Select and fill Square rectangle with blue color Select and fill Ellipse with blue color 3 Join Join figures using arrows 4 Delete Delete drawn figures 5 Save Save figures drawn 6 Print Print Drawn Graphics 7 Change color Border, line, edges 8 Animation Animate (Move entire drawn graphics) 9 Undo an activity Undo an activity such as reverting fill color The screenshot shown in figure 2 below depicts some of the graphic figures drawn when JHotDraw benchmark program was exercised with the AspectJ trace program running in the background. Figure 2. JHotDraw User Interface Screen Shot

40 29 Similarly, based on Laffra s implementation of Dijkstra s algorithm, which was designed and developed exclusively to solve shortest path between any two nodes in a network, series of a network graph problems were solved, and event trace data collected. Every possible user interaction with the graphical user interface was made to make sure that event executions were properly traced. The following figure is a screenshot of LDA user interface showing a graph network with nodes. On the right side of the screen are six main functionalities embedded n the program; on the top part of the screen is documentation displayed to the user as a guide to the selected functionality. Figure 3. LDA Screen Shot Kellens, Mens and Tonella (2007) observed that all known dynamic Aspect Mining techniques are structural and behavioral and work at method level. Since the granularity level used in this dissertation is at method level, the main target for tracing during the execution of the benchmark programs was the invocation of methods within classes and across class borders.

41 30 Metrics and Features Extracted From Benchmark Software Pressman (1977) described software metrics as broad range of measurements applied to the software process with the intent of improving it. Software measures (metrics) are indicators describing complexity of software products and processes. By their very nature, software metrics expose and describe a number of complex and high dimensional data patterns that attempt to provide some useful insights into the very nature of the software systems under investigation. Such exposure helps in investigating and quantifying key properties of the systems such as reliability, readability and maintainability. According to Aggarwal and Sigh (2003), behavior of a software system depends on its ibilities such as dependability, usability and performance. According to Voas (2003), a new generation of software engineers and researchers are realizing that software quality is a behavioral trait and not simply a static one. Discussing the best way to represent dynamic features of software systems, Aggarwal and Sigh (2003) suggested targeting points around ilities and presented a metric to support this view. The strength and validity of the project being proposed depends on the quality of the software metrics that was collected and used as input vector components. Davey et al. (1995) emphasized that, for dynamic metrics to be useful, they should provide a concise, yet informative summary of different aspects of the dynamic behavior of programs, and be able to differentiate between programs with different behaviors. Weyuker (1988) characteristics (widely respected and accepted in software engineering discipline) were used as a guideline in making sure that appropriate metrics are selected and properly formulated.

42 31 Targeted features When representing a program behavior quantitatively, the usual problem faced is that of matching an abstract feature with a concrete implementation. Wong, Gokhale and Horgan (1999) observed that since a feature is an abstract description of a given specification, and a component is a concrete element, it is always a problem making a connection and association between these two; and opined that carefully defined metrics are necessary to obtain a quantitative measure on interactions between components and features within a software system. In selecting the dynamic metrics used in this dissertation, emphasis was placed on metrics that are capable of providing the required discriminative characteristics and behavioral characteristics of the methods, as well as the issue of how the code fragments (methods) in the investigated software modules participate and lend themselves to code clone infestation, code scattering and code tangling. The key features targeted for collection from the legacy benchmark software investigated include the following: 1. Coupling between components of software systems (Fan-In/Fan-Out) 2. Information Flow 3. Similarity and dissimilarity between methods (Method signature and Fingerprints) 4. Level of interaction between a method and other classes (Method Spread) 5. Method Internal Coupling (MIP). Measure of how methods are coupled internally. 6. Method External Coupling (MEC). Measure of how methods are coupled with other external units. 7. Method Cohesion Contribution (MCC). Measures how cohesive a method in a class is towards implementing a feature.

43 32 Phase-2 Vector Component Representation Vector Component Metric 1. Dynamic Fan-In/Fan-Out (FI/FO) Yuying, Qingshan, Ping and Chude (2005) defined dynamic Fan-In metric Fan_In(s, m i ), as the count of number of times method m i in a piece of software is invoked by other methods in an execution scenario. Marin et al (2004) defined Fan-In metric for a method as number of distinct method bodies, which can invoke the method. In an experiment presented by Marin et al (2004), one-third of methods found with high Fan-In counts were seeds leading to aspects. Fan_Out(s, m i ) on the other hand, indicates the number of times method m i invokes other methods in the execution of scenarios. FO is usually used to measure coupling between components of software systems. Note also that both FI and the FO metrics reflect structural dependency, and are based on static analysis of source code. To support the use of these two metrics in understanding software systems structure and behavior, Yuying et al. (2005) hypothesized that methods with high Fan-In or Fan-Out values are the ones that implement the main system functions with a high probability. Vector Component Metric 2. Information Flow (IF) Information flow is defined as the product of the dynamic Fan-in and Fan-out metrics. This metric is formally defined as Information Flow (IF) = (Fan-In * Fan-Out) Note that the components required to derive this metric come from vector component described above. Yuying et al. (2005) stated that, Fan-in helps in identifying classes with high reusability. Methods with higher Fin-In are usually aspect candidates, and that software design patterns are also found to have high Fan-In values.

44 33 Vector Component Metric 3. Method Signature (MSig) A common definition of method signature in software engineering consists of method name, parameter types and their order. Visibility modifier, return type and exception throws are not considered. According to Sunghun, Kim, Pan and Whitehead (2005), since method signature does not change frequently, signature pattern can be used in deciding similarity between methods. Sample method signature is shown in figure 4. public Object mymethod(string param1, Object param2) throws Exception; public void mymethod(object param2, String param1); Figure 4. Sample Java Method Signature, Different but Similar Method Signatures) According to Bucci, Fioravanti, Nesi, Perlini, (1998) the set of parameters of a method represents cognitive measure of its complexity. The principal notion behind method fingerprint is similar to Rabin (1981) premise that, if two fingerprints are different, then the corresponding objects are different, and that there is only a small probability that two objects that are different have the same fingerprint. In this project, an aggregate sum of equivalent ASCI code of each character in the method name and its parameters was used to represent the method signature. It was realized that the signature values obtained this way were more discriminatory than mere Boolean 0/1 values. Vector Component Metric 4, Method Spread (MSP) Borrowing from Lai and Murphy (1999) spread metric, we replace feature in the definition with class method. This metric measures level of interaction between the method and other classes. The lower the method spread value, the lower the interaction of the method with classes in the software system. MSP is formally defined as follows: MS i = (# of classes from which the method is called) (Total Number of classes in the subject software) Figure 5. Method Spread

45 34 Vector Component Metric 5. Method Internal Coupling (MIC) Coupling is used to measure how dependent a software unit is to other units in the software system. It is a software engineering requirement that coupling between units in a software system should be low, in this project however, coupling was used to help differentiate between module units. Although coupling is traditionally calculated at class level, method coupling can reveal a lot of information about a method in a software system. Joshi and Joshi (2006) defined Relative Method Coupling (RMC), as a measure of how coupled a method or an attribute is with its owner class. In line with this, a new method coupling metric Method Internal Coupling (MIC) is introduced as follows. MIC = _ L _ L + E Where L = Count of all local calls to a method E = Count of all external calls to a method Figure 6. Method Internal Coupling Vector Component Metric 6. Method External Coupling (MEC) Joshi and Joshi (2006) defined Relative Inward Coupling metric (RIC), which measures the external usage of a given attribute or method to its internal usage. Similarly we introduce Method External Coupling (MEC), that measures how coupled a method is to other classes outside the method home class. This metric tells us whether the method is used more by external classes or used more in its home class. MEC is defined as follows: MEC = E _ L + E Where L = Count of all local calls to a method E = Count of all external calls to a method Figure 7. Method External Coupling

46 35 Vector Component Metric 7. Method Cohesion Contribution (MCC) Cohesion Among Methods in a Class (CAMC) attributed to Bansiya, Etzorn and Li (1998) measures the extent of interactions between individual method parameter type list and parameter list of all methods in its home class. CAMC is formally defined as follows: Where M i = Set of parameter object type of method i, T = Set of union of method parameter list P m1 P mn P i = M i T N = Number of methods in the method home class Figure 8. Methods Cohesion Contribution (MCC) Method Cohesion Contribution (MCC) metric is a derivative of CAMC. Instead of summation of unique parameters for all methods in the denominator, (MCC) considers individual method cohesion contributions to its home class. MCC is defined as follows: MCC = P i T n Where P i and T are as defined in figure 8 above. Intersection between method signature and signatures of all methods in home class is used to derive this metric. SOM Data Input Structure The method level metrics collected were used as vector components. Implicitly, the combination of these components is a representation of structural and behavioral nature of the methods. The vector components also served as unique characteristic identifiers to each method, and therefore a good representation and solid base for similarity comparison between methods in the software being investigated. Each method in the target benchmark software had an associated set of attributes that are represented as numeric values. The derived vectors were used to uniquely qualify and identify each of the methods in the software being investigated.

47 36 The assumption is that similar code fragments should have similar representational vectors. Similarity was determined based on how close source code vector representations are in Euclidean space. Vector representation of methods m i is represented as follows: Vector for Method v i = {m i a 1, m i a 2,. m i a n } Where a 1 = first attribute or vector component of method i etc. i = row number of vector representation (i=1 to n) n = number of attributes (6 matrix measures in our case) n = number of component vectors Figure 9. Definition of Vector Components Set of vectors v 1, v 2..., v n representing metrics formulated from the dynamic execution of the test programs. A data matrix with (m by n) dimension was constructed in which rows represent a vector (data about a method) and columns representing individual vector components (i.e. software metrics). Figure 10 below is a sample data set. For detailed vector representation, see Appendix D. MethodSig Information Flow MethodSpread MIC MEC MCC Method Name HomeClass lock GraphAlgorithm unlock GraphAlgorithm action Options Figure 10. Data Input Layout for SOM The constructed m by n matrix was then submitted to the MatLab SOM toolbox as input, from which data is collected and clusters are graphically represented and presented as Unified distance matrix (U-Matrix). The visual representation makes it easy to visualize distances between the neurons and how vectors are clustered. In the background, the SOM toolbox provides data for extracting data related to clustered vectors that are required at mapping and Aspect Mining steps.

48 37 The Essence of Clustering The key concept and main idea behind clustering is to group similar things into clusters, such that similarity or cohesion among members of a cluster is high, and intercluster similarity or coupling is low. Coupling has great impact on many quality attributes, such as maintainability, verifiability, flexibility, portability, reusability, interoperability, and expandability. Lung (1998) also observed that, the main objective of clustering is similar to that of software partitioning, and states that clustering techniques can be applied to software systems at different life-cycle phases. According to Lung (1998), clustering techniques have been used in many disciplines to support grouping of similar objects of a system. Clustering analysis is one of the most fundamental techniques adopted in science and engineering. The primary objective of clustering analysis is to facilitate better understanding of the observations and the subsequent construction of complex knowledge structure from features and object clusters. Examples of the use of clustering to aid in knowledge structure discovery include study of botanic species and genetic engineering. Clustering techniques could be used to effectively support both software architecture partitioning at the early phase in the forward engineering process and software architecture recovery of legacy systems in the reverse engineering process. Essentially, the goal and primary objective of clustering analysis is to identifying distinct groups within a dataset, placing data objects within those groups according to their relationships with each other, facilitating better understanding of the observations and the subsequent construction of complex knowledge structure from features of the objects within the derived clusters.

49 38 Clustering Code Fragments Using SOM Sambasivam and Theodospoulos (2006) defined clustering as the classification of objects with similarities into different groups this is accomplished by partitioning data into different groups known as clusters, such that elements in each cluster share some common trait, usually proximity according to a defined distance measure. Lung (1998) observed that clustering techniques have been used in many disciplines to support grouping of similar objects of a system. Self Organizing Map (SOM) is an unsupervised and effective Artificial Intelligence data visualization technique, attributed to Kohonen (1998), that reduces high dimensions of data through the use of neural networks. SOM converts non-linear statistical relationships between high dimensional data into simple geometric relationships of their image points, usually displayed as a two dimensional grid of nodes. SOM classifies input vectors according to similarity preserving the topology of the input vectors assigning nearby vectors to nearby categories, thereby organizing and clustering sample data so that in the final result, samples are usually surrounded by other samples with similar characteristics. The use of SOM helped produce set of clone classes as defined by Kamiya, Kusomoto and Inoue (2002), from which code fragment similarities can be collected and analyzed. Once the dynamic metrics were collected from each method in the benchmark software, the derived values were encoded as vectors. It was observed that similar blocks of code have similar representational vectors. By including these method-level dynamic metrics as vector components, we are inherently establishing an identity, signature and profile of each method in the software artifact being investigated.

50 39 How SOM Performs Data Clustering Self-Organizing maps organizes neurons in a 2-dimensional grid representing the feature space. SOM Neural Networks structure consists of two layers of neurons as shown in Figure 11 below. Each neuron in the input layer represents an input variable with weighted connection to each output layer of the network. During iterations at the training step, the weighted connections adapt and change. The first layer receives the input and transfers it to the second layer. Each neuron of the second layer has its own weight vector whose dimension is equal to the dimension of the input layer. Neurons are connected to adjacent neurons by topographical neighborhood function, which dictates the topology of the map. Figure 11. SOM Architecture Culled from Vesanto and Alhoniemi (2000) In the second layer, weight vectors of neurons are set to random values. After that, some input-vector from the set of learning vectors is selected and set to the input of the Neural Network. Differences between the input vector and all neuron vectors are calculated using the following function: Where, i and j are the indices of neurons in the output layer, D(k 1, k 2 ) = min D ij.

51 40 A neuron whose weight vector is closest to the input vector is chosen as the winner-neuron. k 1 and k 2 are indices of the winner neuron. Corrections and adjustments of vector weights of the winner and all adjacent neurons are made. The neighborhood of neurons is determined by the topological neighborhood function Where, ρ is a distance to the winner-neuron with and σ is a function dictating the space of neighborhood. In the beginning almost whole space of the grid is involved but with time, the value of σ decreases. As the attraction function equals 1, then ρ is equal to zero. After calculating the topological neighborhood function for each neuron, the weights of all neurons are updated using the function where α(t) is a learning rate function that also decreases with time. If a neuron is a winner or adjacent to the winner, then its weight vector is updated, or remains unchanged otherwise. On each step, the NN determines the neuron whose weight vector is the most similar to the input vector and corrects its and its neighbors weights vectors to make them closer to the input vector as shown in Figure 12 below. Note the solid and dashed lines that represent situations before and after updating and adjusting vectors weights. Figure 12. Updating Winner-neuron known as the Best Matching Unit (BMU) and its Neighbors Towards Input Vector Marked With X.

52 41 Each input vector from the training set is presented to the Neural Network, and learning continues until either some specified number of cycles is reached or difference between an input and weight vectors reaches some threshold. The difference between adjacent neurons decreases with time resulting to organized groups of clusters. Steps of the SOM Training Algorithm Each neuron of the input layer represents an input variable (six metrics), with a weighted connection to each node of the output layer. During the training, the connection weights change adaptively with each iteration of the training steps presented below: 1. Each input vector to SOM is a dataset of X vectors {X (t)} each consisting of N variables X(t) = {x 1 (t), x 2 (t), x N (t)} 2. Codebook vectors W={w ij : I = 1,2,, n} of each neuron j are initialized with random numbers in the interval [0,1] 3. At iteration t, input vector x(t) is compared with all the SOM neuron weights using distance measure such as Manhattan distance or other Euclidian distance measure d i (t) = Σ (x(t) w ij (t)) 2 4. Neuron with the shortest distance to the input vector known as the Best Matching Unit (BMU) is then declared a winner 5. Weights of the BMU and its neighboring neurons are then updated in order to reduce the distance between them and the input vector using the following weight function W(t+1)=W ij (t) + η(t)n(t, r) (x i (t) w ij (t)), Where η(t) is the fractional increment of the correction, and N(t, r) is the varying time neighborhood function that determines the radius from the BMU which is gradually reduced until convergence is achieved.

42 SOM Result Visualization Once the SOM has converged and clusters formed, the relevant information about the formed clusters is stored in SOM codebook vectors.

53 42 SOM Result Visualization Once the SOM has converged and clusters formed, the relevant information about the formed clusters is stored in SOM codebook vectors. Through available visualization features and tools, the stored information can be extracted and displayed in several ways. U-Matrix Presentation and Interpretation U-matrix is used to present visual information showing distances between elements on the map so as to reveal clusters present in the processed data, allowing cluster structure to be displayed in the equivalent 2D lattice, either in gray shades or in color levels depicting the mean distance of each unit to its closest neighbors. Figure 13: U-matrix Representation of the Self-Organizing Map In U-Matrix presented in figure 13, the blue hexagons represent the neurons, and the red lines are the connections between neighboring neurons. The colors in the regions containing the red lines indicate the distances between neurons, with the darker colors representing larger distances and thus a gap between the values in the input space. The lighter colors represents smaller distances, indicating that the vectors are close to each other in the input space.

54 43 In the U-Matrix representation, light areas can be thought of as clusters and dark areas as cluster separators. The size of SOM Map units (number of output neuron units) has a strong influence on the quality of obtained clusters. If the selected map size is too small, important differences presented in the data may be missed, on the other hand, if the mp size is too large, the differences in the data may be too difficult to detect; for this reason, Eaddy(2008) opined that the default number of neurons for an optimal U-Matrix is calculated using the following heuristic formula Optimal Number of Neurons = 5* Number of Samples Where N=number of samples of the training data (i.e. observations). Another important consideration is that the output map should not have square form i.e. (n by n); rather, one side of the map should be bigger than the other in a 1:2 ratio; that is, if m and n represent the dimension factors for a map, then m>=2n or n>=2m, where m n. Representing SOM results in U-matrix offers a fast way to get insight of the data distribution. This can be a helpful presentation tool when one tries to find clusters in the input data without having any a priori information about the clusters. In addition to graphical representation, some tools present tables summarising cluster statisticss such as number of items in each cluster, mean and variation, inter and intra cluster distance for further analysis. Component planes. Component planes display the behavior of a given input variable (vector component) along the whole data set. These maps show the value of a given input feature for each SOM unit in the 2D lattice, and helps in comparing correlation or behavior similarity between different input variables.

55 44 SOM Versus Other Clustering Techniques In support of using SOM to cluster data, Azuaje (2002), argued that SOM exhibits significant advantage compared to other known clustering methods such as hierarchical and k-means and agglomerative hierarchical clustering algorithms. Levine, Davidson and Westover (2004) also observed that, SOM has advantages as well as drawbacks when compared to other principal component plots and hierarchical clustering methods. Some of the advantages SOM has over other clustering techniques include the following: 1. SOM is relatively easy to implement and evaluate, computationally inexpensive and scalable, and input data is preserved as faithfully as possible. 2. Ability to reduce data in an unsupervised way, and allowing resulting homogeneous clusters to be represented by symbolic formalism. 3. Unlike hierarchical clustering, SOM facilitates automatic detection and inspection of clusters. Azuaje (2002) further argued that unlike Bayesian-based clustering, SOM does not require prior knowledge about the data, and compared to the k- means clustering algorithm, the SOM exemplifies a robust and structured classification process. Vesanto and Alhoniemi (2000) performed an experiment in which agglomerative and (K-means) clustering algorithms and SOM were used. Results indicated that SOM clustering is computationally more efficient approach. In another experiment, Mangiameli, Chen and West (1996) used SOM to demonstrate that it is superior to the hierarchical clustering methods. The performance of SOM and seven hierarchical clustering methods were tested on 252 data sets with various levels of imperfections that include data dispersion, outliers, irrelevant variables, and non-uniform cluster densities.

56 45 Mangiameli and Chen (1996) observed that, the superior accuracy and robustness of SOM could improve the effectiveness of decisions and research based on clustering messy empirical data. Manninen, Pirkola, and Heiniemi (1999) articulated the superiority of SOM by stating that, because SOM does not require supervision and is non-parametric (no assumption about the distribution of data need to be made before hand), SOM may even find unexpected hidden structures from the data being investigated. Benchmark/Test Source Programs In order to provide base for testing the methodology used and to determine if the adopted methodology works, medium to large-scale size software packages were selected as benchmark test suite. Ceccato et al, (2005), Marin, van Deursen and Moonen (2004) suggested JHOtDraw (an open source graphics software) and PetStore (a simulated online store program) as a good Aspect Mining validation benchmarks. Also, Ravelle, Broadbent and Coppit (2005) suggested Minesweeper (a simple game) and Sort.c (a sort utility program) as good validation benchmarks. Shepherd, Palm, Pollack (2005), and Roy et al (2007), also used Jhotdraw, Laffra s Djikstra s algorithm implementation as a benchmark test suite. Since these two tested software programs have been used in many research works, this dissertation also adopted the same set as a benchmark, and the associated research works as bases for validation and comparison. Table 3 below lists some of the characteristics of the selected benchmark test programs. Table 3. Characteristics of Benchmark Programs Benchmark Name LOC Classes Methods Motivation JHotDraw 5.41b 11, ,800 Roy et al. (2007). Lafra s Dijkstra s Algo Implementation 1, Laffra (1996).

57 46 Notes on Laffra s Dijkstra s Algorithm Implementation (LDA) Laffra s implementation of Dijkstra s algorithm is a Java application developed to solve shortest path problem. It is one of the benchmarks frequently used in Aspect Mining exercise. This application consists of 6 classes and a total of 153 methods. Since this is a small size application, it is an interesting benchmark for study. User can place and connect nodes, adjust distances between nodes or reset to start all over. This application can be run in step mode or in default settings to observe calculations. Notes on JhotDraw (JHD) JHotDraw version 5.4b1 is JHotDraw GUI framework for technical and structured Graphics. It is an application for two-dimensional graphics and a strongly typed objectoriented framework written in Java as an exercise in developing software to showcase the use of design patterns (Gamma et al. 1994). Its design relies heavily on some well-known design patterns. The key contracts of the framework are defined in interfaces, and the key behavior is implemented in abstract classes and in default classes. The framework is fine grained, allowing for creation of different kinds of drawing editors, ranging from drawing editors with a strong focus on visual designs to engineering and simulation tools with a strong focus on structure and behavior. Each of the benchmark programs was traced and profiled dynamically, datum were collected, and relevant metrics formulated. The derived metrics were then submitted as vector input to Matlab SOM toolbox for clustering. Clustered results collected from SOM were then used as input to the third phase of the project (pattern discovery), for analysis, concern pattern matching, and eventual identification of concerns leading to mining of possible aspect candidates.

58 47 Phase-3. Structure Discovery, Concern Isolation, Pattern Identification After Phase 2 of the project is completed, the clusters created by SOM from phase 2 were investigated in an attempt to isolate and identify crosscutting concerns from the benchmark test programs. Most existing Aspect Mining methodologies seem to only focus on code scattering as the symptom for identifying candidate aspects. According to Moldovan and Serban (2006), crosscutting concerns in object-oriented systems have two symptoms, code tangling and code scattering, and that code scattering can be identified by observing large number of calls to a method from other methods and classes. To make sure that all possible aspects in the benchmark software are detected, the methodology applied in this dissertation used a mapping strategy that takes each of the clusters produced by SOM and map them back to the software system being investigated to determine whether the detected aspect candidate seeds are the case of code scattering or tangling. It should be noted that code fragments in a cluster returned by Self Organizing Map (SOM), known as clone class, are assumed to be similar and closely related. According to Kothari, Denton, Mancoridis and Shokuofandeh (2006), features in a cluster with similar functionality also share similarities. Each of the clone class clusters created by the SOM were analyzed and investigated. A heuristic methodology similar to but different from Gybels, Tonella and Kellens (2007) was applied to map each cluster to the entire software being investigated in order to find, identify and establish any recurring patterns that may help point at code scattering and or code tangling in the software being investigated. Methods from the same cluster that map to different classes indicate symptoms of scattering, while methods from different clusters mapping to the same class is an identification of tangling case.

59 48 Identification of Aspect Candidates According to He, Bai, Zhang and Hu (2005), the behavior of Object-oriented systems is identified by the invocation relationship between methods, and that if there are group of code units that have similar action (i.e. similar called method sequence), and appeared frequently in execution traces, then a crosscutting concern exists. Finding code features that implement common concerns among code fragments requires developing a heuristic that may provide a systematic way of pinpointing and highlighting presence of crosscutting concerns in a source code of a software system. He, Bai, Zhang and Hu (2005) also indicated that, high frequency of calls to the same methods from within different modules is an indication of hidden crosscutting concern. According to Breu (2004), recurring execution patterns represent certain behavioral aspects of software system. In light of these novel ideas, the basis of the method used to identify patterns of crosscutting concerns in the benchmark software used in this project was based on the guidelines as presented by Eaddy, Aho and Murphy (2007) as presented on the next page. Anatomy of Concerns in Software Systems Concerns in Object Oriented paradigm (OOP) have complex relationships, dependencies and interactions occurring not only at predefined points but also at arbitrary points within the code. Furthermore, concerns in OOP are not entirely orthogonal. In fact, many of them overlap in the sense that they share a common fragment. Trifu and Kuttruff (2005) articulated the problem of code tangling by stating that due to the orthogonal nature of concerns, a class in OOP rarely incorporates a single concern, instead, it contains tightly tangled fragments of several concerns.

60 49 Definition: Scattering Concern in a software system is scattered if it is related to multiple target elements. If implementation of some concerns is not well modularized but cuts across the decomposition hierarchy of the system, then a crosscutting concern exists. Scattered code belongs to one concern but is also distributed through many programming modules. Trifu and Kuttruff (2005), defined Scattering, Tangling by stating that, if software unit is defined as m i ε M; where M ={m 1, m 2,, m n }, m i is said to be localized with respect to M if it is related to one and only one element of M. Equally, the software unit m i of M is scattered with respect to M if it is related to multiple elements of M. In other words, a concern is scattered if multiple units within a software system implement it. A formal definition by Conejero, Hernández, Jurado, and van den Berg (2007), stated that an element s є Source is Scattered if card(f(s)) > 1; In the same token, Eaddy, Aho and Murphy (2005) defined Scattering as a situation where a concern is related to more than one target item, that is, (C, t i ) ε R and (C, t i ) ε R, and i j. Definition: Tangling A concern is tangled if both it (the concern itself) and at least one other concern are related to the same target element. When code is tangled, a module may contain implementation elements (code) for various concerns. Tangled code belongs to different concerns, but is contained in one programming module unit. Eaddy, Aho and Murphy (2006) observed that tangling occurs when the concern itself and one other concern are related to the same target item based on condition (C m, t) ε R and (C n, t) ε R and C m C n. Conejero et al., (2007) also presented a formal definition by stating that tangling occurs when a target element is related to multiple source elements; that is an element t є Target is tangled if card(g(t)) > 1

61 50 Definition: Crosscutting Concern Conejero et al., defined crosscutting concern by stating that if we let s 1, s 2 є Source, s 1 s 2, we say that s 1 crosscuts s 2 (s 1 cc s 2 ) if the following conditions are met. 1. card(f(s 1 )) > 1 2. t є f(s 1 ): s 2 є g(t) In this definition, it is not required that the second source element (s 2 ) is scattered. If s 1, s 2 є Source, s 1 s 2, then s 1 crosscuts s 2 if and only if card(f(s 1 )) > 1 and f(s 1 ) f(s 2 ) 0. Diagrammatically crosscutting concern is shown in figure 15 in the next page. Definition: Concern Decomposition Laddad (2004) stated that a typical software system may consist of several kinds of concerns including business logic, performance, data persistence, logging and debugging, authentication, security, multithread safety, error checking, and so on. In defining concern decomposition, Laddad (2004) used the prism analogy where a requirements light is passed as beam through a concern-identifier prism, which separates each concern. Figure 14. Concern Decomposition: The prism Analogy

62 51 Symbolic definition of Crosscutting The concept of crosscutting has been presented and discussed in Aspect Mining literature. Conejero et al., (2007) defined crosscutting as one-thing with respect to another-thing, which mathematically means that two different domains are related to each other through a mapping. If three concerns and four requirements (use-cases) are considered, based on the mapping shown in figure 15 shown below, it can be deduced and stated that s 1 crosscuts s 3. Note that source and target are the two domains. Figure 15. Relation Between Source and Target Elements According to Mashura and Kiczales definition, if methods M x and M y in Object Oriented system, are considered as programming units, then it can be stated that M x crosscuts M y with respect to Z if and only if their projections onto Z intersect and neither of the projections is a subset of another. That is crosscutting exists between M x and M y if the following holds: 1. f(m x ) f(m y ) 0 2. f(m x ) f(m y ) 3. f(m y ) f(m x )

63 52 Formal Model For Clustering-based Aspect Mining Moldovan and Serban (2006) presented a formal definition of Clustering based Aspect Mining as follows: If a software system is represented as M = {m 1, m 2, m 3... m n }, where m i 1 i n, is a method of the system (Note that m i can be a statement, a method, a class, or a module of the software system.). The number of methods in the system is then denoted by n = M ; and the number of methods in the crosscutting concerns in the system C is cn = C. If the set of all crosscutting concerns in the software system M is CCC = {C 1, C 2,.., C q }, the number of crosscutting concerns in the system q is CCC. The Partition of a Software System The set K = {K 1, K 2,... K p } is called a partition of the system S if and only if the following holds: (1) 1 p n (2) K i M, K i 0, i {1,2,..p} p i = 1 (3) M = S K i (4) K i K j = 0 i,j {1,2,..p}, i 1, i j Based on these definitions, if K is a set of clusters, then K i is the i th cluster of K. Formally, the problem of Aspect Mining (obtained by an Aspect Mining technique such as clustering-based Aspect Mining or a graph-based method) can be viewed as the problem of identifying a partition K of the software system M. Using abstract presentation, the clustering-based Aspect Mining technique T can be viewed as a tuple of functions T = (divide, select, order), where divide is a function that maps a software system M to a partition K of the system M, i.e., divide(m) = K.

64 53 Consequently, the domain of divide is the set of all software, and its codomain is the set of partitions of software system. In this definition select is a function that indicates the clusters from K that will be analyzed by the user of the Aspect Mining technique, i.e. select(s,k) = SK, SK K, and order is a function that indicates the order in which the selected clusters (given by the function select) will be analyzed by the user of the technique. Moldovan and Serban (2006) also observed that in order for an Aspect Mining technique T, to be efficient, the rule represented by equation CCC = SK must hold. Cluster Mapping Each of the clusters obtained from SOM toolbox were analyzed, and members within such clusters were mapped back to the entire software benchmark being investigated thereby establishing code scattering and or tangling behavior. For an example of how this was achieved, see sample rule table shown in table 4, and cluster mapping data format shown in figure 16. Figure 16. Example of Clusters Mapped Against Benchmark Software

54 When two code fragments from same cluster are found to map to different classes, such pair of code fragments have exhibited scattering behavior.

65 54 When two code fragments from same cluster are found to map to different classes, such pair of code fragments have exhibited scattering behavior. From figure 16, Row-B, Row-E, and Row-C from cluster-1, and Row-F, Row-G from cluster-2 are scattered, and other methods in respective clusters should be investigated and considered as aspect seeds. If two methods from different clusters (implementing different concerns) are found to map to the same class, such pair is considered a possible tangling case. From figure 16 above, Row-C from Cluster-1 and Row-F from cluster-2 are tangled. Although Row-E and Row-A from cluster-1 belong to the same Class-2, this may simply be a case of code cloning. Table 4 below summarizes rules for extracting aspect seeds. Table 4. Rules for Extracting Aspect Seeds Rule Characteristics/Symptoms Aspect seed type 1 Code fragments from same cluster came from different classes Scattering 2 Code fragments from different clusters came from different classes No problem 3 Code fragments from different clusters came from same class Tangling 4 Code fragments from same cluster came from same class Code cloning Using rules presented in table 4 above as a guideline and executing mapping SQL statement, the SOM cluster results were matched linking table fields Home-class ID and Cluster ID to corresponding fields in Method-Class table, thereby obtaining aspect seeds shown on the right side of figure 17 below. Code Fragment Pair Row-B, Row-E Row-B, Row-A Row-E, Row-A Row-C, Row-F Row-F, Row-G Row-G, Row-D Aspect Seed Type Possible Code Scattering Possible Code Scattering Possible Code Cloning Possible Code Tangling Possible Code Scattering Possible Code Scattering Figure 17. Mapped Clusters and Obtained Aspect Seeds With Their Types

66 55 Formal Representation of Mapping If software system being investigated is represented as S, and C = {c 1,, c k } as the set of all classes in S, and M = {m 1,j, m 2j,, m nj } as set of methods in cluster α j and A = {α 1, α 2,, α k } as the set of clusters obtained by SOM. Since every method in every cluster α i came from S, then m ij α j m i C k, and m nj in each cluster α j can be mapped to their appropriate classes C k in S as follows: Figure 18. Mathematical Definition of Mapping The mathematical definition of mapping states that A is mapped to B if no member in set A is left without a matching member in set B, and that some class C x in set B may not be mapped to by any method m nj A, and no two members in set B are mapped to the same single member of set A. In line with above definition, it can be stated that, if m i M C x and m j M C y, m i m j (two different members of a cluster) map to different classes then m i and m j are involved in scattering. Similarly, if m ij M C x and m nj M C x (methods from two different clusters) map to same class in B, then we have a tangling case. The number of Aspect Seeds in S = scat + tang, where scatt=set of methods that participate in implementing scattering, and tang = set of methods that participate in implementing code tangling in software system S. Following same definition, the number of aspects mined in a software system S, can be expressed as Distinct-count( scat + tang ).

67 56 Validation of Methodology Kankanhali and Tan (2004) observed that without a trend to follow, or an expected value to compare against, a calculated measure gives little or no information. Marin, Van-Deursen, and Moonen (2007) observed that, to ensure repeatability of the experiments in Aspect Mining, research results will have to be validated by means of a series of case studies. To meet these requirements, a set of open source benchmark programs were selected such as those listed in table 3 (Benchmark Programs). The main problem with clustering data for instance is that, optimal number of required clusters is not known a priori; to complicate matters, different distance measures may result in different shapes of clusters such as compact, hypersphere, hyperellipsoid etc. Luckily, Matlab and associated tools such as CVAP have built-in facilities that help in determining optimal number of required clusters based on the nature of submitted data, with validity indices such as Davis-Bouldin index also provided. According to Handl and Douglas (2005), the use of cluster validation can help improve quality of results and increase the confidence in the final results. To achieve the required validation, the following validation procedure was used. 1. Validation of software metrics used as input to clustering 2. Validation of optimal number of clusters and compactness of each cluster. This will be done using existing tools such as Matlab CVAP 3. Validation of obtained Aspect candidate seeds 4. Aspect Mining Precision was calculated to compare and determine how well the methodology used in this dissertation compares to other Aspect Mining research.

57 Validation Step-1. Software Metrics. Six software metrics were formulated from the event trace data collected dynamically while the benchmark programs were executed and exercised.

The metrics were carefully selected to ensure that inherent and relevant features within the test programs were properly represented.

68 57 Validation Step-1. Software Metrics. Six software metrics were formulated from the event trace data collected dynamically while the benchmark programs were executed and exercised. The derived metrics were then used to represent the extractible code features from the tested software packages, (i.e. Laffra s Dijkstra s alogorithm, JHotDraw). The metrics were carefully selected to ensure that inherent and relevant features within the test programs were properly represented. From the SOM results collected, Radar/Spider charts were constructed as displayed in figure 19 shown below. Radar plot for LDA Radar plot for JHD Figure 19. Radar Plots for Metrics Used on Benchmark Test Programs As it can be seen from the Radar plots shown in figure 19 above, all the metrics used have shown substantial influence on the clusters obtained from SOM. Of particular distinction are four metrics (MethodSig, InfoFlow, MEC and MethodSpread) that have shown substantial influence on all obtained clusters across the board. The only exception is the low influence they have on clusters 1 and 3 (LDA case). To complement this, MIC has also shown significant influence, where the four mentioned metrics have exhibited less influence. The strength of MCC influence is indeterminate. All in all, the influence exhibited by the metrics as shown in figure 19 is an indication that the combination of dynamic metrics used in this dissertation is a good choice.

69 58 Validation Step-2. Number of Clusters and Cluster Compactness Since optimal number of clusters is not known beforehand, and feature information is inherent in the data set, internal validity indices were considered in assessing how accurate the obtained SOM results are. Other factors that may influence the quality of SOM clusters include the size of the map (number of output neurons) and the map topology. To obtain an optimal number of clusters from a dataset, a balanced approach must be applied. For instance, if the map size is too small, some important difference in the data may be missed, and on the other hand, if the map size is too large, data differences may become too small to be detected. To attain an appropriate validation of clustering results, it is important that objective measures for evaluating the clustering quality are applied. To achieve this goal, optimality measure indices presented by Kohenen (2001) and Kiviluoto (1996) namely, quantization and topographic error measures were considered. Quantization error is the mean average of Euclidean distance of each data vector to its Best Matching Unit (BMU) weight vector. The topographic error is calculated as the proportion of all data vectors for which first and second BMU s are not adjacent units in the grid. Several SOM packages were tried, and MatLab with Neural Network toolbox were found to contain all the quality indices that ensure that compact clusters were produced. Using Matlab Statistical tools, quantization and topographic errors were plotted to help in pinpointing and selecting the optimal number of map units. Similarly, the Davies-Bouldin index measure was obtained from MatLab associated tool (CVAP). Graphs were produced where number of SOM units was plotted against number of clusters to determine the optimal number of clusters.

59 The Davies-Bouldin index estimates the optimal size of clusters through minimization of the ratio between inter-cluster and intra-cluster distances where a low index value indicates number of

The index is usually calculated using the following formula: Where d(c i,c j ) is the distance between clusters c i, and c j (inter-cluster distance); d'(c k )} is the intra-cluster distance of

70 59 The Davies-Bouldin index estimates the optimal size of clusters through minimization of the ratio between inter-cluster and intra-cluster distances where a low index value indicates number of clusters that can be used to attain good clustering results. The goal of Dunn s validity index is identifying cluster sets that are compact and well separated. The index is usually calculated using the following formula: Where d(c i,c j ) is the distance between clusters c i, and c j (inter-cluster distance); d'(c k )} is the intra-cluster distance of cluster c k, and n is the number of clusters. The goal is to maximize the inter-cluster, and minimize the intra-cluster distances, and the number of clusters that maximizes D is taken as the optimal number of the clusters to be used. Many software packages that measure compactness of clusters exist containing such indices as Dunn s, Davies-Bouldin and Silhouette validation indices. For instance, figure 20 below shows how Davies-Bouldin index was used to determine the optimal number of clusters to be considered for LDA and JHD, 10 and 20 in this case respectively. Figure 20. Plots of Optimal Number of Clusters Based on Davies-Bouldin Index

71 60 Validation Step-3. Validation of Aspect Candidates Although there are no established agreed upon Aspect Mining benchmark programs among researchers, applications such as JHotDraw, PetStore and MineSweeper have been used widely. As a base for comparison, aspect candidates obtained from this dissertation were compared to existing results obtained by Ceccato et al. (2005), marin et al., (2004), Tonella, Ciccato (2004), Moldovan and Serban (2006). This comparison helped confirm whether the aspect seeds obtained from the project are indeed aspects. The comparison was made on the basis of one-to-one, meaning that if an aspect seed found in the project matches one found by other studies Ceccato et al., (2005), then the aspect candidate is declared an actual aspect, otherwise, is considered as a false positive. Through this comparison, actual aspects were filtered, false positives identified and data for the calculation of recall and precision were collected. In this dissertation similar approach will be used but since there is no agreed and establish base for Aspect Mining confirmation, and even where such comparison datasets exists, comparison and confirmation can only be made subjectively. Another problem that complicates the situation is that although there might be some overlap in aspect seeds found by different Aspect Mining methods, different Aspect Mining methods produce different sets of aspect seeds due to disparate capabilities associated with different Aspect Mining methodologies. Ceccato et al., (2005) compared the performance and capabilities of three types of Aspect Mining methods one of which was based on JHotDraw. This work provides an objective comparison based on exhaustive and compiled list of aspects mined by Marin, Deursen and Monen (2004) using Dynamo. For this reason this dissertation will use data from Marin, Deursen and Monen (2004) as a comparison base.

72 61 Of the three Aspect Mining methodologies discussed in the study by Ceccato et al., (2005), Dynamic Aspect Mining is found to be closer and similar to the work done in this dissertation; for this reason, the following two types of comparisons were used to validate the results obtained in this dissertation. 1. Capability comparisons in terms of identifying crosscutting sorts as done by Ceccato et al., (2005). 2. Recall, Precision comparison (From False positives and False negatives) For JHD, the methodology of comparing selected crosscutting concern types as presented by Tonella and Ceccato (2004) and discussed in Ceccato et al., (2005), was used to determine how well the dissertation Aspect Mining methodology does. With respect to the second benchmark program (Laffra s Dijkstra s Algorithm implementation), the dissertation results were compared with those obtained by Moldovan and Serban (2006) and Tonella and Ceccato (2004). From literature search on Aspect Mining work, the aforementioned research works and benchmarks are found to be the most prominent and therefore selected as a base for comparison. Validation Step-4. Recall and Precision Recall and precision have been used to assess the quality of literature search and have been used widely in data mining and Information retrieval (IR) systems. Recall measures how well a search method does finding what is required, and precision measures how well the method used weeds out what is not required. With respect to Aspect Mining, Bruntik et al., (2005) defined recall as the measure of how much of the code of a crosscutting concern is found, and precision as the ratio of crosscutting concern code to unrelated code.

73 62 Definitions of recall and precision are presented in the literature depending on the data at hand. According to Eaddy (2008) recall and precision can be defined as follows: Recall = Good Results/(Good Results + False Negatives) Precision = Good Results/(Good Results + False Positives) If number of aspects results returned by Aspect Mining methodology is represented as R, and set of aspect results to be used as a base for comparison as C, and n as the number of aspect seeds that are reported not to be aspects but are confirmed to actually be aspects (False negatives), and p as the number of aspects seeds from R that are thought to be aspects but were found to be non-aspects (False positives); a relationship and definition of precision can be formally defined as shown in figure 21. If C = Number of Confirmed Aspect Seeds (Comparison Base) Recall= R = Number of returned results, then Precision = R/C False negatives = n/r False positives = p/r Figure 21. Precision and Recall Eaddy (2008) also defined recall and precision in the following representation. If E m is the set of program elements (methods) that are actually relevant to concern c (relevant element set), and E d as the set of elements judged to be relevant by the Aspect Mining methodology (i.e. the retrieved element set), then: Recall = E m E d / E m, and Precision = Recall = E m E d / E d When a system retrieves all the relevant documents without introducing irrelevant elements in the returned result, 100%. Precision is attained; in reality this rarely happens due to presence of noise in data or due to inappropriate use of the retrieval methodology.

74 63 Summary of Methodology Used A concern in a software system is a representation of implementation of unique idea or functionality. Due to the limitations of programming language constructs, structural degradation associated with repeated changes and continual enhancements, some concerns (known as crosscutting concerns), cannot be modularized. Aspect Mining is a reverse software engineering exploration technique that is concerned with the development of concepts, principles, methods and tools supporting the identification and extraction of re-factorable aspect candidates in legacy software system. The Aspect Mining approach presented in this dissertation involved three-phases. In the first phase, selected large-scale legacy test systems were dynamically traced and investigated. Metrics representing interactions between code fragments were derived from the collected data. In the second phase, the formulated metrics were then submitted as input to Self Organizing Maps (SOMs) for clustering. In the third phase, clusters produced by the SOM were mapped against the test benchmark software under investigation in order to identify code scattering and tangling symptoms from which crosscutting concerns are identified, and candidate aspect seeds mined. To validate the methodology employed and assess performance, the following validation techniques were employed: 1. Validation of software metrics used as input to clustering. 2. Validation of optimal number of clusters and compactness of each cluster. 3. Validation of obtained Aspect candidate seeds. Obtained results were then compared to results obtained in existing Aspect Mining approaches in order to determine performance.

75 64 Chapter 4 Results Data Analysis, Findings and Summary Introduction This dissertation introduced a new Aspect Mining methodology based on clustering of extractible software code and clustering them using Self Organizing Maps (SOMs). The obtained clusters were then mapped against the software being investigated in order to identify aspect seeds based on code scattering and tangling patterns. The clustering and mapping methodology used have helped identify implementations of crosscutting concerns leading to Aspect Mining. Two benchmark test programs were used namely, Laffra s implementation of Dijkstra s algorithm (LDA) and JHotDraw (JHD). Both programs were designed and developed to showcase and exemplify best programming practice, and have been widely used in other Aspect Mining research. These benchmark programs were dynamically traced and investigated. Data representing interactions between the code fragments was collected and transformed into metrics, which were then used as input to SOMs. Aspect Mining results obtained from this dissertation were then compared to results obtained by two existing research works that used the same set of benchmark programs, namely, the work done by Moldovan and Serban (2006), and Ceccato et al (2005). With respect to JHD, all the 18 aspect seed types identified by Ceccato et al., (2005) were also identified by this dissertation; the only difference is in number of seeds being higher in the case of Ceccato et al., (2005) due to the fact that, while their work was guided, the methodology used in this dissertation was unguided.

76 65 Since both methodologies identified the same set of aspect seed types, it can be concluded that with regards to LDA, the methodology presented in the dissertation did as well and no worse than the methodology used by Tonella and Ceccato (2004). For detailed list of aspect seeds identified by this dissertation and those by Tonella and Ceccato (2004) work, refer to table 6 on page 74. Since both Moldovan and Serban (2006), and Tonella and Ceccato (2004) have used LDA benchmark in their Aspect Mining research, aspect seeds identified in this dissertation were compared to results obtained by both works. It was found that the method used in this dissertation has outperformed the methodology presented by Moldovan and Serban (2006) since Moldovan and Serban (2006) work did not find the three aspect seeds that were found by the methodology used in this dissertation. Moldovan and Serban (2006) opined that the threshold they used and the fact that the approach they used was focused only on scattering might have constrained the methodology resulting in failure to find the aspect seeds that exist in LDA. With regards to comparison of LDA results obtained in this dissertation and those found in the work done by Tonella and Ceccato (2004), it was found that both (this dissertation and Tonella and Ceccato (2004) methodology have identified exactly the same three sets of aspect seeds (lock and unlock methods) that represent crosscutting concerns in LDA benchmark. The following paragraphs give detail discussions on findings and comparison of the results obtained in this dissertation and those obtained by existing aspect mining methodologies that utilized the same two benchmarks used in this dissertation (i.e. LDA and JHD).

77 66 Analysis of LDA Results The set of aspect seeds (lock() and unlock()) identified and mined in this dissertation were found to be the same exact set of seeds identified by Tonella and Ceccato (2004). The three method pairs were found to come from clusters 2 and 3. While two pairs were found in cluster 3, one pair is found to belong to cluster 2. In the case of the first pair of seeds, since each came from different class and both belong to the same cluster, a clear scattering behavior is exhibited, qualifying this set of methods as aspect seeds. It should be noted that these two method pairs were invoked from the other pair that belongs to cluster 2. Although refactoring is not the theme of this dissertation, the identified method pairs should be considered as a base from which such exercise should be carried out. It should also be noted that the method pair belonging to cluster 1 is an example of an attempt towards modularization that leads to case of code scattering. To illustrate some of the characteristics and features identified in LDA and make analysis easier, table 5 is presented below. Table 5. Sample Data Representing Formulated Metrics. MethodSig Information Flow MethodSpread MIC MEC MCC Method Name Cluster lock unlock action lock unlock showline lock unlock init runalg stepalg initalg nextstep run 3

67 Table 5 shown above is a sample LDA dataset after the six metrics were formulated before submission to SOM for clustering. Similar data for JHotDraw test program can be found in appendix D.

As can be seen from this table, rows (vectors) with same or closely similar value in (methodsig and methodspread) metrics are found to belong to the same cluster.

78 67 Table 5 shown above is a sample LDA dataset after the six metrics were formulated before submission to SOM for clustering. Similar data for JHotDraw test program can be found in appendix D. The last column in this table was added to facilitate easy analysis. As can be seen from this table, rows (vectors) with same or closely similar value in (methodsig and methodspread) metrics are found to belong to the same cluster. Although method pairs highlighted in gray (stepalg and initalg) in table 5 are similar in value in the key columns, and belong to the same cluster 3, they are however found not to be aspect seeds. In the SOM U-Matrix displayed in figure 22 below, it can bee seen that these methods are placed far away from the aspect seeds identified above. This highlights the need for further research to establish a threshold that can be used for such a case. For short programs with few hundreds of lines of code, manual investigation may be possible, but in case of industrial size applications this may not be feasible. Figure 22. LDA U-Matrix on Left and Components Planes on Right

68 Referring to figure 23 on page 67, it can be seen that ten clusters were obtained. The right side of figure 23 shows a weight plane for each element of the input.

79 68 Referring to figure 23 on page 67, it can be seen that ten clusters were obtained. The right side of figure 23 shows a weight plane for each element of the input. These are visualizations of the weights that connect each input to each of the neurons. Similarity of component planes is an indication of correlation between the variables; for instance, component planes for InfoFlow and MethodSpread seem to be directly correlated while MIC and MEC are no correlated with any other variable. Figure 23. LDA Clusters Produced by SOM Figure 23 above shows 5 out of the 10 obtained clusters with their respective members within the confines of the clusters. Of special interest are the two clusters (labeled 2, 3, 4) corresponding to the clusters that contain the three matching pairs of methods lock() and unlock(). It should be noted that although some individual isolated dots (not circled) are reported as clusters, these are either representations of isolated implementations of non-aspect features or some noise existing in the event trace data.

80 69 LDA Mapping Example To map cluster members to classes in order to mine aspect seeds, the SQL statement shown in appendix H was used to links the method detail table with cluster result table. The detail table is simply a list of all methods whose vectors were formulated before being clustered by SOM. For the sake of simplicity, method Id, Method name and class names were extracted. Table in figure 24 below is a sample of the obtained results. Figure 24. LDA Aspect Seed Mapping After SQL Execution Figure 24 above shows the Aspect Mining results after the SOM cluster data was subjected to the SQL query (see appendix H) that maps cluster members to their home classes in order to determines whether a particular set of seeds participated in implementing scattering or tangling cases.

81 70 When query shown in appendix H is modified to exclude classes where there is only 1 and only 1 method in cluster pointing at a class, the result shown in figure 25 was obtained. This result is found to contain the two pairs of methods lock() and unlock() that implement locking and unlocking of graphical user interface each time a functionality of LDA is executed. Figure 25. LDA Mined Aspect Seeds Due to the small size of LDA, it is possible to explain how the two aspects are mined through mapping and what methods participated (i.e. seeds) in implementation of the identified aspects. The following diagram (figure 26) shows the obtained cluster members mapped to their home classes. For the sake of explanation, only mapped clusters are shown in this example. On the right hand side is the set of the all the LDA classes with only three of the affected classes shown. Figure 26. LDA Method Mapping Showing the two Identified Aspects

82 71 Arrows from two different clusters pointing to their home classes shows a clear case of scattering. Based on the Scattering and Tangling guidelines presented in table 4, (Lock() and Unlock()) methods from three different classes have exhibited scattering behavior across three classes; and hence were declared as possible aspect seeds. LDA Findings and Result Comparisons The work done by Tonella and Ceccato (2004) on LDA discovered two crosscutting concerns (lock() and unlock()) for locking and unlocking the LDA graphical user interface each time a functionality is executed. In presenting their results, Moldovan and Serban (2006) also investigated LDA and presented an example that shows how methods are grouped in clusters using k-means approach. Using two different models, the optimal number of clusters obtained by k-means for model M1 and M2 were 7 and 5 respectively. Moldovan and Serban (2006) also explained that their investigation did not discover these crosscutting concerns as Tonella and Ceccato (2004) have done, and opined that the inability of their approach to discover these crosscutting concerns may have been due to the threshold used and the fact that their approach focused solely on scattering as opposed to the approach used by Tonella and Ceccato (2004) that targets both scattering and tangling, and opined that a better choice of threshold would have helped in identifying these crosscutting concerns. Due to the fact that the two LDA crosscutting concerns are identified and discovered right from the clustering stage and how vividly they can be seen and identified as shown in figures 22 and 23, it can be said that the approach used in this dissertation has shown a remarkable added advantage in terms of feature visualization and aspect seed identification.

83 72 Findings and Data Analysis (JHotDraw 5.4b1) The MatLab tool (CVAP), suggested creation of 23 clusters for JHD. Through the SOM data structure, details of methods belonging to different clusters were obtained. Due to the size of JHD and the limitation of screen space, method names were written over and superimposed with other cluster member names. Despite this limitation, the obtained clusters can be seen clearly as shown using U-Matrix shown in figure 27 below. Figure 27. JHD U-Matrix The U-matrix shown in figure 27 above shows the obtained SOM clusters. It should be noted that some methods did not fit into any of the major clusters despite this most are located close to clusters they are naturally closest to.

84 73 After observing the 23 obtained clusters, members of five clusters were found to contain non-aspect seeds. The other 18 clusters were found to correspond to the same 18 aspect seeds discovered in Ceccato et al., (2004) work. The graph displayed in figure 28 below shows the cluster depiction of those methods that participated in implementation of the eighteen discovered aspects. Figure 28. 3D JHD SOM Clusters Since JHD was designed and developed to exemplify the best practice software development, it should be expected that the software would be laden with design patterns. For this reason, the main source of crosscutting concerns for this package is formed from various known design patterns. From the data collected see appendix F, all the 18 aspects identified by Tonella and Ceccato (2004) were also discovered by the approach used in this dissertation. The only difference is in the number of seeds (i.e. number of methods that participate in implementation of some of the aspects). Table 6 displayed on the next page shows comparison of JHD aspect seeds obtained by the dissertation approach and those obtained by Tonella and Ceccato (2004).

85 74 Table 6. Comparison of Discovered JHD Seeds Aspect Tonella & Ceccato (2004) This Dissertation Dynamic Analysis Undo Bring to Front 3 3 Send to Back 3 3 Connect Text Persistence Manage Handle Move Figure 7 7 Command Executability Connect Figures Figure Observer (Manage Figure) 11 3 Add Text Add URL to figure 10 7 Manage Figues Outside Drawing 2 2 Get Attribute 2 2 Set Attribute 2 2 Manage View Rectangle 2 2 Visitor 6 6 The graph shown in figure 29 below shows the seed disparity between the two compared methodologies for 17 established and discovered aspects Dynamic Analysis Dissertation Undo Bring to Front Send to Back Connect Text Persistence Manage Move Figure Command Connect Figure Add Text Add URL to Manage Get Attribute Set Attribute Manage View Visitor Figure 29. Graph Showing Seed Discovery Disparity The disparity between the two methods could be explained due to the fact that, while Tonella and Ceccato (2004) work benefited from using Dynamo tool and utilized use-cases as guideline, the methodology used in the dissertation is unsupervised.

86 75 Most of the JHD aspects and the associated seeds were found to point at portions of code that implement design patterns. This is not a coincidence because JHotDraw was intentionally designed and developed as a framework exemplifying good programming habits reflecting the use of design patterns. Marin, Deursen and Moonen (2004) observed that, since substantial number of aspects found in JHD and similar packages are related to design patterns, it is worth it investigating the use of design pattern oriented Aspect Mining techniques. JHD Component Planes Component planes are used to show the values of the variables used in a situation where there is a lot of information to visualize. Figure 30 below shows the component planes corresponding to the 5 vector components used to with respect to JHD. Figure 30. JHD Component Plane Maps Comparing similarity between the component planes it can be seen that, component planes for MIC and MEC are directly correlated while InfoFlow and MethodSpread are inversely correlated.

87 76 Comparison of Results (JHotDraw 5.41b) Shepherd, Palm and Pollock (2005) observed that, due to complex nature of largescale software systems, none of the existing Aspect Mining techniques (Fan-In, Identifier and Dynamic Analysis) are self-sufficient to discover all the possible aspect seeds in a software system. For instance, dynamic analysis and Formal concept analysis methods may do better than Fan-In analysis in discovering some crosscutting concern types, while on the other side Fan-In analysis may do better than dynamic analysis and Formal concept analysis methods in detecting some sort of aspects. Since no single Aspect Mining methodology has the capability of detecting every possible concern type in a software system, and the capabilities of different methodologies differ, Ceccato et al. (2005) selected 6 main concerns types obtained from testing JHotDraw by three different methodologies as the basis for comparison. This comparison is appropriate because capabilities and weaknesses of Aspect Mining methodologies can be highlighted. In line with this, the dissertation performance is compared against those presented by Ceccato et al. (2005), as shown in table 7 below. Table 7. Select Concern Type Capability Comparisons Concern Fan-In Analysis Identifier Analysis Dynamic Analysis Dissertation Results Observer Consistent Behavior Command Execution Bring to Front/Send to Back Manage Handles Move Figures Out of the three Aspect Mining approaches analyzed by Ceccato et al. (2005), Fan-In and Identifier analysis have shown strength in identifying aspects that exhibit scattering behaviors; on the other hand, the Dynamic Analysis approach is found to be capable of detecting both scattering as well as tangling cases.

88 77 With regards to similarities, Dynamic analysis is found to be the closest in similarity to the methodology presented in this dissertation; for this reason, it is then appropriate to compare these two approaches in terms of capability performance based on the selected concern sorts. Table 8 below is a summary showing crosscutting concern types mined by both approaches, depicting how well the two methods did with the selected concern sorts. To compare capabilities of the approach presented in this dissertation against existing results, the following table is constructed in line with Ceccato et al., (2005) presentation. A plus sign in a column (for a particular methodology) indicates that for a selected concern sort (displayed on first column), the methodology is capable, while minus sign indicates that the methodology is lacking in discovering the concern sort. Table 8. JHD Seeds Comparison, Dynamic Analysis Vs Dissertation Approach Selected Concern Sort Description Dynamic Analysis Aspect Seeds Aspect Seeds (Dissertation) Consistent Behavior Checking and refreshing figures + + Contract Enforcement Contract enforcement + + Undo Checks whether a command is doable or not - + Persistence and Deals with reads/write operations + + Resurrection Command Design Pattern Deals with execute in command classes and command - - constructors Observer Design Pattern Observer manipulation and method notification + + Composite Design Pattern Method of manipulating child components + + Decorator Design Pattern Passes calls on to decorator components + + Adapter Design Pattern Manipulates reference from handle adapter to Figure adapter + + Nine out of the 18 known and confirmed aspects were selected for the comparison. It should be noted that these represent 50% of the confirmed aspects and are also known and established design patterns. From the performance comparison shown in table 8 above, for the selected concern types, the approach used in the dissertation has matched and in one instance exceeded the capabilities exhibited by Dynamic analysis attributed to Tonella and Ceccato (2004).

89 78 Result Precision Since Dynamic analysis data from Marin, Deursen and Monen (2004) is the basis on which work done by Ceccato et al., (2005) was based (for JHD), the following definitions and terms were used in determining precision for the results obtained in this dissertation. If number of aspects obtained when Dynamic Analysis was applied to JHD as presented by Ceccato et al (2005) is represented by c, (which was 18), and d = number of aspects found by the methodology presented in this dissertation; since Marin, Deursen and Monen (2004) dynamic analysis using Dynamo attained 51% precision, and the dissertation approach discovered exactly the same 18 aspects found by Ceccato et al. (2004), the precision attained by the dissertation = d/c * (51%) = 18/18 * (51) = 51% which is the same as attained by the comparison base. The fraction d/c is multiplied by 57 because this was the reported precision attained by the dynamic analysis work done by Marin, Deursen and Monen (2004) from which the comparison dataset was created. As stated earlier, although the same level of precision is attained, the number of seeds found by Marin, Deursen and Moonen (2004) is higher because the mining was conducted manually and guided through software use-cases and the use of Dynamo tool. Table 9. LDA Benchmark Precision Comparison Benchmark: LDA Moldovan & Serban (2006) Tonella & Ceccato (2004) This Dissertation Confirmed Seeds Non-Seeds Concerns Table 10. JHD Benchmark Precision Comparison Benchmark: JHotDraw Marin, Deursen and Monen (2004) This Dissertation Confirmed Seeds 58 (51%) 18/18 * (51) =(51%) Non-Seeds 56 (49%) 25/56 = (45%) Concerns 10 10

90 79 Summary of Results (General Observation and Findings) After investigating the data collected from event traces and formulating the dynamic metrics, it was found that the two component vectors MethodSig and InformationFlow (for both test benchmark programs) are found jointly to be good indicators signifying similarity or closeness between vectors. The results obtained after SOM clustering, has also confirmed the fact that the combination of MethodSig and InfoFlow metrics is a strong indicator pointing towards possible aspect candidate seeds. Vectors tightly related on these columns (components) are found to ultimately end up being in the same cluster. Results obtained in this dissertation have also confirmed the premise of Fan-In analysis from Ceccato et al., (2005), that, methods with high Fan-In values are associated with aspects, confirming that there is a high correlation between high fan-in value and aspect seeds found in a software system. As presented on page 30, six component vectors were initially considered as input to SOM, but during the analysis, it was discovered that Method Cohesion Contribution (MCC) metric, had a negative influence on how well the SOM clusters were formed. When the MCC column was removed, the clustering results obtained improved drastically. To determine whether MCC should be dropped from consideration, trial runs were carried out on JHD and LDA (changing normalization parameters) and pairs of quantization and topographic error values were collected as shown in table 11 below. Table 11. Series of Quantiztion and Topographic Errors Collected JHD / LDA Parameters HistC HistD Log Var Range logistic 6 Component vectors used Qe Component vectors used Qe Component vectors used Te Component vectors used Te

80 When data from table 11 was plotted on graph as shown in figure 31, lower quantization and topographic errors were obtained when 5 component vectors were used than when 6 component vectors were

Based on this, a decision was made to drop (MCC) and use five vector components rather than the 6. (JHD) (LDA) Figure 31.

91 80 When data from table 11 was plotted on graph as shown in figure 31, lower quantization and topographic errors were obtained when 5 component vectors were used than when 6 component vectors were used (which is the preferred situation). This proved that MCC negatively impacts how SOM clusters the given data. Based on this, a decision was made to drop (MCC) and use five vector components rather than the 6. (JHD) (LDA) Figure 31. Comparison of Quantization and Topological Errors 5 Vs 6 Components Visualization Problems The ability to visualize results and be able to make inference without delving into volumes of data is of great importance particularly in software exploration where the relationship among abstract concepts and ideas is being sought. Although a lot of visualization advantage is gained through SOM U-Matrix and other visualization tools, it was observed that with increase in number of methods, visualizing clusters produced by SOM is problematic without intrusive data manipulation of the SOM data structure. In situations where methods are tightly clustered, labels are found to be superimposed over each other, making it difficult visualizing the data. Another problem was the case of methods with the same name existing in different classes. The problem with this is that, it is difficult keeping track of such methods, and also differentiating between them is difficult without tagging their class names in datasets.

92 81 Seed Disparity Ceccato et al., Vs Dissertation Approach When analyzing dynamic Aspect Mining results compiled by Marin, Deursen and Moonen (2004), it was observed that a higher number of aspect seeds were returned than the methodology presented in this dissertation. However, on one-on-one basis, the dissertation methodology has matched the actual number (18) of the aspects detected by Ceccato et al., (2005). For instance, Ceccato et al. (2005), found 36 seeds that implement undo aspect while the dissertation approach found only 11, with both pointing and detecting the same undo aspect in JHD. The disparity is attributed to the fact that while Cecatto et al., (2005) approach is guided the approach in this dissertation is unguided. Problem of Incomplete Coverage One problem associated with event trace data collection methodology as used in this dissertation includes the issue of exercising the benchmark test programs such that all possible features get invoked. This problem is known as the Incomplete test coverage and is known to affect both recall and precision. For instance out of 3230 JHD methods, only 315 unique methods were invoked when JHD was exercised. With respect to LDA, 30 unique methods were invoked out of total of 42. See Table 10 bellow for a summary. Table 12. Proportion of Executed Unique Methods Benchmark Number of Number of Coverage% Incomplete Coverage % Program Methods Exercised Methods JHD 3, % 91% LDA % 29% LDA benchmark has a higher coverage percent probably due to its small size, while the size and complexity of JHD (being framework oriented) has a very low coverage and high incomplete coverage. To mitigate this problem, set of carefully selected invoking inputs were formulated as shown in table 2 in phase 1 of the methodology section.

93 82 Data Analysis Problems Data management and analysis of the data associated with large-scale software systems can be daunting. For instance, with JHD having over 300 classes and over 2000 methods, visual display of datasets associated with discovered aspect seeds may not be easy, but with specially designed software exploration tools such as Extravis attributed to Cornolissen, Holten, Zaidman, Moonen,Van Wijik and Deursen (2007), a lot of valuable information about the software being investigated can be deduced. Noise Problem in Data Noise in a dataset can be defined as non-core objects that do not lie within the closest bounds (neighborhood) of any core object. Noise in data is introduced due to imperfect data collection process or due to inherent data elements that cannot easily be identified for removal. Data corruption due to noise may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, luckily, existing learning algorithms have integrated approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A good example of data noise encountered during the JHotDraw event trace data collection, was the case in which mouse events were observed to generate a lot of unnecessary noise having little or nothing to do with methods that implement crosscutting concerns. Most existing noise removal methodologies use statistical outlier detection techniques. MatLab SOM tool seems to have surmounted the existence of noise in data as data representing noise were displayed as outlier sparse objects that were placed as closely as possible to their best possible home cluster.

94 83 Summary of Results To validate the Aspect Mining methodology presented in this dissertation, the obtained results were compared with those obtained in other existing Aspect Mining methodologies. When LDA results obtained by the dissertation methodology were compared with those from Tonella and Ceccato (2004), it was found that the same two crosscutting concerns were discovered, spanning 3 different classes. When compared to results obtained by Moldovan and Serban (2004), the dissertation approach is found to have performed better in the sense that Moldovan and Serban (2004) approach failed to discover the two crosscutting concerns possibly due to wrong choice of a threshold. With respect to the second benchmark (JHD), the results obtained in the dissertation have matched or to have exceeded those obtained by Ceccato et al., (2005). In this case, all the 18 aspects discovered in that study were also discovered in this dissertation. On one-on-one basis, both methodologies found the same 18 exact aspects and attained the same level of precision, the only difference being in terms of number of seeds found under each of the 18 discovered aspects being higher with Ceccato et al., (2004), Marin, Deursen and Moonen (2004), than those found in this dissertation. The reason attributed to this disparity is the fact that while Dynamic Analysis attributed to Marin, Deursen and Moonen (2004) was manually guided through the use Dynamit Aspect Mining tool, the approach presented in this dissertation is unguided. All in all, the Aspect Mining results obtained in this dissertation have matched or exceeded results obtained in similar existing Aspect Mining works, confirming the premise that the approach presented in this dissertation performs at least at par with similar existing Aspect Mining research.

95 84 Chapter 5 Conclusion Understanding internal structure of a software system and the components that composes it requires clear understanding of different concerns that the software addresses, which in turn requires the use of efficient methodologies of identifying those concerns in the source code. While some crosscutting concerns are explicitly represented by program entities, others are not captured by a single program entity and are therefore scattered over many program units and tangled with code implementing other concerns. Aspect Mining is concerned with identifying system-wide crosscutting concerns. Presence of scattered and tangled code, and cloned code fragments in large software systems results in highly fragile and less modular systems that are difficult to maintain and extend. Dynamic software analysis is recognized as a useful means of gaining insight into inner workings of a software system. With this useful feature comes the disadvantage of handling large amounts of event trace data. According to Bauer (1996), one of the steps required to investigate a legacy software system is to understand how the software system works and how its structure is composed. To do so, it is necessary to capture internal and behavioral interaction that represents the interrelationship between various components of the legacy system. This dissertation investigated the use dynamic method-level software metrics as input to SOM to determine if aspect seeds can be found through mapping of obtained SOM cluster elements against classes that make up the investigated benchmark programs. Selected set of benchmark software programs were investigated through execution and dynamic collection of event trace data representing latent features existing in the programs.

96 85 With respect to the first research question raised, results obtained have shown that the derived metrics have perfectly represented hidden features in the benchmark programs very well, and are relevant and useful in providing the required discriminating characteristics. With respect to the second research question that addresses the issue of mapping clusters against the investigated benchmark programs, this question is also answered since the mapped cluster results have helped in exposing scattering and tangling patterns that lead to identification of aspect seeds and mining of aspects. All in all, results obtained in this dissertation have matched or exceeded results obtained in other existing works. With respect to LDA, the approach presented in this dissertation has outperformed the methodology used by Moldovan and Serban (2004), and when compared to results obtained by Tonella and Ceccato (2004), the dissertation results were found to match the exact same 2 LDA crosscutting concerns found by Tonella and Ceccato (2004). With respect to the second benchmark (JHD), all the 18 aspects discovered by Ceccato et al., (2005) were also obtained by the approach presented in the dissertation. The disparity in terms of number of seeds returned by the dissertation and those in the comparison base data was due to the fact that the Dynamic analysis used in the Ceccato et al., (2005) presentation was guided by use-cases and manual use of Dynamo tool, the approach presented in this dissertation is unguided. Since the SOM clustering method is unsupervised, one of the major contribution of this dissertation is the presentation of a new Aspect Mining approach that minimizes human interaction, where hidden software features can be identified, and inferences about the general structure of software system can be made, thereby addressing one of the drawbacks in existing dynamic Aspect Mining methodologies.

97 86 Implications The contribution of the methodology presented in this dissertation lies in the conversion of dynamic collection of data about extractible software fragments at methodlevel granularity, representing these as metrics (representing internal behavior of the software) and using them as guidance for clustering. The obtained results have shown that the approach presented is a good and viable technique that can be used in software visualization and exploration. The strength of the Aspect Mining approach presented in this dissertation is that the input metrics required to represent software code fragments can easily be derived from other viable software metrics formulations without complex formalisms. The following list is a summary of the contribution of the approach used in this dissertation: 1. The methodology used in this dissertation has shown that with proper formulation of software metrics and proper dynamic event trace data collection, internal structure and behavioral nature of large-scale software systems can be explored. 2. SOM and other similar clustering methodologies can be used (as a first step) to help identify aspect seeds leading to Aspect Mining. The visual graphical representation of code fragments provided through SOM can also help support software exploration and visualization. 3. Results obtained in this dissertation have shown that the Aspect Mining approach presented does as well and no worse than existing Aspect Mining methodologies. 4. The approach presented in this dissertation can also be used to provide a means through which useful insights can be made into the very nature of large-scale software systems.

98 87 Recommendations Although the specified goals and objectives of the dissertation were met, there are some suggestions for improvements and other areas of research worth pursuing. For instance, further research to determining optimal number of vector components sufficient and enough to perfectly cluster set of vectors (the question of how many metrics are enough) will help reduce amount of data processing and sharpen obtained results. It will also be interesting to investigate the effect of combining n!/(n-1)! vector components to perform sensitivity analysis in order to determine the best possible input combinations to SOM for clustering. The obtained result will help in determining the effect of including a metric as a vector component and how the metric impacts the results obtained in Aspect Mining. There is also a need for further work to determine how close vector components have to be (threshold) in order to determine cluster destinations of the vectors from which the components come from. This will allow for an intervening step that can be used to further help reduce data processing time through identification and elimination of vectors that may represent noise in the experiment datasets. One major problem with Aspect Mining validation is lack of data to compare Aspect Mining results with. Since there is no established industry-wide acceptable suite that can serve this purpose, it will be a worthwhile effort conducting a research that will help establish such datasets from popularly known Aspect Mining benchmark test programs vis-à-vis (JHotDraw, Laffra s Dijskstra s algorithm implementation, and PetStore) test suits. The established Aspect Mining datasets will then serve as a common base for comparison and validation of newly discovered Aspect Mining techniques.

99 88 Summary Enterprise-wide large-scale software systems change and evolve over time. The evolvability and ever changing nature of such system is a function of changes in business rules, change in technology platforms, third party dependencies and programming paradigm shift. Enterprise systems typically aren't replaced because they stop working or failed to keep up with evolving needs, but because they display tendencies that do not address the intended functionalities The end of life cycle for an application or system is often brought about not because it does not deliver required functionlities but because the system has become too difficult and risky to maintainand or too expensive to modify. Dijkstra (1982) pointed out that separation of concerns in software system is not only desirable but also necessary for ordered and systematic development, especially considering the increasing complexity of today s software systems. New ideas intended to improve this separation have appeared in the literature and other research works that outline new emerging paradigms such as Aspect oriented programming (AOP). The principle of Separation of Concern (SOC) states that components of an application should be organized as a set of distinct modular units where each unit encapsulates one particular entity, concern or functionality of the application. A concern is any piece of requirement of interest or focus in a program. One major impediment to program comprehension, maintenance and evolvability is the presence of crosscutting concerns that are scattered across different modules tangled with implementations of other concerns. Separation of concerns allows features in a software system to be localized making them easily adaptable and easily evolvable, resulting in a software system that is easy to understand, maintain and extend.

100 89 Eaddy et al. (2008) has shown that there is strong correlation between presence of crosscutting and defects in a software system; implying that the more scattered software concern implementations are, the more defects the software may have. According to Bauer (1996), one of the steps required to investigate a legacy software system is to understand how the software system works and how its structure is composed; to achieve this, the approach presented in this dissertation targeted internal and behavioral structure and the interrelationship between various components of the investigated legacy system. A major impediment to program comprehension, maintenance and evolvability is the existence of crosscutting concerns scattered across different modules tangled with implementations of other concerns. Roy, GiasUddin, and Dean (2007) stated that studies and experience have shown that scattering and tangling of concerns greatly increases the difficulty of evolving software in a correct and cost-effective manner. Aspect Mining tries to identify crosscutting concerns with the aim of providing means for re-factoring and improving the quality, scalability and maintainability of software systems. The task of identifying and detecting crosscutting concerns in software systems is called Aspect Mining. Loughran and Rashid (2002) stated that mining software aspects is an important exercise because it allows software engineers and developers to locate, manage and adapt assets efficiently. Aspect Mining is a reverse software engineering exploration technique that is concerned with the development of concepts, principles, methods and tools supporting the identification and extraction of re-factorable aspect candidates in legacy software system. The main goal of Aspect Mining is to identifying portions of the software system that are possible candidates for change with a view to improving sustainability and extendibility of the software system.

101 90 This dissertation presents a dynamic Aspect Mining approach that targets extractible software features (software code fragments) within the investigated benchmark software through dynamic data collection of execution event traces. The basic idea of dynamic Aspect Mining is to observe the run-time behavior of a software system and to extract relevant information about the program under investigation. When a program is run and event traces are collected, hidden behavior of the program is reflected, and within the execution traces, recurring patterns may provide a lot of pertinent information about potential crosscutting concerns in the software system. The Aspect Mining approach presented in this dissertation involved three-phases. In the first phase, selected large-scale legacy test systems were dynamically traced, investigated and profiles. Metrics representing interaction between code fragments were derived from the collected data. In the second phase, the formulated metrics were then submitted as input to Self Organizing Maps (SOMs) for clustering. In the third phase, clusters produced by the SOM were mapped against the test benchmark software under investigation in order to identify code scattering and tangling symptoms from which crosscutting concerns are identified, and candidate aspect seeds mined. The mapped clusters obtained from SOM have helped in exposing clustering and tangling cases in the benchmark software, thereby helping in identification of aspect seeds. At the end of the project, viable validation methodologies were applied to assess performance, and establish the validity of the methodologies used. This dissertation has shown that using dynamic data collection of extractible software features and formulation of metrics that are used as clustering features and mapping the obtained clusters to the software package being investigated could lead to identification of aspect seeds.

102 91 The clustering technique used to cluster the code fragments was SOMs, and since SOMs is central in the clustering part of this wok, some notes regarding this technique is presented in the next few paragraphs. Self Organizing Map (SOMs) is an unsupervised and effective Artificial Intelligence data clustering and visualization technique attributed to Kohonen (1998) that reduces high dimensions of data through the use of neural networks. SOMs converts nonlinear statistical relationships between high dimensional data into simple geometric relationships of their image points, usually displayed as a two dimensional grid of nodes. SOMs classifies input vectors according to similarity preserving the topology of the input vectors assigning nearby vectors to nearby categories, thereby organizing and clustering sample data so that in the final result, samples are usually surrounded by other samples with similar characteristics. SOMs is an excellent tool in in exploratory phase of data mining that can be effectively utilized to visualize and explore properties of data. Self-Organizing maps organizes neurons in a 2-dimensional grid representing the feature space. SOM Neural Networks structure consists of two layers of neurons. Each neuron in the input layer represents an input variable with weighted connection to each output layer of the network. During iterations at the training step, the weighted connections adapt and change. SOMs is an iterative technique used to map multivariate data, in which the network is able to learn and display the topology of the data. A major advantage of using SOMs is that the methodology is unsupervised. Others advantages include ease of use, and the fact that outliers usually affect only one map unit. One disadvantage of SOMs is that, outliers can have drastic and disproportionate effect on principal component plots.

103 92 When results obtained from this dissertation were compared to existing Aspect Mining results, it was found that the methodology presented in this dissertation is at par and in some instance surpassed results obtained in other Aspect mining methodologies. It was also found that clusters obtained from SOMs provides hints for initial identification of crosscutting patterns, and provided some level of understanding of basic internal structure of the software under investigation. With the capabilities exhibited by the applied methodology and the attained level of Aspect Mining capabilities, the two dissertation questions are therefore answered. The strength of the Aspect Mining approach presented in this dissertation is that it can easily be used with other viable formulated software metrics, and representation of the software code fragments into metrics does not depend on complex formalisms. Following list is a summary of the contribution of the approach used in this dissertation: 1. The methodology applied has shown that with proper formulation of software metrics and appropriate dynamic data collection, internal structure and behavioral nature of large-scale software systems can be explored. 2. Use of SOMs and similar clustering methodologies and proper application of cluster mappings can be used (as a first step) to help identify code scattering and tangling patterns leading to mining of Aspect seeds. 3. Results have shown that the Aspect Mining approach used in this dissertation does as well as, and no worse than existing Aspect Mining methods. 4. The approach presented in this dissertation also provided a viable method through which useful insights can be made into the internal structure of largescale software systems.

104 93 Appendixes Appendix A AspectJ Event Traces Code import java.io.bufferedwriter; import java.io.filenotfoundexception; import java.io.filewriter; import java.io.ioexception; //import java.sql.*; //import java.lang.system; public aspect TraceJHD54 { public int count=1; String thestring; pointcut tracemethods() : (cflow(execution(* *.*(..))) execution(*.new(..))) execution(public *.*..*+.*(..) &&!within(tracejhd54); before() : tracemethods() { count+=1; int beginindex=thisjoinpoint.getsignature().tostring().indexof("("); int endindex=thisjoinpoint.getsignature().tostring().indexof(")"); endindex+1) ; int s1; String s2; String s3; String s4; int s5; String s6; String s7; String s8; // theparams = theparams.replace(',', ' '); s1=0; s2=""; s3=""; s4=""; s5=0; s6=""; s7=""; s8=""; if (count % 2 == 0) { s1=thisjoinpoint.getsourcelocation().getline(); s2=thisjoinpointstaticpart.getsignature().getdeclaringtype().getsimplename().tostring(); s3=thisjoinpointstaticpart.getsignature().getname(); s4=thisjoinpoint.getsignature().tostring().substring(beginindex, endindex+1); System.out.print(s1 + " " + s2 + " " + s3 + " " + s4 + " "); //thestring = s1 + " " + s2 + " " + s3 + " " + s4 + " "; //printtofile(thestring,count); } else { //(count % 2 <> 0) s5=thisjoinpoint.getsourcelocation().getline(); s6=thisjoinpointstaticpart.getsignature().getdeclaringtype().getsimplename(); s7=thisjoinpointstaticpart.getsignature().getname(); s8=thisjoinpoint.getsignature().tostring().substring(beginindex, endindex+1); System.out.println(s5 + " " + s6 + " " + s7 + " " + s8 ); //thestring = s5 + " " + s6 + " " + s7 + " " + s8; //printtofile(thestring,count); } }

105 94 public void printtofile(string st, int i) { } } try { BufferedWriter bufferedwriter = null; //File txtfile = new File("JHD541bTrace.Txt"); //if (txtfile.createnewfile()) { // create a new file bufferedwriter = new BufferedWriter(new FileWriter("JHD541bTrace.Txt")); //} //else { //} // the file already existed, just write to it if (i % 2 == 0) { //--- If even ---- bufferedwriter.write(st); bufferedwriter.newline(); } else { //--- We have Odd ---- bufferedwriter.write(st); } } catch (FileNotFoundException ex) { ex.printstacktrace(); } catch (IOException ex) { ex.printstacktrace(); } finally { //Close the BufferedWriter try { if (bufferedwriter!= null) { bufferedwriter.flush(); bufferedwriter.close(); } } catch (IOException ex) { ex.printstacktrace(); } }

106 95 Appendix B VBA Code That Constructs Metrics Private Sub cmdrun_click() Dim str As String Dim IntCallCount, MethodSig, ExtCallCount, FanIn, FanOut, InfFlow As Integer Dim MIC, MEC As Double Dim MCC, NumberOfTimesMethodCalle, MethodSpread As Double Dim TotNumberOfMethodsInSubjectSoftware As Integer '--- CleanUp tblresults SQLStr = "DELETE tblresults.* FROM tblresults;" DoCmd.RunSQL (SQLStr) '---- Cleanup ---- SQLStr = "DELETE tbluniquemethods.* FROM tbluniquemethods;" DoCmd.RunSQL (SQLStr) '-----Create a table containing all methods either Caller or Callee '--SQL for Caller --- SQLStr = "INSERT INTO tbluniquemethods ( LineNumber, Class, Method, Params ) SELECT TraceData.[Caller Line Number], TraceData.[Caller Class] AS Class, TraceData.[Caller Method] AS Method, TraceData.[Caller Params] As Params FROM TraceData;" DoCmd.RunSQL (SQLStr) '--SQL for Callee --- SQLStr = "INSERT INTO tbluniquemethods ( LineNumber, Class, Method, Params ) SELECT TraceData.[Callee Line Number], TraceData.[Callee Class] AS Class, TraceData.[Callee Method] AS Method, TraceData.[Callee Params] As Params FROM TraceData;" DoCmd.RunSQL (SQLStr) '--- Start processing data to formulate the required metrics ---- Dim db As DAO.Database Dim rst As DAO.Recordset DoCmd.SetWarnings (0) Set db = CurrentDb() Set rst = db.openrecordset("select * FROM tbluniquemethods") str = "" While Not rst.eof() '----- Calculate FanIn/FanOut and Information flow metrics for each method FanIn = GetFanInCount(rst![Method]) FanOut = GetFanOutCount(rst![Method]) InfoFlow = (FanIn * FanOut)

107 96 '--- Calculate Internal/External method calls needed to calculate MIC and MEC IntCallCount = KountInternal(rst![Method], rst![class]) ExtCallCount = KountExternal(rst![Method], rst![class]) If IntCallCount <> 0 Then MIC = (IntCallCount / (IntCallCount + ExtCallCount)) Else MIC = 0 End If If ExtCallCount <> 0 Then MEC = (ExtCallCount / (IntCallCount + ExtCallCount)) Else MEC = 0 End If '---- Calculate Method Spread NumberOfTimesMethodIsCalled = (IntCallCount + ExtCallCount) TotNumberOfMethodsInSubjectSoftware = 45 '---Calculate and change this --- MethodSpread = NumberOfTimesMethodIsCalled/ (TotNumberOfMethodsInSubjectSoftware) '---- Calculate Method Cohesion Contribution MCC = CalculateMCC(rst![Class], rst![params]) '---- Determine Method signature (Nothing but sum of ASCII values of individual characters that makeup MethodName and its parameters) MethodSig = Val(CreateMethodSignature(rst![Method] & rst![params])) SQLStr = "INSERT INTO tblresults ([MethodSig],[Information Flow],[MethodSpread],[MIC],[MEC],[MCC], [Method Name], [Method Class]) VALUES (" & MethodSig & ",'" & InfoFlow & "','" & MethodSpread & "','" & MIC & "','" & MEC & "','" & MCC & "', '" & rst![method] & "', '" & rst![class] & "')" DoCmd.RunSQL (SQLStr) rst.movenext Wend DoCmd.SetWarnings (1) '---- Export the generated results (tblresults) as Access txt file for SOM to use DoCmd.OutputTo acoutputtable, "tblresults", acformattxt, "C:\InputToSOM.txt", False End Sub NOTE: Individual metrics were calculated by calls to a module that contain the code

97 Appendix C Sample LDA Execution Trace Data Before Metrics Formulation Appendix D Sample Metric Data (LDA) MethodSig Information Flow MethodSpread MIC MEC MCC Method Name 1266 12 0.

108 97 Appendix C Sample LDA Execution Trace Data Before Metrics Formulation Appendix D Sample Metric Data (LDA) MethodSig Information Flow MethodSpread MIC MEC MCC Method Name <init> hasmoreelements draw isexecutable isexecutable selected draw isexecutable getwidth access$ getheight isexecutable activate isexecutable isexecutable nextfigure get isexecutable outline displaybox displaybox deactivate outline draw

109 98 Appendix E MatLab SOM Code addpath('c:\somtoolbox\somtoolbox'); sd=som_read_data('c:\somtoolbox\jhddec26.txt'); % Normalize the data so that all components are in the same footing sd = som_normalize(sd,'histd'); sm = som_make(sd); % sm=som_autolabel(sm,sd,'vote'); som_show(sm,'umat','all'); som_show_add('label',sm,'subplot',1); title('jhd') %pause % Strike any key to display component planes... % %pause % Strike any key to display clusters in 3D (diagram with red '0')... %plot3(sd.data(:,1),sd.data(:,2),sd.data(:,3),'ro',... % sm.codebook(:,1),sm.codebook(:,2),sm.codebook(:,3), 'k+') %rotate3d on %grid on %axis square % T = clusterdata(sd.data,'maxclust',9); %scatter3(sd.data(:,1),sd.data(:,2),sd.data(:,3),100,t,'filled'); scatter3(sd.data(:,2),sd.data(:,3),sd.data(:,4),100,t,'filled'); find(t==1) find(t==2) find(t==3) find(t==4) find(t==5) find(t==6) find(t==7) find(t==8) find(t==9) find(t==10) % som_show(sm, 'comp',1:5);

110 99 Appendix F List of Aspects and Associated Seeds Found Candidate aspect # 1: Undo CH.ifa.draw.standard.UndoActivity.undo() CH.ifa.draw.util.UndoCommand.execute() CH.ifa.draw.util.UndoManager.getLastElement(List) CH.ifa.draw.util.UndoManager.isUndoable() CH.ifa.draw.util.UndoManager.peekRedo() CH.ifa.draw.util.UndoManager.peekUndo() CH.ifa.draw.util.UndoManager.popUndo() CH.ifa.draw.util.UndoManager.pushRedo(Undoable) CH.ifa.draw.util.UndoableAdapter.isRedoable() CH.ifa.draw.util.UndoableAdapter.undo() -- CH.ifa.draw.application.DrawApplication.getUndoManager() Candidate aspect # 2: Bring to front CH.ifa.draw.standard.BringToFrontCommand.createUndoActivity() CH.ifa.draw.standard.BringToFrontCommand.execute() CH.ifa.draw.standard.CompositeFigure.bringToFront(Figure) Candidate aspect # 3: Send to back CH.ifa.draw.standard.CompositeFigure.sendToBack(Figure) CH.ifa.draw.standard.SendToBackCommand.createUndoActivity() CH.ifa.draw.standard.SendToBackCommand.execute() Candidate aspect # 4: Connect text CH.ifa.draw.figures.ConnectedTextTool.activate() CH.ifa.draw.figures.ConnectedTextTool.createUndoActivity() CH.ifa.draw.figures.ConnectedTextTool.endEdit() CH.ifa.draw.figures.ConnectedTextTool.getConnectedFigure() CH.ifa.draw.figures.ConnectedTextTool.mouseDown(MouseEvent,int,int) CH.ifa.draw.figures.ConnectedTextTool.setConnectedFigure(Figure) CH.ifa.draw.figures.ElbowConnection.connectedTextLocator(Figure) CH.ifa.draw.figures.ElbowTextLocator.locate(Figure) CH.ifa.draw.figures.TextFigure.connect(Figure) CH.ifa.draw.figures.UndoActivity.setConnectedFigure(Figure) CH.ifa.draw.standard.AbstractFigure.connectedTextLocator(Figure) CH.ifa.draw.standard.AbstractFigure.getTextHolder() Candidate aspect # 5: Persistence CH.ifa.draw.figures.AbstractLineDecoration.write(StorableOutput) CH.ifa.draw.figures.ArrowTip.write(StorableOutput) CH.ifa.draw.figures.AttributeFigure.write(StorableOutput) CH.ifa.draw.figures.EllipseFigure.write(StorableOutput) CH.ifa.draw.figures.FigureAttributes.write(StorableOutput) CH.ifa.draw.figures.FigureAttributes.writeColor(StorableOutput,String,Color)

111 100 CH.ifa.draw.figures.LineConnection.write(StorableOutput) CH.ifa.draw.figures.PolyLineFigure.write(StorableOutput) CH.ifa.draw.figures.RectangleFigure.write(StorableOutput) CH.ifa.draw.figures.RoundRectangleFigure.write(StorableOutput) CH.ifa.draw.figures.TextFigure.write(StorableOutput) CH.ifa.draw.samples.javadraw.AnimationDecorator.write(StorableOutput) CH.ifa.draw.standard.AbstractConnector.write(StorableOutput) CH.ifa.draw.standard.AbstractFigure.write(StorableOutput) CH.ifa.draw.standard.DecoratorFigure.write(StorableOutput CH.ifa.draw.util.StorableOutput.writeBoolean(boolean) CH.ifa.draw.util.StorableOutput.writeColor(Color) CH.ifa.draw.util.StorableOutput.writeDouble(double) CH.ifa.draw.util.StorableOutput.writeInt(int) CH.ifa.draw.util.StorableOutput.writeStorable(Storable) CH.ifa.draw.util.StorableOutput.writeString(String) Candidate aspect # 6: Manage handles CH.ifa.draw.figures.ElbowConnection.handles() CH.ifa.draw.figures.ElbowHandle.ElbowHandle(LineConnection,int) CH.ifa.draw.figures.ElbowHandle.draw(Graphics) CH.ifa.draw.figures.ElbowHandle.locate() CH.ifa.draw.figures.ElbowHandle.ownerConnection() CH.ifa.draw.figures.EllipseFigure.handles() CH.ifa.draw.figures.LineConnection.handles() CH.ifa.draw.figures.PolyLineFigure.handles() CH.ifa.draw.figures.PolyLineHandle.PolyLineHandle(PolyLineFigure,Locator,int) CH.ifa.draw.figures.RectangleFigure.handles() CH.ifa.draw.standard.BoxHandleKit.addCornerHandles(Figure,List) CH.ifa.draw.standard.BoxHandleKit.addHandles(Figure,List) CH.ifa.draw.standard.BoxHandleKit.east(Figure) CH.ifa.draw.standard.BoxHandleKit.north(Figure) CH.ifa.draw.standard.BoxHandleKit.northEast(Figure) CH.ifa.draw.standard.BoxHandleKit.northWest(Figure) CH.ifa.draw.standard.BoxHandleKit.south(Figure) CH.ifa.draw.standard.BoxHandleKit.southEast(Figure) CH.ifa.draw.standard.BoxHandleKit.southWest(Figure) CH.ifa.draw.standard.BoxHandleKit.west(Figure) CH.ifa.draw.standard.ChangeConnectionEndHandle.ChangeConnectionEndHandle(Figure) CH.ifa.draw.standard.EastHandle.EastHandle(Figure) CH.ifa.draw.standard.NorthEastHandle.NorthEastHandle(Figure) CH.ifa.draw.standard.NorthHandle.NorthHandle(Figure) CH.ifa.draw.standard.NorthWestHandle.NorthWestHandle(Figure) CH.ifa.draw.standard.ResizeHandle.ResizeHandle(Figure,Locator) CH.ifa.draw.standard.SouthEastHandle.SouthEastHandle(Figure) CH.ifa.draw.standard.SouthHandle.SouthHandle(Figure) CH.ifa.draw.standard.SouthWestHandle.SouthWestHandle(Figure) CH.ifa.draw.standard.WestHandle.WestHandle(Figure)

112 101 CH.ifa.draw.figures.FontSizeHandle.FontSizeHandle(Figure,Locator) CH.ifa.draw.figures.TextFigure.handles() CH.ifa.draw.standard.AbstractHandle.AbstractHandle(Figure) CH.ifa.draw.standard.LocatorHandle.LocatorHandle(Figure,Locator) CH.ifa.draw.standard.NullHandle.NullHandle(Figure,Locator) CH.ifa.draw.figures.FontSizeHandle.draw(Graphics) CH.ifa.draw.standard.NullHandle.draw(Graphics) CH.ifa.draw.figures.RadiusHandle.draw(Graphics) CH.ifa.draw.figures.RadiusHandle.locate() CH.ifa.draw.standard.AbstractHandle.draw(Graphics) Candidate aspect # 7: Manage figure changed event CH.ifa.draw.figures.LineConnection.figureChanged(FigureChangeEvent) CH.ifa.draw.standard.FigureChangeEventMulticaster.figureChanged(FigureChangeEvent) CH.ifa.draw.standard.AbstractFigure.removeFigureChangeListener(FigureChangeListener) CH.ifa.draw.standard.CompositeFigure.figureChanged(FigureChangeEvent) Candidate aspect # 8: Move figure CH.ifa.draw.figures.EllipseFigure.basicMoveBy(int,int) CH.ifa.draw.figures.PolyLineFigure.basicMoveBy(int,int) CH.ifa.draw.figures.RectangleFigure.basicMoveBy(int,int) CH.ifa.draw.figures.RoundRectangleFigure.basicMoveBy(int,int) CH.ifa.draw.figures.TextFigure.moveBy(int,int) CH.ifa.draw.standard.AbstractFigure.moveBy(int,int) CH.ifa.draw.standard.DecoratorFigure.moveBy(int,int) Candidate aspect # 9: Command executability CH.ifa.draw.figures.GroupCommand.isExecutableWithView() CH.ifa.draw.figures.UngroupCommand.isExecutableWithView() CH.ifa.draw.standard.AbstractCommand.isExecutable() CH.ifa.draw.standard.AbstractCommand.isExecutableWithView() CH.ifa.draw.standard.BringToFrontCommand.isExecutableWithView() CH.ifa.draw.standard.CopyCommand.isExecutableWithView() CH.ifa.draw.standard.CutCommand.isExecutableWithView() CH.ifa.draw.standard.DeleteCommand.isExecutableWithView() CH.ifa.draw.standard.DuplicateCommand.isExecutableWithView() CH.ifa.draw.standard.PasteCommand.isExecutableWithView() CH.ifa.draw.standard.SelectAllCommand.isExecutableWithView() CH.ifa.draw.standard.SendToBackCommand.isExecutableWithView() CH.ifa.draw.util.CommandMenu.checkEnabled() CH.ifa.draw.util.RedoCommand.isExecutableWithView() CH.ifa.draw.util.UndoCommand.isExecutableWithView() CH.ifa.draw.util.UndoableCommand.isExecutable() Candidate aspect # 10: Connect figures CH.ifa.draw.figures.ChopEllipseConnector.ChopEllipseConnector(Figure) CH.ifa.draw.figures.EllipseFigure.connectorAt(int,int)

113 102 CH.ifa.draw.figures.LineConnection.canConnect() CH.ifa.draw.figures.LineConnection.canConnect(Figure,Figure) CH.ifa.draw.figures.LineConnection.insertPointAt(Point,int) CH.ifa.draw.figures.LineConnection.readObject(ObjectInputStream) CH.ifa.draw.figures.LineConnection.setPointAt(Point,int) CH.ifa.draw.figures.PolyLineConnector.PolyLineConnector(Figure) CH.ifa.draw.figures.PolyLineConnector.chop(Figure,Point) CH.ifa.draw.figures.PolyLineFigure.connectorAt(int,int) CH.ifa.draw.figures.PolyLineFigure.insertPointAt(Point,int) CH.ifa.draw.standard.ConnectionTool.findConnectableFigure(int,int,Drawing) CH.ifa.draw.standard.ConnectionTool.findConnection(int,int,Drawing) CH.ifa.draw.standard.ConnectionTool.findConnectionStart(int,int,Drawing) CH.ifa.draw.standard.ConnectionTool.findConnector(int,int,Figure) CH.ifa.draw.standard.ConnectionTool.findSource(int,int,Drawing) CH.ifa.draw.standard.ConnectionTool.findTarget(int,int,Drawing) CH.ifa.draw.standard.ConnectionTool.getAddedFigure() CH.ifa.draw.standard.ConnectionTool.getConnection() CH.ifa.draw.standard.ConnectionTool.getEndConnector() CH.ifa.draw.standard.ConnectionTool.getStartConnector() CH.ifa.draw.standard.ConnectionTool.getTargetConnector() CH.ifa.draw.standard.ConnectionTool.getTargetFigure() CH.ifa.draw.standard.ConnectionTool.mouseDown(MouseEvent,int,int) CH.ifa.draw.standard.ConnectionTool.mouseDrag(MouseEvent,int,int) CH.ifa.draw.standard.ConnectionTool.mouseMove(MouseEvent,int,int) CH.ifa.draw.standard.ConnectionTool.mouseUp(MouseEvent,int,int) CH.ifa.draw.standard.ConnectionTool.setAddedFigure(Figure) CH.ifa.draw.standard.ConnectionTool.setConnection(ConnectionFigure) CH.ifa.draw.standard.ConnectionTool.setEndConnector(Connector) CH.ifa.draw.standard.ConnectionTool.setStartConnector(Connector) CH.ifa.draw.standard.ConnectionTool.setTargetConnector(Connector) CH.ifa.draw.standard.ConnectionTool.setTargetFigure(Figure) CH.ifa.draw.standard.ConnectionTool.trackConnectors(MouseEvent,int,int) Candidate aspect # 11: Figure update CH.ifa.draw.standard.CompositeFigure.figureRequestUpdate(FigureChangeEvent) CH.ifa.draw.standard.DecoratorFigure.figureRequestUpdate(FigureChangeEvent) CH.ifa.draw.standard.FigureChangeEventMulticaster.figureRequestUpdate(FigureChangeEvent) Candidate aspect # 12: Add text CH.ifa.draw.figures.TextFigure.acceptsTyping() CH.ifa.draw.figures.TextFigure.getText() CH.ifa.draw.figures.TextFigure.textDisplayBox() CH.ifa.draw.figures.TextTool.createFloatingTextField() CH.ifa.draw.figures.TextTool.createPasteUndoActivity() CH.ifa.draw.figures.TextTool.deactivate() CH.ifa.draw.figures.TextTool.endEdit() CH.ifa.draw.figures.TextTool.fieldBounds(TextHolder)

114 103 CH.ifa.draw.figures.TextTool.getFloatingTextField() CH.ifa.draw.figures.TextTool.isDeleteTextFigure() CH.ifa.draw.figures.TextTool.mouseDown(MouseEvent,int,int) CH.ifa.draw.figures.TextTool.mouseUp(MouseEvent,int,int) CH.ifa.draw.figures.TextTool.setFloatingTextField(FloatingTextField) CH.ifa.draw.figures.TextTool.setSelectedFigure(Figure) CH.ifa.draw.figures.TextTool.setTypingTarget(TextHolder) Candidate aspect # 13: Add URL to figure CH.ifa.draw.samples.javadraw.URLTool.beginEdit(Figure) CH.ifa.draw.samples.javadraw.URLTool.endEdit() CH.ifa.draw.samples.javadraw.URLTool.fieldBounds(Figure) CH.ifa.draw.samples.javadraw.URLTool.getURL(Figure) CH.ifa.draw.samples.javadraw.URLTool.mouseDown(MouseEvent,int,int) CH.ifa.draw.samples.javadraw.URLTool.mouseUp(MouseEvent,int,int) CH.ifa.draw.samples.javadraw.URLTool.setURL(Figure,String) Candidate aspect # 14: Manage figures outside drawing CH.ifa.draw.standard.CompositeFigure.orphan(Figure) CH.ifa.draw.standard.StandardDrawing.orphan(Figure) Candidate aspect # 15: Get attribute CH.ifa.draw.figures.PolyLineFigure.getAttribute(FigureAttributeConstant) CH.ifa.draw.standard.DecoratorFigure.getAttribute(FigureAttributeConstant) Candidate aspect # 16: Set attribute CH.ifa.draw.figures.PolyLineFigure.setAttribute(FigureAttributeConstant,Object) CH.ifa.draw.standard.DecoratorFigure.setAttribute(FigureAttributeConstant,Object) Candidate aspect # 17: Manage view rectangle CH.ifa.draw.standard.CompositeFigure._removeFromQuadTree(Figure) CH.ifa.draw.standard.QuadTree.remove(Object) Candidate aspect # 18: Visitor CH.ifa.draw.figures.LineConnection.visit(FigureVisitor) CH.ifa.draw.standard.DeleteFromDrawingVisitor.DeleteFromDrawingVisitor(Drawing) CH.ifa.draw.standard.DeleteFromDrawingVisitor.getDrawing() CH.ifa.draw.standard.DeleteFromDrawingVisitor.setDrawing(Drawing) CH.ifa.draw.standard.DeleteFromDrawingVisitor.visitFigure(Figure) CH.ifa.draw.standard.DeleteFromDrawingVisitor.visitHandle(Handle)

115 104 Appendix G Method Detail and Cluster Tables That Were Linked to Mine LDA Seeds

Enhanced Method Call Tree for Comprehensive Detection of Symptoms of Cross Cutting Concerns

Nova Southeastern University NSUWorks CEC Theses and Dissertations College of Engineering and Computing 2016 Enhanced Method Call Tree for Comprehensive Detection of Symptoms of Cross Cutting Concerns