Effective Estimation of Modules Metrics in Software Defect Prediction

Similar documents
International Journal of Advanced Research in Computer Science and Software Engineering

IQPI: An Incremental System for Answering Imprecise Queries Using Approximate Dependencies and Concept Similarities

Software Defect Prediction System Decision Tree Algorithm With Two Level Data Preprocessing

An Artificial Immune Recognition System-based Approach to Software Engineering Management: with Software Metrics Selection

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

SOFTWARE DEFECT PREDICTION USING IMPROVED SUPPORT VECTOR MACHINE CLASSIFIER

SOFTWARE DEFECT PREDICTION USING PARTICIPATION OF NODES IN SOFTWARE COUPLING

International Journal of Advance Research in Engineering, Science & Technology

Evolutionary Decision Trees and Software Metrics for Module Defects Identification

Functional Dependencies and Single Valued Normalization (Up to BCNF)

A HYBRID FEATURE SELECTION MODEL FOR SOFTWARE FAULT PREDICTION

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

A Fuzzy Model for Early Software Quality Prediction and Module Ranking

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

Further Thoughts on Precision

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Correlation Based Feature Selection with Irrelevant Feature Removal

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

International Journal of Applied Sciences, Engineering and Management ISSN , Vol. 04, No. 01, January 2015, pp

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Classification Using Unstructured Rules and Ant Colony Optimization

Cross-project defect prediction. Thomas Zimmermann Microsoft Research

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Research Article ISSN:

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Classification with Diffuse or Incomplete Information

Maintenance of the Prelarge Trees for Record Deletion

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

International Journal of Advanced Research in Computer Science and Software Engineering

Empirical Study on Impact of Developer Collaboration on Source Code

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

Software Defect Prediction Using Static Code Metrics Underestimates Defect-Proneness

Using Association Rules for Better Treatment of Missing Values

An Improved Apriori Algorithm for Association Rules

A Study of Software Metrics

Reliability of Software Fault Prediction using Data Mining and Fuzzy Logic

Systematic Detection And Resolution Of Firewall Policy Anomalies

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Efficiently Mining Positive Correlation Rules

An Empirical Study of Lazy Multilabel Classification Algorithms

Impact of Dependency Graph in Software Testing

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Software Metrics. Lines of Code

Association Rule with Frequent Pattern Growth. Algorithm for Frequent Item Sets Mining

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Datasets Size: Effect on Clustering Results

Rough Set Approach to Unsupervised Neural Network based Pattern Classifier

Predicting Source Code Quality with Static Analysis and Machine Learning. Vera Barstad, Morten Goodwin, Terje Gjøsæter

Handling Missing Values via Decomposition of the Conditioned Set

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,

OO Development and Maintenance Complexity. Daniel M. Berry based on a paper by Eliezer Kantorowitz

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Comparison of FP tree and Apriori Algorithm

Evaluating the Effect of Inheritance on the Characteristics of Object Oriented Programs

International Journal of Computer Engineering and Applications, Volume XI, Issue XII, Dec. 17, ISSN

Finding Effective Software Security Metrics Using A Genetic Algorithm

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Business Club. Decision Trees

ERA -An Enhanced Ripper Algorithm to Improve Accuracy in Software Fault Prediction

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1

Bellman : A Data Quality Browser

Automated Software Defect Prediction Using Machine Learning

Stochastic propositionalization of relational data using aggregates

Fault-Proneness Estimation and Java Migration: A Preliminary Case Study

Type-2 Fuzzy Logic Approach To Increase The Accuracy Of Software Development Effort Estimation

MACHINE LEARNING BASED METHODOLOGY FOR TESTING OBJECT ORIENTED APPLICATIONS

Generating Cross level Rules: An automated approach

Mining of Web Server Logs using Extended Apriori Algorithm

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

A Framework for Source Code metrics

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

The Encoding Complexity of Network Coding

SNS College of Technology, Coimbatore, India

Bottom-up Induction of Functional Dependencies from Relations

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Web page recommendation using a stochastic process model

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

3. Data Preprocessing. 3.1 Introduction

1 Introduction. Abstract

Performance Based Study of Association Rule Algorithms On Voter DB

An Object-Oriented Metrics Suite for Ada 95

Chapter 8 The C 4.5*stat algorithm

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Improving Naïve Bayes Classifier for Software Architecture Reconstruction

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A study of classification algorithms using Rapidminer

Transcription:

Effective Estimation of Modules Metrics in Software Defect Prediction S.M. Fakhrahmad, A.Sami Abstract The prediction of software defects has recently attracted the attention of software quality researchers. Many predictive classification systems have already been proposed, which aim at early discovery of software modules that are faultprone and versa. The proposed methods are usually assessed using datasets available from NASA Metrics Data repository. These datasets include a combination of design-level and codelevel metrics for different modules. To apply a defect predictor, all metrics have to be measured for any of the modules (to be used as the classifier inputs). The measurement of some of these metrics is easy and can be done straight forward. However, there are a number of metrics which are more difficult or timeconsuming to quantify. Moreover, many of them do not have an exact value; so, they may get different values when using different formulas or tools. In this paper, we first discuss this hypothesis that some strong dependencies exist among various features of these datasets. Based on this hypothesis, we search for short combinations of features from the first category (easy-tomeasure features), which can describe any of the features from the second category (hard-to-measure features) with a high accuracy. Then, we introduce a set of fuzzy modeling systems, each of which estimates the value of one of the second category features from its specified determinants. The evaluation of the estimation systems is carried out by computing the MSE values for all features. The experimental results are promising. The presented estimation system provides usability of the defect prediction system rather than its accuracy. Using this system, the user will not have to measure all the required mentioned metrics for any of the modules. All the features of the second category will automatically be estimated with a high accuracy. Index Terms Software defect prediction, Fuzzy Modeling, Fuzzy Classification, Parameter Estimation, Approximate Dependencies I. INTRODUCTION Quality of a software system is relative to the number of defects reported in the final product. Early discovery of software errors is very important and may cause significant cost savings, especially for large and complex systems. Software defect prediction is the task of classifying software modules into fault-prone (fp) and non-fault-prone (nfp) by means of metric-based classification [1, 2]. S. M. Fakhrahmad is Faculty member in the department of computer engineering, Islamic Azad University of Shiraz, and phd student in Shiraz University, Iran; e-mail: mfakhrahmad@ cse.shirazu.ac.ir. A. Sami is Assistant Professor in the department of computer science and engineering, Shiraz University, Iran; e-mail: asami@ieee.org. Software testing is the most expensive and time consuming issue in the process of software development. It usually requires about 50% of the whole project schedule. On the other hand, it has been proved by experience that the majority of a system s faults exist in a small fraction of modules. Thus, efficient prediction models are helpful for software testing. Accurate estimates of defective modules may help software developers in terms of allocating the limited resources and thus, decreasing testing times [3, 4]. During the past decade, several classification systems have been proposed, which perform predictive modeling efforts for detection of modules that are likely to contain faults. The evaluation of such systems has almost been carried out using a set of datasets available from NASA MDP repository [5]. Recently, via a set of experiments on NASA datasets, Lessman et al. [6] concluded and reported that there is not a high gap between predictive accuracies of different classification methods. In other words, even simple classifiers are able to classify software modules according to code attributes with a good accuracy. This is maybe due to the nature of majority of data sets which have been observed to be linearly separable. Each of the NASA MDP datasets is related to one of the NASA projects and contains several modules. Each module is described by a set of code-level and design-level attributes. All discovered faults of the system are also registered in each dataset, together with the number of module containing the fault. Hence, there are two categories of modules; the modules with 1 or more errors (nfp modules) and those containing no fault (fp modules). The number of module metrics in different NASA datasets varies from 21 to 43 metrics. To apply a defect prediction system to detect fp and nfp modules of a real software system, we have to measure all of these metrics for any of the modules (to be used as the classifier inputs). The measurement of some of these metrics is easy and can be done straight forward. For instance, LOC metrics (such as LOC-Comments which represents the no. of code lines containing comments) and some other attributes can be easily measured. The measurement of such metrics can be handled manually or automatically. On the other hand, some of the metrics are more difficult or time-consuming to measure. The main challenging metrics are the design metrics (e.g., complexity metrics) which require availability of design phase artifacts and design diagrams such

as DFDs, control flow graphs, Formal Description Language (FDL) graphs and UML diagrams, to be extracted from. Moreover, some of the metrics can not be quantified easily and directly. Different formulas and tools have been proposed to estimate these features. Many of the formulas relate to some specific applications or processes. Some others have been defined to be used for some specific programming languages. The other challenge of this kind of metrics is that they require the program to be completed. Effort is an example of such features. This metric represents the mental effort needed to develop or maintain a software module. A high value for Effort means that the module is difficult to change. One of the estimations for this metric has been developed by Maurice Halstead [7], denoted by Halstead_Effort. Halstead_Effort is calculated for modules written in COBOL and PL1 using the following formula: E = D * V (1) Loc_Blank Cyclomatic_complexity (1) Branch_Count Design_Complexity In this formula, D stands for Difficulty and is calculated as Loc_Code_And_Comment Essential_Complexity D = (n1 / 2) * (N2 / n2), (2) Loc_Comments Halstead_Content Where n1 is the number of distinct operators Loc_Executable Halstead_difficulty, n2 is the number of distinct operands and N2 is the whole number of operands. The second parameter, V, stands for Volume and is calculated as Num_Operands Num_Operators Num_Unique_Operands Num_Unique_Operators Halstead_Effort Halstead_Error-Est Halstead_Length Halstead_Level Loc_Total Halstead_ProgTime V = N * (LOG2 n) (3) Halstead_Volume Where n is the module vocabulary size and is calculated as n = n1 + n2 and N is the module length calculated as N = N1 + N2. During the past years, many researchers have attempted to evaluate different methods and several defect prediction systems have been proposed. The results of these systems are given in terms of classification accuracy, precision, performance, etc. However, these factors do not really show the goodness of the model. In this paper, we will introduce an accurate software defect prediction system which aims to provide high usability beside good accuracy. In other words, the user will not have to measure all the above mentioned metrics for any of the modules. The rest of the paper is organized as follows. In Section 2, we describe our approach and introduce different parts of the system in detail. The experimental results of any part of the system are also given in this section. Finally, Section 3 concludes the paper. II. THE PROPOSED SYSTEM As mentioned in Section 1, the main goal of this paper is to develop a defect prediction system which brings about a high degree of usability. If we can discover some probable hidden relations among different metrics, we will be able to develop an estimation system to estimate some metrics values from a combination of others. So, the user will be required to provide just a few metric values. For this purpose, we first divide the set of metrics into two categories. The first category denoted as Type-1-Faeatures includes the set of metrics which are easily quantifiable. The second category denoted as Type-1- Faeatures represents design and Halstead metrics which have some problems to be measured, as discussed in the previous section. Table 1 indicates an example this categorization for the metrics belonging to KC1, which is one of the NASA MDP datasets. This example will be used in the next subsections to illustrate the experimental results. Table I. Categorization of KC1 dataset metrics according to ease of measurement Type-1-Features Type-2-Features The proposed defect prediction system is composed of three major components, namely the approximate dependency miner, the estimation part and the fuzzy rule-based classifier. A high-level view of the system is shown in Fig. 1. A. Approximate Dependency Miner Functional dependencies (FDs) are defined as relationships between attributes of a relational scheme R, and are presented in expressions of the form X A. In this expression X (referred to as the Left-Hand Side (LHS) of the dependency) is a subset of attributes belonging to R and A (referred to as the Right-Hand Side (RHS) of the dependency) is an attribute of R. A functional dependency is said to be valid in a given relation r over R, if for all pairs of tuples t, u belonging to r, we have (t[x i ] = u[x i ], for all X i in X) t[a] = u[a] (4), where t[x] is the value assigned to the attribute x of the tuple t.

Values of Type-1-Features Dataset Dependency Miner Determinants of all Type-2-Features Estimation part Values of Type-2-Features Fuzzy Classifier Classification rate Fig. 1. The high-level architectural view of the proposed defect prediction system Classical Functional dependencies are used in relational schema design in order to normalize relations to be free of redundancy and update anomalies. These dependencies do not allow for exceptions and are sensitive to noisy data. Approximate Dependencies (ADs) are dependencies which do not hold over a fraction of data and thus have a higher flexibility for exceptions and noisy data. In dependency mining part of the system, we use AD-Miner which was proposed in [8] as an incremental mining algorithm. This algorithm uses logical operations on binary strings to find the set of minimal approximate dependencies (having an acceptable accuracy) between attributes. Most of other dependency mining approaches already proposed are not incremental and so have to re-scan all data and repeat the whole computations when a number of records are added to the database [9-13]. In this part of the system, we are interested in discovery of any existing dependency between the values of type-1 and type-2 features. In other words, we look for any short combination of Type-1-Features which can describe one or more features of Type-2. Since the algorithm requires discrete or nominal data, as a preprocessing step, we discretized all features into equi-size partitions. To decrease the dependence on the no. of partitions, the algorithm was run frequently, using 3 to 7 partitions for discritization. Finally, for any of Type-2-Features, the best combination of Type-1-Features (having the highest value of dependency in average) was selected as its determinant. In other words, the most accurate dependencies between a Type-2-Feature and a combination of Type-1-Features were extracted. The results of this phase over KC1 dataset features are shown in Table2. Determinant features found for Type-2-Features will then be used in the estimation part of the system. Table II. The results of the Dependency Miner part: shortlength dependencies between Type-1 and Type-2 Features Approximate Dependency Branch_Count Cylomatic_Complexity Branch_Count, Num_Operands Design_Complexity Branch_Count, Loc_Blank Essential_Complexity Num_Unique_Operands, Loc_Executable Halstead_content Branch_Count, Loc_Blank Halstead_difficulty Branch_Count, Loc_Total Halstead_Effort Num_Operands, Loc_Executable Halstead_Error-Est Num_ Operands, Num_ Operators Halstead_Length Branch_Count, Loc_Blank Halstead_Level Num_Unique_Operands, Branch_Count Halstead_ProgTime Num_ Operands, Num_ Operators Halstead_Volume Accuracy (%) B. Fuzzy Estimation part In this part of the system, we follow the Wang and Mendel s fuzzy rule learning method [14] and develop a set of fuzzy modeling systems with similar structures. Each of these fuzzy modeling systems is constructed to estimate the value of a Type-2-Feature using a combination of Type-1-features as its determinants (discovered in the previous section). Unlike the Dependency Mining part, this part of the system will be used online, when the user will give a set of metric values for a module to the prediction system to be classified. At this time, the user will present just the values of Type-1- Features. As shown in Fig. 1, the values of Type-2-Features will be automatically estimated and forwarded to the classification system. For each dataset, we used 90% of whole data to train the system and 10% for the test. For example, 1900 (out of 2107) modules contained in KC1 were used as train data and the remaining were left for test. In order to evaluate the 98.9 98.2 95.6 88.7 93.4 96.5 96.2 93.6 91.8 93.3 95.7

performance of the estimation part, the MSE values were computed on both train and test data for every Type-2-Feature. Table3 shows the evaluation results of this part of the system for KC1 dataset. Fig. 2 visualizes the behavior of the estimation system in mapping the Branch-Count metric to Cyclomatic-Complexity. Fig. 2.(a) shows the existing relation (dependency) between these two metrics through 2107 modules included in KC1. Fig. 2.(b) indicates the approximated function generated by the estimation part which maps any value of Branch-Count to Cyclomatic-Complexity. The approximated functions for other metrics (all having 2 determinants) are shown in Fig. 3. Table III. Evaluation results of estimation system in terms of MSE and error significance Estimation MSE on Train Data MSE on Test Data Branch_Count Cylomatic_Complexity 0.05 0.09 Branch_Count, Num_Operands Design_Complexity Branch_Count, Loc_Blank Essential_Complexity Num_Unique_Operands, Loc_Executable Halstead_content Branch_Count, Loc_Blank Halstead_difficulty Branch_Count, Loc_Total Halstead_Effort Num_Operands, Loc_Executable Halstead_Error-Est Num_ Operands, Num_ Operators Halstead_Length Branch_Count, Loc_Blank Halstead_Level Num_Unique_Operands, Branch_Count Halstead_ProgTime Num_ Operands, Num_ Operators Halstead_Volume 0.07 0.04 0.81 0.35 4.82 0.009 1.24 0.02 3.5 2.88 0.1 0.04 0.93 0.38 6.03 0.01 1.11 0.08 5.13 3.43 Cyclomatic-Complexity 50 45 40 35 30 25 20 15 10 5 0 0 20 40 60 80 100 Branch-Count (a) Fig. 2. (a) Branch-Count Vs Cyclomatic-Complexity for KC1 dataset; (b) The approximated function which maps Branch- Count to Cyclomatic-Complexity (b)

Fig. 3. The surface viewer of approximated functions mapping easy-to-measure metrics into hard-to-measure ones III. CONCLUSION In this paper, we introduced a highly usable software defect prediction system. The system was assessed using NASA which is a widely used benchmark dataset. In the mining part of the system, a set of dependencies among hard-to-measure features of the dataset and easy-to-measure ones were discovered. Then, we developed a set of fuzzy modeling systems, each of which estimates the value of one of the hardto-obtain features from its specified determinants. In this part of the system, we followed the Wang and Mendel s fuzzy rule learning method. The evaluation of the estimation systems was accomplished by computing the MSE values for all features. The results showed the high ability of the system in terms of approximation. Using this system, the user will not have to measure all the required mentioned metrics for any of the modules. All of the hard-to-measure features will automatically be estimated with a high accuracy. As a future task, we are going to develop a fuzzy classification system using NASA datasets. The classifier accuracy will be verified in two ways. First, we use the actual values of all features from test data. The accuracy value obtained for this case will then be compared to the similar case, where just the first category features are fed into the classifier and the other features are automatically estimated (using the system proposed in this paper) and used. The very low difference between these two classification rates can be promising and will be another evidence for successfulness of the fuzzy estimation part proposed in this paper. REFERENCES [1] L.C. Briand, W.L. Melo, and J. Wu st, (2002). Assessing the Applicability of Fault-Proneness Models Across Object- Oriented Software Projects, IEEE Trans. Software Eng., 28 (7), pp. 706-720. [2] J. Dem_sar, (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Machine Learning Research, 7, pp. 1-30. [3] N. Nagappan, T. Ball, B. Murphy, Using Historical Data and Product Metrics for Early Estimation of Software Failures, In Proc. ISSRE 2006, Raleigh, NC, 2006.

[4] L. Briand, Measurement and Modeling in Software Engineering, presented at Foundations of Empirical Software Engineering Workshop The Legacy of Victor Basili, International Conference Software Engineering, 2005. [5] M. Chapman, P. Callis, and W. Jackson, Metrics Data Program, NASA IV and V Facility, http://mdp.ivv.nasa.gov/, 2004. [6] Stefan Lessmann, (2008). Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 34 (4), pp. 485-496. [7] M.H. Halstead, (1977). Elements of Software Science. Elsevier. [8] S.M. Fakhrahmad, M.H. sadreddini, M. Zolghadri jahromi, (2008). AD Miner: A new incremental method for discovery of minimal approximate dependencies using logical operations, Intelligent Data Analysis 12, pp. 1 13. [9] P. A. Flach and I. Savnik. (1999). Database dependency discovery: a machine learning approach, AI communications, 12 (3). pp. 139 160. [10] Y. Huhtala, J. Kärkkäinen, P. Porkka and H. Toivonen. (1999) TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. The Computer Journal. 42(2). pp. 100 111. [11] Y. Huhtala, J. Kärkkäinen, P. Porkka and H. Toivonen, Efficient Discovery of Functional and Approximate Dependencies Using Partitions, In: Proc. the Fourteenth International Conference on Data Engineering, (February 1998), pp. 392-401. [12] S. Lopes, J.M. Petit, L. Lakhal, Efficient Discovery of Functional Dependencies and Armstrong Relations, in: Proc. ICDT 2000, the 7th International Conference on Extending Database Technology: Advances in Database Technology, vol 1777, pp. 350 364, 2000. [13] S.L. Wang, J.S. Tsai, B.C. Chang, "Mining Approximate Dependencies using partitions on Similarity-Relation-based Fuzzy databases", in : Proc. IEEE SMC'99, Vol. 6, pp. 871 875, 1999. [14] Wang L-X, Mendel JM (1992) Fuzzy basis functions, universal approximation, and orthogonal least squares learning. IEEE Trans. Neural Network 3, 807 814.