PREDICTING SIGNIFICANT DATASETS USING DECISION TREE TECHNIQUES FOR SOFTWARE DEFECT ANALYSIS

Size: px

Start display at page:

Download "PREDICTING SIGNIFICANT DATASETS USING DECISION TREE TECHNIQUES FOR SOFTWARE DEFECT ANALYSIS"

Elijah Dalton
5 years ago
Views:

1 PREDICTING SIGNIFICANT DATASETS USING DECISION TREE TECHNIQUES FOR SOFTWARE DEFECT ANALYSIS E. Suganthi 1, Dr. S. Prakasam 2 1 M. phil(research Scholar) Department of Computer Science Application SCSVMVUniversity,Enathur 2 Associate Professor Department of Computer Science Application SCSVMV University,Enathur ABSTRACT Data mining techniques in software engineering can act as unique underlying methods, since it affects the preprocessing as well as the post analysis. Software defect predictors are useful to maintain the high quality of software products effectively. In this research is to establish a method for identifying software defects using preprocessing, cluster and decision tree methods. In this work we used NASA s and Promise dataset as software metrics data.decision trees are one of the most popular approaches for both classification and preprocessing type predictions. They are generated based on specific rules. Decision tree is a classifier in a tree structure. The software prediction of defects the software modules can help the software developers to allocate the available dataset to deliver high quality software products. The software defect prediction is the process to be finding the many dataset defective software modules as possible without affecting the overall performance. Software engineering consists of collecting software engineering data (NASA and Premise) to extracting and data mining the dataset has emerged as a successful software defect prediction multiple datasets.the NASA and Promise datasets are containing the varied quantities of correlated instances and attributes, which are useful for checking data integrity. Additionally, it is possible to use domain specific expertise to validate data integrity. The significance of the results has been tested via decision tree analysis performed by using preprocess cluster algorithms. the software defect predict analyses in a continuous and disciplined approach that brings many recompense such as accurate dataset results and classify the decision tree mechanism, and improving the software prediction and process qualities. Keywords: Software Engineering, Decision tree, PreProcessing Cluster, WEKA Tool,Software defect Prediction, Dataset. 1. INTRODUCTION As software engineering generates huge amount of data, it is important to utilize it properly so that the problems regarding the software development cycle can be solved efficiently. In this paper, we focus on to find best dataset in each type of Software Engineering [1] data, and the specific data mining techniques that can solve those problems. Classification can be described as a function that maps (classifies) a data item into one of the several predefined classes [3]. Here the goal is to induce a model that can be used to classify future data items with unknown classification into unique classes. In software development process the performance of classifier depends upon the type and class of data. The software systems that we work with are inherently complex and difficult to conceptualize. This complexity lead to faults [4][9] and defects as result increases the cost of software. Software metric dataset have long been a standard tool for assessing quality of software systems and the processes that produce them.clusterclassification analysis is a group of multivariate techniques whose primary purpose is to group entities based on their attributes. 2. LITERATURE SURVEY N. Fenton, M. Neil,(2011)A Critique of Software Defect Prediction Models. Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art. Volume 5, Issue 7, July 2017 Page 73

2 Y. Freund, L. Mason (1999)The Alternating Decision Tree Learning Algorithm. An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number. ADTrees always have prediction nodes as both root and leaves. An instance is classified by an ADTree by following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. Y. Brun.,(2003) Software fault identification via dynamic analysis and machine learning.in this propose a technique that identifies program properties that may indicate errors. The technique generates machine learning models of runtime program properties known to expose faults, and applies these models to program properties of user-written code to classify and rank properties that may lead the user to errors finding technique. Henry, S., and Kafura, D(1999)The Evaluation of Software System s Structure Using Quantitative Software Metrics,: The design and analysis of the structure of software systems has typically been based on purely qualitative grounds. In this paper we report on our positive experience with a set of quantitative measures of software structure. These metrics, based on the number of possible paths of information flow through a given component. H. L. Larsen,(1999)AnApproach to Flexible Information Access Systems Using Soft Computing. We present a scheme for modeling expert like flexibility in query-answering through extending information bases with a knowledge-based soft computing layer for query processing.the query answer is a subset of the most satisfying objects. An envelope calibration method is proposed for fast retrieval of these objects from the information base. J. MacQueen,(1967)Some methods for classification and analysis of multivariate observations.the main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, - * *, Sk} is a partition of EN, and ui, i = 1, 2, * -, k, is the conditional mean of p over the set Si, then W2(S) = ff= ISf i z - u42 dp(z) tends to be low for the partitions S generated by the method. R. R. Papalkar and G. Chanidel,(2013)Clustering in web text mining and its application in ieeeabstract classification Text Mining, a branch of computer science, is the process ofextracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Text Mining is seen as an increasingly important tool by modern business totransform data into business intelligence giving an informational advantage. Salleb, Ansaf and ChristelVrain(2000)An Application of Assosiation Knowledge Discovery and Data Mining.The rapidly emerging field of knowledge discovery in databases (kdd) has grown significantly in the past few years. This growth is driven by a mix of daunting practical needs and strong research interest. The technology for computing and storage has enabled people to collect and store information from a wide range of sources at rates that were, only a few years ago, considered unimaginable. 3. METHODOLOGY & TOOLS Weka tool supports many different standard data mining in software engineering dataset tasks such as data preprocessing, classification, clustering, decision tree (J48), data visualization and feature selection. The basic premise of the application is to utilize a computer application that can be trained data set to perform clustering capabilities and derive useful information in the form of trends and patterns. In software defects Prediction is the task of predicting continuous or ordered values for given dataset input However, as we have seen in[9], some classification techniques such as preprocessing and clustering classification rule algorithms can be adapted for prediction to find the best dataset using weka tool. Decision Tree Data mining in weka tool decision tree is a decision to support that uses a tree like graph decisions and their possible after-effect, as well as event resource costs, intergarity, results, and utility. A Decision Tree, or a cluster classification decision tree (J48), is used to learn a classification rule function which concludes the value of a dependent instances Volume 5, Issue 7, July 2017 Page 74

3 and attribute (variable) given the dataset values of the independent (input) instances and attributes (variables). This verifies a problem known as supervised classification rule because the dependent attribute and the counting of classes (values) are given [2]. Clustering algorithm (also referred to as soft clustering) is a form of clustering in which each data point can belong to more than one cluster. Clustering or cluster analysis involves assigning data points to clusters (also called buckets, bins, or classes), or homogeneous classes, such that items in the same class or cluster are as similar as possible datasets, while items belonging to different attribute classes are as dissimilar as possible. Clusters are identified via similarity measures. These similarity measures include distance, connectivity, and intensity. Different similarity measures may be chosen based on the data or the application. 4. IMPLEMENTATION WEKA is the collection or a suite of the tools for performing data mining with the implementation of the Classification classifier rules in it. Basically it is a compilation of datasetcluster classificationrule for the task of data mining, which is able to be applied directly to dataset or can call from your own java code. It is compilation or suite of tools for performing the dataset preprocessing, Cluster classification, regression, clustering, association rules and visualization type operations and it also can be enhance any new preprocessing scheme. A classifier model is an arbitrary complex mapping from all-but-one dataset attributes to the class attribute. The specific (NASA and promise) dataset form and creation of this map- ping, or model, differs from classifier to classifier. For example [14], ZeroR s (= weka.classifiers.rules.zeror) model just consists of a single value: the most common class, or the median of all numeric values in case of predicting a numeric value (= regression learning). The performance of the learners on the MDP data was assessed using receiver-operator (ROC) curves. Formally, a defect predictor hunts for a signal that a software module is defect prone. Signal detection theory [17] offers ROC curves as an analysis method for assessing different predictors. Any learning algorithm inweka is derived from the abstract weka.classifiers[18].abstractclassifier class. This, in turn, implements weka.classifiers.classifier. Top-down induction of decision trees (TDIDT, old approach knowfrom pattern recognition): Select an attribute for root node and create a branch for each possible attribute value. Split the instances into subsets (one for each branch extending from the node). Repeat the procedure recursively for each branch, using only instances that reach the branch (those that satisfy the conditions along the path from the root to the branch). Stop if all instances have the same class.id3, C4.5, J48 (WEKA): Select the attribute that minimizes the class entropy in the split. Figure 1 J48 Decision tree classifier Volume 5, Issue 7, July 2017 Page 75

Visualize MergeCurve: Figure 2 Visualizing merge curve Comparing different classifiers on one dataset can also be done via plot metric curves, not just via Accuracy, Correlation coefficient etc.

The actual clustering for this algorithm is shown as one instance for each cluster representing the cluster centroid.

4 Visualize MergeCurve: Figure 2 Visualizing merge curve Comparing different classifiers on one dataset can also be done via plot metric curves, not just via Accuracy, Correlation coefficient etc. In the Explorer it is not possible to do that for several classifiers, this is only possible in the Knowledge Flow. The actual clustering for this algorithm is shown as one instance for each cluster representing the cluster centroid. Figure 3 Cluster Analysis Cluster analysis groups objects (observations, events) based on the information found in the data describing the objects or their relationships. The goal is that the objects in a group will be similar (or related) to one other and different from (or unrelated to) the objects in other groups. The greater the likeness (or homogeneity) within a group, and the greater the disparity between groups, the betterǁ or more distinct the clustering. Volume 5, Issue 7, July 2017 Page 76

5 Cluster Output: 5. CONCLUSION Figure 4 Cluster Analysis Output The conclusion of this research work has proposed a new approach for efficiently predicting the best Data Set has been used for experimental purpose. The data mining tool WEKA tool has been used for generate the modified J-48 model classifiers. Experimental results have shown a significant improvement over the existing J-48 algorithm. It has been proved that the proposed algorithm can achieve accuracy. Also the decision tree algorithm generates rules in the classification process. These rules are used for deciding which branches to select towards the leaf nodes in the tree. Also data mining is as good as results it produces so quality and quantity of available data and computational cost determines the success of data mining in software development process in weka tool. FUTURE ENHANCEMENT As a future work, different clustering algorithm or improved versions of the used advanced cluster andmachine learning algorithms may be included in the experiments. The algorithms used in our evaluation experiments are the simplest forms of some widely used methods. Also this model can be applied to other risk assessment procedures which can be supplied as input to the system. Certainly these risk issues should have quantitative representations to be considered as an input for our system. We have recognized reasons why software engineering is a good fit for data mining technique, including the inherent complexity of development, pitfalls of raw metrics and the difficulties of understanding software processes. REFRENCES [1] Q. Taylor and C. Giraud-Carrier, Applications of data mining in software engineering, Int. J. Data Analysis Techniques and Strategies,2010. [2] Z. Li and M. Reformat, A practical method for the software fault prediction, in proceedings of IEEE International Conference Information Reuse and Integration (IRI),2007. [3] M. Shtern and Vassilios, Review Article Advances in Software Engineering Clustering Methodologies for Software Engineering, Tzerpos Volume,2012. [4]. Y. Brun. Software fault identification via dynamic analysis and machine learning. Master s thesis, MIT Dept. of EECS, Aug. 16, 2003, [5]. Agrawal, Rakesh and RamakrishnanSrikant, Fast Algorithms for Mining & Preprocessing Assosiation Rules, Proceedings of the 20th VLDB Conference, Santiago, Chile (1994). [6]. J. Demar, "Statistical Comparisons of Classifiers over Multiple Data Sets", J. Machine Learning Research, vol. 7, pp. 1-30, [7] K. El-Emam, S. Benlarbi, N. Goel, S.N. Rai, "Comparing Case-Based Reasoning Classifiers for Predicting High- Risk Software Components", J. Systems and Software, vol. 55, no. 3, pp , [8] Agrawal, R., Imielinski, T., and Swami, A. N Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Volume 5, Issue 7, July 2017 Page 77

6 [9] H. Lu and B. Cukic.An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th International Conference on Predictive Models in Software Engineering, PROMISE 12, pages 79 88, New York, NY, USA, ACM. [10]. Henry, S., and Kafura, D., The Evaluation of Software System s Structure Using Quantitative Software Metrics, Software Practice and Experience, vol. 14, no. 6, pp [11]. Yuriy, B., and Ernst, M. D., Finding latent code errors via machine learning over program executions.proceedings of the 26th International Conference on Software Engineering, (Edinburgh, Scotland). [12]. M. Elsner, E. Charniak, and M. Johnson, Structured generative models for unsupervised named-entity clustering, in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 09). Stroundsburg, PA: Association for Computational Linguistics, 2009, pp [13.]Norman Fenton, Paul Krause, and Martin Neil. A probabilistic model for software defect prediction.ieee Trans Software Eng, 2001 [14]. Boetticher G, Menzies T. and Ostrand T., PROMISE Repository of empirical software engineering data West Virginia University, Department of Computer Science, 2007 [15]. Fenton, N.E. and Neil, M., A critique of software defect prediction models, IEEE Transactions. on Software. Engineering., 25(5), 1999, pp [16]. Khoshgoftaar, T. M. and Seliya, N., Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques, Empirical Software Engineering., 8(3), 2003, pp [17].Menzies, T., DiStefano, J., Orrego, A., Chapman, R., Assessing Predictors of Software Defects, In Proceedings of Workshop Predictive Software Models, [18.]Menzies T., Greenwald, J., Frank, A., Data mining static code attributes to learn defect predictors, IEEE Transactions on Software Engineering, 33(1), 2007, pp [19.]Munson, J. and Khoshgoftaar, T. M., The Detection of Fault-Prone Programs, IEEE Transactions on Software Engineering., 18(5), 1992, pp [20] Padberg, F., Ragg T., Schoknecht R., Using machine learning for estimating the defect content after an inspection, IEEE Transactions on Software Engineering, 30(1), 2004, pp: Volume 5, Issue 7, July 2017 Page 78

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,