Analysis of Data Mining Techniques for Software Effort Estimation

Similar documents
Analyzing Outlier Detection Techniques with Hybrid Method

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Redefining and Enhancing K-means Algorithm

Credit card Fraud Detection using Predictive Modeling: a Review

Iteration Reduction K Means Clustering Algorithm

Comparative Study Of Different Data Mining Techniques : A Review

An Efficient Clustering for Crime Analysis

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Dynamic Clustering of Data with Modified K-Means Algorithm

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Normalization based K means Clustering Algorithm

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Comparative Study of Clustering Algorithms using R

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Comparative Study of Web Structure Mining Techniques for Links and Image Search

ECLT 5810 Clustering

The Evaluation of Useful Method of Effort Estimation in Software Projects

ECLT 5810 Clustering

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Efficient Approach for Color Pattern Matching Using Image Mining

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

The Data Mining usage in Production System Management

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

K-Mean Clustering Algorithm Implemented To E-Banking

Performance Analysis of Data Mining Classification Techniques

Hybrid Model with Super Resolution and Decision Boundary Feature Extraction and Rule based Classification of High Resolution Data

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Two Dimensional Microwave Imaging Using a Divide and Unite Algorithm

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

A Review of K-mean Algorithm

User Manual Management System: P G School, IARI. Creation of New User Account

Regression Based Cluster Formation for Enhancement of Lifetime of WSN

WEB USAGE MINING: ANALYSIS DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE ALGORITHM

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

A REVIEW ON K-mean ALGORITHM AND IT S DIFFERENT DISTANCE MATRICS

International Journal of Advanced Research in Computer Science and Software Engineering

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Introduction of Clustering by using K-means Methodology

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Improved MapReduce k-means Clustering Algorithm with Combiner

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

A New Improved Hybridized K-MEANS Clustering Algorithm with Improved PCA Optimized with PSO for High Dimensional Data Set

AN IMPROVED DENSITY BASED k-means ALGORITHM

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Correlation Based Feature Selection with Irrelevant Feature Removal

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

HCR Using K-Means Clustering Algorithm

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Heart Disease Detection using EKSTRAP Clustering with Statistical and Distance based Classifiers

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

A HYBRID APPROACH FOR DATA CLUSTERING USING DATA MINING TECHNIQUES

International Journal of Advanced Research in Computer Science and Software Engineering

Web Data mining-a Research area in Web usage mining

A Study on K-Means Clustering in Text Mining Using Python

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Performance Based Study of Association Rule Algorithms On Voter DB

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Techniques for Mining Text Documents

Comparison of FP tree and Apriori Algorithm

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

A CRITIQUE ON IMAGE SEGMENTATION USING K-MEANS CLUSTERING ALGORITHM

CHAPTER 4: CLUSTER ANALYSIS

Simulation of Zhang Suen Algorithm using Feed- Forward Neural Networks

Computational Intelligence Meets the NetFlix Prize

Enhancing K-means Clustering Algorithm with Improved Initial Center

A Web Page Recommendation system using GA based biclustering of web usage data

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Research Article Combining Pre-fetching and Intelligent Caching Technique (SVM) to Predict Attractive Tourist Places

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Fast or furious? - User analysis of SF Express Inc

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Randomized Response Technique in Data Mining

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Fuzzy C-MeansC. By Balaji K Juby N Zacharias

Algorithm Collections for Digital Signal Processing Applications Using Matlab

An Algorithm for user Identification for Web Usage Mining

Characterization of Request Sequences for List Accessing Problem and New Theoretical Results for MTF Algorithm

SIMULATIVE ANALYSIS OF EDGE DETECTION OPERATORS AS APPLIED FOR ROAD IMAGES

Transcription:

2014 11th International Conference on Information Technology: New Generations Analysis of Data Mining Techniques for Software Effort Estimation Sumeet Kaur Sehra Asistant Professor GNDEC Ludhiana,India sumeetksehra@gmail.com Jasneet Kaur Senior Faculty NIELIT Chandigarh, India jasneetsalaria88@gmail.com Yadwinder Singh Brar Professor GNDEC Ludhiana, India braryadwinder@yahoo.co.in Navdeep Kaur Professor SGGSWU Fatehgarh Sahib, India nrwsingh@yahoo.com Abstract Software effort estimation requires high accuracy, but accurate estimations are difficult to achieve. Increasingly, data mining is used to improve an organization s software process quality, e.g. the accuracy of effort estimations.there are a large number of different method combination exists for software effort estimation, selecting the most suitable combination becomes the subject of research in this paper. In this study data preprocessing is implemented and effort is calculated using COCOMO Model. Then data mining techniques OLS Regression and K Means Clustering are implemented on preprocessed data and results obtained are compared and data mining techniques when implemented on preprocessed data proves to be more accurate then OLS Regression Technique. Keywords Software Effort Estimation, Kilo Lines of Code (KLOC), COCOMO Model, Data Preprocessing, K Means Clustering, OLS Regression I. INTRODUCTION From the beginning of software engineering as research area more than three decades ago, several development effort estimation methods and process have been proposed. The accurate estimation avoids the project is over running the actual schedule [1]. Various models have been derived to calculate the effort of large number of completed software projects from various organizations and applications to explore how project sizes mapped into project effort [2]. Increasingly, data mining is used to improve an organization s software process quality, e.g. the accuracy of effort estimations. In data mining process data is collected from projects, and data miners are used to discover beneficial knowledge. Data pre-processing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project [3]. As raw data is highly susceptible to noise, missing values, and in consistency. The Quality of data affects the data mining results. In order to help improve the quality of the data and, consequently, of the mining results raw data is pre-processed so as to improve the efficiency and ease of the mining process. Data pre-processing includes cleaning, extraction and Selection etc.the product of data pre-processing is the final training set [4]. In this paper data preprocessing is implemented on data set.once data is been preprocessed the data mining techniques K Means Clustering and OLS Regression are implemented. II. BASIC COCOMO MODEL The Constructive Cost Model (COCOMO) is software cost estimation model developed by Boehm. The model uses a basic regression formula with parameters that are derived from historical project data and current project characteristics [4]. Basic COCOMO computes software development effort (and cost) as a function of program size. Program size is expressed in estimated thousands of Kilo lines of code (KLOC). COCOMO applies to three classes of software projects [5]: a) Organic projects - "small" teams with "good" experience working with "less than rigid" requirements b) Semi-detached projects - "medium" teams with mixed experience working with a mix of rigid and less than rigid requirements c) Embedded projects - developed within a set of "tight" constraints. It is also combination of organic and semidetached projects. (hardware, software, operational,...) The basic COCOMO equation for calculating effort takes the form Effort Applied (E) = a (KLOC) b (1) Where, KLOC is the estimated number of delivered lines (expressed in thousands) of code for project. The coefficients a, b are given in the following table: TABLE 1. Basic Cocomo Model Software project a b Organic 2.4 1.05 Semi-detached 3.0 1.12 Embedded 3.6 1.21 978-1-4799-3187-3/14 $31.00 2014 IEEE DOI 10.1109/ITNG.2014.116 633

III. DATA PRE-PROCESSING Data pre-processing is an important step in the data mining process. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. Data pre-processing includes Data Cleaning, Data Integration, Data Transformation and Data Reduction [6]. In this study we investigate: Three preprocessors [7]: a. None Technique: - None is the simplest preprocessor- all values are unchanged. In this the effort is directly calculated using Cocomo model using (1). b. Norm Technique: - With the norm preprocessor (maxmin normalization), numeric values are normalized to a 0-1 interval using equation Normalized Value = Actual Value-min(All Values) Max All Values - Min(All Values) (2) Where, Actual value: - current valued to be mapped Min: - minimum value from particular column to be normalized in given data set Max: - Max value from particular column to be normalized in given data set c. Log Technique: - With the log preprocessor, all numeric are replaced with their logarithm. This logging Procedure Minimizes the effects of the occasional very large numeric value. Log value=log_2 (Actual value) (3) IV. OLS REGRESSION It is one of the oldest techniques for software effort estimation. The technique [6] fits a linear regression function to data set containing a dependent,, and multiple independent variables, to. This type of regression is also refer to as multiple regression.ols Regression assumes following linear model of data: e i = x' i + b 0 + i (4) Where represents row vector containing the values of ith observation, x i (1) to x i (n). Is column vector containing slope parameters that are estimated by regression, and is intercept scaler. Is the error associated with each observation This error term is used to estimate by using equation =(x ' x) -1 (x ' e) (5) Where e is column vector containing effort and X is N x n matrix V. K MEANS CLUSTERING K-means clustering [9] is a well known partitioning method. In this objects are classified as belonging to one of K- groups. The results of partitioning method are a set of K clusters, each object of data set belonging to one cluster. In each cluster there may be a centroid or a cluster representative. The algorithm for K Means Clustering is This algorithm consists of four steps: 1. Initialization In this first step data set, number of clusters and the centroid that we defined for each cluster. 2. Classification The distance is calculated for each data point from the centroid and the data point having minimum distance from the centriod of a cluster is assigned to that particular cluster. 3. Centroid Recalculation Clusters generated previously, the centriod is again repeatly calculated means recalculation of the centriod. 4. Convergence Condition Some convergence conditions are given as below: 4.1 Stopping when reaching a given or defined number of iterations. 4.2 Stopping when there is no exchange of data points between the clusters. 4.3 Stopping when a threshold value is achieved. 5. If all of the above conditions are not satisfied, then go to step 2 and the whole process repeat again, until the given conditions are not satisfied. VI. DATA COLLECTION The data is collected of IVR application [10] though survey from BPO / software Industry where IVR application is developed with questionnaires directly to the company s project managers and senior software development professionals. Due to Company security and policy the author could not show the name of project but it is indicated as code metrics, person-month and month. VII. EXPERIMENTAL RESULTS A. Data Preprocessing Results The data preprocessing is implemented on data set. In case of norm and log techniques the KLOC is calculated using (2) and 634

(3) where as in case of none data preprocessing technique KLOC value is not affected. The effort is calculated by using (1) as shown below:- -25.67690408-5.99148E-05-0.317539577-109.6590243-0.000919078-0.845813798-232.8308684-0.002509046-1.251957154 TABLE 2. Data Preprocessing -196.4007426-0.002024379-1.151568025 ( None Technique) (Norm Technique) (Log Technique) -273.3461179-0.003053481-1.3514965-121.9122533-0.001068171-0.897062508 44.6891 2.2334 7.0353 13.9357 0.4768 4.1256 20.1867 0.8311 5.0427 12.1875 0.3784 3.7961 7.873 0.1388 2.7322 13.5524 0.4552 4.0569 17.9616 0.7047 4.7528 16.8539 0.6418 4.5951 19.0726 0.7678 4.9017 14.1002 0.4861 4.1545 22.7039 0.9745 5.3353 20.7449 0.8629 5.1106 34.0382 1.6224 6.3492 28.0606 1.2803 5.8646 25.5165 1.135 5.6268 8.6749 0.1828 2.9669 17.9616 0.7047 4.7528 15.1988 0.5481 4.3395 19.6293 0.7994 4.9731 19.0726 0.7678 4.9017 22.9845 0.9905 5.3659 16.8539 0.6418 4.5951 28.6274 1.3127 5.9147 16.5775 0.6262 4.5541-435.549873-0.005281288-1.667440466-342.1997705-0.003995086-1.498663167-1285.686626-0.017039686-2.565966903-767.2647852-0.009891619-2.107781396-595.1338265-0.007499389-1.90234958-33.27665538-0.000112956-0.389500263-232.8308684-0.002509046-1.251957154-148.9879492-0.001407917-0.999411308-295.1959946-0.003351718-1.400848144-273.3461179-0.003053481-1.3514965-450.0850583-0.00548497-1.691263241-196.4007426-0.002024379-1.151568025-809.4010523-0.010474406-2.152804165-187.9087107-0.001911388-1.126238716 C. K Means Clusterng Results K Means clustering technique is implemented on preprocessed data and clusters (, ) are obtained.thus after obtaining clusters effort is calculated using (1).The results obtained are shown below ( None Technique) TABLE 4. K Means Clustering (Norm Technique) (Log Technique) B. OLS Regression Results OLS Regression is implemented on preprocessed data and intercept scalar (, ) are obtained and effort is calculated using (1) as shown below (None Technique) TABLE 3. OLS Regression (Norm Technique) (Log Technique) -2661.850033-0.035518244-3.308907484-118.1483984-0.001022753-0.881643463-318.139286-0.003663925-1.449805546-82.56858661-0.000601373-0.717409237 10075944495275100 00000000000000000 0000000.00 30923284611095000 00000000.00 26519803937127100 00 50788127016397900 000000.00 77359868851493700. 00 13149877949824200 00000000.00 73929295951576000 0.267207884 0.532189733 0.105729637 0.386258665 0.14758052 0.435736608 0.092028661 0.367437965 0.050368053 0.301584628 0.10281712 0.382394825 0.133676061 0.420516616 635

10504679753841500 46536998113108500 0 44306782527304300 00000000.00 97274949272490600 000 61187882273170300 00 23929638896804200 00000000000000000 000.00 64267820849706000 00000000000000000. 0 34894320979198400 0000000000000000.0 0 15126718252251200 00.00 73929295951576000 44192693726140100 000000000.0 11240652612447900 00 46536998113108500 0 14174007345489700 0000 10504679753841500 11864690236772900 00000000000000000 0.00 63277070300651400 0000000000.0 0.126383874 0.412085061 0.140715856 0.428385193 0.106937084 0.387884886 0.162378746 0.450757418 0.150955988 0.439250345 0.22053736 0.50038687 0.191316964 0.477095536 0.177962586 0.465386139 0.059446144 0.316887889 0.133676061 0.420516616 0.114940589 0.398174116 0.144185086 0.432124515 0.140715856 0.428385193 0.16399261 0.452309598 0.126383874 0.412085061 0.194200841 0.479545285 0.124501025 0.409870101 B. MRE in KMeans Clustering TABLE 5. MRE in OLS Regression Data Preprocessing Technique MRE Value None 1.8799 Norm 0.7397 Log 15.1369 IX. GRAPHICAL REPRESENTATION OF DATA PREPROCESSING TECHNIQUES. A. None Data Preprocessing Technique. Fig.3. represents estimated effort varies from 5.5 to 45 with minimum value of estimated effort as 5.5 and maximum value of estimated effort as 45. Fig.3. Data preprocessing in None Technique B. Norm Data Preprocessing Technique Fig.4. represents estimated effort varies from varies from 2.3 to 0.2 with minimum value of estimated effort as 0.2 and maximum value of estimated effort as 2.3. VIII. MAGNITUDE OF REPALTIVE ERROR(MRE) The magnitude of relative error is calculated using formula A. MRE in OLS Regression TABLE 4. MRE in OLS Regression Data Preprocessing Technique MRE Value None 60.5670 Norm 1.0159 Log 1.4703 (6) Fig.4. Data preprocessing in Norm Technique C. Log Data Preprocessing Technique Fig.5 represent estimated effort varies from 7 to 2.9 with minimum value of estimated effort as 2.9and maximum value of estimated effort as 7. 636

Fig.8. OLS Regression in Log technique Fig.5. Data preprocessing in Log Technique X. GRAPHICAL REPRESENTATION OF OLS REGRESSION A. OLS Regression in None Data Preprocessing Technique The Fig. 6. represents effort obtained shows t uniform results while the values of estimated effort vary from 0 to -0.1. XI. GRAPHICAL REPRESENTATION OF KMEANS CLUSTERING A. K Means Clustering in None Data Preprocessing Technique The Fig. 9. represents estimated effort varies from 0 to 10 with the estimated effort showing the uniform results. Fig.9. K Means Clustering in None technique Fig.6. OLS Regression in None technique B. OLS Regression in Norm Data Preprocessing Technique The Fig. 7. represents estimated effort varies from 0 to -0.1 with minimum value of estimated effort is -0.03 and maximum value of estimated effort is 0 B. K Means Clustering in Norm Data Preprocessing Technique The fig. 10. represents estimated effort varies from 0 to 10 where the estimated effort having minimum value 0.1 and maximum value 0.7. Fig.7. OLS Regression in Norm technique Fig.10. K Means Clustering in Norm technique C. OLS Regression in Log Data Preprocessing Technique The Fig.8. represents estimated effort varies from 0 to -0.1 where the estimated effort showing the uniform results. C. K Means Clustering in Log Data Preprocessing Technique The fig. 11. represents estimated effort varies from 0 to 10 where the estimated effort having minimum value 5 and maximum value 10. 637

compared to estimated effort Clustering. obtained in case of K Means Fig.11. K Means Clustering in Log technique XII. CONCLUSION A. Comparison of Data preprocessing Techniques The data preprocessing technique can make a valuable contribution to set of software effort estimation techniques. If effort estimate is far from actual value i.e. more than 25 percent, the estimate cannot be consider accurate. Therefore as per our estimation it has been found that result of effort obtained using none and log technique is more than 25 percent where result of effort estimation after normalizing data is less than 25 percent i.e. the resulted data afterr using norm data preprocessing technique is more accurate. When implemented graphically it is obtained then estimated effort when calculated using norm data preprocessing techniques give more stable values as compared to none and log data preprocessing techniques. B. Comparison of Data Mining Techniques The effort obtained by using K Means clustering in Norm data preprocessing technique gives more accuratee results as compared to OLS Regression Technique. Also the result shows that the value of MRE (Magnitude of Relative Error) applying K Means Clustering in case of Norm data preprocessing technique is substantiallyy lower than MRE obtained by applying OLS Regression in case of Norm Data preprocessing Technique. XIII. REFRENCES [1]. Ruotsalainen.L,"Data Mining Tools for Technology and Competitive Intelligence". Espoo, VTT, 2008. [2]. Devesh.K.S, Durg.S.C, Raghuraj.S, "VRS Model: A Model for Estimation of Efforts and Time Duration in Development of IVR Software System ". International Journal of Software Engineering(IJSE), vol. 5,pp. 27-46, 2012. [3]. Edelstein.H, " Introduction to Data Mining and Knowledge Discovery",3 rd ed,two Crows Corporation, Maryland:Potomac, 1999. [4]. DR.A.A, DR P. K. M, DR S.M et al, "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets. Division of Computer Applications Indian Agricultural Statistics Research Institute(ICAR), New Delhi : Pusa [5]. http://en.wikipedia.org/wiki/cocomo. (n.d.). [6]. Karel.D, Verbeke, Wouter.V, David.M, Bart. B "Data Mining Techniques for Software Effort Estimation: A Comparative Study". IEEE Transactions on Software Engineering, vol 38,pp.375-397, 2012. [7]. Jacky.K, Ekrem. K., Tim. M, "A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation". Springer Science +Business Media, pp.1-19, 2011 [8]. Navjot.K, Jaspreet.K.S, Navneet.K. "Efficient K-Means Clustering Algorithm Using Ranking Method In Data Mining". International Journal of Advanced Research in Computer Engineering & Technology, vol 1, pp.85 91,2012 [9]. Devesh.K.S, D. K., Durg S..C, Raghuraj.S, "VRS Model: A Model for Estimation of Efforts andd Time Duration in Development of IVR Software System ". International Journal of Software Engineering(IJSE), vol.5,pp. 27-46,2012 OLS Regression Normalized Effort K Means Clustering Normalized effort Fig.12. OLS and K Means Clustering in Normm technique Fig.12. represents estimated effort in case of K Means Clustering gives more accurate values and compared to OLS Regression. Also estimated effort obtainedd in case of OLS Regression is more scattered resulting in instability as 638