Analysis of Data Mining Techniques for Software Effort Estimation

2014 11th International Conference on Information Technology: New Generations Analysis of Data Mining Techniques for Software Effort Estimation Sumeet Kaur Sehra Asistant Professor GNDEC Ludhiana,India sumeetksehra@gmail.com Jasneet Kaur Senior Faculty NIELIT Chandigarh, India jasneetsalaria88@gmail.com Yadwinder Singh Brar Professor GNDEC Ludhiana, India braryadwinder@yahoo.co.in Navdeep Kaur Professor SGGSWU Fatehgarh Sahib, India nrwsingh@yahoo.com Abstract Software effort estimation requires high accuracy, but accurate estimations are difficult to achieve. Increasingly, data mining is used to improve an organization s software process quality, e.g. the accuracy of effort estimations.there are a large number of different method combination exists for software effort estimation, selecting the most suitable combination becomes the subject of research in this paper. In this study data preprocessing is implemented and effort is calculated using COCOMO Model. Then data mining techniques OLS Regression and K Means Clustering are implemented on preprocessed data and results obtained are compared and data mining techniques when implemented on preprocessed data proves to be more accurate then OLS Regression Technique. Keywords Software Effort Estimation, Kilo Lines of Code (KLOC), COCOMO Model, Data Preprocessing, K Means Clustering, OLS Regression I. INTRODUCTION From the beginning of software engineering as research area more than three decades ago, several development effort estimation methods and process have been proposed. The accurate estimation avoids the project is over running the actual schedule [1]. Various models have been derived to calculate the effort of large number of completed software projects from various organizations and applications to explore how project sizes mapped into project effort [2]. Increasingly, data mining is used to improve an organization s software process quality, e.g. the accuracy of effort estimations. In data mining process data is collected from projects, and data miners are used to discover beneficial knowledge. Data pre-processing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project [3]. As raw data is highly susceptible to noise, missing values, and in consistency. The Quality of data affects the data mining results. In order to help improve the quality of the data and, consequently, of the mining results raw data is pre-processed so as to improve the efficiency and ease of the mining process. Data pre-processing includes cleaning, extraction and Selection etc.the product of data pre-processing is the final training set [4]. In this paper data preprocessing is implemented on data set.once data is been preprocessed the data mining techniques K Means Clustering and OLS Regression are implemented. II. BASIC COCOMO MODEL The Constructive Cost Model (COCOMO) is software cost estimation model developed by Boehm. The model uses a basic regression formula with parameters that are derived from historical project data and current project characteristics [4]. Basic COCOMO computes software development effort (and cost) as a function of program size. Program size is expressed in estimated thousands of Kilo lines of code (KLOC). COCOMO applies to three classes of software projects [5]: a) Organic projects - "small" teams with "good" experience working with "less than rigid" requirements b) Semi-detached projects - "medium" teams with mixed experience working with a mix of rigid and less than rigid requirements c) Embedded projects - developed within a set of "tight" constraints. It is also combination of organic and semidetached projects. (hardware, software, operational,...) The basic COCOMO equation for calculating effort takes the form Effort Applied (E) = a (KLOC) b (1) Where, KLOC is the estimated number of delivered lines (expressed in thousands) of code for project. The coefficients a, b are given in the following table: TABLE 1. Basic Cocomo Model Software project a b Organic 2.4 1.05 Semi-detached 3.0 1.12 Embedded 3.6 1.21 978-1-4799-3187-3/14 $31.00 2014 IEEE DOI 10.1109/ITNG.2014.116 633

III. DATA PRE-PROCESSING Data pre-processing is an important step in the data mining process. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. Data pre-processing includes Data Cleaning, Data Integration, Data Transformation and Data Reduction [6]. In this study we investigate: Three preprocessors [7]: a. None Technique: - None is the simplest preprocessor- all values are unchanged. In this the effort is directly calculated using Cocomo model using (1). b. Norm Technique: - With the norm preprocessor (maxmin normalization), numeric values are normalized to a 0-1 interval using equation Normalized Value = Actual Value-min(All Values) Max All Values - Min(All Values) (2) Where, Actual value: - current valued to be mapped Min: - minimum value from particular column to be normalized in given data set Max: - Max value from particular column to be normalized in given data set c. Log Technique: - With the log preprocessor, all numeric are replaced with their logarithm. This logging Procedure Minimizes the effects of the occasional very large numeric value. Log value=log_2 (Actual value) (3) IV. OLS REGRESSION It is one of the oldest techniques for software effort estimation. The technique [6] fits a linear regression function to data set containing a dependent,, and multiple independent variables, to. This type of regression is also refer to as multiple regression.ols Regression assumes following linear model of data: e i = x' i + b 0 + i (4) Where represents row vector containing the values of ith observation, x i (1) to x i (n). Is column vector containing slope parameters that are estimated by regression, and is intercept scaler. Is the error associated with each observation This error term is used to estimate by using equation =(x ' x) -1 (x ' e) (5) Where e is column vector containing effort and X is N x n matrix V. K MEANS CLUSTERING K-means clustering [9] is a well known partitioning method. In this objects are classified as belonging to one of K- groups. The results of partitioning method are a set of K clusters, each object of data set belonging to one cluster. In each cluster there may be a centroid or a cluster representative. The algorithm for K Means Clustering is This algorithm consists of four steps: 1. Initialization In this first step data set, number of clusters and the centroid that we defined for each cluster. 2. Classification The distance is calculated for each data point from the centroid and the data point having minimum distance from the centriod of a cluster is assigned to that particular cluster. 3. Centroid Recalculation Clusters generated previously, the centriod is again repeatly calculated means recalculation of the centriod. 4. Convergence Condition Some convergence conditions are given as below: 4.1 Stopping when reaching a given or defined number of iterations. 4.2 Stopping when there is no exchange of data points between the clusters. 4.3 Stopping when a threshold value is achieved. 5. If all of the above conditions are not satisfied, then go to step 2 and the whole process repeat again, until the given conditions are not satisfied. VI. DATA COLLECTION The data is collected of IVR application [10] though survey from BPO / software Industry where IVR application is developed with questionnaires directly to the company s project managers and senior software development professionals. Due to Company security and policy the author could not show the name of project but it is indicated as code metrics, person-month and month. VII. EXPERIMENTAL RESULTS A. Data Preprocessing Results The data preprocessing is implemented on data set. In case of norm and log techniques the KLOC is calculated using (2) and 634

(3) where as in case of none data preprocessing technique KLOC value is not affected. The effort is calculated by using (1) as shown below:- -25.67690408-5.99148E-05-0.317539577-109.6590243-0.000919078-0.845813798-232.8308684-0.002509046-1.251957154 TABLE 2. Data Preprocessing -196.4007426-0.002024379-1.151568025 ( None Technique) (Norm Technique) (Log Technique) -273.3461179-0.003053481-1.3514965-121.9122533-0.001068171-0.897062508 44.6891 2.2334 7.0353 13.9357 0.4768 4.1256 20.1867 0.8311 5.0427 12.1875 0.3784 3.7961 7.873 0.1388 2.7322 13.5524 0.4552 4.0569 17.9616 0.7047 4.7528 16.8539 0.6418 4.5951 19.0726 0.7678 4.9017 14.1002 0.4861 4.1545 22.7039 0.9745 5.3353 20.7449 0.8629 5.1106 34.0382 1.6224 6.3492 28.0606 1.2803 5.8646 25.5165 1.135 5.6268 8.6749 0.1828 2.9669 17.9616 0.7047 4.7528 15.1988 0.5481 4.3395 19.6293 0.7994 4.9731 19.0726 0.7678 4.9017 22.9845 0.9905 5.3659 16.8539 0.6418 4.5951 28.6274 1.3127 5.9147 16.5775 0.6262 4.5541-435.549873-0.005281288-1.667440466-342.1997705-0.003995086-1.498663167-1285.686626-0.017039686-2.565966903-767.2647852-0.009891619-2.107781396-595.1338265-0.007499389-1.90234958-33.27665538-0.000112956-0.389500263-232.8308684-0.002509046-1.251957154-148.9879492-0.001407917-0.999411308-295.1959946-0.003351718-1.400848144-273.3461179-0.003053481-1.3514965-450.0850583-0.00548497-1.691263241-196.4007426-0.002024379-1.151568025-809.4010523-0.010474406-2.152804165-187.9087107-0.001911388-1.126238716 C. K Means Clusterng Results K Means clustering technique is implemented on preprocessed data and clusters (, ) are obtained.thus after obtaining clusters effort is calculated using (1).The results obtained are shown below ( None Technique) TABLE 4. K Means Clustering (Norm Technique) (Log Technique) B. OLS Regression Results OLS Regression is implemented on preprocessed data and intercept scalar (, ) are obtained and effort is calculated using (1) as shown below (None Technique) TABLE 3. OLS Regression (Norm Technique) (Log Technique) -2661.850033-0.035518244-3.308907484-118.1483984-0.001022753-0.881643463-318.139286-0.003663925-1.449805546-82.56858661-0.000601373-0.717409237 10075944495275100 00000000000000000 0000000.00 30923284611095000 00000000.00 26519803937127100 00 50788127016397900 000000.00 77359868851493700. 00 13149877949824200 00000000.00 73929295951576000 0.267207884 0.532189733 0.105729637 0.386258665 0.14758052 0.435736608 0.092028661 0.367437965 0.050368053 0.301584628 0.10281712 0.382394825 0.133676061 0.420516616 635

10504679753841500 46536998113108500 0 44306782527304300 00000000.00 97274949272490600 000 61187882273170300 00 23929638896804200 00000000000000000 000.00 64267820849706000 00000000000000000. 0 34894320979198400 0000000000000000.0 0 15126718252251200 00.00 73929295951576000 44192693726140100 000000000.0 11240652612447900 00 46536998113108500 0 14174007345489700 0000 10504679753841500 11864690236772900 00000000000000000 0.00 63277070300651400 0000000000.0 0.126383874 0.412085061 0.140715856 0.428385193 0.106937084 0.387884886 0.162378746 0.450757418 0.150955988 0.439250345 0.22053736 0.50038687 0.191316964 0.477095536 0.177962586 0.465386139 0.059446144 0.316887889 0.133676061 0.420516616 0.114940589 0.398174116 0.144185086 0.432124515 0.140715856 0.428385193 0.16399261 0.452309598 0.126383874 0.412085061 0.194200841 0.479545285 0.124501025 0.409870101 B. MRE in KMeans Clustering TABLE 5. MRE in OLS Regression Data Preprocessing Technique MRE Value None 1.8799 Norm 0.7397 Log 15.1369 IX. GRAPHICAL REPRESENTATION OF DATA PREPROCESSING TECHNIQUES. A. None Data Preprocessing Technique. Fig.3. represents estimated effort varies from 5.5 to 45 with minimum value of estimated effort as 5.5 and maximum value of estimated effort as 45. Fig.3. Data preprocessing in None Technique B. Norm Data Preprocessing Technique Fig.4. represents estimated effort varies from varies from 2.3 to 0.2 with minimum value of estimated effort as 0.2 and maximum value of estimated effort as 2.3. VIII. MAGNITUDE OF REPALTIVE ERROR(MRE) The magnitude of relative error is calculated using formula A. MRE in OLS Regression TABLE 4. MRE in OLS Regression Data Preprocessing Technique MRE Value None 60.5670 Norm 1.0159 Log 1.4703 (6) Fig.4. Data preprocessing in Norm Technique C. Log Data Preprocessing Technique Fig.5 represent estimated effort varies from 7 to 2.9 with minimum value of estimated effort as 2.9and maximum value of estimated effort as 7. 636

Fig.8. OLS Regression in Log technique Fig.5. Data preprocessing in Log Technique X. GRAPHICAL REPRESENTATION OF OLS REGRESSION A. OLS Regression in None Data Preprocessing Technique The Fig. 6. represents effort obtained shows t uniform results while the values of estimated effort vary from 0 to -0.1. XI. GRAPHICAL REPRESENTATION OF KMEANS CLUSTERING A. K Means Clustering in None Data Preprocessing Technique The Fig. 9. represents estimated effort varies from 0 to 10 with the estimated effort showing the uniform results. Fig.9. K Means Clustering in None technique Fig.6. OLS Regression in None technique B. OLS Regression in Norm Data Preprocessing Technique The Fig. 7. represents estimated effort varies from 0 to -0.1 with minimum value of estimated effort is -0.03 and maximum value of estimated effort is 0 B. K Means Clustering in Norm Data Preprocessing Technique The fig. 10. represents estimated effort varies from 0 to 10 where the estimated effort having minimum value 0.1 and maximum value 0.7. Fig.7. OLS Regression in Norm technique Fig.10. K Means Clustering in Norm technique C. OLS Regression in Log Data Preprocessing Technique The Fig.8. represents estimated effort varies from 0 to -0.1 where the estimated effort showing the uniform results. C. K Means Clustering in Log Data Preprocessing Technique The fig. 11. represents estimated effort varies from 0 to 10 where the estimated effort having minimum value 5 and maximum value 10. 637

compared to estimated effort Clustering. obtained in case of K Means Fig.11. K Means Clustering in Log technique XII. CONCLUSION A. Comparison of Data preprocessing Techniques The data preprocessing technique can make a valuable contribution to set of software effort estimation techniques. If effort estimate is far from actual value i.e. more than 25 percent, the estimate cannot be consider accurate. Therefore as per our estimation it has been found that result of effort obtained using none and log technique is more than 25 percent where result of effort estimation after normalizing data is less than 25 percent i.e. the resulted data afterr using norm data preprocessing technique is more accurate. When implemented graphically it is obtained then estimated effort when calculated using norm data preprocessing techniques give more stable values as compared to none and log data preprocessing techniques. B. Comparison of Data Mining Techniques The effort obtained by using K Means clustering in Norm data preprocessing technique gives more accuratee results as compared to OLS Regression Technique. Also the result shows that the value of MRE (Magnitude of Relative Error) applying K Means Clustering in case of Norm data preprocessing technique is substantiallyy lower than MRE obtained by applying OLS Regression in case of Norm Data preprocessing Technique. XIII. REFRENCES [1]. Ruotsalainen.L,"Data Mining Tools for Technology and Competitive Intelligence". Espoo, VTT, 2008. [2]. Devesh.K.S, Durg.S.C, Raghuraj.S, "VRS Model: A Model for Estimation of Efforts and Time Duration in Development of IVR Software System ". International Journal of Software Engineering(IJSE), vol. 5,pp. 27-46, 2012. [3]. Edelstein.H, " Introduction to Data Mining and Knowledge Discovery",3 rd ed,two Crows Corporation, Maryland:Potomac, 1999. [4]. DR.A.A, DR P. K. M, DR S.M et al, "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets. Division of Computer Applications Indian Agricultural Statistics Research Institute(ICAR), New Delhi : Pusa [5]. http://en.wikipedia.org/wiki/cocomo. (n.d.). [6]. Karel.D, Verbeke, Wouter.V, David.M, Bart. B "Data Mining Techniques for Software Effort Estimation: A Comparative Study". IEEE Transactions on Software Engineering, vol 38,pp.375-397, 2012. [7]. Jacky.K, Ekrem. K., Tim. M, "A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation". Springer Science +Business Media, pp.1-19, 2011 [8]. Navjot.K, Jaspreet.K.S, Navneet.K. "Efficient K-Means Clustering Algorithm Using Ranking Method In Data Mining". International Journal of Advanced Research in Computer Engineering & Technology, vol 1, pp.85 91,2012 [9]. Devesh.K.S, D. K., Durg S..C, Raghuraj.S, "VRS Model: A Model for Estimation of Efforts andd Time Duration in Development of IVR Software System ". International Journal of Software Engineering(IJSE), vol.5,pp. 27-46,2012 OLS Regression Normalized Effort K Means Clustering Normalized effort Fig.12. OLS and K Means Clustering in Normm technique Fig.12. represents estimated effort in case of K Means Clustering gives more accurate values and compared to OLS Regression. Also estimated effort obtainedd in case of OLS Regression is more scattered resulting in instability as 638