On the Value of Ranked Voting Methods for Estimation by Analogy

Size: px

Start display at page:

Download "On the Value of Ranked Voting Methods for Estimation by Analogy"

Edgar May
5 years ago
Views:

1 On the Value of Ranked Voting Methods for Estimation by Analogy Mohammad Azzeh, Marwan Alseid Department of Software Engineering Applied Science University Amman, Jordan POBOX 166 Abstract. Background: One long-standing issue in Estimation by Analogy (EBA) is finding closest analogies. Prior studies revealed that existing similarity measures are easily influenced by extreme values and irrelevant features. Aims: Instead of identifying closest projects based on the aggregated similarity degrees, we propose to use Ranked Voting Methods that rank projects per feature and then aggregate those ranks over all features using voting count rules. The project(s) with highest score will be the winners and form new estimate for the target project. This also enables us to automatically come up with the preferred number of analogies for each target project since the winner set may contain more than a single winner. Method: Empirical evaluation with Jack-knifing procedure has been carried out in which nine datasets come from two repositories (PROMISE & ISBSG) were used for benchmarking. The proposed models are compared to some well-known estimation methods: regular K based EBA, Stepwise Regression, Ordinary Least Square Regression, and Regression tree (CART). Results & Conclusions: The performance figures of the proposed models were promising. The use of voting methods present some useful advantages: (1) saving time in finding appropriate K number of analogies for each individual project, (2) No need for project pruning, and (3) No data standardization is required. 1. INTRODUCTION Estimation by Analogy (EBA) has become a very popular and also commercially successful software effort estimation method [1, 4, 7, 9]. It is based on the assumption that effort of new project can be estimated efficiently by reusing project effort about similar, already estimated projects documented in a dataset [18]. In order to estimate a new project, in a first step one has to identify projects which contain the most useful prediction. Since the utility of a project cannot be evaluated directly a-priori, similarity between project descriptions is used as a heuristics to estimate the projects effort [2, 4]. Therefore the closest projects are basically identified based on the aggregated similarity degrees, not based on which project is mostly preferred by all features. Many researchers [1, 14, 18] argue that, in theory, the used similarity measures in EBA are easily influenced by irrelevant features and extreme values existed in some features. Azzeh et al. [2] and Shepperd & Schofield [18] demonstrated in prior studies that Euclidean distance does not often retrieve the exact closest projects because it is easily influenced by abovementioned challenges. For instance, a project with Contradict wins (i.e. ranked first as closest project over all features) might not be selected but a project with less number of wins would be selected because it has some extreme values. However, the remarkable observation from previous studies indicates that identifying the closest projects is sensitive to the choice of similarity measure. This should not be a surprising result; Mendes et al. [14] compared between different types of distance metrics in analogy software estimation and revealed that using different distance metrics yield different results.

2 Another important issue in EBA is to specify a priori the appropriate number of analogies that are required to produce the effort estimate. The current approach starts with a single analogy and increases this number depending on the overall performance of the whole dataset then it uses the global K value that produces the overall best performance [12]. However, a fixed K value that produces overall best performance does not necessarily provide the best performance for individual projects and may not be suitable for other datasets. Therefore the K value should be different for each individual project. This paper presents a new model based on Ranked Voting Method (RVM) to identify closest analogy and automatically come up with preferred number of analogies for each target project. The RVM is a winner election method in which voters rank candidates in order of preference [5, 11]. The winner of an election is usually determined by giving each candidate a certain number of points corresponding to the position in which it is ranked by each voter. Once all votes have been counted the candidate with the highest score is the winner. The voters in EBA are represented by the features; thus features and voters will be used interchangeably thereafter. The voting method will be integrated with EBA process to rank source projects per feature according to their closeness to the target project, and then aggregate those rankings to identify winners set. Hence, the voting methods may identify not only a single winner but set of winners having indifferences between them. This eventually will help us to save time in finding appropriate number of analogies. This kind of integration is defined by Kocaguneli et al. [21] as ensemble learning. Ensemble means building multiple predictors rather using single method, and then aggregate estimates that come from different learners. The results of [21] showed that ensemble methods perform better than single methods which forms another motivation for this work. The rest of the article is organized as follows: Section 2 presents the related work. Section 3 introduces Ranked Voting Method. Section 4 presents methodology used in this article. Section 5 presents the results we obtained. Section 6 presents threats to validity of this study. Lastly section 6 summarizes our conclusions and future work. 2. RELATED WORKS In literature, there is no study focused on the use of RVM in EBA. But we can find two studies are considered related to our research area. The first study was conducted by Miranda [16] based on Analytical Hierarchal Process (AHP), which is a decision-making theory. The AHP was mainly used for the problem of size estimation based on comparisons between software project components with a given limited verbal scale ( equal, slightly/much/extremely, smaller/bigger ). The outcome of this process is a vector of weights that can be used to deliver estimates for the size, if at least one reference point out of them is known. The similarity to our work lies mostly in the fact that both models use ranking approach to arrive at good estimate. The difference lies mostly in the fact that the AHP method is geared towards estimating a single attribute for several entities based on a single reference point, while our method as described later uses arbitrary number of features in order to have an overall picture of new estimate. The second study was conducted by Koch and Mitlöhner [13] based on social choice. They proposed a new estimation method that resembles AHP method using comparison approach, but the difference was that the social choice used an arbitrary list of variables and aggregates the resulting ranking to place the target project in the proper place. Then the mean effort values of the near projects are calculated to produce new estimate. This method was available only for a single new project at a time; and since there are only weak preference assertions there is no need for a translation of verbal scales into numeric values as in AHP. Both previous studies have been

3 validated over a very limited number of datasets, which is insufficient to judge the credibility and reliability of such methods. 3. RANKED VOTING THEORY Ranked Voting theory (Also known as preferential voting methods) is a measure of individual interests and preferences as an aggregate towards collective decision [5, 11]. In other words, voters can rank possible candidates in order of a preference and some aggregating rules are then used to find winner or a set of winners among various candidates. A voting problem is usually described by a set of m voters offering preference ranking for n candidates. Once the candidates have been assigned ranks by voters, a RVM can be used to aggregate the collective decision. An aggregation resulting from the application of a RVM may contain indifferences therefore the winner set may contain more than one candidate. However, there are two important issues that should be fulfilled by RVM, first, if a candidate x beats all other candidates in pairwise comparisons then x is a Condorcet winner based on the collective decision. Second, aggregate relation should not contain any cycles and represent a complete (possibly weak) order of the candidates. Although there exist various RVMs in literature we limited our choice to the most common methods that satisfy the above two issues: Borda, Copeland and Maximin rules. In these methods the candidate(s) with highest score form the winner set. Since the aggregate relations of the Maximin, Copeland, and Borda methods are based on numeric scores they can always be expressed as complete weak preference orders, i.e. they can contain indifferences, but not cycles or intransitivity [13]. Borda, Copeland and Maximin can be simply calculated by first constructing majority margins matrix (MM). Consequentially, Borda count is calculated by finding summation of votes for every candidate over each column [19]. The Copeland count is calculated by summing only signs of those numbers which indicates the difference between the number of candidates it wins and the number of candidates it loses against [5]. The Maximin is calculated by finding the smallest number for every candidate. Finally, the winner is the candidate with higher score [19]. To illustrate how this works, we provide a hypothetical example with 4 voters (A, B, C, D) and 5 candidates (x, y, z, w, u) as shown in Table 1. Table 1 Hypothetical example of voting methods Rank A B C D 1 y x y x 2 x w w w 3 w y x u 4 z z u y 5 u u z z The complete profile fort this example is described formally by the following notation: ((y x w z u),(x w y z u),(y w x u z),(x w u y z)) where y x denotes that the x is preferred by y or y defeats (precedes) x. Table 2 shows the MM for the candidates in Table 1. The value in every entry represents how many times does the candidate in the corresponding row beats the candidate in the opponent column. E.g. the first row and third column tells us that MM x,z = 4 which indicates that the candidate x beats candidate z by a margin of four.

4 Table 2 Majority margin matrix for the Profile in Table 1 x y z w u BO CO MA x y z w u The resulted aggregated ranks for every candidate are represented in Bold in the last three columns of Table 2. Therefore, for the above profile we get the following rankings: BO: x (y~w) u z. CO: x y w (u~z) MA: (x~y) w (u~z) where candidates on the left hand side are ranked higher and ~ symbol means indifference between two candidates (i.e. they have same rank). From the final rankings we can notice that both BO and CO suggests that the x candidate is the winner while MA suggests that both x and y form the winners set. Also we can notice that the CO and MA methods rank the Condorcet winner highest whenever such a winner exists. Furthermore, it is obvious that using different RVMs yield often different order of preference for the same profile. 4. METHODOLOGY 4.1 EBA and Voting Rules The notion of RVM can be integrated efficiently in EBA especially at project retrieval stage. The RVM will be mainly used to retrieve closest projects by ranking source projects per every feature and then aggregate such ranks. The voters in EBA are represented by features, whereas the source projects are the possible candidates. Ranking per feature is often assumed to consist of strict orderings only and may contain some tied ranks when some candidates have same values. This is very important in EBA when a particular feature is described by categorical or ordinal values. The proposed method (EBAV) procedure is described as follows: 1. Find distance between test project and all source projects in terms of each feature individually using Eq. (1). pi - p j d( pi, p j ) 0 1 if the featureis continuous if the featureiscategorical and pi p if the featureiscategorical and p p i j j (1) 2. Rank source projects per every feature according to their closeness to the test project. For categorical features, the projects that have same category with target project are ranked first then different projects are ranked second. 3. Calculate BO, CO and MA counts by aggregating ranks for every source project over all features, taking into account number of voters for every preference (i.e. weights as discussed later). 4. The project(s) with highest score will be ranked first and the mean of their effort values will be used to produce new estimate. 5. Steps 1 to 4 are repeated for all test projects.

5 5.2 Feature Weighting The ranked voting methods usually allow voters to give multiple candidates the same ranking. In EBA, this is represented by assigning a weight for every feature (voter) which is translated into having more voters than features, with several voters giving their ranking according to a single feature. For example, we might include ten voters preferred a ranking based on function points, while only two voters preferred the development mode. To obtain such weights we used Artificial Bees algorithm (AB) to optimize weight values for each dataset. The Bees algorithm performs a kind of neighbourhood search combined with random search [17]. As any search algorithm, the use of AB requires initializing its parameters: problem size (m), number of scout bees (n), number of sites selected out of n visited sites (k), number of best sites out of k selected sites (e), number of bees recruited for best e sites (nep), number of bees recruited for the other (k-e) selected sites (nsp), and initial size of patches (ngh), in addition to Stopping criterion [17]. The algorithm starts with an initial population (pop) of scout bees placed randomly in initial search space. Each scout bee (i.e. each row) represents a potential solution as set of weight values, as shown in Figure 1, where n is the number of solutions and m is the dimension of the solution which is equivalent to the number of dataset features. w11 w12 w1 m w21 w22 w2m pop wn1 wn 2 wnm Figure 1. Initial population The fitness value each solution in pop is evaluated on the proposed model using MMRE. Then the solutions are reordered based on their fitness values from the lower to higher. Based on values of k and e, the best k solutions are being selected for neighbourhood search. For example if k=20 and e=10 that means we select the top best 10 solutions out of 20 (i.e. solutions from #1 to #10 in the ordered pop) as elite sites to visit, and recruit a number of bees (nep) to search in the neighbourhood of each best solution for better improvement on the solution, and form new patch. In other words, for each elite solution there will be nep generated solutions to search in the neighbourhood for other possible best solutions. Similarly, for the best solutions from #11 to #20 a number of bees (nsp) are also recruited for each solution to search in the neighbourhood and forms new patches. It is important to note that nsp should be less than nep to reflect the fitness of solutions. The area of neighbourhood search is determined by identifying the radius of search area from best solution in order to update the k solutions declared in the previous step. This is important as there might be better solutions than the original solution in the neighbourhood. The best solution in each patch will be used to replace the old best solution in that patch. The remaining bees in the population (i.e. solutions from #21 to #100) will be replaced randomly with other solutions. The algorithm continues searching in the neighbourhood of the selected sites, recruiting more bees to search near to the best sites which may have promising solutions. These steps are repeated until the criterion of stop (lowest MMRE) is met or the number of iteration has finished. In this we paper we used the following AB parameters: (n=100; k=20, e=10, nep=30, nsp=20, ngh=0.05). 5.3 Experimental Design The proposed method have been validated over nine datasets come from two repositories PROMISE [3] and ISBSG [6]. The use of sufficient number of datasets increase credibility and reliability of the proposed method. PROMISE is an on-line publically available data repository which consists of

6 datasets donated by various researchers around the world [12]. The datasets come from this source are: Albrecht, Kemerer, Desharnais, COCOMO, Maxwell, China, Telecom, and NASA93 datasets. The other dataset comes from ISBSG data repository (release 10) which is a large data repository consists of more than 4000 projects collected from different types of projects around the world. Since many projects have missing values only 500 projects with quality rating A are considered. 14 useful features were selected, 8 of which are numerical features and 6 of which are categorical features. The descriptive statistics of such datasets are summarized in Table 3. Table 3 Statistical properties of the employed datasets Dataset Cases # Min max Effort mean ISBSG Desharnais COCOMO Kemerer Albrecht Maxwell NASA China Telecom For each dataset we follow the same testing strategy, we Leave-one-out cross validation to identify test and train projects such that, in each run, we select one project as test set and the remaining projects as training set. This procedure is performed until all projects within the dataset are used as test projects. In each run, the prediction accuracy of different techniques is assessed using various performance measures. 5.4 Performance Measures Three performance measures have been used to evaluate and compare between different estimation models. The most common measure is Magnitude of Relative Error (MRE) which calculates absolute percentage of error between actual and predicted project effort values as shown in Eq. 2. A summary of MRE can be derived as the Mean Magnitude of Relative Error (MMRE) as shown in Eq. 3. pred(0.25) is an alternative performance measure used to count the percentage of MREs that fall within less than 0.25 of the actual values as shown in Eq. 4. xi xˆ i MREi (2) x i N i 1 1 MMRE MRE (3) i N where x i and xˆ i are the actual value and predicted values of i th project, and N is the number of observations. 100 N 1 if MREi pred( ) (4) N i 1 0 otherwise In addition to that we used win-tie-loss algorithm [21] to compare the performance of EBAV to other estimation methods as shown in Figure 2. To do so, we first check if two methods M i and M j are statistically different according to the Wilcoxon test; otherwise we increase tie i and tie j. If the

7 distributions are statistically different, we update win i; win j and loss i; loss j, after checking which one is better according to the performance measure at hand E. The performance measures used here are MRE, MMRE, median of MRE (MdMRE) and Pred(0.25). Win i=0,tie i=0,loss i=0 Win j=0,tie j=0;loss j=0 if Wilcoxon (MRE(M i), MRE(M j), 95) says they are the same then tie i = tie i + 1 tie j = tie j + 1 else if better(e(m i), E(M j)) then win i = win i + 1 loss j = loss j + 1 else win j = win j + 1 loss i = loss i + 1 end if end if Figure 2. Pseudo code for Win-Tie-Loss calculation between method M i and M j based on performance measure E [21]. 6. RESULTS AND DISCUSSIONS This section describes the results obtained when comparing EBAV with other regular K-EBA methods. The K values used in this study ranges from 1 to 5 as they are intensively used for comparison in previous studies [8, 10]. Tables 4 and 5 summarize the predictive performance of the variants of EBAV and EBA in terms of MMRE and Pred. When we look at the MMRE values, we can notice that in all nine datasets, the EBAV variants have never been outperformed by any other variant of EBA with lowest MMRE values and larger Pred. This suggests that choosing closest projects based on their ranking is more efficient than calculating aggregated similarity degrees, but taking into consideration the appropriate number of voters for each ranking. Azzeh et al. [2] and Shepperd & Schofied [18] have paid attention to the problems of some similarity measures used in EBA such that they are easily influenced by outliers and irrelevant features. Therefor the similarity degree is amplified when a project with extreme values is assessed against the target project. Later the source project will be excluded from the similarity order in spite of its effort is more predictor. Table 4 MMRE results of EBAV and EBA variants Dataset BO CO MA EBA1 EBA2 EBA3 EBA4 EBA5 Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA Three aspects of these results are worth commenting: 1. EBAV variants are nearly at the same predictive accuracy level with very slight differences but still producing better accuracy than EBA variants. This enables us to conclude that there is no single EBAV variant produces the superior results over all employed datasets, but we can draw a guideline based on the obtained results: (1) BO is more suitable for moderate size

8 datasets such as Desharnais, Maxwell and COCOMO that have large number of features. (2) MA is suitable for small size datasets such as NASA and Telecom that have small number of features. (3) CO is suitable for large datasets such as China and ISBSG. Also we can conclude that Borda works well with datasets with large number of categorical features such as in Maxwell and COCOMO. 2. Among all RVMs, the MA is the best method that can identify more than one project in the winner set, which allows us to automatically come up with the best set of K of nearest neighbours for every test project without any intervention from experts. This is very important issue in estimation by analogy since there is no reliable method that can discover the optimum number of analogies for every project solely. Previous studies that use K-EBA are based on expert intuition where K values are determined based on the value that minimizes overall MMRE but not MRE for every test project. 3. Using voting methods, there is no need to pre-process the data as in the basic EBA. It is well recognized that the use of EBA requires data standardization or transformation to make all features have the same degree of influence. This step is not entirely required in EBAV because the basic process depends on the ranking of projects not based on the distance values. Table 5. Pred results of EBAV and EBA variants Dataset BO CO MA EBA1 EBA2 EBA3 EBA4 EBA5 Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA To identify top methods among EBA and EBAV variants over all datasets, we run win-tie-loss algorithm. This algorithm ranks different methods based on comparison between them in terms of some performance measures over all employed datasets. The overall results of win-tie-loss are recorded in Table 6. However, there is reasonable believe that using RVMs with EBA have never been outperformed by other regular K based EBA. Indeed, this confirms the significant improvement brought to the basic EBA. From these results we can notice that the number of win suggests that BO is the best performer with 100 wins, but the (win-loss) values suggests that the all EBAV variants are better with remarkable difference than conventional EBA. However, the CO still produces comparable results to MA with slight difference in win-loss. Table 6 win-tie-loss Results of EBAV and EBA variants Rank variant win tie loss win-loss 2 BO CO MA EBA EBA EBA EBA EBA

9 The performance of EBAV variants are also compared against the most common regression methods: Categorical Regression Tree (CART), Ordinary Least Square Regression (OLS) and Stepwise Regression (SR), using Leave-one-out cross validation procedure. The choice of such prediction methods is based on the different strategies they use to make estimate. The remarkable difference between SR and OLS is that SR builds regression model from the significant features only whereas OLS builds regression model from all employed features. For SR and OLS, it is important to make sure that the skewed numerical variables are being transformed using Log transforms, in case they need, such that they resemble more closely a normal distribution [20]. The logarithmic transformation ensures that the resulting model goes through the origin on the raw data scale. Further, all categorical attributes were converted into appropriate dummy variables [20]. Moreover, all necessary prerequested tests such as normality tests are performed once before running empirical validation which resulted in a general regression model. Then, in each validation iteration a different regression model that resembles general regression model in the structure is built based on the training data set and then the prediction of test project is made on training data set. Table 7 presents a sample of SR regression models. Table 7 general Regression models Dataset SR model R 2 Albrecht Effort RawFP 0.90 Kemerer Ln( Effort ) Ln( AdjFP ) 0.67 Desharnais Ln( Effort) Ln( AdjFP) L L COCOMO Ln( Effort ) PCAP 5. 94TURN 0.18 Maxwell Ln( Effort) Size 0.71 China Ln( Effort ) Ln( AFP) Ln( PDR_ AFP) 0.48 ISBSG Ln( Effort) Ln( AFP) Ln( ADD ) 0.21 Telecom Effort changes 0.53 Nasa Ln( Effort ) Ln( KDLOC) ME 0.90 The R 2 of SR models for COCOMO and ISBSG suggest that the models were very poor with only 18-21% of the variation in effort being explained by variation in the significant selected features. But this does not often lead to poor predictive performance. On the other hand, the SR model for Desharnais dataset uses L1 and L2 as dummy variables instead of the categorical variable (Dev.mode). For OLS technique we followed the same procedure used in [20] in which the log transformation is used to transfer skewed dependent and independent features so the residuals of regression model become more homoscedastic, and follow more closely a normal distribution [20]. Tables 8 and 9 present the obtained results from applying SR and CART over all datasets. The results revealed that the EBAV variants still produce better accuracy than regression models, but with exception to China, NASA and COCOMO datasets. However, we can notice that the difference in MMRE and Pred values in both China, NASA datasets are not too prominent, which draw a conclusion that the proposed methods have potential to deliver good estimates. In reference to the comparison results of Dejaeger et al. [20] who reported that OLS+log transformation was the superior technique over the most employed datasets. This paper used four datasets mentioned in that article which are: Desharnais, NASA, COCOMO and Maxwell. The results obtained especially for these datasets in terms of MMRE and pred suggest that the proposed techniques beat OLS over three datasets, whereas OLS performs better over COCOMO dataset. This comes to conclude that EBAV variants have potential to deliver more accurate estimates than regression models.

10 Table 8 MMRE results of EBAV variants against Regression models Dataset BO CO MA SR CART OLS Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA Table 9 Pred results of EBAV variants against Regression models Dataset BO CO MA SR CART OLS Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA Table 4 shows the sum of win, tie and loss values for all methods used in this paper. Every method is compared to 11 methods, over 4 error measures and 9 datasets, so the maximum value that either one of the win, tie, loss statistics can attain is: =396. Notice that in Table 10 (j) the tie values are in range. Therefore they would not be so informative as to differentiate the methods, so we consult to win and loss statistics. There is considerable difference between the best and the worst methods in terms on win and loss. The results revealed that the BO is top ranked method with winloss=115 followed by MA in the second place with win-loss=92. Interestingly, Both MA and CO have the minimum number of loss over all datasets which is also better than loss of BO. Further analysis shows that BO was the winner over 6 datasets, whereas MA, CO and SR were the winners over 2 datasets each, and CART was winner over only one dataset. Interestingly, the conventional EBA methods were not the top winners over any dataset. Also the regression models developed by SR and OLS perform better than CART and conventional EBA, but not better than EBAV variants as confirmed in Table 10 (j). TABLE 10: Win-Tie-Loss values for different estimation methods (a) Albrecht (b) Kemerer Method win tie loss win-loss Method win tie Loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS

11 (c) Desharnais (d) COCOMO81 Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS (e) Maxwell (f) China Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS (g) ISBSG (h) Nasa Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS (i) Telecom (j) Cumulative win-tie-loss results Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR

12 CART CART OLS OLS THREATS TO VALIDITY Internal validity is the degree to which conclusions can be drawn with regard to configuration setup of AB algorithm including: 1) determining initial parameter values and (2) the identification of initial solutions. Currently, there is no reasonable approach to perfectly initialize parameter values of AB algorithm, but some studies suggest choosing different values for every problem based on the size and complexity of the datasets. Also there is no reliable method to choose initial solutions so we used random selection procedure. However, we believe that these decisions were reasonable even though it may affect computation power of our EBAV method. Concerning construct validity which assures that we are measuring what we actually intended to measure. This paper used Jack-knife validation to assess different methods, though some authors favoured n-fold cross validation. The principal reason for this selection, the Jack-knife has been used in deterministic procedure that can be exactly repeated by any other research with access to a particular dataset. According to previous studies, the Jack-knife generates Lower base estimates than n-fold CV since methods need to learn from fewer examples. Also, it generates higher variance estimates than n-fold CV since Jack-knife conducts more tests. External validity assesses ability to generalize the obtained findings of our comparative studies; we used nine datasets from two different sources to ensure the generalizability of the obtained results. The employed datasets contain a wide diversity of projects in terms of their sources, their domains and the time period they were developed in. We also believe that reproducibility of results is an important factor for external validity. Therefore, we have purposely selected publicly available datasets. However, we consider that some datasets are very old to be used in software cost estimation because they represent different software development approaches and technologies. The reason for this is that these datasets are publically available, and still widely used for benchmarking purposes. An ideal case would to request to new datasets with specific properties that would best suit the experimental concern and that would support the random sampling of data in experimentation. 8 CONCLUSIONS This paper presents a new approach to identify and choose closest projects in EBA method using Ranked Voting Methods, namely: Borda, Copeland and Maximin. The use of voting rules with EBA has three distinct advantages: (1) alleviating the effect of extreme values since no numeric values are used but only a ranking within an existing set of projects is necessary. (2) The aggregated ranks may nominate more than a single project as closest analogies, which would save time in finding appropriate number of analogies. (3) The procedure followed is transparent to practitioners and easy to understand. (4) As confirmed by Kocaguneli et al. [21] this paper shows that ensemble method of VRM and EBA works better than single EBA methods. The proposed EBAV variants haven been benchmarked to some well-known estimation methods such as regular K-based EBA, SR and CART. The top ranked method is BO confirmed by collecting win-tie-loss for each compared method as shown in Table 10 (j). BO has never outperformed in seven out of nine datasets. Future work is required to investigate the impact of feature subset selection [15] on the predictive performance, in addition to the impact of ensembles voting methods rather than a single voting method.

13 9 ACKNOWLEDGEMENTS The author is grateful to the Applied Science University, Amman, Jordan, for the financial support granted to cover the publication fee of this research article. 10 REFERENCES 1 Azzeh, M., A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation. Empirical Software Engineering 17(1-2): Azzeh M., Neagu, D., Cowling, P., Fuzzy grey relational analysis for software effort estimation, Empirical Software Engineering, 15: Boetticher, G., Menzies, T., Ostrand, T., PROMISE Repository of empirical software engineering data repository, West Virginia University, Department of Computer Science. 4 Chiu, NH., Huang, SJ., The adjusted analogy-based software effort estimation based on similarity distances. J Systems and Software 80: doi: /j.jss Fishburn P.C., Condorcet social choice functions, SIAM Journal of Applied Mathematics 33 (1977) ISBSG, International software benchmark and standard group, Data CDRelease 10, Jorgensen, M., Indahl, U., Sjoberg, D., Software effort estimation by analogy and regression toward the mean. Journal of Systems and Software 68: Kadoda, G., Cartwright, M., Chen, L. & Shepperd, M., Experiences using case based reasoning to predict software project effort, in proceedings of EASE: Evaluation and Assessment in Software Engineering Conference, Keele, UK. 9 Keung, J., Kitchenham, B., Jeffery, DR., Analogy-X: Providing Statistical Inference to Analogy- Based Software Cost Estimation. IEEE Transaction on Software Engineering. 34(4): Kirsopp, C., Mendes, E., Premraj, R., Shepperd, M., 2003 An empirical analysis of linear adaptation techniques for case-based prediction. International Conference on CBR. pp Klamler C., On the closeness aspect of three voting rules: Borda, Copeland, Maximin, Group Decision and Negotiation 14 (3) (2005) Kocaguneli, E., Menzies, T., Bener, A., Keung, J., Exploiting the Essential Assumptions of Analogy-based Effort Estimation, IEEE transaction on Software Engineering. ISSN: Koch, S. and Mitlöhner, J., 2009, Software project effort estimation with voting rules, Journal of Decision Support Systems: (46) Mendes, E., Watson, I., Triggs, C., Mosley, N., Counsell, S., A comparative study of cost estimation models for web hypermedia applications. Empirical Software Engineering 8: Menzies, T., Chen Z., Hihn, J. Lum, K., Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering. 32: Miranda, E., Improving subjectice estimates using paired comparisons, IEEE Software 18 (1) (January/February 2001) Pham, D.T., Ghanbarzadeh, A., Koç, E., Otri, S., Rahim, S., Zaidi, M. The Bees Algorithm A novel tool for complex optimisation problems. In: the 2nd Virtual International Conference on Intelligent Production Machines and Systems (I*PROMS-06), Cardiff, UK, (2006) 18 Shepperd, M., Schofield, C., Estimating software project effort using analogies. IEEE Transaction Software Engineering 23: Eckert, D., Klamler, C., Mitlöhner, J., Schlötterer, C., A distance-based comparison of basic voting rules, Central European Journal of Operations Research 14 (4) (2006) Dejaeger, K., Verbeke, W., Martens, D., Baesens, B., Data mining techniques for software effort estimation: a comparative study. IEEE Transactions on Software Engineering, 38(2) (2012): Ekrem, K., Menzies, T., Keung, J., On the value of ensemble effort estimation, IEEE Transactions on Software Engineering, (2011): 1-1.

An Optimized Analogy-Based Project Effort Estimation

An Optimized Analogy-Based Project Effort Estimation Mohammad Azzeh Faculty of Information Technology Applied Science University Amman, Jordan POBOX 166 m.y.azzeh@asu.edu.jo Abstract. Yousef Elsheikh Faculty