On the Value of Ranked Voting Methods for Estimation by Analogy

Size: px
Start display at page:

Download "On the Value of Ranked Voting Methods for Estimation by Analogy"

Transcription

1 On the Value of Ranked Voting Methods for Estimation by Analogy Mohammad Azzeh, Marwan Alseid Department of Software Engineering Applied Science University Amman, Jordan POBOX 166 Abstract. Background: One long-standing issue in Estimation by Analogy (EBA) is finding closest analogies. Prior studies revealed that existing similarity measures are easily influenced by extreme values and irrelevant features. Aims: Instead of identifying closest projects based on the aggregated similarity degrees, we propose to use Ranked Voting Methods that rank projects per feature and then aggregate those ranks over all features using voting count rules. The project(s) with highest score will be the winners and form new estimate for the target project. This also enables us to automatically come up with the preferred number of analogies for each target project since the winner set may contain more than a single winner. Method: Empirical evaluation with Jack-knifing procedure has been carried out in which nine datasets come from two repositories (PROMISE & ISBSG) were used for benchmarking. The proposed models are compared to some well-known estimation methods: regular K based EBA, Stepwise Regression, Ordinary Least Square Regression, and Regression tree (CART). Results & Conclusions: The performance figures of the proposed models were promising. The use of voting methods present some useful advantages: (1) saving time in finding appropriate K number of analogies for each individual project, (2) No need for project pruning, and (3) No data standardization is required. 1. INTRODUCTION Estimation by Analogy (EBA) has become a very popular and also commercially successful software effort estimation method [1, 4, 7, 9]. It is based on the assumption that effort of new project can be estimated efficiently by reusing project effort about similar, already estimated projects documented in a dataset [18]. In order to estimate a new project, in a first step one has to identify projects which contain the most useful prediction. Since the utility of a project cannot be evaluated directly a-priori, similarity between project descriptions is used as a heuristics to estimate the projects effort [2, 4]. Therefore the closest projects are basically identified based on the aggregated similarity degrees, not based on which project is mostly preferred by all features. Many researchers [1, 14, 18] argue that, in theory, the used similarity measures in EBA are easily influenced by irrelevant features and extreme values existed in some features. Azzeh et al. [2] and Shepperd & Schofield [18] demonstrated in prior studies that Euclidean distance does not often retrieve the exact closest projects because it is easily influenced by abovementioned challenges. For instance, a project with Contradict wins (i.e. ranked first as closest project over all features) might not be selected but a project with less number of wins would be selected because it has some extreme values. However, the remarkable observation from previous studies indicates that identifying the closest projects is sensitive to the choice of similarity measure. This should not be a surprising result; Mendes et al. [14] compared between different types of distance metrics in analogy software estimation and revealed that using different distance metrics yield different results.

2 Another important issue in EBA is to specify a priori the appropriate number of analogies that are required to produce the effort estimate. The current approach starts with a single analogy and increases this number depending on the overall performance of the whole dataset then it uses the global K value that produces the overall best performance [12]. However, a fixed K value that produces overall best performance does not necessarily provide the best performance for individual projects and may not be suitable for other datasets. Therefore the K value should be different for each individual project. This paper presents a new model based on Ranked Voting Method (RVM) to identify closest analogy and automatically come up with preferred number of analogies for each target project. The RVM is a winner election method in which voters rank candidates in order of preference [5, 11]. The winner of an election is usually determined by giving each candidate a certain number of points corresponding to the position in which it is ranked by each voter. Once all votes have been counted the candidate with the highest score is the winner. The voters in EBA are represented by the features; thus features and voters will be used interchangeably thereafter. The voting method will be integrated with EBA process to rank source projects per feature according to their closeness to the target project, and then aggregate those rankings to identify winners set. Hence, the voting methods may identify not only a single winner but set of winners having indifferences between them. This eventually will help us to save time in finding appropriate number of analogies. This kind of integration is defined by Kocaguneli et al. [21] as ensemble learning. Ensemble means building multiple predictors rather using single method, and then aggregate estimates that come from different learners. The results of [21] showed that ensemble methods perform better than single methods which forms another motivation for this work. The rest of the article is organized as follows: Section 2 presents the related work. Section 3 introduces Ranked Voting Method. Section 4 presents methodology used in this article. Section 5 presents the results we obtained. Section 6 presents threats to validity of this study. Lastly section 6 summarizes our conclusions and future work. 2. RELATED WORKS In literature, there is no study focused on the use of RVM in EBA. But we can find two studies are considered related to our research area. The first study was conducted by Miranda [16] based on Analytical Hierarchal Process (AHP), which is a decision-making theory. The AHP was mainly used for the problem of size estimation based on comparisons between software project components with a given limited verbal scale ( equal, slightly/much/extremely, smaller/bigger ). The outcome of this process is a vector of weights that can be used to deliver estimates for the size, if at least one reference point out of them is known. The similarity to our work lies mostly in the fact that both models use ranking approach to arrive at good estimate. The difference lies mostly in the fact that the AHP method is geared towards estimating a single attribute for several entities based on a single reference point, while our method as described later uses arbitrary number of features in order to have an overall picture of new estimate. The second study was conducted by Koch and Mitlöhner [13] based on social choice. They proposed a new estimation method that resembles AHP method using comparison approach, but the difference was that the social choice used an arbitrary list of variables and aggregates the resulting ranking to place the target project in the proper place. Then the mean effort values of the near projects are calculated to produce new estimate. This method was available only for a single new project at a time; and since there are only weak preference assertions there is no need for a translation of verbal scales into numeric values as in AHP. Both previous studies have been

3 validated over a very limited number of datasets, which is insufficient to judge the credibility and reliability of such methods. 3. RANKED VOTING THEORY Ranked Voting theory (Also known as preferential voting methods) is a measure of individual interests and preferences as an aggregate towards collective decision [5, 11]. In other words, voters can rank possible candidates in order of a preference and some aggregating rules are then used to find winner or a set of winners among various candidates. A voting problem is usually described by a set of m voters offering preference ranking for n candidates. Once the candidates have been assigned ranks by voters, a RVM can be used to aggregate the collective decision. An aggregation resulting from the application of a RVM may contain indifferences therefore the winner set may contain more than one candidate. However, there are two important issues that should be fulfilled by RVM, first, if a candidate x beats all other candidates in pairwise comparisons then x is a Condorcet winner based on the collective decision. Second, aggregate relation should not contain any cycles and represent a complete (possibly weak) order of the candidates. Although there exist various RVMs in literature we limited our choice to the most common methods that satisfy the above two issues: Borda, Copeland and Maximin rules. In these methods the candidate(s) with highest score form the winner set. Since the aggregate relations of the Maximin, Copeland, and Borda methods are based on numeric scores they can always be expressed as complete weak preference orders, i.e. they can contain indifferences, but not cycles or intransitivity [13]. Borda, Copeland and Maximin can be simply calculated by first constructing majority margins matrix (MM). Consequentially, Borda count is calculated by finding summation of votes for every candidate over each column [19]. The Copeland count is calculated by summing only signs of those numbers which indicates the difference between the number of candidates it wins and the number of candidates it loses against [5]. The Maximin is calculated by finding the smallest number for every candidate. Finally, the winner is the candidate with higher score [19]. To illustrate how this works, we provide a hypothetical example with 4 voters (A, B, C, D) and 5 candidates (x, y, z, w, u) as shown in Table 1. Table 1 Hypothetical example of voting methods Rank A B C D 1 y x y x 2 x w w w 3 w y x u 4 z z u y 5 u u z z The complete profile fort this example is described formally by the following notation: ((y x w z u),(x w y z u),(y w x u z),(x w u y z)) where y x denotes that the x is preferred by y or y defeats (precedes) x. Table 2 shows the MM for the candidates in Table 1. The value in every entry represents how many times does the candidate in the corresponding row beats the candidate in the opponent column. E.g. the first row and third column tells us that MM x,z = 4 which indicates that the candidate x beats candidate z by a margin of four.

4 Table 2 Majority margin matrix for the Profile in Table 1 x y z w u BO CO MA x y z w u The resulted aggregated ranks for every candidate are represented in Bold in the last three columns of Table 2. Therefore, for the above profile we get the following rankings: BO: x (y~w) u z. CO: x y w (u~z) MA: (x~y) w (u~z) where candidates on the left hand side are ranked higher and ~ symbol means indifference between two candidates (i.e. they have same rank). From the final rankings we can notice that both BO and CO suggests that the x candidate is the winner while MA suggests that both x and y form the winners set. Also we can notice that the CO and MA methods rank the Condorcet winner highest whenever such a winner exists. Furthermore, it is obvious that using different RVMs yield often different order of preference for the same profile. 4. METHODOLOGY 4.1 EBA and Voting Rules The notion of RVM can be integrated efficiently in EBA especially at project retrieval stage. The RVM will be mainly used to retrieve closest projects by ranking source projects per every feature and then aggregate such ranks. The voters in EBA are represented by features, whereas the source projects are the possible candidates. Ranking per feature is often assumed to consist of strict orderings only and may contain some tied ranks when some candidates have same values. This is very important in EBA when a particular feature is described by categorical or ordinal values. The proposed method (EBAV) procedure is described as follows: 1. Find distance between test project and all source projects in terms of each feature individually using Eq. (1). pi - p j d( pi, p j ) 0 1 if the featureis continuous if the featureiscategorical and pi p if the featureiscategorical and p p i j j (1) 2. Rank source projects per every feature according to their closeness to the test project. For categorical features, the projects that have same category with target project are ranked first then different projects are ranked second. 3. Calculate BO, CO and MA counts by aggregating ranks for every source project over all features, taking into account number of voters for every preference (i.e. weights as discussed later). 4. The project(s) with highest score will be ranked first and the mean of their effort values will be used to produce new estimate. 5. Steps 1 to 4 are repeated for all test projects.

5 5.2 Feature Weighting The ranked voting methods usually allow voters to give multiple candidates the same ranking. In EBA, this is represented by assigning a weight for every feature (voter) which is translated into having more voters than features, with several voters giving their ranking according to a single feature. For example, we might include ten voters preferred a ranking based on function points, while only two voters preferred the development mode. To obtain such weights we used Artificial Bees algorithm (AB) to optimize weight values for each dataset. The Bees algorithm performs a kind of neighbourhood search combined with random search [17]. As any search algorithm, the use of AB requires initializing its parameters: problem size (m), number of scout bees (n), number of sites selected out of n visited sites (k), number of best sites out of k selected sites (e), number of bees recruited for best e sites (nep), number of bees recruited for the other (k-e) selected sites (nsp), and initial size of patches (ngh), in addition to Stopping criterion [17]. The algorithm starts with an initial population (pop) of scout bees placed randomly in initial search space. Each scout bee (i.e. each row) represents a potential solution as set of weight values, as shown in Figure 1, where n is the number of solutions and m is the dimension of the solution which is equivalent to the number of dataset features. w11 w12 w1 m w21 w22 w2m pop wn1 wn 2 wnm Figure 1. Initial population The fitness value each solution in pop is evaluated on the proposed model using MMRE. Then the solutions are reordered based on their fitness values from the lower to higher. Based on values of k and e, the best k solutions are being selected for neighbourhood search. For example if k=20 and e=10 that means we select the top best 10 solutions out of 20 (i.e. solutions from #1 to #10 in the ordered pop) as elite sites to visit, and recruit a number of bees (nep) to search in the neighbourhood of each best solution for better improvement on the solution, and form new patch. In other words, for each elite solution there will be nep generated solutions to search in the neighbourhood for other possible best solutions. Similarly, for the best solutions from #11 to #20 a number of bees (nsp) are also recruited for each solution to search in the neighbourhood and forms new patches. It is important to note that nsp should be less than nep to reflect the fitness of solutions. The area of neighbourhood search is determined by identifying the radius of search area from best solution in order to update the k solutions declared in the previous step. This is important as there might be better solutions than the original solution in the neighbourhood. The best solution in each patch will be used to replace the old best solution in that patch. The remaining bees in the population (i.e. solutions from #21 to #100) will be replaced randomly with other solutions. The algorithm continues searching in the neighbourhood of the selected sites, recruiting more bees to search near to the best sites which may have promising solutions. These steps are repeated until the criterion of stop (lowest MMRE) is met or the number of iteration has finished. In this we paper we used the following AB parameters: (n=100; k=20, e=10, nep=30, nsp=20, ngh=0.05). 5.3 Experimental Design The proposed method have been validated over nine datasets come from two repositories PROMISE [3] and ISBSG [6]. The use of sufficient number of datasets increase credibility and reliability of the proposed method. PROMISE is an on-line publically available data repository which consists of

6 datasets donated by various researchers around the world [12]. The datasets come from this source are: Albrecht, Kemerer, Desharnais, COCOMO, Maxwell, China, Telecom, and NASA93 datasets. The other dataset comes from ISBSG data repository (release 10) which is a large data repository consists of more than 4000 projects collected from different types of projects around the world. Since many projects have missing values only 500 projects with quality rating A are considered. 14 useful features were selected, 8 of which are numerical features and 6 of which are categorical features. The descriptive statistics of such datasets are summarized in Table 3. Table 3 Statistical properties of the employed datasets Dataset Cases # Min max Effort mean ISBSG Desharnais COCOMO Kemerer Albrecht Maxwell NASA China Telecom For each dataset we follow the same testing strategy, we Leave-one-out cross validation to identify test and train projects such that, in each run, we select one project as test set and the remaining projects as training set. This procedure is performed until all projects within the dataset are used as test projects. In each run, the prediction accuracy of different techniques is assessed using various performance measures. 5.4 Performance Measures Three performance measures have been used to evaluate and compare between different estimation models. The most common measure is Magnitude of Relative Error (MRE) which calculates absolute percentage of error between actual and predicted project effort values as shown in Eq. 2. A summary of MRE can be derived as the Mean Magnitude of Relative Error (MMRE) as shown in Eq. 3. pred(0.25) is an alternative performance measure used to count the percentage of MREs that fall within less than 0.25 of the actual values as shown in Eq. 4. xi xˆ i MREi (2) x i N i 1 1 MMRE MRE (3) i N where x i and xˆ i are the actual value and predicted values of i th project, and N is the number of observations. 100 N 1 if MREi pred( ) (4) N i 1 0 otherwise In addition to that we used win-tie-loss algorithm [21] to compare the performance of EBAV to other estimation methods as shown in Figure 2. To do so, we first check if two methods M i and M j are statistically different according to the Wilcoxon test; otherwise we increase tie i and tie j. If the

7 distributions are statistically different, we update win i; win j and loss i; loss j, after checking which one is better according to the performance measure at hand E. The performance measures used here are MRE, MMRE, median of MRE (MdMRE) and Pred(0.25). Win i=0,tie i=0,loss i=0 Win j=0,tie j=0;loss j=0 if Wilcoxon (MRE(M i), MRE(M j), 95) says they are the same then tie i = tie i + 1 tie j = tie j + 1 else if better(e(m i), E(M j)) then win i = win i + 1 loss j = loss j + 1 else win j = win j + 1 loss i = loss i + 1 end if end if Figure 2. Pseudo code for Win-Tie-Loss calculation between method M i and M j based on performance measure E [21]. 6. RESULTS AND DISCUSSIONS This section describes the results obtained when comparing EBAV with other regular K-EBA methods. The K values used in this study ranges from 1 to 5 as they are intensively used for comparison in previous studies [8, 10]. Tables 4 and 5 summarize the predictive performance of the variants of EBAV and EBA in terms of MMRE and Pred. When we look at the MMRE values, we can notice that in all nine datasets, the EBAV variants have never been outperformed by any other variant of EBA with lowest MMRE values and larger Pred. This suggests that choosing closest projects based on their ranking is more efficient than calculating aggregated similarity degrees, but taking into consideration the appropriate number of voters for each ranking. Azzeh et al. [2] and Shepperd & Schofied [18] have paid attention to the problems of some similarity measures used in EBA such that they are easily influenced by outliers and irrelevant features. Therefor the similarity degree is amplified when a project with extreme values is assessed against the target project. Later the source project will be excluded from the similarity order in spite of its effort is more predictor. Table 4 MMRE results of EBAV and EBA variants Dataset BO CO MA EBA1 EBA2 EBA3 EBA4 EBA5 Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA Three aspects of these results are worth commenting: 1. EBAV variants are nearly at the same predictive accuracy level with very slight differences but still producing better accuracy than EBA variants. This enables us to conclude that there is no single EBAV variant produces the superior results over all employed datasets, but we can draw a guideline based on the obtained results: (1) BO is more suitable for moderate size

8 datasets such as Desharnais, Maxwell and COCOMO that have large number of features. (2) MA is suitable for small size datasets such as NASA and Telecom that have small number of features. (3) CO is suitable for large datasets such as China and ISBSG. Also we can conclude that Borda works well with datasets with large number of categorical features such as in Maxwell and COCOMO. 2. Among all RVMs, the MA is the best method that can identify more than one project in the winner set, which allows us to automatically come up with the best set of K of nearest neighbours for every test project without any intervention from experts. This is very important issue in estimation by analogy since there is no reliable method that can discover the optimum number of analogies for every project solely. Previous studies that use K-EBA are based on expert intuition where K values are determined based on the value that minimizes overall MMRE but not MRE for every test project. 3. Using voting methods, there is no need to pre-process the data as in the basic EBA. It is well recognized that the use of EBA requires data standardization or transformation to make all features have the same degree of influence. This step is not entirely required in EBAV because the basic process depends on the ranking of projects not based on the distance values. Table 5. Pred results of EBAV and EBA variants Dataset BO CO MA EBA1 EBA2 EBA3 EBA4 EBA5 Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA To identify top methods among EBA and EBAV variants over all datasets, we run win-tie-loss algorithm. This algorithm ranks different methods based on comparison between them in terms of some performance measures over all employed datasets. The overall results of win-tie-loss are recorded in Table 6. However, there is reasonable believe that using RVMs with EBA have never been outperformed by other regular K based EBA. Indeed, this confirms the significant improvement brought to the basic EBA. From these results we can notice that the number of win suggests that BO is the best performer with 100 wins, but the (win-loss) values suggests that the all EBAV variants are better with remarkable difference than conventional EBA. However, the CO still produces comparable results to MA with slight difference in win-loss. Table 6 win-tie-loss Results of EBAV and EBA variants Rank variant win tie loss win-loss 2 BO CO MA EBA EBA EBA EBA EBA

9 The performance of EBAV variants are also compared against the most common regression methods: Categorical Regression Tree (CART), Ordinary Least Square Regression (OLS) and Stepwise Regression (SR), using Leave-one-out cross validation procedure. The choice of such prediction methods is based on the different strategies they use to make estimate. The remarkable difference between SR and OLS is that SR builds regression model from the significant features only whereas OLS builds regression model from all employed features. For SR and OLS, it is important to make sure that the skewed numerical variables are being transformed using Log transforms, in case they need, such that they resemble more closely a normal distribution [20]. The logarithmic transformation ensures that the resulting model goes through the origin on the raw data scale. Further, all categorical attributes were converted into appropriate dummy variables [20]. Moreover, all necessary prerequested tests such as normality tests are performed once before running empirical validation which resulted in a general regression model. Then, in each validation iteration a different regression model that resembles general regression model in the structure is built based on the training data set and then the prediction of test project is made on training data set. Table 7 presents a sample of SR regression models. Table 7 general Regression models Dataset SR model R 2 Albrecht Effort RawFP 0.90 Kemerer Ln( Effort ) Ln( AdjFP ) 0.67 Desharnais Ln( Effort) Ln( AdjFP) L L COCOMO Ln( Effort ) PCAP 5. 94TURN 0.18 Maxwell Ln( Effort) Size 0.71 China Ln( Effort ) Ln( AFP) Ln( PDR_ AFP) 0.48 ISBSG Ln( Effort) Ln( AFP) Ln( ADD ) 0.21 Telecom Effort changes 0.53 Nasa Ln( Effort ) Ln( KDLOC) ME 0.90 The R 2 of SR models for COCOMO and ISBSG suggest that the models were very poor with only 18-21% of the variation in effort being explained by variation in the significant selected features. But this does not often lead to poor predictive performance. On the other hand, the SR model for Desharnais dataset uses L1 and L2 as dummy variables instead of the categorical variable (Dev.mode). For OLS technique we followed the same procedure used in [20] in which the log transformation is used to transfer skewed dependent and independent features so the residuals of regression model become more homoscedastic, and follow more closely a normal distribution [20]. Tables 8 and 9 present the obtained results from applying SR and CART over all datasets. The results revealed that the EBAV variants still produce better accuracy than regression models, but with exception to China, NASA and COCOMO datasets. However, we can notice that the difference in MMRE and Pred values in both China, NASA datasets are not too prominent, which draw a conclusion that the proposed methods have potential to deliver good estimates. In reference to the comparison results of Dejaeger et al. [20] who reported that OLS+log transformation was the superior technique over the most employed datasets. This paper used four datasets mentioned in that article which are: Desharnais, NASA, COCOMO and Maxwell. The results obtained especially for these datasets in terms of MMRE and pred suggest that the proposed techniques beat OLS over three datasets, whereas OLS performs better over COCOMO dataset. This comes to conclude that EBAV variants have potential to deliver more accurate estimates than regression models.

10 Table 8 MMRE results of EBAV variants against Regression models Dataset BO CO MA SR CART OLS Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA Table 9 Pred results of EBAV variants against Regression models Dataset BO CO MA SR CART OLS Albrecht Kemerer Desharnais COCOMO Maxwell China Telecom ISBSG NASA Table 4 shows the sum of win, tie and loss values for all methods used in this paper. Every method is compared to 11 methods, over 4 error measures and 9 datasets, so the maximum value that either one of the win, tie, loss statistics can attain is: =396. Notice that in Table 10 (j) the tie values are in range. Therefore they would not be so informative as to differentiate the methods, so we consult to win and loss statistics. There is considerable difference between the best and the worst methods in terms on win and loss. The results revealed that the BO is top ranked method with winloss=115 followed by MA in the second place with win-loss=92. Interestingly, Both MA and CO have the minimum number of loss over all datasets which is also better than loss of BO. Further analysis shows that BO was the winner over 6 datasets, whereas MA, CO and SR were the winners over 2 datasets each, and CART was winner over only one dataset. Interestingly, the conventional EBA methods were not the top winners over any dataset. Also the regression models developed by SR and OLS perform better than CART and conventional EBA, but not better than EBAV variants as confirmed in Table 10 (j). TABLE 10: Win-Tie-Loss values for different estimation methods (a) Albrecht (b) Kemerer Method win tie loss win-loss Method win tie Loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS

11 (c) Desharnais (d) COCOMO81 Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS (e) Maxwell (f) China Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS (g) ISBSG (h) Nasa Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR CART CART OLS OLS (i) Telecom (j) Cumulative win-tie-loss results Method win tie loss win-loss Method win tie loss win-loss BO BO CO CO MA MA EBA EBA EBA EBA EBA EBA EBA EBA EBA EBA SR SR

12 CART CART OLS OLS THREATS TO VALIDITY Internal validity is the degree to which conclusions can be drawn with regard to configuration setup of AB algorithm including: 1) determining initial parameter values and (2) the identification of initial solutions. Currently, there is no reasonable approach to perfectly initialize parameter values of AB algorithm, but some studies suggest choosing different values for every problem based on the size and complexity of the datasets. Also there is no reliable method to choose initial solutions so we used random selection procedure. However, we believe that these decisions were reasonable even though it may affect computation power of our EBAV method. Concerning construct validity which assures that we are measuring what we actually intended to measure. This paper used Jack-knife validation to assess different methods, though some authors favoured n-fold cross validation. The principal reason for this selection, the Jack-knife has been used in deterministic procedure that can be exactly repeated by any other research with access to a particular dataset. According to previous studies, the Jack-knife generates Lower base estimates than n-fold CV since methods need to learn from fewer examples. Also, it generates higher variance estimates than n-fold CV since Jack-knife conducts more tests. External validity assesses ability to generalize the obtained findings of our comparative studies; we used nine datasets from two different sources to ensure the generalizability of the obtained results. The employed datasets contain a wide diversity of projects in terms of their sources, their domains and the time period they were developed in. We also believe that reproducibility of results is an important factor for external validity. Therefore, we have purposely selected publicly available datasets. However, we consider that some datasets are very old to be used in software cost estimation because they represent different software development approaches and technologies. The reason for this is that these datasets are publically available, and still widely used for benchmarking purposes. An ideal case would to request to new datasets with specific properties that would best suit the experimental concern and that would support the random sampling of data in experimentation. 8 CONCLUSIONS This paper presents a new approach to identify and choose closest projects in EBA method using Ranked Voting Methods, namely: Borda, Copeland and Maximin. The use of voting rules with EBA has three distinct advantages: (1) alleviating the effect of extreme values since no numeric values are used but only a ranking within an existing set of projects is necessary. (2) The aggregated ranks may nominate more than a single project as closest analogies, which would save time in finding appropriate number of analogies. (3) The procedure followed is transparent to practitioners and easy to understand. (4) As confirmed by Kocaguneli et al. [21] this paper shows that ensemble method of VRM and EBA works better than single EBA methods. The proposed EBAV variants haven been benchmarked to some well-known estimation methods such as regular K-based EBA, SR and CART. The top ranked method is BO confirmed by collecting win-tie-loss for each compared method as shown in Table 10 (j). BO has never outperformed in seven out of nine datasets. Future work is required to investigate the impact of feature subset selection [15] on the predictive performance, in addition to the impact of ensembles voting methods rather than a single voting method.

13 9 ACKNOWLEDGEMENTS The author is grateful to the Applied Science University, Amman, Jordan, for the financial support granted to cover the publication fee of this research article. 10 REFERENCES 1 Azzeh, M., A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation. Empirical Software Engineering 17(1-2): Azzeh M., Neagu, D., Cowling, P., Fuzzy grey relational analysis for software effort estimation, Empirical Software Engineering, 15: Boetticher, G., Menzies, T., Ostrand, T., PROMISE Repository of empirical software engineering data repository, West Virginia University, Department of Computer Science. 4 Chiu, NH., Huang, SJ., The adjusted analogy-based software effort estimation based on similarity distances. J Systems and Software 80: doi: /j.jss Fishburn P.C., Condorcet social choice functions, SIAM Journal of Applied Mathematics 33 (1977) ISBSG, International software benchmark and standard group, Data CDRelease 10, Jorgensen, M., Indahl, U., Sjoberg, D., Software effort estimation by analogy and regression toward the mean. Journal of Systems and Software 68: Kadoda, G., Cartwright, M., Chen, L. & Shepperd, M., Experiences using case based reasoning to predict software project effort, in proceedings of EASE: Evaluation and Assessment in Software Engineering Conference, Keele, UK. 9 Keung, J., Kitchenham, B., Jeffery, DR., Analogy-X: Providing Statistical Inference to Analogy- Based Software Cost Estimation. IEEE Transaction on Software Engineering. 34(4): Kirsopp, C., Mendes, E., Premraj, R., Shepperd, M., 2003 An empirical analysis of linear adaptation techniques for case-based prediction. International Conference on CBR. pp Klamler C., On the closeness aspect of three voting rules: Borda, Copeland, Maximin, Group Decision and Negotiation 14 (3) (2005) Kocaguneli, E., Menzies, T., Bener, A., Keung, J., Exploiting the Essential Assumptions of Analogy-based Effort Estimation, IEEE transaction on Software Engineering. ISSN: Koch, S. and Mitlöhner, J., 2009, Software project effort estimation with voting rules, Journal of Decision Support Systems: (46) Mendes, E., Watson, I., Triggs, C., Mosley, N., Counsell, S., A comparative study of cost estimation models for web hypermedia applications. Empirical Software Engineering 8: Menzies, T., Chen Z., Hihn, J. Lum, K., Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering. 32: Miranda, E., Improving subjectice estimates using paired comparisons, IEEE Software 18 (1) (January/February 2001) Pham, D.T., Ghanbarzadeh, A., Koç, E., Otri, S., Rahim, S., Zaidi, M. The Bees Algorithm A novel tool for complex optimisation problems. In: the 2nd Virtual International Conference on Intelligent Production Machines and Systems (I*PROMS-06), Cardiff, UK, (2006) 18 Shepperd, M., Schofield, C., Estimating software project effort using analogies. IEEE Transaction Software Engineering 23: Eckert, D., Klamler, C., Mitlöhner, J., Schlötterer, C., A distance-based comparison of basic voting rules, Central European Journal of Operations Research 14 (4) (2006) Dejaeger, K., Verbeke, W., Martens, D., Baesens, B., Data mining techniques for software effort estimation: a comparative study. IEEE Transactions on Software Engineering, 38(2) (2012): Ekrem, K., Menzies, T., Keung, J., On the value of ensemble effort estimation, IEEE Transactions on Software Engineering, (2011): 1-1.

An Optimized Analogy-Based Project Effort Estimation

An Optimized Analogy-Based Project Effort Estimation An Optimized Analogy-Based Project Effort Estimation Mohammad Azzeh Faculty of Information Technology Applied Science University Amman, Jordan POBOX 166 m.y.azzeh@asu.edu.jo Abstract. Yousef Elsheikh Faculty

More information

Analogy-Based Effort Estimation: A New Method to Discover Set of Analogies from Dataset Characteristics

Analogy-Based Effort Estimation: A New Method to Discover Set of Analogies from Dataset Characteristics Analogy-Based Effort Estimation: A New Method to Discover Set of Analogies from Dataset Characteristics Mohammad Azzeh Department of Software Engineering Applied Science University Amman, Jordan POBOX

More information

Detection of Aberrant Data Points for an effective Effort Estimation using an Enhanced Algorithm with Adaptive Features

Detection of Aberrant Data Points for an effective Effort Estimation using an Enhanced Algorithm with Adaptive Features Journal of Computer Science 8 (2): 195-199, 2012 ISSN 1549-3636 2012 Science Publications Detection of Aberrant Data Points for an effective Effort Estimation using an Enhanced Algorithm with Adaptive

More information

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models Nikolaos Mittas, Lefteris Angelis Department of Informatics,

More information

The Evaluation of Useful Method of Effort Estimation in Software Projects

The Evaluation of Useful Method of Effort Estimation in Software Projects The Evaluation of Useful Method of Effort Estimation in Software Projects Abstract Amin Moradbeiky, Vahid Khatibi Bardsiri Kerman Branch, Islamic Azad University, Kerman, Iran moradbeigi@csri.ac.i Kerman

More information

Keywords: Case-Base, Software Effort Estimation, Bees Algorithm

Keywords: Case-Base, Software Effort Estimation, Bees Algorithm Volume 5, Issue 6, June 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Bees Algorithm

More information

Iterative Voting Rules

Iterative Voting Rules Noname manuscript No. (will be inserted by the editor) Iterative Voting Rules Meir Kalech 1, Sarit Kraus 2, Gal A. Kaminka 2, Claudia V. Goldman 3 1 Information Systems Engineering, Ben-Gurion University,

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

The Artificial Bee Colony Algorithm for Unsupervised Classification of Meteorological Satellite Images

The Artificial Bee Colony Algorithm for Unsupervised Classification of Meteorological Satellite Images The Artificial Bee Colony Algorithm for Unsupervised Classification of Meteorological Satellite Images Rafik Deriche Department Computer Science University of Sciences and the Technology Mohamed Boudiaf

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Applying Outlier Deletion to Analogy Based Cost Estimation

Applying Outlier Deletion to Analogy Based Cost Estimation Applying Outlier Deletion to Analogy Based Cost Estimation Masateru Tsunoda masate-t@is.naist.jp Akito Monden akito-m@is.naist.jp Mizuho Watanabe 1 mizuho.watanabe01@is.naist.jp Takeshi Kakimoto Kagawa

More information

arxiv: v1 [cs.ma] 8 May 2018

arxiv: v1 [cs.ma] 8 May 2018 Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract

More information

Practical voting rules with partial information

Practical voting rules with partial information DOI 10.1007/s10458-010-9133-6 Practical voting rules with partial information Meir Kalech Sarit Kraus Gal A. Kaminka Claudia V. Goldman The Author(s) 2010 Abstract Voting is an essential mechanism that

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

An Evolutionary Computation Approach for Project Selection in Analogy based Software Effort Estimation

An Evolutionary Computation Approach for Project Selection in Analogy based Software Effort Estimation Indian Journal of Science and Technology, Vol 9(21), DOI: 10.17485/ijst/2016/v9i21/95286, June 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 An Evolutionary Computation Approach for Project Selection

More information

ESTIMATION BY ANALOGY IN THE PERSPECTIVE EDUCATIONAL WEB HYPERMEDIA APPLICATIONS

ESTIMATION BY ANALOGY IN THE PERSPECTIVE EDUCATIONAL WEB HYPERMEDIA APPLICATIONS International Journal of Information Technology and Knowledge Management January-June 2011, Volume 4, No. 1, pp. 305-313 ESTIMATION BY ANALOGY IN THE PERSPECTIVE EDUCATIONAL WEB HYPERMEDIA APPLICATIONS

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Ahmed T. Sadiq. Ali Makki Sagheer* Mohammed Salah Ibrahim

Ahmed T. Sadiq. Ali Makki Sagheer* Mohammed Salah Ibrahim Int. J. Reasoning-based Intelligent Systems, Vol. 4, No. 4, 2012 221 Improved scatter search for 4-colour mapping problem Ahmed T. Sadiq Computer Science Department, University of Technology, Baghdad,

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

An Experiment in Visual Clustering Using Star Glyph Displays

An Experiment in Visual Clustering Using Star Glyph Displays An Experiment in Visual Clustering Using Star Glyph Displays by Hanna Kazhamiaka A Research Paper presented to the University of Waterloo in partial fulfillment of the requirements for the degree of Master

More information

An Empirical Evaluation of Outlier Deletion Methods for Analogy-Based Cost Estimation

An Empirical Evaluation of Outlier Deletion Methods for Analogy-Based Cost Estimation An Empirical Evaluation of Outlier Deletion Methods for Analogy-Based Cost Estimation Masateru Tsunoda Nara Institute of Science and Technology Kansai Science City, 630-0192 Japan masate-t@is.naist.jp

More information

Exploiting the Essential Assumptions of Analogy-based Effort Estimation

Exploiting the Essential Assumptions of Analogy-based Effort Estimation JOURNAL OF IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, SOMEMONTH 201Z 1 Exploiting the Essential Assumptions of Analogy-based Effort Estimation Ekrem Kocaguneli, Student Member, IEEE, Tim

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

CHAPTER 3 MAINTENANCE STRATEGY SELECTION USING AHP AND FAHP

CHAPTER 3 MAINTENANCE STRATEGY SELECTION USING AHP AND FAHP 31 CHAPTER 3 MAINTENANCE STRATEGY SELECTION USING AHP AND FAHP 3.1 INTRODUCTION Evaluation of maintenance strategies is a complex task. The typical factors that influence the selection of maintenance strategy

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Effort Estimation in Early stages of Web Application Development

Effort Estimation in Early stages of Web Application Development Effort Estimation in Early stages of Web Application Development Manpreet kaur Research Scholar, Department of Research Innovation & Consultancy IKG PTU, Kapurthala, India Dr. Sumesh Sood Assistant Professor

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Efficient Pairwise Classification

Efficient Pairwise Classification Efficient Pairwise Classification Sang-Hyeun Park and Johannes Fürnkranz TU Darmstadt, Knowledge Engineering Group, D-64289 Darmstadt, Germany Abstract. Pairwise classification is a class binarization

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Usability Evaluation of Software Testing Based on Analytic Hierarchy Process Dandan HE1, a, Can WANG2

Usability Evaluation of Software Testing Based on Analytic Hierarchy Process Dandan HE1, a, Can WANG2 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Usability Evaluation of Software Testing Based on Analytic Hierarchy Process Dandan HE1, a, Can WANG2 1,2 Department

More information

CALIBRATING FUNCTION POINT BACKFIRING CONVERSION RATIOS USING NEURO-FUZZY TECHNIQUE

CALIBRATING FUNCTION POINT BACKFIRING CONVERSION RATIOS USING NEURO-FUZZY TECHNIQUE International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems World Scientific Publishing Company CALIBRATING FUNCTION POINT BACKFIRING CONVERSION RATIOS USING NEURO-FUZZY TECHNIQUE JUSTIN

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys

Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Mathias Wagner, Stefan Heyse mathias.wagner@nxp.com Abstract. We present an improved search

More information

II (Sorting and) Order Statistics

II (Sorting and) Order Statistics II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 8 Sorting in Linear Time The sorting algorithms introduced thus far are comparison sorts Any comparison

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Seismic regionalization based on an artificial neural network

Seismic regionalization based on an artificial neural network Seismic regionalization based on an artificial neural network *Jaime García-Pérez 1) and René Riaño 2) 1), 2) Instituto de Ingeniería, UNAM, CU, Coyoacán, México D.F., 014510, Mexico 1) jgap@pumas.ii.unam.mx

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation IT 15 010 Examensarbete 30 hp Februari 2015 Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation Xiaoyuan Chu Institutionen för informationsteknologi

More information

A NOTE ON COST ESTIMATION BASED ON PRIME NUMBERS

A NOTE ON COST ESTIMATION BASED ON PRIME NUMBERS International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 241-245 A NOTE ON COST ESTIMATION BASED ON PRIME NUMBERS Surya Prakash Tripathi 1, Kavita

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

MATH 1340 Mathematics & Politics

MATH 1340 Mathematics & Politics MATH 1340 Mathematics & Politics Lecture 5 June 26, 2015 Slides prepared by Iian Smythe for MATH 1340, Summer 2015, at Cornell University 1 An example (Exercise 2.1 in R&U) Consider the following profile

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Probabilistic Models of Software Function Point Elements

Probabilistic Models of Software Function Point Elements Probabilistic Models of Software Function Point Elements Masood Uzzafer Amity university Dubai Dubai, U.A.E. Email: muzzafer [AT] amityuniversity.ae Abstract Probabilistic models of software function point

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October-2015 726 Performance Validation of the Modified K- Means Clustering Algorithm Clusters Data S. Govinda Rao Associate

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology, Madras. Lecture No.

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology, Madras. Lecture No. Fundamentals of Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture No. # 13 Transportation Problem, Methods for Initial Basic Feasible

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Evaluation of Direct and Indirect Blockmodeling of Regular Equivalence in Valued Networks by Simulations

Evaluation of Direct and Indirect Blockmodeling of Regular Equivalence in Valued Networks by Simulations Metodološki zvezki, Vol. 6, No. 2, 2009, 99-134 Evaluation of Direct and Indirect Blockmodeling of Regular Equivalence in Valued Networks by Simulations Aleš Žiberna 1 Abstract The aim of the paper is

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY Proceedings of the 1998 Winter Simulation Conference D.J. Medeiros, E.F. Watson, J.S. Carson and M.S. Manivannan, eds. HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A

More information

Math 130 Final Exam Study Guide. 1. Voting

Math 130 Final Exam Study Guide. 1. Voting 1 Math 130 Final Exam Study Guide 1. Voting (a) Be able to interpret a top choice ballot, preference ballot and preference schedule (b) Given a preference schedule, be able to: i. find the winner of an

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

Information Retrieval Rank aggregation. Luca Bondi

Information Retrieval Rank aggregation. Luca Bondi Rank aggregation Luca Bondi Motivations 2 Metasearch For a given query, combine the results from different search engines Combining ranking functions Text, links, anchor text, page title, etc. Comparing

More information

Further Thoughts on Precision

Further Thoughts on Precision Further Thoughts on Precision David Gray, David Bowes, Neil Davey, Yi Sun and Bruce Christianson Abstract Background: There has been much discussion amongst automated software defect prediction researchers

More information

Affymetrix GeneChip DNA Analysis Software

Affymetrix GeneChip DNA Analysis Software Affymetrix GeneChip DNA Analysis Software User s Guide Version 3.0 For Research Use Only. Not for use in diagnostic procedures. P/N 701454 Rev. 3 Trademarks Affymetrix, GeneChip, EASI,,,, HuSNP, GenFlex,

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION

SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION Kamil Zakwan Mohd Azmi, Zuwairie Ibrahim and Dwi Pebrianti Faculty of Electrical

More information

An Empirical Comparison of Spectral Learning Methods for Classification

An Empirical Comparison of Spectral Learning Methods for Classification An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,

More information

Possibilities of Voting

Possibilities of Voting Possibilities of Voting MATH 100, Survey of Mathematical Ideas J. Robert Buchanan Department of Mathematics Summer 2018 Introduction When choosing between just two alternatives, the results of voting are

More information

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing 1. Introduction 2. Cutting and Packing Problems 3. Optimisation Techniques 4. Automated Packing Techniques 5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing 6.

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Stochastic propositionalization of relational data using aggregates

Stochastic propositionalization of relational data using aggregates Stochastic propositionalization of relational data using aggregates Valentin Gjorgjioski and Sašo Dzeroski Jožef Stefan Institute Abstract. The fact that data is already stored in relational databases

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information