A General Method for Statistical Performance Evaluation

Size: px
Start display at page:

Download "A General Method for Statistical Performance Evaluation"

Transcription

1 A General Method for Statistical Performance Evaluation Longzhuang Li Dept. of Computing and Mathematical Sci. Texas A&M Uni., Corpus Christi, TX 7842 Wei Zhang Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 652 Yi Shang Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 652 Hongchi Shi Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 652 Abstract In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects an appropriate statistical metric according to the problem parameters. Empirically, We compare the performance of five representative statistical metrics under different conditions through simulation. They are expected loss, dman statistic, interval-based selection, probability of win, and probably approximately correct. In the experiments, expected loss is the best for small means, like or 2, and probably approximately correct is the best for all the other cases. Also, we apply the general method to compare the performance of HITS-based algorithms that combine four relevance scoring methods, VSM, Okapi, TLS, and CDR, usingaset of broad topic queries. Among the four relevance scoring methods, CDR is the best statistically when it is combined with a HITS-based algorithm.. roduction Performance evaluation has many real world applications. For example, when a customer wants to buy a computer, he needs to compare prices, CPU speed, memory, and pre-installed softwares, etc., among multiple choices before he finally decides which one to buy. In information retrieval on the Web, we may wonder which search engine will return the most relevant information for the given queries [3]. In performance evaluation, hypotheses are selected or ranked based on performance comparison of hypotheses on sample data. Research supported in part by the National Science Foundation under grant DUE and EIA Among the real world applications of statistical performance evaluation, many solutions or hypotheses exist and the ones performing the best in terms of predetermined measurements are sought. For example, In image compression, it is critical to design and choose the best filter banks for the quality of the reconstructed images [2]. In evolutionary algorithms, the individuals to be propagated to future generations are often selected with likelihood that is proportionate to their rank in the current generation [7]. The performance measurements of hypotheses are numerical numbers and have to be obtained based on sample data and may contain noise. In addition, due to the time and resource constraints in real applications, it is often impractical or even impossible to evaluate all hypotheses. Thus, statistical metrics are used to evaluate the performance of hypotheses efficiently using a limited number of sample data. There are many statistical metrics available and their results depends on many factors, such as the size of sample data, and the distribution of performance measurements of the hypotheses. Selecting the most appropriate statistical metrics is a challenging task. In the paper, a general, effective method to evaluate hypotheses performance is developed. The method incorporates various statistical metrics and automatically selects an appropriate one based on the parameters of applications. We have considered the following important parameters: the number of hypotheses, the size of sample data for each hypothesis, the distribution of performance measurements, and the distribution of noise in performance measurements. Also, the general method is applied to evaluate the performance of the combination of HITS-based algorithms [2, 2] with one of the four relevance scoring methods: cover density ranking (CDR) [6], Okapi similarity measurement (Okapi) [9], vector space model (VSM) [6], and threelevel scoring method (TLS) [4], using a set of broad topic Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

2 queries. In the experiments, we study the performance of five representative statistical metrics using sample data with four different types of distributions. The five statistical metrics are expected loss () [4], dman statistic () [7], interval-based selection () [4], probability of win () [], and probably approximately correct () [9]. The four distributions of sample data are chisquare distribution, exponential distribution, normal distribution, and Poisson distribution. This paper is organized as follows. In Section 2, we briefly review the statistical metrics for performance evaluation. In Section 3, we propose a general method for statistical performance evaluation, and apply the general method to evaluate the performance of HITS-based algorithms. In Section 4, we describe the criteria to compare the performance of different statistical metrics. In Section 5, we show our experimental results. And in Section 6, we summarize the paper. 2. Statistical Metrics for Performance Evaluation In this section, we briefly review statistical metrics to evaluate the performance of different hypotheses. Performance evaluation consists of two kinds of problems: hypothesis selection problems and hypothesis ranking problems [, 4, 5]. A hypothesis selection problem arises when we select the best one from a set of hypotheses, given their performance over some sample data. In hypothesis ranking problems, a set of hypotheses are ranked by their expected performance. Hypothesis ranking problems are extension of hypothesis selection problems [5]. Generally, statistical metrics for hypothesis selection problems can be applied to hypothesis ranking problems. The distinction between hypothesis ranking and hypothesis selection is that the result is a single best hypothesis in hypothesis selection, whereas the order of all the hypotheses is returned in hypothesis ranking. Many metrics have been developed to solve the hypothesis selection problems (see figure ). They can be classified into two groups: one for problems with a small number of hypotheses, and the other for problems with a large number of hypotheses. The statistical metrics for a small number of hypotheses include the interval-based selection [4], the COMPOSER system [8], the Turnbull and Weiss algorithm [8], the probably approximately correct () model [9], the expected loss () approach [4], the dman statistic [7], and the probability of win [], etc. In the scenario of the statistical selection metrics for a large or infinitive number of hypotheses, we usually adopt generate-and-test search strategies to find the best hypothesis. As defined by Mitchell [5], the strategies can broadly Statistical Selection Metrics Small Number of Hypotheses Large or Infinitive Number of Hypotheses erval-based COMPOSER T. & W. Algorithm dman Statistic Probability of Win Depth-First Data- Breadth-First Driven Version-Space Decision-Theoretic Knowledge- Driven Explanation-Based Figure. Statistical hypothesis selection metrics. be classified as data driven and knowledge-driven. The difference lies in the amount of tests performed: data-driven metrics do not rely on domain knowledge and often require extensive tests on the hypotheses under consideration, whereas knowledge-driven metrics depend on domain knowledge and one or a few tests to deduce new hypotheses. In the paper, we focus on the statistical metrics for a small number of hypotheses. To find the best one among a small set of hypotheses, we compare the hypotheses pairwisely. 3. A General Method for Statistical Performance Evaluation Because a statistical metric may only be suitable for certain situation for different applications, in this section, we first propose a general method for statistical performance evaluation, then apply the general method to a real world application. 3.. A General Method The general method consists of following major steps (see figure 2):. Select a set of sample data. At this step, we must be careful in choosing sample data when the hypotheses are too expensive to be tested extensively and we have a large number, and possible infinite data. On the other hand, when the size of sample data is limited and the cost of information is high, it is very important to minimize the cost of acquiring additional samples while achieving the desired evaluation quality. 2. Test the performance of each hypothesis on sample data. Sometimes, the performance of hypotheses de- Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

3 pend on the measures and algorithms we use to test sample data. 3. Select an appropriate statistical metric according to the problem parameters. For example, those parameters may include the number of hypotheses, the size of sample data for each hypothesis, the distribution of performance measurements, and the distribution of noise in performance measurements. 4. Select or rank the hypotheses based on the performance measurements using the chosen statistical metric. The chosen hypothesis is the one with the best statistical value. Also, we expect the hypothesis selected to be generalizable; that is, it must perform well not only on sample data but also on data not seen in evaluation. The general method incorporates various statistical metrics and automatically selects an appropriate one based on the parameters of the applications, and can be adapted to different applications under time and resource constraints An Application: Performance Evaluation of HITS-based Algorithms Kleinberg s hypertext-induced topic selection (HITS) algorithm [2] is a very popular and effective algorithm to rank documents based on the link information among a set of documents. The algorithm presumes that a good hub is a document that points to many others, and a good authority is a document that many documents point to. Hubs and authorities exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed by many good hubs. In the context of Web search, a HITS-based algorithm first collects a base document set for each query. Then it recursively calculates the hub and authority values for each document. To gather the base document set I, first, a root set R that matching the query is fetched from a search engine; then, for each document r R, a set of documents L that point to r and another set of documents L that are pointed to by r are added to the set I as R s neighborhood. For a document i I,leta i and h i be the authority and hub values respectively. To begin the algorithm, a i and h i are initialized to. While the values have not converged, the algorithm iteratively proceeds as follows:. For all i I which point to i, a i = i h i 2. For all i I which is pointed to by i, h i = i a i 3. Normalize a i and h i values so that i a i = i h i =. Kleinberg showed that the algorithm will eventually converge, but the bound on the number of iterations is unknown. In practice, the algorithm converges quickly. Because the HITS algorithm ranks documents only depending on the in-degree and out-degree of links, it will cause problems in some cases. For example, Bharat [2] identified two problems: mutually reinforcing relationships between hosts and topic drift. Both problems can be solved or alleviated by adding weights to documents. In Bharat s improved HITS algorithm (BHITS), to solve the first problem, a document is given a authority weight of /k if the document is in a group of k documents on a first host which link to a single document on a second host, and a hub weight of /l if there are l links from the document on a first host to a set of documents on a second host [2]. And the second problem can be alleviated by adding weights to edges based on text in the documents or their anchors [2, 3]. Bharat s improved HITS algorithm (BHITS) achieved a remarkable better results by a simple modification of Kleinberg shits algorithm for the first problem, while further precision was obtained by adding content analysis for the second problem. Disregarding the time it may take, combining connectivity and content analysis has been proved to be useful in improving precision. But the similarity measure currently used is vector space model [2] or just a simple occurrence frequency of the query terms in the text around the anchors [3], which may not be the best method to evaluate the relevance of Web documents because most queries submitted to search engines are short, consisting of three terms or less [6]. Although we can expand the short queries by adding more related words, expanding itself can cause topic drift. In this paper, we statistically compare the performance of four relevance scoring methods when they are combined with Bharat s improved HITS algorithm. Three of them are variations of methods widely used in the traditional information retrieval field. They are cover density ranking (CDR) [6], Okapi similarity measurement (Okapi) [9], and vector space model (VSM) [6]. In addition, the fourth one is called the three-level scoring method (TLS) [4], which mimics commonly used manual similarity measuring approaches. 4. Performance Comparison of Statistical Metrics In the paper, we compare the performance of five representative statistical metrics under different conditions throughsimulation. The five statistical metrics are expected loss (), dman statistic (), interval-based selection (), probability of win (), and probably approximately correct (). The four distributions used in our Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

4 Sample Measure Measurement Data Performance Results Select an appropriate statistical comparison metric A Statistical Statistical Final Metric Comparison Results Figure 2. A general method for statistical performance evaluation. experiments are chi-square distribution, exponential distribution, normal distribution, and Poisson distribution. When we compare the performance of different statistical metrics, we need the criteria to justify which statistical metrics are better in identifying one distribution X is better (or worse) than another distribution Y. There are two ways to do it. One is to fix the size of sample data from X and Y, andfind the correct ratios of the statistical metrics. This can be done through simulation. We randomly generate a fixed number of samples, such as 2, from each distribution, apply each statistical metric to the sample data, and see whether the result is correct or not. This test can be repeated many times, such as times, to get an average correctness, which is more accurate than just one run. Another way is to find the smallest size of sample data that each statistical metric needs in order to identify the correct result, i.e. X is better (or worse) than Y, with a high confidence, such as 99% confidence. This can also be done through experiments. In our experiments, we take the first approach. In our controlled experiments, we generate the sample data from a known distribution, such as chi-square distribution, and we know ahead of time which data set is better. Then we can check how the statistical metrics work by comparing their answers to the right ones. In our experiments, we are taking a simple approach by assuming that the one with larger mean is better, although it may be subjective for some real situations. Although most of statistical metrics we test assume that the sample data are of normal distribution, we do not need to change the formula when we test the metrics on sample data with other distributions. In another words, we are testing the robustness of these metrics under the situations that the sample data are not normally distributed. Of the five statistical metrics, the dman statistic has two control parameters: a sample size n and a significance level α. The interval-based techniques have three control parameters: a sample size n, a desired confidence of correct ranking γ and an indifference setting ɛ. The techniques have two control parameters: a sample size n and a loss threshold l. The techniques have three control parameters: a sample size n, a desired confidence of correct ranking γ and an indifference setting ɛ. Theprobability of win has two control parameters: a sample size n and a desired confidence of correct ranking p. In our experiments, we set γ and p as 5, α and l as.5, ɛ as % of the standard deviation of sample data. 5. Experiments In the experiments, we first analyze the correct ratios of five statistical metrics through the simulation based on data without or with noise, and draw some rules on how to select the statistical metrics under different conditions. Then we apply the general method to compare the performance of HITS-based algorithms according to the chosen statistical metrics. 5.. Simulation on Data Without Noise In this section, without considering noise, we do two groups of experiments on data with different distributions and different sample numbers. In the first group of experiments, for each run, we generate two sample sets using Matlab and use all the data in both sample sets to test the different statistical metrics. In the second group of experiments, we first generate two data sets, whose size is ; then randomly pick a subsets from each data set for each run. In both groups of experiments, the two compared data sets have means and 2, and 5, 4 and 5, and and 2, respectively. If the difference between two means is less or equal to, we consider the difference is small; otherwise, if the difference of two means is large or equal to 4, the difference is big. We set the variance of each data set equal to its mean for normal distribution. In the first group of experiments, to compare two random variables, X and Y, of the same distribution (e.g. chisquare), with different means and variances: for i= to, (the number of random runs) {. randomly generate a sample set of size k (e.g., 3) for X using Matlab and a sample set of size k for Y. 2. compare these two sample sets using each statistical metric. 3. check the results with the correct answer. } compute the correct ratio of each statistical metric for the sample size k based on random runs. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

5 In the second group of experiments, to compare two random variables, X and Y, of the same distribution (e.g. chisquare), with different means and variances: for i = to 2, (the number of random runs for a sample set of size k, e.g., 3) { generate a data set D for X and D2 for Y respectively, the size of both D and D2 is. for j = to, (the number of test runs for correct ratio) {. randomly pick a sample set of size k for X from D and a sample set of size k for Y from D2. 2. compare these two sample sets using each statistical metric. 3. check the results with the correct answer. } compute the correct ratio of test runs for the sample size k. } calculate the average correct ratio of 2 random runs for the sample size k. The results of the first group of experiments are shown in figure 3 and 4. The results of the second group of experiments are not shown here because the results are similar to those obtained in the first group of experiments. Figure 3 and 4 show that as the sizes of sample data increase, the correct ratios of all five statistical metrics also increase. But we can find some exceptions, the correct ratios become smaller when the size of data set is enlarged. For example, in the top-left plot labeled chisquare 2 in figure 3, the performance of drops from to 2 as the size of sample set increases from 5 to 2. The reason may be that the testing sample sets do not represent the underline distribution enough. Comparing all the plots, we can easily see that and are the two best metrics under all kinds of conditions. Especially, is the best for small means, like or 2, and is the best for all the other cases. If the difference m of two sample means µ x and µ y is big, e.g. m =4,and µ x =and µ y =5(see bottom half of figure 3), a sample set of size can almost guarantee above 9% correct ratios for,, and ; a sample set of size 5 is good enough for all the five metrics to get correct ratios of almost %. If m is small, e.g. m =,andµ x =and µ y =2(see top half of figure 3), a sample set of size 3 is large enough to assure correct ratios of above 8% for and ; and with a sample size of 4, above 9% of correct ratios can be achieved by and ; and a sample set of size 2 can have % correct ratios for all the five metrics. As µ x and µ y become larger, more samples are needed to achieve high correct ratios (see figure 4). Both groups of experiments tell us that the size of data set does not matter much as long as it is large enough to represent the underline distribution sufficiently. The experiments show that the data size of is good enough. Next, we study how the performance of the statistical metrics degrades as the problem becomes harder. According to our understanding, there are two ways to make the problem harder, the first one is to make the difference of means of two compared data sets become smaller and smaller; the second one is to fix the difference of two means, and make the means of both data sets larger. As we know, for distributions like chi-square, exponential, and Poisson, the variance turns bigger when the mean is bigger. In our experiments, we choose the second way to make the problem more difficult. We have done experiments when the distributions of data sets are chi-square, exponential, Poisson, and normal, respectively. The experimental results can be found in figure 5, which show the correct ratios of five statistical metrics for different mean pairs when the data set is of chi-square, exponential, normal, and Poisson distribution, respectively. In figure 5, each data point of a curve is the average performance of 2 sub-sample data sets from a data set of size generated by Matlab, the size of 2 sub-sample data sets are 5,, 5, 2, 3, 4, 5,, 2, 5, 8, and, respectively. In figure 5, x =represents -2, which denotes that the two compared data sets have means and 2. x =2represents 4-5, which denotes that the two compared data sets have means 4 and 5. Similarly, we can explain x =3for -2, x =4for 2-22, x =5for 3-32, x =6for 4-42, x =7for 5-52,..., x =2for 8-82,the increase of means is. In figure 5, for each distribution, the variance of each data set is set to equal to its mean. Figure 5 shows that the correct ratios of five statistical metrics drop quickly when the means are increased from (x=) to 2 (x=3). But after that, the correct ratios change in a certain small range. When the mean is small, e.g., x=, is the best but is the best when the mean is larger than (x=3). Figure 5 also shows that the correct ratios of each statistical metric degrade almost at the similar rate when the means are increased from to 2. But after that, the degradation rates vary little Simulation on Noisy Data We use white noise in our experiments. White noise is randomly (uniformly) distributed in a certain range. We determine the noise range as a percentage of the standard deviation of the base distribution, such as % is small and % is large. In our experiments, we test the performance of five sta- Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

6 chisquare 2 exponential normal poisson chisquare exponential normal poisson Figure 3. The correct ratios of five statistical metrics for different numbers of sample data when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal 2 denotes that the data set is of normal distribution, and the two compared data sets have means and 2, respectively. The others are analogously defined. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

7 chisquare_4_5 exponential_4_ normal_4_ poisson_4_ chisquare exponential normal poisson Figure 4. The correct ratios of five statistical metrics for different numbers of sample data when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal 4 5 denotes that the data set is of normal distribution, and the two compared data sets have means 4 and 5, respectively. The others are analogously defined. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

8 chisquare exponential Compared Two Sample Means normal Compared Two Sample Means poisson Compared Two Sample Means Compared Two Sample Means Figure 5. The correct ratios of five statistical metrics for different mean pairs when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal denotes the data set is of normal distribution. The others are analogously defined. tistical metrics based on data sets contaminated by small or large white noise. We compare four pairs of data sets for each distribution, these data sets have means and 2, 4 and 5, and 2, and and 5 for chi-square, exponential, and Poisson distribution; and means and variances (,) and (,), (4,) and (5,), (,3) and (,3), and (4,3) and (5,3) for normal distribution. Also, the data sets have different sizes, they are 5,, 5, 2, 3, 4, 5, and 5, respectively. From the experimental results, we find that the small or large noise has little or no effect on the correct ratios of statistical metrics. Also from our experiments, we find that for normal distribution, if the data sets have same means but with larger variances, the correct ratios are decreased. For example, the correct ratios for sample data pairs with means and variances (4,3) and (5,3) are less than the correct ratios for sample data pairs with means and variances (4,) and (5,) Guidelines for Choosing Statistical Metrics In this section, we summarize the results from all the previous experiments as follows, The size of data set does not matter much as long as it is large enough to represent the underline distribution sufficiently. The experiments show that points (data sets with size ) is good enough. is the best for small means, like or 2, and is the best for all the other cases (see figure 5). (a) If the difference m of two means µ x and µ y is big, e.g. m =4,andµ x =and µ y =5,asample set of size can almost guarantee above 9% correct ratios for,, and ; a sample set of 5 is good enough for all the five metrics to get correct ratios of almost % (see bottom half of figure 3). As µ x and µ y become larger, more samples are needed to achieve high correct ratios. (b)if m is small, e.g. m =, and µ x =and µ y =2, a sample set of size 4 is large enough to assure correct ratios of above 9% for and ; and a sample set of size 2 can have % correct ratios for all the five metrics (see top half of figure 3). As µ x and µ y become larger, more samples are needed to achieve high correct ratios (see figure 4). For normal distribution, when variance is large, more Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

9 Table. The four HITS-based algorithms Algorithm Description CB OB TB VB Combination of BHITS and CDR Combination of BHITS and Okapi Combination of BHITS and TLS Combination of BHITS and VSM samples data are needed to get high correct ratio. White noise, whose range is about % of the standard deviation, has little effect on the correct ratios Experiments on the Performance of HITSbased Algorithms In our experiments, 28 broad topic queries and five search engines are used. The queries are exactly the same ones as those used in [2, 3], one example is vintage car. For each query, to build a base set, we start five threads simultaneously to collect top 2 hits and their neighborhood from five search engines, AltaVista, Fast, Google, HotBot, and NorthernLight, respectively. The combination of these top hits and their neighborhood forms the base set. For a document in the root set, we limit it to at most 5 in-links and collect all its out-links. The default search mode of the five search engines and lower-case queries were used. The way we construct the base set is different from the previous works [2, 2], which usually build the base set from only one search engine, e.g., AltaVista. Combining top 2 hits and their neighborhood from five search engines gives us a more relevant base set. Running on a Sun Ultra- workstation with 3 MHz UltraSPARC-IIi processor connected to the ernet by the Mbps fast Ethernet, the Java program took about three to five minutes to gather the base set for each query. After we remove duplicate links, intra-domain links, broken links, and irrelevant links, the numbers of distinct links in these 28 base sets range from 65 to Table lists the four HITS-based algorithms used in the experiments. The four algorithms are BHITS and its combination with one of the four relevance scoring methods. To compare the performance of the above mentioned four algorithms, we first use the pooling method [] to build a query pool formed by the top authority links and the top hub links generated by each of the four algorithms, then recruit three graduate students to personally visit all documents in each query pool, and manually score them in a scale between and, with representing the most irrelevant and most relevant. A Web page receives a high score if it contains both useful and comprehensive information about the query. Also, a page may be given a high score it has many links which lead to relevant message Table 2. Average improvement(%) of relevance scores between two algorithms. Each number in the table is the improvement of the method in the first column over the method in the first row. CB OB TB VB CB OB TB because we encouraged three evaluators to follow outgoing link and browse a page s neighborhood. We did not score the pages that are written in language we do not understand, and did not tell the evaluators the algorithm from which a set of links were derived. We take the average score of three evaluators for a link, and the average score of top 2 links, top authority links and top hub links, as the final score for a query. Table 2 presents the average improvement(%) of relevance scores between two algorithms. It shows that the combination of BHITS algorithm with any of the four scoring methods has comparable performance, with CDR the best and VSM the worst. The best algorithm CB improves OB, TB,andVBby only %, %, and 2.9%, respectively. According the guidelines to select statistical metrics, and are the best two metrics under various testing conditions, so we choose both of them to compare the performance of the above four algorithms. We set the confident level 5 for and the loss threshold.5 for. The results are shown in Table 3. Both metrics give consistent results: CB is the best and VBis the worstamongthe four algorithms combined with relevance scoring methods; and OB better than TB, although two values are a little below the confident level. 6. Conclusion In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects the appropriate one based on the parameters of a specific application. By combining relevance scoring methods with a HITSbased algorithm, we statistically compare the performance of four relevance scoring methods, VSM, Okapi, TLS,and CDR, usinga setof broadtopicqueries, with CDR the best and VSM the worst, although the performance differences among the four relevance scoring methods are not significant. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

10 Table 3. Statistical performance comparison of different algorithms. CBvOB means the statistical comparison of CB over OB, and the others are similarly defined. Statistical Statistical Comparison of a Pair of HITS-Based Algorithms Method CBvOB CBvTB CBvVB OBvTB OBvVB TBvVB References [] R. E. Bechhofer. A simple-sample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics, 25():6 39, 954. [2] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 2st Annual ernational ACM SI- GIR Conference on Research and Development in Information Retrieval, pages 4, 998. [3] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Network and ISDN Systems, 3:65 74, 998. [4] S.Chien,J.Gratch,andM.Burl.Ontheefficient allocation of resouces for hypothesis evaluation: A statistical approach. IEEE Transactions on Pattern Analysis and Machine elligence, 7(7): , July 995. [5] S. Chien, A. Stechert, and D. Mutz. Efficient heuristic hypothesis ranking. Journal of Artificial elligence Research, pages , (999). [6] C. L. A. Clarke, G. V. Cormack, and E. A. Tudhope. Relevance ranking for one to three term queries. Information Processing & Management, 36:29 3, 2. [7] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Pub. Co., 989. [8] J. Gratch and G. DeJong. Composer: A probabilistic solution to the utility problem in speedup learning. In Proceedings of the th National Conference on Artificial elligence, pages , 992. [9] D. Hawking, P. Bailey, and N. Craswell. Acsys trec-8 experiments. In Proceedings of the TREC-8, 999. [] D. Hawking, N. Craswell, and P. Thistlewaste. Overview of the trec-7 very large collection track. In Proceedings of the TREC-7, 998. [] A. Ieumwananonthachai and B. W. Wah. Statistical generalization of performance-related heuristics for knowledge-lean applications. l J. of Artificial elligence Tools,5(2):6 79, June 996. [2] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages , ACM Press, New York, 998. [3] H. V. Leighton and J. Srivastava. First 2 precision among world wide web search services (search engines). J. of the American Society for Information Science, 5():87 88, 999. [4] L. Li and Y. Shang. A new statistical method for evaluating search engines. In Proc. IEEE 2th l Conf. on Tools with Artificial elligence, 2. [5] T. M. Mitchell. Generalization as search. Artificial elligence, 8:23 226, 982. [6] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., 989. [7] S. Siegel and Jr. N. J. Castellan. Nonparametric statistics for the behavioral sciences. McGraw-Hill, 988. [8] B. W. Turnbull and L. I. Weiss. A class of sequential procedures for k-sample problems concerning normal means with unknown equal variances. In T. J. Santner and A. C. Tamhane, editors, Design of Experiments: Ranking and Selection, pages Marcel Dekker, 984. [9] D. Vanderbilt and S. G. Louie. A Monte Carlo simulated annealing approach to optimization over continuous variables. Journal of Computational Physics, 56:259 27, 984. [2] J. D. Villasenor, B. Belzer, and J. Liao. Wavelet filter evaluation for image compression. IEEE Trans. on Image Processing, 2:53 6, 995. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

ABSTRACT. The purpose of this project was to improve the Hypertext-Induced Topic

ABSTRACT. The purpose of this project was to improve the Hypertext-Induced Topic ABSTRACT The purpose of this proect was to improve the Hypertext-Induced Topic Selection (HITS)-based algorithms on Web documents. The HITS algorithm is a very popular and effective algorithm to rank Web

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Markov Cluster Algorithm Web Web Web Kleinberg HITS Web Web HITS Web Markov Cluster Algorithm ( ) Web The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Kazutami KATO and Hiroshi

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Measuring Similarity to Detect

Measuring Similarity to Detect Measuring Similarity to Detect Qualified Links Xiaoguang Qi, Lan Nie, and Brian D. Davison Dept. of Computer Science & Engineering Lehigh University Introduction Approach Experiments Discussion & Conclusion

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Assessment of WWW-Based Ranking Systems for Smaller Web Sites

Assessment of WWW-Based Ranking Systems for Smaller Web Sites Assessment of WWW-Based Ranking Systems for Smaller Web Sites OLA ÅGREN Department of Computing Science, Umeå University, SE-91 87 Umeå, SWEDEN ola.agren@cs.umu.se Abstract. A comparison between a number

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University Introduction Searching on the web is defined as the process of

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

3. Genetic local search for Earth observation satellites operations scheduling

3. Genetic local search for Earth observation satellites operations scheduling Distance preserving recombination operator for Earth observation satellites operations scheduling Andrzej Jaszkiewicz Institute of Computing Science, Poznan University of Technology ul. Piotrowo 3a, 60-965

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

An advanced greedy square jigsaw puzzle solver

An advanced greedy square jigsaw puzzle solver An advanced greedy square jigsaw puzzle solver Dolev Pomeranz Ben-Gurion University, Israel dolevp@gmail.com July 29, 2010 Abstract The jigsaw puzzle is a well known problem, which many encounter during

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015 MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015 2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

A Scalable Parallel HITS Algorithm for Page Ranking

A Scalable Parallel HITS Algorithm for Page Ranking A Scalable Parallel HITS Algorithm for Page Ranking Matthew Bennett, Julie Stone, Chaoyang Zhang School of Computing. University of Southern Mississippi. Hattiesburg, MS 39406 matthew.bennett@usm.edu,

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AUTOMATIC TEST CASE GENERATION FOR PERFORMANCE ENHANCEMENT OF SOFTWARE THROUGH GENETIC ALGORITHM AND RANDOM TESTING Bright Keswani,

More information

Block-level Link Analysis

Block-level Link Analysis Block-level Link Analysis Deng Cai 1* Xiaofei He 2* Ji-Rong Wen * Wei-Ying Ma * * Microsoft Research Asia Beijing, China {jrwen, wyma}@microsoft.com 1 Tsinghua University Beijing, China cai_deng@yahoo.com

More information

Generating random samples from user-defined distributions

Generating random samples from user-defined distributions The Stata Journal (2011) 11, Number 2, pp. 299 304 Generating random samples from user-defined distributions Katarína Lukácsy Central European University Budapest, Hungary lukacsy katarina@phd.ceu.hu Abstract.

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

A Random Number Based Method for Monte Carlo Integration

A Random Number Based Method for Monte Carlo Integration A Random Number Based Method for Monte Carlo Integration J Wang and G Harrell Department Math and CS, Valdosta State University, Valdosta, Georgia, USA Abstract - A new method is proposed for Monte Carlo

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

A Novel Approach for Weighted Clustering

A Novel Approach for Weighted Clustering A Novel Approach for Weighted Clustering CHANDRA B. Indian Institute of Technology, Delhi Hauz Khas, New Delhi, India 110 016. Email: bchandra104@yahoo.co.in Abstract: - In majority of the real life datasets,

More information

CS229 Final Project: k-means Algorithm

CS229 Final Project: k-means Algorithm CS229 Final Project: k-means Algorithm Colin Wei and Alfred Xue SUNet ID: colinwei axue December 11, 2014 1 Notes This project was done in conjuction with a similar project that explored k-means in CS

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A

More information

WebSail: From On-line Learning to Web Search

WebSail: From On-line Learning to Web Search WebSail: From On-line Learning to Web Search Zhixiang Chen Department of Computer Science University of Texas-Pan American Edinburg, TX 78539, USA chen@cs.panam.edu Binhai Zhu Department of Computer Science

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

Computer Vision 2 Lecture 8

Computer Vision 2 Lecture 8 Computer Vision 2 Lecture 8 Multi-Object Tracking (30.05.2016) leibe@vision.rwth-aachen.de, stueckler@vision.rwth-aachen.de RWTH Aachen University, Computer Vision Group http://www.vision.rwth-aachen.de

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

Towards Systematic Usability Verification

Towards Systematic Usability Verification Towards Systematic Usability Verification Max Möllers RWTH Aachen University 52056 Aachen, Germany max@cs.rwth-aachen.de Jonathan Diehl RWTH Aachen University 52056 Aachen, Germany diehl@cs.rwth-aachen.de

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

5th World Congress for Software Quality Shanghai, China November 2011

5th World Congress for Software Quality Shanghai, China November 2011 Yoshihiro Kita University of Miyazaki Miyazaki, Japan kita@earth.cs.miyazaki-u.ac.jp Proposal of Execution Paths Indication Method for Integration Testing by Using an Automatic Visualization Tool Avis

More information

Seismic regionalization based on an artificial neural network

Seismic regionalization based on an artificial neural network Seismic regionalization based on an artificial neural network *Jaime García-Pérez 1) and René Riaño 2) 1), 2) Instituto de Ingeniería, UNAM, CU, Coyoacán, México D.F., 014510, Mexico 1) jgap@pumas.ii.unam.mx

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS

MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS In: Journal of Applied Statistical Science Volume 18, Number 3, pp. 1 7 ISSN: 1067-5817 c 2011 Nova Science Publishers, Inc. MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS Füsun Akman

More information

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University,

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas

More information

HITS algorithm improvement using semantic text portion

HITS algorithm improvement using semantic text portion Web Intelligence and Agent Systems: An International Journal 8 (2010) 149 164 DOI 10.3233/WIA-2010-0184 IOS Press 149 HITS algorithm improvement using semantic text portion Bui Quang Hung *, Masanori Otsubo,

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

Sensor Tasking and Control

Sensor Tasking and Control Sensor Tasking and Control Outline Task-Driven Sensing Roles of Sensor Nodes and Utilities Information-Based Sensor Tasking Joint Routing and Information Aggregation Summary Introduction To efficiently

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Xavier Le Faucheur a, Brani Vidakovic b and Allen Tannenbaum a a School of Electrical and Computer Engineering, b Department of Biomedical

More information

A Memetic Heuristic for the Co-clustering Problem

A Memetic Heuristic for the Co-clustering Problem A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu

More information

A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents

A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents Cathal Gurrin & Alan F. Smeaton School of Computer Applications Dublin City University Ireland cgurrin@compapp.dcu.ie

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Interval Algorithms for Coin Flipping

Interval Algorithms for Coin Flipping IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.2, February 2010 55 Interval Algorithms for Coin Flipping Sung-il Pae, Hongik University, Seoul, Korea Summary We discuss

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Scheme of Big-Data Supported Interactive Evolutionary Computation

Scheme of Big-Data Supported Interactive Evolutionary Computation 2017 2nd International Conference on Information Technology and Management Engineering (ITME 2017) ISBN: 978-1-60595-415-8 Scheme of Big-Data Supported Interactive Evolutionary Computation Guo-sheng HAO

More information

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes Mary Melekopoglou Anne Condon Computer Sciences Department University of Wisconsin - Madison 0 West Dayton Street Madison,

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Evolving Hierarchical and Recursive Teleo-reactive Programs through Genetic Programming

Evolving Hierarchical and Recursive Teleo-reactive Programs through Genetic Programming Evolving Hierarchical and Recursive Teleo-reactive Programs through Genetic Programming Mykel J. Kochenderfer Department of Computer Science Stanford University Stanford, California 94305 mykel@cs.stanford.edu

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Basic Statistical Terms and Definitions

Basic Statistical Terms and Definitions I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can

More information

Link Analysis Ranking Algorithms, Theory, and Experiments

Link Analysis Ranking Algorithms, Theory, and Experiments Link Analysis Ranking Algorithms, Theory, and Experiments Allan Borodin Gareth O. Roberts Jeffrey S. Rosenthal Panayiotis Tsaparas June 28, 2004 Abstract The explosive growth and the widespread accessibility

More information

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR Detecting Missing and Spurious Edges in Large, Dense Networks Using Parallel Computing Samuel Coolidge, sam.r.coolidge@gmail.com Dan Simon, des480@nyu.edu Dennis Shasha, shasha@cims.nyu.edu Technical Report

More information

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing Solving Traveling Salesman Problem Using Parallel Genetic Algorithm and Simulated Annealing Fan Yang May 18, 2010 Abstract The traveling salesman problem (TSP) is to find a tour of a given number of cities

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

2 Approaches to worldwide web information retrieval

2 Approaches to worldwide web information retrieval The WEBFIND tool for finding scientific papers over the worldwide web. Alvaro E. Monge and Charles P. Elkan Department of Computer Science and Engineering University of California, San Diego La Jolla,

More information

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains Ahmad Ali Abin, Mehran Fotouhi, Shohreh Kasaei, Senior Member, IEEE Sharif University of Technology, Tehran, Iran abin@ce.sharif.edu,

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY Proceedings of the 1998 Winter Simulation Conference D.J. Medeiros, E.F. Watson, J.S. Carson and M.S. Manivannan, eds. HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

A SIMULATED ANNEALING ALGORITHM FOR SOME CLASS OF DISCRETE-CONTINUOUS SCHEDULING PROBLEMS. Joanna Józefowska, Marek Mika and Jan Węglarz

A SIMULATED ANNEALING ALGORITHM FOR SOME CLASS OF DISCRETE-CONTINUOUS SCHEDULING PROBLEMS. Joanna Józefowska, Marek Mika and Jan Węglarz A SIMULATED ANNEALING ALGORITHM FOR SOME CLASS OF DISCRETE-CONTINUOUS SCHEDULING PROBLEMS Joanna Józefowska, Marek Mika and Jan Węglarz Poznań University of Technology, Institute of Computing Science,

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Complementary Graph Coloring

Complementary Graph Coloring International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Complementary Graph Coloring Mohamed Al-Ibrahim a*,

More information

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini Metaheuristic Development Methodology Fall 2009 Instructor: Dr. Masoud Yaghini Phases and Steps Phases and Steps Phase 1: Understanding Problem Step 1: State the Problem Step 2: Review of Existing Solution

More information

Estimating the Information Rate of Noisy Two-Dimensional Constrained Channels

Estimating the Information Rate of Noisy Two-Dimensional Constrained Channels Estimating the Information Rate of Noisy Two-Dimensional Constrained Channels Mehdi Molkaraie and Hans-Andrea Loeliger Dept. of Information Technology and Electrical Engineering ETH Zurich, Switzerland

More information

New Results on Simple Stochastic Games

New Results on Simple Stochastic Games New Results on Simple Stochastic Games Decheng Dai 1 and Rong Ge 2 1 Tsinghua University, ddc02@mails.tsinghua.edu.cn 2 Princeton University, rongge@cs.princeton.edu Abstract. We study the problem of solving

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra Pattern Recall Analysis of the Hopfield Neural Network with a Genetic Algorithm Susmita Mohapatra Department of Computer Science, Utkal University, India Abstract: This paper is focused on the implementation

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information