A General Method for Statistical Performance Evaluation

Size: px

Start display at page:

Download "A General Method for Statistical Performance Evaluation"

Herbert Underwood
6 years ago
Views:

A General Method for Statistical Performance Evaluation Longzhuang Li Dept. of Computing and Mathematical Sci. Texas A&M Uni., Corpus Christi, TX 7842 lli@sci.tamucc.edu Wei Zhang Dept.

1 A General Method for Statistical Performance Evaluation Longzhuang Li Dept. of Computing and Mathematical Sci. Texas A&M Uni., Corpus Christi, TX 7842 Wei Zhang Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 652 Yi Shang Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 652 Hongchi Shi Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 652 Abstract In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects an appropriate statistical metric according to the problem parameters. Empirically, We compare the performance of five representative statistical metrics under different conditions through simulation. They are expected loss, dman statistic, interval-based selection, probability of win, and probably approximately correct. In the experiments, expected loss is the best for small means, like or 2, and probably approximately correct is the best for all the other cases. Also, we apply the general method to compare the performance of HITS-based algorithms that combine four relevance scoring methods, VSM, Okapi, TLS, and CDR, usingaset of broad topic queries. Among the four relevance scoring methods, CDR is the best statistically when it is combined with a HITS-based algorithm.. roduction Performance evaluation has many real world applications. For example, when a customer wants to buy a computer, he needs to compare prices, CPU speed, memory, and pre-installed softwares, etc., among multiple choices before he finally decides which one to buy. In information retrieval on the Web, we may wonder which search engine will return the most relevant information for the given queries [3]. In performance evaluation, hypotheses are selected or ranked based on performance comparison of hypotheses on sample data. Research supported in part by the National Science Foundation under grant DUE and EIA Among the real world applications of statistical performance evaluation, many solutions or hypotheses exist and the ones performing the best in terms of predetermined measurements are sought. For example, In image compression, it is critical to design and choose the best filter banks for the quality of the reconstructed images [2]. In evolutionary algorithms, the individuals to be propagated to future generations are often selected with likelihood that is proportionate to their rank in the current generation [7]. The performance measurements of hypotheses are numerical numbers and have to be obtained based on sample data and may contain noise. In addition, due to the time and resource constraints in real applications, it is often impractical or even impossible to evaluate all hypotheses. Thus, statistical metrics are used to evaluate the performance of hypotheses efficiently using a limited number of sample data. There are many statistical metrics available and their results depends on many factors, such as the size of sample data, and the distribution of performance measurements of the hypotheses. Selecting the most appropriate statistical metrics is a challenging task. In the paper, a general, effective method to evaluate hypotheses performance is developed. The method incorporates various statistical metrics and automatically selects an appropriate one based on the parameters of applications. We have considered the following important parameters: the number of hypotheses, the size of sample data for each hypothesis, the distribution of performance measurements, and the distribution of noise in performance measurements. Also, the general method is applied to evaluate the performance of the combination of HITS-based algorithms [2, 2] with one of the four relevance scoring methods: cover density ranking (CDR) [6], Okapi similarity measurement (Okapi) [9], vector space model (VSM) [6], and threelevel scoring method (TLS) [4], using a set of broad topic Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

2 queries. In the experiments, we study the performance of five representative statistical metrics using sample data with four different types of distributions. The five statistical metrics are expected loss () [4], dman statistic () [7], interval-based selection () [4], probability of win () [], and probably approximately correct () [9]. The four distributions of sample data are chisquare distribution, exponential distribution, normal distribution, and Poisson distribution. This paper is organized as follows. In Section 2, we briefly review the statistical metrics for performance evaluation. In Section 3, we propose a general method for statistical performance evaluation, and apply the general method to evaluate the performance of HITS-based algorithms. In Section 4, we describe the criteria to compare the performance of different statistical metrics. In Section 5, we show our experimental results. And in Section 6, we summarize the paper. 2. Statistical Metrics for Performance Evaluation In this section, we briefly review statistical metrics to evaluate the performance of different hypotheses. Performance evaluation consists of two kinds of problems: hypothesis selection problems and hypothesis ranking problems [, 4, 5]. A hypothesis selection problem arises when we select the best one from a set of hypotheses, given their performance over some sample data. In hypothesis ranking problems, a set of hypotheses are ranked by their expected performance. Hypothesis ranking problems are extension of hypothesis selection problems [5]. Generally, statistical metrics for hypothesis selection problems can be applied to hypothesis ranking problems. The distinction between hypothesis ranking and hypothesis selection is that the result is a single best hypothesis in hypothesis selection, whereas the order of all the hypotheses is returned in hypothesis ranking. Many metrics have been developed to solve the hypothesis selection problems (see figure ). They can be classified into two groups: one for problems with a small number of hypotheses, and the other for problems with a large number of hypotheses. The statistical metrics for a small number of hypotheses include the interval-based selection [4], the COMPOSER system [8], the Turnbull and Weiss algorithm [8], the probably approximately correct () model [9], the expected loss () approach [4], the dman statistic [7], and the probability of win [], etc. In the scenario of the statistical selection metrics for a large or infinitive number of hypotheses, we usually adopt generate-and-test search strategies to find the best hypothesis. As defined by Mitchell [5], the strategies can broadly Statistical Selection Metrics Small Number of Hypotheses Large or Infinitive Number of Hypotheses erval-based COMPOSER T. & W. Algorithm dman Statistic Probability of Win Depth-First Data- Breadth-First Driven Version-Space Decision-Theoretic Knowledge- Driven Explanation-Based Figure. Statistical hypothesis selection metrics. be classified as data driven and knowledge-driven. The difference lies in the amount of tests performed: data-driven metrics do not rely on domain knowledge and often require extensive tests on the hypotheses under consideration, whereas knowledge-driven metrics depend on domain knowledge and one or a few tests to deduce new hypotheses. In the paper, we focus on the statistical metrics for a small number of hypotheses. To find the best one among a small set of hypotheses, we compare the hypotheses pairwisely. 3. A General Method for Statistical Performance Evaluation Because a statistical metric may only be suitable for certain situation for different applications, in this section, we first propose a general method for statistical performance evaluation, then apply the general method to a real world application. 3.. A General Method The general method consists of following major steps (see figure 2):. Select a set of sample data. At this step, we must be careful in choosing sample data when the hypotheses are too expensive to be tested extensively and we have a large number, and possible infinite data. On the other hand, when the size of sample data is limited and the cost of information is high, it is very important to minimize the cost of acquiring additional samples while achieving the desired evaluation quality. 2. Test the performance of each hypothesis on sample data. Sometimes, the performance of hypotheses de- Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

3 pend on the measures and algorithms we use to test sample data. 3. Select an appropriate statistical metric according to the problem parameters. For example, those parameters may include the number of hypotheses, the size of sample data for each hypothesis, the distribution of performance measurements, and the distribution of noise in performance measurements. 4. Select or rank the hypotheses based on the performance measurements using the chosen statistical metric. The chosen hypothesis is the one with the best statistical value. Also, we expect the hypothesis selected to be generalizable; that is, it must perform well not only on sample data but also on data not seen in evaluation. The general method incorporates various statistical metrics and automatically selects an appropriate one based on the parameters of the applications, and can be adapted to different applications under time and resource constraints An Application: Performance Evaluation of HITS-based Algorithms Kleinberg s hypertext-induced topic selection (HITS) algorithm [2] is a very popular and effective algorithm to rank documents based on the link information among a set of documents. The algorithm presumes that a good hub is a document that points to many others, and a good authority is a document that many documents point to. Hubs and authorities exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed by many good hubs. In the context of Web search, a HITS-based algorithm first collects a base document set for each query. Then it recursively calculates the hub and authority values for each document. To gather the base document set I, first, a root set R that matching the query is fetched from a search engine; then, for each document r R, a set of documents L that point to r and another set of documents L that are pointed to by r are added to the set I as R s neighborhood. For a document i I,leta i and h i be the authority and hub values respectively. To begin the algorithm, a i and h i are initialized to. While the values have not converged, the algorithm iteratively proceeds as follows:. For all i I which point to i, a i = i h i 2. For all i I which is pointed to by i, h i = i a i 3. Normalize a i and h i values so that i a i = i h i =. Kleinberg showed that the algorithm will eventually converge, but the bound on the number of iterations is unknown. In practice, the algorithm converges quickly. Because the HITS algorithm ranks documents only depending on the in-degree and out-degree of links, it will cause problems in some cases. For example, Bharat [2] identified two problems: mutually reinforcing relationships between hosts and topic drift. Both problems can be solved or alleviated by adding weights to documents. In Bharat s improved HITS algorithm (BHITS), to solve the first problem, a document is given a authority weight of /k if the document is in a group of k documents on a first host which link to a single document on a second host, and a hub weight of /l if there are l links from the document on a first host to a set of documents on a second host [2]. And the second problem can be alleviated by adding weights to edges based on text in the documents or their anchors [2, 3]. Bharat s improved HITS algorithm (BHITS) achieved a remarkable better results by a simple modification of Kleinberg shits algorithm for the first problem, while further precision was obtained by adding content analysis for the second problem. Disregarding the time it may take, combining connectivity and content analysis has been proved to be useful in improving precision. But the similarity measure currently used is vector space model [2] or just a simple occurrence frequency of the query terms in the text around the anchors [3], which may not be the best method to evaluate the relevance of Web documents because most queries submitted to search engines are short, consisting of three terms or less [6]. Although we can expand the short queries by adding more related words, expanding itself can cause topic drift. In this paper, we statistically compare the performance of four relevance scoring methods when they are combined with Bharat s improved HITS algorithm. Three of them are variations of methods widely used in the traditional information retrieval field. They are cover density ranking (CDR) [6], Okapi similarity measurement (Okapi) [9], and vector space model (VSM) [6]. In addition, the fourth one is called the three-level scoring method (TLS) [4], which mimics commonly used manual similarity measuring approaches. 4. Performance Comparison of Statistical Metrics In the paper, we compare the performance of five representative statistical metrics under different conditions throughsimulation. The five statistical metrics are expected loss (), dman statistic (), interval-based selection (), probability of win (), and probably approximately correct (). The four distributions used in our Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

4 Sample Measure Measurement Data Performance Results Select an appropriate statistical comparison metric A Statistical Statistical Final Metric Comparison Results Figure 2. A general method for statistical performance evaluation. experiments are chi-square distribution, exponential distribution, normal distribution, and Poisson distribution. When we compare the performance of different statistical metrics, we need the criteria to justify which statistical metrics are better in identifying one distribution X is better (or worse) than another distribution Y. There are two ways to do it. One is to fix the size of sample data from X and Y, andfind the correct ratios of the statistical metrics. This can be done through simulation. We randomly generate a fixed number of samples, such as 2, from each distribution, apply each statistical metric to the sample data, and see whether the result is correct or not. This test can be repeated many times, such as times, to get an average correctness, which is more accurate than just one run. Another way is to find the smallest size of sample data that each statistical metric needs in order to identify the correct result, i.e. X is better (or worse) than Y, with a high confidence, such as 99% confidence. This can also be done through experiments. In our experiments, we take the first approach. In our controlled experiments, we generate the sample data from a known distribution, such as chi-square distribution, and we know ahead of time which data set is better. Then we can check how the statistical metrics work by comparing their answers to the right ones. In our experiments, we are taking a simple approach by assuming that the one with larger mean is better, although it may be subjective for some real situations. Although most of statistical metrics we test assume that the sample data are of normal distribution, we do not need to change the formula when we test the metrics on sample data with other distributions. In another words, we are testing the robustness of these metrics under the situations that the sample data are not normally distributed. Of the five statistical metrics, the dman statistic has two control parameters: a sample size n and a significance level α. The interval-based techniques have three control parameters: a sample size n, a desired confidence of correct ranking γ and an indifference setting ɛ. The techniques have two control parameters: a sample size n and a loss threshold l. The techniques have three control parameters: a sample size n, a desired confidence of correct ranking γ and an indifference setting ɛ. Theprobability of win has two control parameters: a sample size n and a desired confidence of correct ranking p. In our experiments, we set γ and p as 5, α and l as.5, ɛ as % of the standard deviation of sample data. 5. Experiments In the experiments, we first analyze the correct ratios of five statistical metrics through the simulation based on data without or with noise, and draw some rules on how to select the statistical metrics under different conditions. Then we apply the general method to compare the performance of HITS-based algorithms according to the chosen statistical metrics. 5.. Simulation on Data Without Noise In this section, without considering noise, we do two groups of experiments on data with different distributions and different sample numbers. In the first group of experiments, for each run, we generate two sample sets using Matlab and use all the data in both sample sets to test the different statistical metrics. In the second group of experiments, we first generate two data sets, whose size is ; then randomly pick a subsets from each data set for each run. In both groups of experiments, the two compared data sets have means and 2, and 5, 4 and 5, and and 2, respectively. If the difference between two means is less or equal to, we consider the difference is small; otherwise, if the difference of two means is large or equal to 4, the difference is big. We set the variance of each data set equal to its mean for normal distribution. In the first group of experiments, to compare two random variables, X and Y, of the same distribution (e.g. chisquare), with different means and variances: for i= to, (the number of random runs) {. randomly generate a sample set of size k (e.g., 3) for X using Matlab and a sample set of size k for Y. 2. compare these two sample sets using each statistical metric. 3. check the results with the correct answer. } compute the correct ratio of each statistical metric for the sample size k based on random runs. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

5 In the second group of experiments, to compare two random variables, X and Y, of the same distribution (e.g. chisquare), with different means and variances: for i = to 2, (the number of random runs for a sample set of size k, e.g., 3) { generate a data set D for X and D2 for Y respectively, the size of both D and D2 is. for j = to, (the number of test runs for correct ratio) {. randomly pick a sample set of size k for X from D and a sample set of size k for Y from D2. 2. compare these two sample sets using each statistical metric. 3. check the results with the correct answer. } compute the correct ratio of test runs for the sample size k. } calculate the average correct ratio of 2 random runs for the sample size k. The results of the first group of experiments are shown in figure 3 and 4. The results of the second group of experiments are not shown here because the results are similar to those obtained in the first group of experiments. Figure 3 and 4 show that as the sizes of sample data increase, the correct ratios of all five statistical metrics also increase. But we can find some exceptions, the correct ratios become smaller when the size of data set is enlarged. For example, in the top-left plot labeled chisquare 2 in figure 3, the performance of drops from to 2 as the size of sample set increases from 5 to 2. The reason may be that the testing sample sets do not represent the underline distribution enough. Comparing all the plots, we can easily see that and are the two best metrics under all kinds of conditions. Especially, is the best for small means, like or 2, and is the best for all the other cases. If the difference m of two sample means µ x and µ y is big, e.g. m =4,and µ x =and µ y =5(see bottom half of figure 3), a sample set of size can almost guarantee above 9% correct ratios for,, and ; a sample set of size 5 is good enough for all the five metrics to get correct ratios of almost %. If m is small, e.g. m =,andµ x =and µ y =2(see top half of figure 3), a sample set of size 3 is large enough to assure correct ratios of above 8% for and ; and with a sample size of 4, above 9% of correct ratios can be achieved by and ; and a sample set of size 2 can have % correct ratios for all the five metrics. As µ x and µ y become larger, more samples are needed to achieve high correct ratios (see figure 4). Both groups of experiments tell us that the size of data set does not matter much as long as it is large enough to represent the underline distribution sufficiently. The experiments show that the data size of is good enough. Next, we study how the performance of the statistical metrics degrades as the problem becomes harder. According to our understanding, there are two ways to make the problem harder, the first one is to make the difference of means of two compared data sets become smaller and smaller; the second one is to fix the difference of two means, and make the means of both data sets larger. As we know, for distributions like chi-square, exponential, and Poisson, the variance turns bigger when the mean is bigger. In our experiments, we choose the second way to make the problem more difficult. We have done experiments when the distributions of data sets are chi-square, exponential, Poisson, and normal, respectively. The experimental results can be found in figure 5, which show the correct ratios of five statistical metrics for different mean pairs when the data set is of chi-square, exponential, normal, and Poisson distribution, respectively. In figure 5, each data point of a curve is the average performance of 2 sub-sample data sets from a data set of size generated by Matlab, the size of 2 sub-sample data sets are 5,, 5, 2, 3, 4, 5,, 2, 5, 8, and, respectively. In figure 5, x =represents -2, which denotes that the two compared data sets have means and 2. x =2represents 4-5, which denotes that the two compared data sets have means 4 and 5. Similarly, we can explain x =3for -2, x =4for 2-22, x =5for 3-32, x =6for 4-42, x =7for 5-52,..., x =2for 8-82,the increase of means is. In figure 5, for each distribution, the variance of each data set is set to equal to its mean. Figure 5 shows that the correct ratios of five statistical metrics drop quickly when the means are increased from (x=) to 2 (x=3). But after that, the correct ratios change in a certain small range. When the mean is small, e.g., x=, is the best but is the best when the mean is larger than (x=3). Figure 5 also shows that the correct ratios of each statistical metric degrade almost at the similar rate when the means are increased from to 2. But after that, the degradation rates vary little Simulation on Noisy Data We use white noise in our experiments. White noise is randomly (uniformly) distributed in a certain range. We determine the noise range as a percentage of the standard deviation of the base distribution, such as % is small and % is large. In our experiments, we test the performance of five sta- Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

6 chisquare 2 exponential normal poisson chisquare exponential normal poisson Figure 3. The correct ratios of five statistical metrics for different numbers of sample data when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal 2 denotes that the data set is of normal distribution, and the two compared data sets have means and 2, respectively. The others are analogously defined. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

7 chisquare_4_5 exponential_4_ normal_4_ poisson_4_ chisquare exponential normal poisson Figure 4. The correct ratios of five statistical metrics for different numbers of sample data when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal 4 5 denotes that the data set is of normal distribution, and the two compared data sets have means 4 and 5, respectively. The others are analogously defined. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

8 chisquare exponential Compared Two Sample Means normal Compared Two Sample Means poisson Compared Two Sample Means Compared Two Sample Means Figure 5. The correct ratios of five statistical metrics for different mean pairs when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal denotes the data set is of normal distribution. The others are analogously defined. tistical metrics based on data sets contaminated by small or large white noise. We compare four pairs of data sets for each distribution, these data sets have means and 2, 4 and 5, and 2, and and 5 for chi-square, exponential, and Poisson distribution; and means and variances (,) and (,), (4,) and (5,), (,3) and (,3), and (4,3) and (5,3) for normal distribution. Also, the data sets have different sizes, they are 5,, 5, 2, 3, 4, 5, and 5, respectively. From the experimental results, we find that the small or large noise has little or no effect on the correct ratios of statistical metrics. Also from our experiments, we find that for normal distribution, if the data sets have same means but with larger variances, the correct ratios are decreased. For example, the correct ratios for sample data pairs with means and variances (4,3) and (5,3) are less than the correct ratios for sample data pairs with means and variances (4,) and (5,) Guidelines for Choosing Statistical Metrics In this section, we summarize the results from all the previous experiments as follows, The size of data set does not matter much as long as it is large enough to represent the underline distribution sufficiently. The experiments show that points (data sets with size ) is good enough. is the best for small means, like or 2, and is the best for all the other cases (see figure 5). (a) If the difference m of two means µ x and µ y is big, e.g. m =4,andµ x =and µ y =5,asample set of size can almost guarantee above 9% correct ratios for,, and ; a sample set of 5 is good enough for all the five metrics to get correct ratios of almost % (see bottom half of figure 3). As µ x and µ y become larger, more samples are needed to achieve high correct ratios. (b)if m is small, e.g. m =, and µ x =and µ y =2, a sample set of size 4 is large enough to assure correct ratios of above 9% for and ; and a sample set of size 2 can have % correct ratios for all the five metrics (see top half of figure 3). As µ x and µ y become larger, more samples are needed to achieve high correct ratios (see figure 4). For normal distribution, when variance is large, more Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

9 Table. The four HITS-based algorithms Algorithm Description CB OB TB VB Combination of BHITS and CDR Combination of BHITS and Okapi Combination of BHITS and TLS Combination of BHITS and VSM samples data are needed to get high correct ratio. White noise, whose range is about % of the standard deviation, has little effect on the correct ratios Experiments on the Performance of HITSbased Algorithms In our experiments, 28 broad topic queries and five search engines are used. The queries are exactly the same ones as those used in [2, 3], one example is vintage car. For each query, to build a base set, we start five threads simultaneously to collect top 2 hits and their neighborhood from five search engines, AltaVista, Fast, Google, HotBot, and NorthernLight, respectively. The combination of these top hits and their neighborhood forms the base set. For a document in the root set, we limit it to at most 5 in-links and collect all its out-links. The default search mode of the five search engines and lower-case queries were used. The way we construct the base set is different from the previous works [2, 2], which usually build the base set from only one search engine, e.g., AltaVista. Combining top 2 hits and their neighborhood from five search engines gives us a more relevant base set. Running on a Sun Ultra- workstation with 3 MHz UltraSPARC-IIi processor connected to the ernet by the Mbps fast Ethernet, the Java program took about three to five minutes to gather the base set for each query. After we remove duplicate links, intra-domain links, broken links, and irrelevant links, the numbers of distinct links in these 28 base sets range from 65 to Table lists the four HITS-based algorithms used in the experiments. The four algorithms are BHITS and its combination with one of the four relevance scoring methods. To compare the performance of the above mentioned four algorithms, we first use the pooling method [] to build a query pool formed by the top authority links and the top hub links generated by each of the four algorithms, then recruit three graduate students to personally visit all documents in each query pool, and manually score them in a scale between and, with representing the most irrelevant and most relevant. A Web page receives a high score if it contains both useful and comprehensive information about the query. Also, a page may be given a high score it has many links which lead to relevant message Table 2. Average improvement(%) of relevance scores between two algorithms. Each number in the table is the improvement of the method in the first column over the method in the first row. CB OB TB VB CB OB TB because we encouraged three evaluators to follow outgoing link and browse a page s neighborhood. We did not score the pages that are written in language we do not understand, and did not tell the evaluators the algorithm from which a set of links were derived. We take the average score of three evaluators for a link, and the average score of top 2 links, top authority links and top hub links, as the final score for a query. Table 2 presents the average improvement(%) of relevance scores between two algorithms. It shows that the combination of BHITS algorithm with any of the four scoring methods has comparable performance, with CDR the best and VSM the worst. The best algorithm CB improves OB, TB,andVBby only %, %, and 2.9%, respectively. According the guidelines to select statistical metrics, and are the best two metrics under various testing conditions, so we choose both of them to compare the performance of the above four algorithms. We set the confident level 5 for and the loss threshold.5 for. The results are shown in Table 3. Both metrics give consistent results: CB is the best and VBis the worstamongthe four algorithms combined with relevance scoring methods; and OB better than TB, although two values are a little below the confident level. 6. Conclusion In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects the appropriate one based on the parameters of a specific application. By combining relevance scoring methods with a HITSbased algorithm, we statistically compare the performance of four relevance scoring methods, VSM, Okapi, TLS,and CDR, usinga setof broadtopicqueries, with CDR the best and VSM the worst, although the performance differences among the four relevance scoring methods are not significant. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

10 Table 3. Statistical performance comparison of different algorithms. CBvOB means the statistical comparison of CB over OB, and the others are similarly defined. Statistical Statistical Comparison of a Pair of HITS-Based Algorithms Method CBvOB CBvTB CBvVB OBvTB OBvVB TBvVB References [] R. E. Bechhofer. A simple-sample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics, 25():6 39, 954. [2] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 2st Annual ernational ACM SI- GIR Conference on Research and Development in Information Retrieval, pages 4, 998. [3] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Network and ISDN Systems, 3:65 74, 998. [4] S.Chien,J.Gratch,andM.Burl.Ontheefficient allocation of resouces for hypothesis evaluation: A statistical approach. IEEE Transactions on Pattern Analysis and Machine elligence, 7(7): , July 995. [5] S. Chien, A. Stechert, and D. Mutz. Efficient heuristic hypothesis ranking. Journal of Artificial elligence Research, pages , (999). [6] C. L. A. Clarke, G. V. Cormack, and E. A. Tudhope. Relevance ranking for one to three term queries. Information Processing & Management, 36:29 3, 2. [7] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Pub. Co., 989. [8] J. Gratch and G. DeJong. Composer: A probabilistic solution to the utility problem in speedup learning. In Proceedings of the th National Conference on Artificial elligence, pages , 992. [9] D. Hawking, P. Bailey, and N. Craswell. Acsys trec-8 experiments. In Proceedings of the TREC-8, 999. [] D. Hawking, N. Craswell, and P. Thistlewaste. Overview of the trec-7 very large collection track. In Proceedings of the TREC-7, 998. [] A. Ieumwananonthachai and B. W. Wah. Statistical generalization of performance-related heuristics for knowledge-lean applications. l J. of Artificial elligence Tools,5(2):6 79, June 996. [2] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages , ACM Press, New York, 998. [3] H. V. Leighton and J. Srivastava. First 2 precision among world wide web search services (search engines). J. of the American Society for Information Science, 5():87 88, 999. [4] L. Li and Y. Shang. A new statistical method for evaluating search engines. In Proc. IEEE 2th l Conf. on Tools with Artificial elligence, 2. [5] T. M. Mitchell. Generalization as search. Artificial elligence, 8:23 226, 982. [6] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., 989. [7] S. Siegel and Jr. N. J. Castellan. Nonparametric statistics for the behavioral sciences. McGraw-Hill, 988. [8] B. W. Turnbull and L. I. Weiss. A class of sequential procedures for k-sample problems concerning normal means with unknown equal variances. In T. J. Santner and A. C. Tamhane, editors, Design of Experiments: Ranking and Selection, pages Marcel Dekker, 984. [9] D. Vanderbilt and S. G. Louie. A Monte Carlo simulated annealing approach to optimization over continuous variables. Journal of Computational Physics, 56:259 27, 984. [2] J. D. Villasenor, B. Belzer, and J. Liao. Wavelet filter evaluation for image compression. IEEE Trans. on Image Processing, 2:53 6, 995. Proceedings of the 36th Hawaii ernational Conference on System Sciences (HICSS 3) /3 $7. 22 IEEE

ABSTRACT. The purpose of this project was to improve the Hypertext-Induced Topic

ABSTRACT. The purpose of this project was to improve the Hypertext-Induced Topic ABSTRACT The purpose of this proect was to improve the Hypertext-Induced Topic Selection (HITS)-based algorithms on Web documents. The HITS algorithm is a very popular and effective algorithm to rank Web