A Probabilistic Relevance Propagation Model for Hypertext Retrieval

Size: px
Start display at page:

Download "A Probabilistic Relevance Propagation Model for Hypertext Retrieval"

Transcription

1 A Probabilistic Relevance Propagation Model for Hypertext Retrieval Azadeh Shakery Department of Computer Science University of Illinois at Urbana-Champaign Illinois 6181 ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Illinois 6181 ABSTRACT A major challenge in developing models for hypertext retrieval is to effectively combine content information with the link structure available in hypertext collections. Although several link-based ranking methods have been developed to improve retrieval results, none of them can fully exploit the discrimination power of contents as well as fully exploit all useful link structures. In this paper, we propose a general relevance propagation framework for combining content and link information. The framework gives a probabilistic score to each document defined based on a probabilistic surfing model. Two main characteristics of our framework are our probabilistic view on the relevance propagation model and propagation through multiple sets of neighbors. We compare eight different models derived from the probabilistic relevance propagation framework on two standard TREC Web test collections. Our results show that all the eight relevance propagation models can outperform the baseline content only ranking method for a wide range of parameter values, indicating that the relevance propagation framework provides a general, effective and robust way of exploiting link information. Our experiments also show that using multiple neighbor sets outperforms using just one type of neighbors significantly and taking a probabilistic view of propagation provides guidance on setting propagation parameters. Categories and Subject Descriptors H.3.3. [Information Storage and Retrieval]: Information Search and Retrieval Retrieval Models General Terms Algorithms, Experimentation Keywords Content and link ranking, hypertext retrieval model, probabilistic relevance propagation, web information retrieval Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM 6, November 5 11, 26, Arlington, Virginia, USA. Copyright 26 ACM /6/11...$ INTRODUCTION Hypertext Retrieval, the task of searching for information in a hypertext collection, has been around for a while. A key characteristic that distinguishes the search task in a hypertext collection from a traditional retrieval task is the existence of link information in the former one. Although the primary goal of creating links is to guide a user to other parts of the collection, the link information can also be exploited to improve the search accuracy. The existence of this extra information makes it inappropriate to use traditional information retrieval methods, which do the retrieval task based on the content only, to do the search task. The early works on the hypertext retrieval task were more on the literature side. Some researchers have used bibliographic citation methods to determine relationships among documents in scientific papers [32, 16, 37]. They have used different citation methods in this direction, namely direct citation, bibliographic coupling - the sharing of one or more references by two documents - and co-citation. Modha and Spangler [25] have proposed a clustering algorithm that clusters hypertext documents using words, out-links and in-links, Chakrabrti et al. [6] have developed a technique called spectral filtering for discovering high-quality topical resources in hyperlinked corpora and Ray Larson [22] has applied co-citation analysis methods to the World Wide Web to produce clusterings of the WWW sites that have topical similarities. Currently with the fast growth and popularity of the World Wide Web, the search task on this huge collection of hypertext data has gained much attention. The problem of hypertext retrieval on the Web has been studied extensively. Several link-based ranking methods have been developed to improve retrieval results [2, 27, 2, 5, 3, 26, 18, 3, 14, 36, 4, 39, 24, 38, 41, 43, 19, 4, 1, 29]. Although these algorithms have been shown to improve the performance over some baseline approaches, it remains a challenging research question what is the best way to exploit the content information and the link information to maximize search accuracy. These works appear to have adopted five strategies for combining content and link information: (1) Using the query as a filter to select documents and rank them according to link-based scores (e.g., PageRank [27] and HITS [2]); (2) Computing a weighted combination of topic-specific PageRank scores, where the weights are determined by the query (Topic-sensitive PageRank [18]); (3) Using the query to compute the relevance value of each document and regulating the influence of nodes in HITS using these value (e.g. ARC [5], Bharat and Henzinger[2]); (4) Using the query to compute the relevance value of each document and propagate these values through links (Intelligent Surfer [3], [36], [29]). (5) Using sitemap links to propagate term frequencies ([38], [29]). Unfortunately, none of these combination methods can fully exploit

2 the discrimination power of contents as well as fully exploit all useful link structures. Despite the importance of link information, the contents of documents are clearly the most direct evidence regarding whether a document is relevant to a user s interest. Thus presumably, contents of the documents should be the main basis for ranking them. In this sense, among the five strategies, only the last two are close to fully exploiting the content information to improve ranking. However, the intelligent surfer only considers the in-links of a document, the relevance propagation method only considers direct in-links or out-links and the term propagation method only considers parent-child links in a sitemap. Each of these methods only considers one type of explicit neighbors and none of them fully take advantage of all the available link information. But intuitively, all neighbors can be potentially exploited; for example, both out-links and in-links may be useful as we will show in our experiments. Besides, for the propagation methods, there exist no principled framework to do the propagation. For example, the content scores can be transformed using any monotonic function without affecting the ranking, but such transformation would presumably affect the propagation. How should we transform the scores to achieve the best propagation results? In this paper, we propose a general probabilistic relevance propagation framework for combining content and link information, which can fully take advantage of content information and the link structure in a principled way and can unify most existing link-based ranking algorithms. The basic idea of probabilistic relevance propagation is to first compute a content-based relevance probability score for each document using the query, and then propagate the probabilities through different groups of neighbors. We exploit the content information as a basis for finding the probability of the relevance of a document to a query and use the link structure to define different groups of neighbors to propagate the probabilities through. After propagation, unlike [29], our model gives us a probabilistic score for each document defined based on a probabilistic surfing model. Moreover, our model supports using multiple types of neighbors, which is shown to outperform the results of using a single type of neighbor. On the other hand, the probabilistic interpretation of the model suggests that we should transfer the contentbased retrieval scores to probabilities of relevance, which is shown to be beneficial in our experiments. The probabilistic interpretation also provides guidance on how to set various parameters in the propagation model. In some sense, our work resembles previous work on spreading activation [7, 33, 12, 13, 34, 15, 28, 35, 23, 11] as both involve propagating values through a network/graph. The main difference, however, is that in these spreading activation methods, the number of steps for propagating the weights is predefined and is a small value in most of the cases, while our framework is an iterative process which iterates until the ranks converge to a limit. We derive several special instances of the general probabilistic relevance propagation framework and show that probabilistic relevance propagation is a very general mechanism that allows us to recover most of the major existing algorithms as special cases. Moreover, it also naturally suggests several new algorithms that can combine content and link information. In our experiments, we evaluated several propagation algorithms and the experiment results show that: (1) Using relevance propagation to combine link information and content information for scoring can improve retrieval accuracy over using only content for scoring. (2) Using multiple sets of neighbors for propagation outperforms using a single neighbor set. (3) Using probabilities to control the effect of different groups of neighbors helps. (4) Using probabilities to control the influence of each document in a neighbor set helps. The rest of the paper is organized as follows: We present our relevance propagation framework and derive several special cases in section 2. We discuss the experiment results in section 3 and conclude in section 4 finally. 2. A PROBABILISTIC RELEVANCE PROPAGATION FRAMEWORK Given a query, intuitively, a good result document is one whose content is related to the query topic and which is surrounded by other good documents; i.e. located in the center of a subset of the collection relevant to the query. Thus in order to maximize ranking accuracy, we need to consider the relevance of the document to the query as well as the relevance of its neighbors. In this section, we propose a general probabilistic relevance propagation framework for combining relevance values of different groups of neighbors in a principled way. The basic idea of probabilistic relevance propagation is to first use the query to compute a contentbased self relevance probability score for each document and then propagate the scores through the neighbors. 2.1 Probabilistic Relevance Propagation Framework In this framework, we allow different types of neighbors to influence the quality score of a document. Figure 1 shows a sample... Nk d N2 Figure 1: A typical documents d and its neighbors document d surrounded by k different groups of neighbors. The neighbor sets are not necessarily mutually exclusive. In-links, Outlinks, the single document itself and the whole set of documents are a few examples of potential neighbor sets. Think of a random surfer surfing the Web looking for documents related to a given query q. At each step, the surfer being in a document, selects a group of neighbors surrounding the document and jumps to a document in that group. The surfer keeps doing this iteratively, jumping to neighbor documents looking for documents relevant to query q. The final score of each page is equal to the stationary probability of the surfer visiting the page. Formally, the probability of the surfer being in each document is defined as: p(x) = k i=1 k α i = 1, i=1 α i d D x D N N1 p(d)p i(d x) p i(d x) = 1 where D is the set of all documents, α i indicates the probability of choosing a particular type of neighbor set when leaving the current

3 document and p i is the probability of visiting a particular page in the chosen neighbor set. Note that p i(a b) is positive only if b is a neighbor of a, otherwise p i(a b) =. In order to compute the scores, for each neighbor set, we construct a matrix M i where M i(m, n) = p i(d m d n). The probability scores can be computed using matrix multiplication: P = M T P where P is the vector of the probability values and M = Σ k i=1α im i. The probability values are computed iteratively through matrix multiplications in a very similar way as any of the existing link-based scoring algorithms. Clearly, efficient matrix multiplication methods can be used to further speed up the scoring. The final scores will be the values of the stationary probability distribution. To ensure reachability to each document, we generally would include the whole set of documents as one special set of neighbors in our propagation framework. Thus by the Ergodicity theorem for Markov chains [17], we know that the Markov chain defined by such a transition matrix M must have a unique stationary probability distribution. In this framework, we identify three groups of probabilistic parameters: Hyper-Relevance Probability p(d): Defined for each document indicating the probability of visiting the document. Neighbor Set Selection Probability α: Defined for each group of neighbors indicating the probability of choosing a particular type of neighbor set when leaving the current document. Navigation Probability p i(d x): Defined for each document in a specific group indicating the probability of visiting a particular page in the chosen neighbor set. 2.2 Special Cases By setting α is to different values and instantiating p i s with specific functions, we can easily obtain many special cases of our general relevance propagation framework. In particular, the framework can recover most existing link-based algorithms. Table 1 shows a family of relevance propagation algorithms which are covered by our general framework. As can be seen from the table, PageRank and its extensions are special cases of the framework. The HITS algorithm is not directly a special case, since it does not satisfy the probability property. But with minor changes, i.e. normalization of the weights, it will be a special case. In this table, we include the normalized version of HITS as well as the normalized version of its extensions. 2.3 Parameter Estimation In our framework, we have identified three groups of probabilistic parameters: content relevance probabilities, neighbor set selection probabilities and navigation probabilities. The content relevance probability of a document can be estimated based on its relevance score given by any content-based retrieval method. Neighbor set selection probabilities and navigation probabilities can be either set to uniform or estimated based on content-based relevance scores. In this subsection, we show how to estimate the parameters. We compare different estimation methods in the following section Content Relevance Probabilities In our probabilistic framework we should convert the original content scores to probabilities. The specific conversion method is inevitably dependent on the specific content scoring method, but with some training data, we may use techniques such as logistic regression to do the conversion. If the original retrieval model is a probabilistic model, we have some natural analytical way to transform the scores. As an example of this transformation, here we show how we compute relevance probabilities from Okapi scores and from language model(lm) scores. Okapi Having the Okapi scores, our goal is to normalize the scores to find the relevance probabilities of the documents to the query. We use logistic regression to normalize the scores [31]: The Okapi score is a X + b, where X is the log odds of relevance, i.e., log(p(rel)/(1 p(rel)). So, to recover the probability p(rel), we have p(rel) = exp(x)/(1 + exp(x)). Given a score s, we have s = ax + b, or X = (s b)/a. Thus, the normalization formula should be p(rel) = exp((s b)/a)/(1+exp((s b)/a)). In order to set a and b, we assume that the minimum score min corresponds to a very small probability δ. We also assume that the maximum score max corresponds to p(rel) =. Solving these equations will give us values for a and b: min max a = log( δ ) log( ) 1 δ 1 max log δ b = log 1 δ min log δ 1 δ log Language Modeling Approach 1 1 In the language modeling approach, we score a document D w.r.t. a query Q by s = log p(q D) [42]. Thus we can do an exponential transformation to recover the probability of relevance. That is, p(rel) p(q D)p(D) exp(s) (assuming uniform p(d)) Neighbor Set Selection Probabilities The easiest way to estimate neighbor set selection probabilities is uniform estimation, counting all the neighbor sets to be equal, i.e. α i = 1 k. But obviously this is not the best we can do. Our framework suggests to use relevance scores for Neighbor set probability estimation. We get our intuition for defining neighbor set selection probabilities from the surfer model. In the surfer model, in each step, the surfer should decide on the neighbor set he wants to jump to. Intuitively, the surfer will select the neighbor set based on the average relevance of the documents in the neighbor set, the higher the average relevance, the more probable the surfer will select that group. Using this intuition, we set α i using: α i 1 N i X Ni rel(x), α i = Navigation Probabilities Like neighbor set selection probabilities, the navigation probabilities are most easily estimated through uniform estimation. But intuitively, estimating the probabilities using relevance values should give better results. We define navigation probabilities based on the content relevance probabilities of target pages. The higher the probability of the relevance of the target page, the higher the probability of navigating the link: p(d x) p(x). 2.4 Summary The proposed framework provides a general probabilistic interpretation of relevance-based propagation through multiple sets of neighbors. It can unify most existing link-based ranking algorithms, making it possible to compare the assumptions made in each specific algorithm. It also makes it possible to systematically explore

4 k D d k Table 1: Probabilistic Relevance Propagation Algorithms Method k Neighbors α i s p i s N : Set of all documents α > const. P (d x) = 1 N 1 PageRank[27] 2 N I : Set of In-links α I > const. P I (d x) = OUT (d) 1 Topic-Sensitive PR[18] 2 N : Set of all documents α > const. P (d x) = C j if d in ODPC1 (c j ) o.w. 1 N I : Set of In-links α I > const. P I (d x) = OUT (d) Rel(x) Intelligent Surfer[3] 2 N : Set of all documents α > const. P (d x) = Rel(k) Rel(x) N I : Set of In-links α I > const. P I (d x) = Rel(k) Authorities 1 N CC : Set of Co-Citations α CC = 1 P CC (d x) IN(d) if d = x Normalized HITS[2] #Common Parents o.w. Hubs 1 N CR : Set of Co-References α CR = 1 P CR (d x) OUT(d) if d = x #Common Children o.w. Authorities 1 N CC : Set of Co-Citations α CC = 1 P CC (d x) Normalized Rel(x) IN(d) if d = x #Common Parents o.w. Weighted HITS[2] Hubs 1 N CR : Set of Co-References α CR = 1 P CR (d x) OUT(d) if d = x Rel(x) #Common Children o.w. Authorities 1 N CC : Set of Co-Citations α CC = 1 P CC (d x) IN(d) if d = x Normalized ARC[5] Rel(anchor(x)) #Common Parents o.w. Hubs 1 N CR : Set of Co-References α CR = 1 P CR (d x) OUT(d) if d = x Rel(anchor(x)) #Common Children o.w. Authorities 2 N : Set of all documents α > const. P (d x) = 1 N IN(d) if d = x Normalized N CC : Set of Co-Citations α CC > const. P CC (d x) #Common Parents o.w. Randomized HITS[26] Hubs 2 N : Set of all documents α > const. P (d x) = 1 N OUT(d) if d = x N CR : Set of Co-References α CR > const. P CR (d x) #Common Children o.w. the algorithm space and compare different components of algorithms. Moreover, taking a strict probabilistic view of propagation provides guidance on how to normalize content scores and how to set other propagation parameters to optimize retrieval accuracy, as will be shown later in the paper. 3. COMPARISON OF RELEVANCE PROPAGATION ALGORITHMS We have done some experiments to evaluate the performance of our proposed models. In this section, we present our experiment results. 3.1 Experiment Design Data Set and Methods As our data set, we used the.gov test collection, which is an 18 gigabyte, 1.25 million document 22 partial crawl of the.gov domain used in TREC-22, TREC-23 and TREC-24 experiments for topic distillation [8, 9, 1]. We used two sets of queries in our experiments: (1) 5 topic distillation topics created by NIST for TREC-23 and (2) 75 topic distillation topics created by NIST for TREC-24. The topics are keyword queries for which key resources exist within the.gov collection. An important advantage of using this set of data is that it is created carefully for the purpose of evaluating Web retrieval algorithms with a significant number of judgments available for quantitatively comparing different methods. In our experiments, we used two baseline methods: Okapi and Language Modeling approach. Since our exploration is orthogonal to the use of anchor text and many other heuristics which are known to improve the performance, we preferred not to enter these heuristics in our baseline. Despite this, we already have a very strong baseline compared to the reported results in TREC23 [9] and TREC24 [1]. We expect the performance to be further improved when we use other heuristics on top of our method Neighbor Sets In our experiments, we compare the performance of using two types of neighbors: The set of documents which have links to the document(in) and the set of documents which are linked from the document(out). There also exist a universal neighbor set N which contains all the documents in the collection. Selecting this universal neighbor set to jump to is equivalent to jumping to a random page Content Relevance Probabilities The probabilistic relevance propagation framework allows us to use any content-based retrieval algorithm from which we can com-

5 pute the relevance probabilities. In our experiments, we try two baseline methods: Okapi and language modeling approach and compute the relevance probabilities from these relevance scores Neighbor Set Selection and Navigation Probabilities As mentioned earlier, α is are the parameters which indicate the probability of choosing a particular type of neighbor when leaving the current document. In our experiments, we follow one of the two approaches: either manually set α i to different values from to 1 or automatically set α i using neighbor set selection probabilities. Navigation probabilities on the other hand indicate the probability of visiting a particular page in a group. In our experiments, we use two different estimations of these probabilities: Uniform estimation ( Uni ) and relevance based estimation ( Wt ). 3.2 Result Analysis Effectiveness of Exploiting Link Information The first research question we want to answer is whether applying probabilistic relevance propagation on top of a content-based retrieval method would improve the performance. Most existing studies of link-based scoring algorithms focus on comparing different link-based algorithms without comparing link-based algorithms with scoring using only contents. The Web Track of TREC has seen some evaluation of effectiveness of exploiting link information to improve content-based scoring, but the results are not quite conclusive due to the many uncontrolled factors. To answer the first research question, we compare the performance of using two types of neighbors: in-links and out-links. (Note that we also have the universal neighbor set). We will have: p(x) = α Σ d D p(d)p (d x) + α IΣ d IN p(d)p I(d x) + α OΣ d OUT p(d)p O(d x) where α is the probability of randomly jumping to a page, α I is the probability of jumping to an in-link and α O is the probability of jumping to an out-link. Jumping probabilities can either be uniform (considering all the members to be equal) or weighted based on relevance probabilities. We also consider the combination of the two types of neighbors. This gives us eight combinations, which we compare with the content-only baseline in tables 2 and 3. The numbers in parenthesis show the percent of improvement over the baseline methods. We use precision at 1 and average precision for comparison. The shown results are the best performances achieved by these methods through tuning the parameter α manually in the probabilistic relevance propagation model; we will analyze the sensitivity later. From tables 2 and 3, we can make the following observations: 1. On both query sets, both types of neighbors can outperform the baseline significantly. 2. Weighted propagation of probabilities outperform uniform propagation. 3. The combination of different types of neighbors outperform using any single neighbor set. 4. We get significant improvement using both Okapi and LM baselines. Overall, we see that the probabilistic relevance propagation framework is reasonable and all these specific derived algorithms can help improve search results. Table 2: Combining Link and Content - Okapi Method TREC-23 Prec@1 Avg. Prec 8 1 Uni-IN 18(9.3%) 5(2%) Wt-IN 8(18.5%) 4(19%) Uni-OUT 18(9.3%) 51(24.8%) Wt-OUT 2(13%) 63(34.7%) Uni-IN Uni-OUT 6(16.7%) 68(38.8%) Uni-IN Wt-OUT 32(22.2%) 66(37.2%) Wt-IN Uni-OUT 38(27.8%) 73(43%) Wt-IN Wt-OUT 38(27.8%) 79(47.9%) TREC-24 Prec@1 Avg. Prec 9.93 Uni-IN 8(39.5%) 5(34.4%) Wt-IN 81(4.3%) 5(34.4%) Uni-OUT 56(2.9%) 12(2.4%) Wt-OUT 57(21.7%) 13(21.5%) Uni-IN Uni-OUT 79(38.8%) 4(33.3%) Uni-IN Wt-OUT 84(42.6%) 7(36.6%) Wt-IN Uni-OUT 8(39.5%) 5(34.4%) Wt-IN Wt-OUT 88(45.7%) 7(36.6%) Effectiveness of Combining Different Groups of Neighbors In tables 4 and 5, we compare the results of using only one type of neighbor with the results when we consider multiple groups of neighbors. We did a Wilcoxon signed rank test to see if the improvement on Average Precision is statistically significant. In these tables we compare the best results for each type of neighbor. Statistically significant improvements are distinguished by a star( ). As the tables show, combining different groups of neighbors improves the performance over using a single set of neighbors. Potentially, we can improve the performance by adding new types of neighbors, e.g. co-citations (documents which have at least one common parent with the document) and co-references (documents which have at least one common child with the document) Content Score Transformation The probabilistic framework suggests that we should convert the original content scores to probabilities. In our experiments, we use Okapi and LM methods as our baseline and transform the scores to probabilities using logistic regression and exponential transformation respectively. Figures 2 and 3 compare the performance of probabilistic transformation with the performance of the original raw score propagation as done in all the previous work. As can be seen, the performance is much better when we use probabilistic transformation Comparison of Estimation Methods Relevance-Based Estimate of α Improves over Uniform Estimate. In one set of experiments, we tried to set αs automatically based on the average relevance values of neighbors. Table 6 compares the results of relevance-based estimation of α with uniform estimation. As the table shows, in most of the cases relevance-based estimation gives better results. Note that these results are completely automatic; i.e. we do not have to tune any parameters. Thus these improvements are very encouraging. These results also confirm that using multiple neighbor sets improves over using just a single neighbor set.

6 Table 3: Combining Link and Content - LM Method TREC-23 Prec@1 Avg. Prec Uni-IN 18(28.3%) 35(36.4%) Wt-IN 18(28.3%) 2(43.4%) Uni-OUT 6(15.2%) 9(3.3%) Wt-OUT 1(19.6%) 35(36.3%) Uni-IN Uni-OUT 18(28.3%) 5(51.5%) Uni-IN Wt-OUT 2(32.6%) 2(43.4%) Wt-IN Uni-OUT 6(37%) 5(46.5%) Wt-IN Wt-OUT 8(39.1%) 4(45.5%) TREC-24 Prec@1 Avg. Prec 9.95 Uni-IN 65(27.9%) 13(18.9%) Wt-IN 67(29.5%) 15(21.1%) Uni-OUT 1(9.3%) 7(12.6%) Wt-OUT 4(11.6%) 1(15.8%) Uni-IN Uni-OUT 6(24%) 15(2.8%) Uni-IN Wt-OUT 63(26.4%) 16(21.1%) Wt-IN Uni-OUT 64(27.1%) 17(23.2%) Wt-IN Wt-OUT 67(29.5%) 19(25.3%) Relevance-Based Estimate of p i(d x) Improves over Uniform Estimate. In table 7, we compare the results of uniformly setting the navigation weights vs. estimating them based on relevance scores. As the table shows, relevance based estimation improves the performance in most of the cases Sensitivity Analysis We have so far only looked at the best performance using each method. We now turn to the question about how sensitive each method is to the setting of the parameter α, which controls the amount of influence from the neighbors. To answer this research question, we compute an optimal range of parameter values for each method, which is defined as the interval of parameter values for which a method outperforms the baseline. Table 8 shows the optimal ranges for four of our algorithms. We see that, in general, the optimal range is wide for most methods, indicating that exploiting these groups of neighbors for relevance propagation is useful. The uniform methods are generally more sensitive to the setting of α, which indicates that using weighted methods is more robust. In figures 2 and 3, we show the complete picture of the sensitivity of these methods with prec@1 and average precision. 4. CONCLUSIONS In this paper, we proposed a general probabilistic relevance propagation framework for combining content and link information in a principled manner to fully take advantage of query-based content scoring and link structures. The framework can unify most existing link-based ranking algorithms and can also suggest several interesting new algorithms through different propagation strategies. Following the probabilistic relevance propagation framework, we systematically compared eight specific relevance propagation models on two TREC test collections for Web retrieval. Our results show that all the eight relevance propagation models that we tested can outperform the baseline content only ranking method for a wide range of parameter values, indicating that Table 4: Using Multiple Neighbors vs. a Single Neighbor Set- TREC-23 Multi Neighbor Sets Single Neighbor Set Improvement Okapi Uni-IN Uni-OUT Uni-IN % * 68 Uni-OUT % Uni-IN Wt-OUT Uni-IN % * 66 Wt-OUT % Wt-IN Uni-OUT Wt-IN 4 2% * 73 Uni-OUT % Wt-IN Wt-OUT Wt-IN % * 79 Wt-OUT % LM Uni-IN Uni-OUT Uni-IN % * 5 Uni-OUT % Uni-IN Wt-OUT Uni-IN % * 2 Wt-OUT % Wt-IN Uni-OUT Wt-IN 2 2.1% 5 Uni-OUT % * Wt-IN Wt-OUT Wt-IN 2 1.4% 4 Wt-OUT % * Table 5: Using Multiple Neighbors vs. a Single Neighbor Set- TREC-24 Multi Neighbor Sets Single Neighbor Set Improvement Okapi Uni-IN Uni-OUT Uni-IN 5-4 Uni-OUT % * Uni-IN Wt-OUT Uni-IN 5.8% 6 Wt-OUT % Wt-IN Uni-OUT Wt-IN 5-5 Uni-OUT % * Wt-IN Wt-OUT Wt-IN 5 1.6% 7 Wt-OUT % * LM Uni-IN Uni-OUT Uni-IN % 15 Uni-OUT 7 7.5% * Uni-IN Wt-OUT Uni-IN % 16 Wt-OUT 1 5.5% * Wt-IN Uni-OUT Wt-IN % 17 Uni-OUT 7 9.3% * Wt-IN Wt-OUT Wt-IN % 19 Wt-OUT 1 8.1% * the relevance propagation framework provides a general, effective and robust way of exploiting link information to improve hypertext search accuracy. While the previous work all uses just one type of neighbor for propagation, we have shown that using multiple neighbor sets outperforms using just one type of neighbors significantly. We have also shown that taking a probabilistic view of propagation provides guidance on setting propagation parameters, that using content scores to estimate the probabilities of relevance improves the performance and that relevance based estimation of the parameters helps us improve the results. There are several interesting directions for further research: 1. Our framework naturally accommodates the use of anchor text through estimating navigation parameters based on anchor text. It is interesting to see how this estimation compares with our current estimation methods. 2. We have shown that in-links and out-links are useful for relevance propagation and can outperform the content only

7 Precision at 1 Precision at Average Precision Average Precision (a) Results using relevance probabilities (b) Results using raw content scores Figure 2: TREC-23 results with Okapi baseline Table 6: Effectiveness of Neighbor Set Selection Probability Estimation (α) Method Uniform Estimate Relevance Estimate Prec@1 Avg Prec Prec@1 Avg Prec Uni-IN Wt-IN Uni-OUT Wt-OUT Uni-IN Uni-OUT Uni-IN Wt-OUT Wt-IN Uni-OUT Wt-IN Wt-OUT Table 7: Navigation Probability Estimation - TREC-23 Okapi Neighbor Uniform Estimate Relevance Estimate (Impr.) Set Prec@1 Avg Prec Prec@1 Avg Prec IN (8.5%) 4(-) OUT (3.4%) 63(8.1%) IN & OUT (9.5%) 79(6.5%) LM Neighbor Uniform Estimate Relevance Estimate (Impr.) Set Prec@1 Avg Prec Prec@1 Avg Prec IN (-) 2(5.2%) OUT 6 9 1(3.8%) 35(4.7%) IN & OUT (8.5%) 4(-) baseline. It would be interesting to try other kinds of neighbors, e.g. co-citations and co-references to see if they can further improve the performance. 3. Other than the neighbor sets derived from the explicit link structure of the Web, we can also define other types of neighbors. In general, the framework allows us to define any set of documents with a specific characteristic as a neighbor set. As an example, we can define the set of pages with simi- lar content as a neighbor set [21]. It is interesting to see if exploiting these types of neighbors can further improve the retrieval accuracy. 4. The probabilistic relevance propagation framework is a general hypertext retrieval framework that can be applicable to any hypertext retrieval environment. For example, we may apply the algorithms we studied here to literature search where the links represent citations.

8 Precision at 1 Precision at Average Precision Average Precision (a) Results using relevance probabilities (b) Results using raw content scores Figure 3: TREC-23 results with LM baseline Table 8: Ranges of α Values for Improving s Okapi Method TREC-23 TREC-24 1 Avg. Prec 1 Avg. Prec Uni-IN [.6,.9] [.6,.9] [.3,.9] [.3,.9] Wt-IN [.3,.9] [.3,.9] [.3,.9] [.2,.9] Uni-OUT [.4,.9] [.4,.9] [.6,.9] [.4,.9] Wt-OUT [.2,.9] [.2,.9] [.3,.9] [.2,.9] Language Model Uni-IN [.5,.9] [.6,.9] [.6,.9] [.6,.9] Wt-IN [.3,.9] [.2,.9] [.3,.9] [.4,.9] Uni-OUT [.6,.9] [.3,.9] [.7,.9] [.6,.9] Wt-OUT [.5,.9] [,.9] [.6,.9] [.3,.9] 5. ACKNOWLEDGMENTS This work is in part supported by the National Science Foundation under award numbers , , and REFERENCES [1] V. Anh and A. Moffat. Melbourne university 24: Terabyte and web tracks. In Proceedings of the TREC Conference, 24. [2] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In SIGIR 98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, ACM Press. [3] K. Bharat and G. A. Mihaila. When experts agree: using non-affiliated experts to rank popular topics. ACM Trans. Inf. Syst., 2(1):47 58, 22. [4] A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Link analysis ranking: algorithms, theory, and experiments. ACM Trans. Inter. Tech., 5(1): , 25. [5] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, [6] S. Chakrabarti, B. Dom, D. Gibson, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In ACM SIGIR workshop on Hypertext Information Retrieval on the Web, [7] P. R. Cohen and R. Kjeldsen. Information retrieval by constrained spreading activation in semantic networks. Inf. Process. Manage., 23(4): , 1987.

9 [8] N. Craswell and D. Hawking. Overview of the trec-22 web track. In Proceedings of the TREC Conference, 22. [9] N. Craswell and D. Hawking. Overview of the trec-23 web track. In Proceedings of the TREC Conference, 23. [1] N. Craswell and D. Hawking. Overview of the trec-24 web track. In Proceedings of the TREC Conference, 24. [11] F. Crestani and P. L. Lee. Searching the web by constrained spreading activation. Inf. Process. Manage., 36(4):585 65, 2. [12] W. B. Croft, T. J. Lucia, and P. R. Cohen. Retrieving documents by plausible inference: a priliminary study. In SIGIR 88: Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, ACM Press. [13] W. B. Croft, T. J. Lucia, J. Cringean, and P. Willett. Retrieving documents by plausible inference: an experimental study. Inf. Process. Manage., 25(6): , [14] B. D. Davison. Toward a unification of text and link analysis. In SIGIR 3: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages , New York, NY, USA, 23. ACM Press. [15] H. P. Frei and D. Stieger. The use of semantic links in hypertext information retrieval. Inf. Process. Manage., 31(1):1 13, [16] E. Garfield. Citation indexes for science. Science, 129, [17] G. Grimmett and D. Stirzaker. Probability and random processes. In Oxford University Press, [18] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 15(4): , 23. [19] J. Kamps, G. Mishne, and M. de Rijke. Language models for searching in web corpora. In Proceedings of the TREC Conference, 24. [2] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):64 632, [21] O. Kurland and L. Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In Proceedings of ACM SIGIR 25, pages , 25. [22] R. Larson. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In Annual Meeting of the American Society of Information Science, [23] M. Marchiori. The quest for correct information on the Web: Hyper search engines. Computer Networks and ISDN Systems, 29(8 13): , [24] F. Mathieu and M. Bouklit. The effect of the back button is a random walk: Application for pagerank. In Proceedings of the 13th International World Wide Web Conference, 24. [25] D. S. Modha and W. S. Spangler. Clustering hypertext with applications to web searching. In HYPERTEXT : Proceedings of the eleventh ACM on Hypertext and hypermedia, pages , New York, NY, USA, 2. ACM Press. [26] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In SIGIR 1: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, 21. ACM Press. [27] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library, [28] P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow s ear: Extracting usable structures from the web. In Proc. ACM Conf. Human Factors in Computing Systems, CHI. ACM Press, [29] T. Qin, T.-Y. Liu, X.-D. Zhang, Z. Chen, and W.-Y. Ma. A study of relevance propagation for web search. In SIGIR 5: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, 25. ACM Press. [3] M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, 22. [31] S. Robertson. Threshold setting and performance optimization in adaptive filtering. Information Retrieval, 5(2-3): , 22. [32] G. Salton. Associative document retrieval techniques using bibliographic information. J. ACM, 1(4):44 457, [33] G. Salton and C. Buckley. On the use of spreading activation methods in automatic information retrieval. Technical report, Ithaca, NY, USA, [34] J. Savoy. Bayesian inference networks and spreading activation in hypertext systems. Inf. Process. Manage., 28(3):389 46, [35] J. Savoy. Ranking schemes in hybrid boolean systems: a new approach. J. Am. Soc. Inf. Sci., 48(3): , [36] A. Shakery and C. Zhai. Relevance propagation for topic distillation uiuc trec 23 web track experiments. In Proceedings of the TREC Conference, 23. [37] H. Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. The American Society of Information Science, 24, [38] R. Song, J. R. Wen, S. M. Shi, T. Y. Xin, G. M. abd Liu, T. Qin, X. Zheng, J. Y. Zhang, G. R. Xue, and W. Y. Ma. Microsoft research asia at web track and terabyte track of trec 24. In Proceedings of the TREC Conference, 24. [39] M. Sydow. Random surfer with back step. In Proceedings of the 13th International World Wide Web Conference, 24. [4] T. Tomiyama, K. Karoji, T. Kondo, Y. Kakuta, and T. Takagi. Meiji university web, novelty and genomics track experiments. In Proceedings of the TREC Conference, 24. [41] H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. Robertson. Microsoft cambridge at trec 13: Web and hard tracks. In Proceedings of the TREC Conference, 24. [42] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM 1: Proceedings of the tenth international conference on Information and knowledge management, pages 43 41, New York, NY, USA, 21. ACM Press. [43] Z. Zhou, Y. Guo, B. Wang, X. Cheng, H. Xu, and G. Zhang. Trec 24 web track experiments at cas-ict. In Proceedings of the TREC Conference, 24.

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

The application of Randomized HITS algorithm in the fund trading network

The application of Randomized HITS algorithm in the fund trading network The application of Randomized HITS algorithm in the fund trading network Xingyu Xu 1, Zhen Wang 1,Chunhe Tao 1,Haifeng He 1 1 The Third Research Institute of Ministry of Public Security,China Abstract.

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Learning to Rank Networked Entities

Learning to Rank Networked Entities Learning to Rank Networked Entities Alekh Agarwal Soumen Chakrabarti Sunny Aggarwal Presented by Dong Wang 11/29/2006 We've all heard that a million monkeys banging on a million typewriters will eventually

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

University of Delaware at Diversity Task of Web Track 2010

University of Delaware at Diversity Task of Web Track 2010 University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

TREC-10 Web Track Experiments at MSRA

TREC-10 Web Track Experiments at MSRA TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **

More information

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

E-Business s Page Ranking with Ant Colony Algorithm

E-Business s Page Ranking with Ant Colony Algorithm E-Business s Page Ranking with Ant Colony Algorithm Asst. Prof. Chonawat Srisa-an, Ph.D. Faculty of Information Technology, Rangsit University 52/347 Phaholyothin Rd. Lakok Pathumthani, 12000 chonawat@rangsit.rsu.ac.th,

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Block-level Link Analysis

Block-level Link Analysis Block-level Link Analysis Deng Cai 1* Xiaofei He 2* Ji-Rong Wen * Wei-Ying Ma * * Microsoft Research Asia Beijing, China {jrwen, wyma}@microsoft.com 1 Tsinghua University Beijing, China cai_deng@yahoo.com

More information

Topic-Level Random Walk through Probabilistic Model

Topic-Level Random Walk through Probabilistic Model Topic-Level Random Walk through Probabilistic Model Zi Yang, Jie Tang, Jing Zhang, Juanzi Li, and Bo Gao Department of Computer Science & Technology, Tsinghua University, China Abstract. In this paper,

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

Welcome to the class of Web and Information Retrieval! Min Zhang

Welcome to the class of Web and Information Retrieval! Min Zhang Welcome to the class of Web and Information Retrieval! Min Zhang z-m@tsinghua.edu.cn Coffee Time The Sixth Sense By 费伦 Min Zhang z-m@tsinghua.edu.cn III. Key Techniques of Web IR Using web specific page

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Link Analysis Ranking Algorithms, Theory, and Experiments

Link Analysis Ranking Algorithms, Theory, and Experiments Link Analysis Ranking Algorithms, Theory, and Experiments Allan Borodin Gareth O. Roberts Jeffrey S. Rosenthal Panayiotis Tsaparas June 28, 2004 Abstract The explosive growth and the widespread accessibility

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Using Non-Linear Dynamical Systems for Web Searching and Ranking

Using Non-Linear Dynamical Systems for Web Searching and Ranking Using Non-Linear Dynamical Systems for Web Searching and Ranking Panayiotis Tsaparas Dipartmento di Informatica e Systemistica Universita di Roma, La Sapienza tsap@dis.uniroma.it ABSTRACT In the recent

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track A Study of Pattern-based Subtopic Discovery and Integration in the Web Track Wei Zheng and Hui Fang Department of ECE, University of Delaware Abstract We report our systems and experiments in the diversity

More information

The University of Amsterdam at the CLEF 2008 Domain Specific Track

The University of Amsterdam at the CLEF 2008 Domain Specific Track The University of Amsterdam at the CLEF 2008 Domain Specific Track Parsimonious Relevance and Concept Models Edgar Meij emeij@science.uva.nl ISLA, University of Amsterdam Maarten de Rijke mdr@science.uva.nl

More information

Assessment of WWW-Based Ranking Systems for Smaller Web Sites

Assessment of WWW-Based Ranking Systems for Smaller Web Sites Assessment of WWW-Based Ranking Systems for Smaller Web Sites OLA ÅGREN Department of Computing Science, Umeå University, SE-91 87 Umeå, SWEDEN ola.agren@cs.umu.se Abstract. A comparison between a number

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Personalizing PageRank Based on Domain Profiles

Personalizing PageRank Based on Domain Profiles Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Learning Web Page Scores by Error Back-Propagation

Learning Web Page Scores by Error Back-Propagation Learning Web Page Scores by Error Back-Propagation Michelangelo Diligenti, Marco Gori, Marco Maggini Dipartimento di Ingegneria dell Informazione Università di Siena Via Roma 56, I-53100 Siena (Italy)

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Survey on Web Structure Mining

Survey on Web Structure Mining Survey on Web Structure Mining Hiep T. Nguyen Tri, Nam Hoai Nguyen Department of Electronics and Computer Engineering Chonnam National University Republic of Korea Email: tuanhiep1232@gmail.com Abstract

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

Relevance Weighting for Query Independent Evidence

Relevance Weighting for Query Independent Evidence Relevance Weighting for Query Independent Evidence Nick Craswell, Stephen Robertson, Hugo Zaragoza and Michael Taylor Microsoft Research Cambridge, U.K. {nickcr,ser,hugoz,mitaylor}@microsoft.com ABSTRACT

More information

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This

More information

Microsoft Research Asia at the Web Track of TREC 2009

Microsoft Research Asia at the Web Track of TREC 2009 Microsoft Research Asia at the Web Track of TREC 2009 Zhicheng Dou, Kun Chen, Ruihua Song, Yunxiao Ma, Shuming Shi, and Ji-Rong Wen Microsoft Research Asia, Xi an Jiongtong University {zhichdou, rsong,

More information

Spectral filtering for resource discovery

Spectral filtering for resource discovery Spectral filtering for resource discovery Soumen Chakrabarti Byron E. Dom David Gibson Ravi Kumar Prabhakar Raghavan Sridhar Rajagopalan Andrew Tomkins IBM Almaden Research Center 650 Harry Road San Jose,

More information

Re-ranking Documents Based on Query-Independent Document Specificity

Re-ranking Documents Based on Query-Independent Document Specificity Re-ranking Documents Based on Query-Independent Document Specificity Lei Zheng and Ingemar J. Cox Department of Computer Science University College London London, WC1E 6BT, United Kingdom lei.zheng@ucl.ac.uk,

More information

Entity Ranking on Graphs: Studies on Expert Finding

Entity Ranking on Graphs: Studies on Expert Finding Entity Ranking on Graphs: Studies on Expert Finding Henning Rode 1, Pavel Serdyukov 1, Djoerd Hiemstra 1, and Hugo Zaragoza 2 1 CTIT, University of Twente, The Netherlands {h.rode, p.serdyukov, d.hiemstra}@cs.utwente.nl

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Markov Cluster Algorithm Web Web Web Kleinberg HITS Web Web HITS Web Markov Cluster Algorithm ( ) Web The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Kazutami KATO and Hiroshi

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

Local Methods for Estimating PageRank Values

Local Methods for Estimating PageRank Values Local Methods for Estimating PageRank Values Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 yenyu, qq gan, suel @photon.poly.edu Abstract The Google search

More information

SimRank : A Measure of Structural-Context Similarity

SimRank : A Measure of Structural-Context Similarity SimRank : A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom 1.Co-Founder at FriendDash 2.Stanford, compter Science, department chair SIGKDD 2002, Citation : 506 (Google Scholar) 1

More information

An Improved PageRank Method based on Genetic Algorithm for Web Search

An Improved PageRank Method based on Genetic Algorithm for Web Search Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 2983 2987 Advanced in Control Engineeringand Information Science An Improved PageRank Method based on Genetic Algorithm for Web

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information

LET:Towards More Precise Clustering of Search Results

LET:Towards More Precise Clustering of Search Results LET:Towards More Precise Clustering of Search Results Yi Zhang, Lidong Bing,Yexin Wang, Yan Zhang State Key Laboratory on Machine Perception Peking University,100871 Beijing, China {zhangyi, bingld,wangyx,zhy}@cis.pku.edu.cn

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique 60 2 Within-Subjects Design Counter Balancing Learning Effect 1 [1 [2www.worldwidewebsize.com

More information

on the WorldWideWeb Abstract. The pages and hyperlinks of the World Wide Web may be

on the WorldWideWeb Abstract. The pages and hyperlinks of the World Wide Web may be Average-clicks: A New Measure of Distance on the WorldWideWeb Yutaka Matsuo 12,Yukio Ohsawa 23, and Mitsuru Ishizuka 1 1 University oftokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-8656, JAPAN, matsuo@miv.t.u-tokyo.ac.jp,

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

Social Information Retrieval

Social Information Retrieval Social Information Retrieval Sebastian Marius Kirsch kirschs@informatik.uni-bonn.de th November 00 Format of this talk about my diploma thesis advised by Prof. Dr. Armin B. Cremers inspired by research

More information

University of Glasgow at the Web Track: Dynamic Application of Hyperlink Analysis using the Query Scope

University of Glasgow at the Web Track: Dynamic Application of Hyperlink Analysis using the Query Scope University of Glasgow at the Web Track: Dynamic Application of yperlink Analysis using the uery Scope Vassilis Plachouras 1, Fidel Cacheda 2, Iadh Ounis 1, and Cornelis Joost van Risbergen 1 1 University

More information

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining Scientific Journal of Impact Factor (SJIF): 4.14 International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 A Review

More information

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages An Enhanced Page Ranking Algorithm Based on eights and Third level Ranking of the ebpages Prahlad Kumar Sharma* 1, Sanjay Tiwari #2 M.Tech Scholar, Department of C.S.E, A.I.E.T Jaipur Raj.(India) Asst.

More information

HIPRank: Ranking Nodes by Influence Propagation based on authority and hub

HIPRank: Ranking Nodes by Influence Propagation based on authority and hub HIPRank: Ranking Nodes by Influence Propagation based on authority and hub Wen Zhang, Song Wang, GuangLe Han, Ye Yang, Qing Wang Laboratory for Internet Software Technologies Institute of Software, Chinese

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Popularity Weighted Ranking for Academic Digital Libraries

Popularity Weighted Ranking for Academic Digital Libraries Popularity Weighted Ranking for Academic Digital Libraries Yang Sun and C. Lee Giles Information Sciences and Technology The Pennsylvania State University University Park, PA, 16801, USA Abstract. We propose

More information

A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents

A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents Cathal Gurrin & Alan F. Smeaton School of Computer Applications Dublin City University Ireland cgurrin@compapp.dcu.ie

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

Window Extraction for Information Retrieval

Window Extraction for Information Retrieval Window Extraction for Information Retrieval Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center

More information

Factors Affecting Web Page Similarity

Factors Affecting Web Page Similarity Factors Affecting Web Page Similarity Anastasios Tombros and Zeeshan Ali Department of Computer Science, Queen Mary University of London, London, U.K tassos@dcs.qmul.ac.uk, zeeshan@odl.qmul.ac.uk Abstract.

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

Object-Level Ranking: Bringing Order to Web Objects

Object-Level Ranking: Bringing Order to Web Objects Object-Level Ranking: Bringing Order to Web Objects Zaiqing Nie 1 Yuanzhi Zhang 2 Ji-Rong Wen 1 Wei-Ying Ma 1 1 Microsoft Research Asia 2 Peking University Beijing, P. R. China Beijing, P. R. China {t-znie,jrwen,wyma}@microsoft.com

More information