Automatic Identification of User Interest For Personalized Search

Size: px
Start display at page:

Download "Automatic Identification of User Interest For Personalized Search"

Transcription

1 Automatic Identification of User Interest For Personalized Search Feng Qiu University of California Los Angeles, CA Junghoo Cho University of California Los Angeles, CA ABSTRACT One hundred users, one hundred needs. As more and more topics are being discussed on the web and our vocabulary remains relatively stable, it is increasingly difficult to let the search engine know what we want. Coping with ambiguous queries has long been an important part of the research on Information Retrieval, but still remains a challenging task. Personalized search has recently got significant attention in addressing this challenge in the web search community, based on the premise that a user s general preference may help the search engine disambiguate the true intention of a query. However, studies have shown that users are reluctant to provide any explicit input on their personal preference. In this paper, we study how a search engine can learn a user s preference automatically based on her past click history and how it can use the user preference to personalize search results. Our experiments show that users preferences can be learned accurately even from little click-history data and personalized search based on user preference yields significant improvements over the best existing ranking mechanism in the literature. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval search process General Terms Algorithms, Human Factors, Experimentation Keywords Web search, Personalized search, User profile, User search behavior 1. INTRODUCTION A number of studies have shown that a vast majority of queries to search engines are short and under-specified [1] and users may have completely different intentions for the same query [2]. For example, a real-estate agent may issue the query office to look for a vacant office space, while an IT specialist may issue the same query to look for popular Microsoft productivity software. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2006, May 23 26, 2006, Edinburgh, Scotland. ACM /06/0005. To address these differences among the users, there has been a recent surge of interest in personalized search to customize search results based on a user s interest. Given the large and growing importance of search engines, personalized search has the potential to significantly improve user experience. For example, according to recent statistics [3] if we can reduce the time users spend on searching for results on Google by a mere 1% through effective personalization, over 187,000 person-hours (21 years!) will be saved each month. Unfortunately, studies have also shown that the vast majority of users are reluctant to provide any explicit feedback on search results and their interest [4]. Therefore, a personalized search engine intended for a large audience has to learn the user s preference automatically without any explicit input from the users. In this paper, we study the problem of how we can learn a user s interest automatically based on her past click history and how we can use the learned interest to personalize search results for future queries. To realize this goal, there exist a number of important technical questions to be addressed. First, we need to develop a reasonable user model that captures how a user s click history is related to her interest; a user s interest can be learned through her click history only if they are correlated. Second, based on this model we need to design an effective learning method that identifies the user s interest by analyzing the user s click history. Finally, we need to develop an effective ranking mechanism that considers the learned interest of the user in generating the search result. Our work, particularly our ranking mechanism, is largely based on a recent work by Haveliwala on Topic-Sensitive PageRank [5]. In this work, instead of computing a single global PageRank value for every page, the search engine computes multiple Topic-Sensitive PageRank values, one for each topic listed in the Open Directory 1. Then during the query time, the search engine picks the most suitable Topic- Sensitive PageRank value for the given query and user, hoping that this customized version of PageRank will be more relevant than the global PageRank. In fact, in a small-scale user study, Haveliwala has shown that this approach leads to notable improvements in the search result quality if the appropriate Topic-Sensitive PageRank value can be selected by the search engine, but he posed the problem of automatic learning of user interest as an open research issue. In this paper, we try to plug in this final missing piece for an automatic personalized search system and make the following contributions: 1

2 We provide a formal framework to investigate the problem of learning a user s interest based on her past click history. As part of this framework, we propose a simple yet reasonable model on how we can succinctly represent a user s interest and how the interest affects her web click behavior. Based on the formal user model, we develop a method to estimate her hidden interest automatically based on her observable past click behavior. We provide theoretical and experimental justification of our estimation method. Finally, we describe a ranking mechanism that considers a user s hidden interest in ranking pages for a query based on the work in [5]. We conduct a user survey to evaluate how much the search quality improves through this personalization. While preliminary, our survey result indicates significant improvement in the search quality we observe about 25% improvement over the best existing method demonstrating the potential of our approach in personalizing web search. The rest of this paper is organized as follows. We first provide an overview of Topic-Sensitive PageRank in Section 2, on which our work is mainly based. In Section 3 we describe our models and methods. Experimental results are presented in Section 4. We finally review related work in Section 5 and conclude the paper in Section BACKGROUND In this section we briefly present some basic background for our work. We will start by presenting the PageRank algorithm in Section 2.1, then we will proceed to Topic- Sensitive PageRank in Section PageRank The key idea behind PageRank is that, when a page p 0 links to a page p, it is probably because the author of page p 0 thinks that page p is important. Thus, this link adds to the importance score of page p. How much score should be added for each link? Intuitively, if a page itself is very important, then its author s opinion on the importance of other pages is more reliable; and if a page links to many pages, the importance score it confers to each of them is decreased. This simple intuition leads to the following formula of computing PageRank: for each page p, let A p denote the set of pages linking to p, l p0 denote the number of out-links on page p 0, and PR(p) denote the PageRank of page p, then PR(p) = X p 0 A p PR(p 0)/l p0 (1) Another intuitive explanation of PageRank is based on the random surfer model [6], which essentially models a user doing a random walk on the web graph. In this model, a user starts from a random page on the web, and at each step she randomly chooses an out-link to follow. Then the probability of the user visiting each page is equivalent to the PageRank of that page computed using the above formula. However, in real life, a user does not follow links all the time in web browsing; she sometimes types URL s directly and visits a new page. To reflect this fact, the random surfer model is modified such that at each step, the user follows one of the out-links with probability d, while with the remaining probability 1 d she gets bored and jumps to a new page. Under the standard PageRank, the probability of this jump to a page is uniform across all pages; when a user jumps, she is likely to jump to any page with equal probability. Mathematically, this is described as E(p) = 1/n for all p, where E(p) is the probability to jump to page p when she gets bored and n is the total number of web pages. We call such a jump a random jump, and the vector E = [E(1),..., E(n)] a random jump probability vector. Thus we have the following modified formula 2 : PR(p) = d X PR(p 0)/l p0 + (1 d) E(p) (2) p 0 A p The computed PageRank values are used by search engines during query processing, such that the pages with high PageRank values are generally placed at the top of search results. Note that under the standard PageRank, every page is assigned a single global score independently of any query. Therefore, PageRank can be deployed very efficiently for online query processing because it can be precomputed offline before any query. However, because all pages are given a single global rank, this standard PageRank cannot incorporate the fact that the relative importance of pages may change depending on a query. 2.2 Topic-Sensitive PageRank The Topic-Sensitive PageRank scheme (TSPR) proposed in [5] is an interesting extension of PageRank that can potentially provide different rankings for different queries, while essentially retaining the efficiency advantage of the standard PageRank. In the TSPR scheme, multiple scores, instead of just one, are computed for each page, one for every topic that we consider. More precisely, to compute TSPR with respect to topic t, we first define a biased random jump probability vector with respect to topic t, E t = [E t(1),..., E t(n)], as j 1/nt if page p is related to topic t E t(p) = (3) 0 otherwise, where n t is the total number of pages related to topic t. The set of pages that are considered related to topic t (i.e., the pages whose E t(p) value is non-zero) are referred to as the bias set of topic t. Then the TSPR score of page p with respect to t is defined as TSPR t(p) = d X TSPR t(p 0)/l p0 + (1 d) E t(p). p 0 A p Note the similarity of Equations 2 and 4. Under the random surfer model interpretation, the biased vector E t means that when a user jumps, she jumps only to the pages related to topic t. Then TSPR t(p) is equivalent to the probability of the user s arriving at v when her random jumps are biased this way. Assuming that we consider m topics, m TSPR scores are computed for each page, which can be done offline. Then during online query processing, given a query, the search engine figures out the most appropriate TSPR score and uses it to rank pages. 2 Here we skip the discussion on how we deal with dangling pages for brevity. See [6] for more details (4)

3 In [5], Haveliwala computed TSPR values considering the first-level topics listed in the Open Directory and showed that TSPR provides notable performance improvement over the standard PageRank if the search engine can effectively estimate the topic to which the user-issued query belongs. In the paper, the author also posed the problem of automatic identification of user preference as an open research issue, which can help the search engine pick the best TSPR for a given query and user. 3. PERSONALIZED SEARCH BASED ON USER PREFERENCE We now discuss how we build upon the result of TSPR to personalize search results. In particular, we describe our framework on how we can learn a user s personal preference automatically based on her past click history and how we can use the preference during query time to rank search results for the particular user. In Section 3.1 we first describe our representation of user preferences. Then in Section 3.2 we propose our user model that captures the relationship between users preferences and their click history. In Section 3.3 we proceed to how we learn user preferences. Finally in Section 3.4 we describe how to use this preference information in ranking search results. 3.1 User Preference Representation Given the billions of pages available on the web and their diverse subject areas, it is reasonable to assume that an average web user is interested in a limited subset of web pages. In addition, we often observe that a user typically has a small number of topics that she is primarily interested in and her preference to a page is often affected by her general interest in the topic of the page. For example, a physicist who is mainly interested in topics such as science may find a page on video games not very interesting, even if the page is considered to be of high quality by a video-game enthusiast. Given these observations, we may represent a user s preference at the granularity of either topics or individual web pages as follows: Definition 1 (Topic Preference Vector) A user s topic preference vector is defined as an m-tuple T = [T(1),..., T(m)], in which m is the number of topics in consideration and T(i) represents the user s degree of interest in the i th topic (say, Computers ). The vector T is normalized such that T(i) = 1. P m Example 1 Suppose there are only two topics: Computers and News, and a user is interested in Computers three times as much as she is interested in News, then the topic preference vector of the user is [0.75, 0.25]. Instead of the above, we may represent a user s interest at the level of web pages. Definition 2 (Page Preference Vector) A user s page preference vector is defined as an n-tuple P = [P(1),..., P(n)], in which n is the total number of web pages and P(i) represents the user s degree of interest in the i th page. The vector P is normalized such that P n P(i) = 1. In principle, the page preference vector may capture a user s interest better than the topic preference vector, because her interest is represented in more detail. However, we note that our goal is to learn the user s interest through the analysis of her past click history. Given the billions of pages available on the web, a user can click on only a very small fraction of them (at most hundreds of thousands), making the task of learning the page preference vector very difficult; we have to learn the values of a billion-dimension vector from hundreds of thousands data points, which is bound to be inaccurate. Due to this practical reason, we use the topic preference vector as our representation of user interest in the rest of this paper. We note that the this choice of preference representation is valid only if a user s interest in a page is mainly driven by the topic of the page. We will try to check the validity of this assumption later in the experiment section even though in an indirect way by measuring the effectiveness of our search personalization method based on topic preference vectors. In Table 1, we summarize the symbols that we use throughout this paper. The meaning of some of the symbols will be clear as we introduce our user model. Symbol n m T(i) P(i) V (p) E t PR(p) TSPR t(p) PPR T(p) Meaning The total number of web pages The number of topics in consideration A user s topic preference on the i th topic A user s page preference on the i th page A user s probability of visiting page p Biased random jump probability vector with respect to topic t PageRank of page p Topic-Sensitive PageRank of page p on topic t Personalized PageRank of page p for the user whose topic preference vector is T Table 1: Symbols used throughout this paper and their meanings 3.2 User Model To learn the topic preference vector of a user from her past click history, we need to understand how the user s clicks are related to her preference. In this section, we describe our user model that captures this relationship. As a starting point, we first describe the topic-driven random surfer model. Definition 3 (Topic-Driven Random Surfer Model) Consider a user with topic preference vector T. Under the topic-driven random surfer model, the user browses the web in a two-step process. First, the user chooses a topic of interest t for the ensuing sequence of random walks with probability T(t) (i.e., her degree of interest in topic t). Then with equal probability, she jumps to one of the pages on topic t (i.e, pages whose E t(p) values are non-zero). Starting from this page, the user then performs a random walk, such that at each step, with probability d, she randomly follows an out-link on the current page; with the remaining probability 1 d she gets bored and picks a new topic of interest for the next sequence of random walks based on T and jumps to a page on the chosen topic. This process is repeated forever.

4 Example 2 Suppose there are only two topics: Computers and News, and a user s topic preference vector is [0.7, 0.3]. Under the topic-driven random surfer model, this means that 70% of the the user s random-walk sessions (the set of pages that the user visits by following links) start from computer-related pages and 30% start from newsrelated pages. Note the difference of this model from the standard random surfer model, where the user s random walk may start from any page with equal probability. Given this model and the discussion in Section 2.2, we can see that, TSPR t(p), the Topic-Sensitive PageRank of page p with regard to topic t, is equivalent to the probability of a user s visiting page p when the user is only interested in topic t (i.e., T(i) is 1 for i = t and 0 otherwise). In general, we can derive the relationship between a user s visit to a page and her topic preference vector T. To help the derivation, we first define a user s visit probability vector Definition 4 (Visit Probability Vector) A user s visit probability vector is defined as an n-tuple V = [V (1),..., V (n)], in which n is the total number of pages and V (i) represents the user s probability to visit the page i. We assume the vector V is normalized such that P n V (i) = 1. Given this definition, it is straightforward to derive the following relationship between a user s visit probability vector V and the user s topic preference vector T based on the linearity property of TSPR [5]: Lemma 1 Consider a user with the topic preference vector T = [T(1),..., T(m)] and the visit probability vector V = [V (1),..., V (n)]. Under the topic-driven random surfer model, V (p), the probability to visit page p, is given by mx V (p) = T(i) TSPR i(p) (5) Due to space constraints, we provide all detailed proofs in the extended version of this paper [7]. Note that Equation 5 shows the relationship between a user s visits and her topic preference vector if the user follows the topic-driven random surfer model. In practice, however, the search engine can only observe the user s clicks on its search result, not the general web surfing behavior of the user. That is, the user clicks that the search engine observes is not based on the topic-driven random surfer model; instead the user s clicks are heavily affected by the rankings of search results. To address this discrepancy, we now extend the topic-driven random-surfer model as follows: Definition 5 (Topic-Driven Searcher Model) Consider a user with topic preference vector T. Under the topicdriven searcher model, the user always visits web pages through a search engine in a two-step process. First, the user chooses a topic of interest t with probability T(t). Then the user goes to the search engine and issues a query on the chosen topic t. The search engine then returns pages ranked by TSPR t(p), on which the user clicks. To derive the relationship between a user s visit probability vector V and her topic preference vector T under this new model, we draw upon the result from a recent study by Cho and Roy [8]. In [8], the authors have shown that when a user s clicks are affected by search results ranked by PR(p), the user s visit probability to page p, V (p), is proportional to PR(p) 9/4, as opposed to PR(p) as predicted by the random-surfer model. Given this new relationship, it is relatively straightforward to prove the following: Theorem 1 Consider a user with the topic preference vector T and the visit probability vector (V ). Then under the topic-driven searcher model, V (p) is given by 3 V (p) = mx T(i) [TSPR i(p)] 9/4 (6) Note that Equation 6 is equivalent to Equation 5 except that the term TSPR i(p) is replaced with [TSPR i(p)] 9/4, a replacement coming from the result in [8]. In the rest of this paper, we will assume the above topicdriven searcher model as our main user model and use Equation 6 as the relationship between a user s visit probability vector and her topic preference vector. 3.3 Learning Topic Preference Vector We now turn our attention to how we can learn a user s topic preference vector from the user s past click history on search results. Roughly, this learning can be done based on our user model, in particular Equation 6, which describes the relationship between a user s visits, her topic preference, and the Topic-Sensitive PageRank values of the pages. Note that among the variables in the equation we can measure V (p) values experimentally by observing the user s clicks on search results as we illustrate in the following example: Example 3 Suppose there exist 10 pages on the web and through the past searches, the user has visited the first page twice, the second page once and none of the other pages. The user s visit probability vector is then can be estimated as V = [ 2, 1,0,..., 0] given the click history. 3 3 Also, the TSPR i(p) values in Equation 6 can be computed from Equation 4 based on the hyperlink structure of the web and a reasonable choice of the bias set for each topic. The only unknown variables in Equation 6 are T(i) values, which can be derived from other variables in the equation. Formally, this learning process can be formulated as the following problem: Problem 1 Given the user visit probability vector V = [V (1),..., V (n)], and m Topic-Sensitive PageRank vectors TSPR i = [TSPR i(1),..., TSPR i(n)] for i = 1,..., m, find the user s topic preference vector T = [T(1),..., T(m)] that satisfies V = mx T(i) TSPR 9/4 i (7) Here, we use the notation A k to denote a vector whose elements are the elements of A raised to the kth power. There exist a number of ways to approach this problem. In particular, regression-based and maximum-likelihood methods are two of the popular techniques used in this context: 3 Even if the TSPR i(j) values are normalized such that P n j=1 TSPRi(j) = 1, it may be that TSPRi(j)9/4 values do not satisfy P n j=1 TSPRi(j)9/4 = 1. In Equation 5, we assume that we have already renormalized TSPR i(j) values such that P n j=1 TSPRi(j)9/4 = 1 for every i.

5 1. Linear regression: Because the topic-driven searcher model is an approximation of what users do and because a search engine can observe a relatively small number of user clicks, the V vector measured from user clicks cannot be identical to the one prescribed by the model. Linear regression is a popular method used in this scenario. It picks the unknown parameter values such that the difference between what we observe and what the model prescribes is minimized. In our context, it determines the T(i) values that minimize the square-root error V Pm T(i) TSPR9/4 i 2. Assuming that TSPR is an n m matrix whose (i, j) entry is TSPR j(i) 9/4, the linear regression method asserts that this error is minimum when the topic preference vector is T = `TSPR t TSPR 1 TSPR t V. (8) Here, TSPR t is the transpose of TSPR. 2. Maximum-likelihood estimator: The core principle behind a maximum-likelihood estimator is that the unknown parameters of the model should be chosen such that the observed set of events are the most likely outcome from the model. To apply this method, we introduce some new notation. We assume that the user has visited k pages and use p i to denote the i th visited page. We also define V T(p) as P m T(i) TSPR9/4 i, the probability that the user visits page p under the topic-driven searcher model when her topic preference vector is T = [T(1),..., T(m)]. Assuming that all user visits are independent, the probability that the user visits the pages p 1,..., p k is then Q k VT(pi). Then the values of the vector T is chosen such that this probability is maximized. That is, under the maximumlikelihood estimator,! T = arg max T ky V T(p i) In our experiments, we have applied both methods to the learning task and find that the maximum-likelihood estimator performs significantly better than the linear regression method. This is in large part due to, again, the sparsity of the observed click history. Since the user does not visit the vast majority of pages, most of the entries in the visit probability vector are zero. Therefore, even though two users may visit completely different sets of pages, the difference between their visit probability vectors is small because the two vectors have the same zero values for most entries. This problem turns out to be not very significant for the maximum likelihood method because it only considers the pages that the user visited during estimation as we can see from Equation 9. In our experiment section, therefore, we report only the results from the maximum-likelihood estimator. 3.4 Ranking Search Results Using Topic Preference Vectors In this section we describe how we rank search results based on the user s topic preference vector we have learned. Our approach is based on the framework proposed in [5]: Given a query q, we identify the most likely topic of the query q and use the TSPR values corresponding to the identified topic. More precisely, given q, we estimate P R(T(i) q), (9) the probability that q is related to topic i, and compute the ranking of page p as follows: mx Pr(T(i) q) TSPR i(p) (10) t=1 That is, we compute the sum of TSPR values weighted by the likelihood that the query is related to each topic. Note that this ranking degenerates into TSPR t(p) when the topic of the query q is clearly t (i.e., Pr(T(t) q) = 1 only for i = t). In deciding the likely topic of q, we may rely on two potential sources of information: the user s preference and the query itself. Example 4 For simplicity, suppose we consider only two topics, Business and Computers. For the query C++ programming, it is clear from the query that this is on Computers not Business. For another query office, however, its topic is difficult to judge from the query itself. It is only when we consider the preference of the user, say, the user is an IT specialist who is mostly interested in Computers that the likely topic of the query becomes clear. We can design a mechanism that considers both sources of information in identifying the likely topic of a query based upon the Bayesian framework [9, 5]. According to the Bayes theorem, P(T(i) q) can be rewritten as Pr(T(i) q) = Pr(q,T(i)) Pr(q) = Pr(T(i)) Pr(q T(i)) Pr(q) Pr(T(i)) Pr(q T(i)) (11) Here, note that Pr(T(i)) is the probability that the user is interested in topic i, which is captured in the i th term of the user s topic preference vector T = [T(1),..., T(m)]. That is, Pr(T(i)) = T(i). Pr(q T(i)) is the probability that the user issues the query q when she is interested in topic i. As in [5], this probability may be computed by counting the total number of occurrences of terms in query q in the web pages listed under the topic i in the Open Directory. Given Equations 10 and 11, we can then compute PPR q,t(p), the personalized ranking of page p for query q issued by the user of topic preference vector T, as follows: mx PPR T(p) = T(i) Pr(q T(i)) TSPR i(p) (12) t=1 Note that in the above equation, the term T(i) is the personalization factor based on the user s preference. P r(q T(i)) is the term that identifies the topic based on the query itself. Later in our experiment section, we will evaluate the effectiveness of these two individual terms by using only one of them during ranking. 3.5 Evaluation Metrics How can we evaluate the effectiveness of our proposed methods? Given that our primary goal is to learn the user s topic preference vector from her past click history and use this vector to personalize search ranking, we may consider one of the following evaluation metrics: 1. Accuracy of topic preference vector: One natural way to evaluate our learning method is to directly measure

6 the difference between the user s actual topic preference vector and the learned vector. For this purpose, we may use the relative error between the two: E(T e) = Te T T (13) Here T denotes the user s actual topic preference vector and T e denotes our estimation based on the user s click history. 2. Accuracy of personalized ranking: Our final goal is to personalize search results based on the user s preference. Therefore, we may use the accuracy of the final personalized ranking (as opposed to the accuracy of the user s topic preference vector) as our evaluation metric. We illustrate why this metric may be more useful using the following example. Example 5 We consider only two topics and 5 pages. Suppose that the two topics are closely related, and their TSPR values are identical. That is, for example, TSPR 1 = TSPR 2 = [0.8, 0.1, 0,0, 0.1]. Now consider two users whose topic preference vectors are T 1 = [1, 0] and T 2 = [0,1], respectively. In this scenario, note that even though the users topic preference vectors are completely different, their visit probability vectors are identical: V T1 = 1 TSPR TSPR 2 = 0 TSPR TSPR 2 = V T2 Therefore, it is not possible to learn the users exact topic preference vectors from their visit history. Fortunately in this case, the difference in the weights of the topic preference vectors do not matter; the personalized ranking for the two users are always the same Pr(T(1) q) TSPR 1 + Pr(T(2) q) TSPR 2 = TSPR 1. The above example illustrates that even though our estimated topic preference vector may not be the same as the user s actual vector, the final personalized ranking can still be accurate, making our learning method useful. To evaluate how well we can compute the user s personalized ranking, we use the Kendall s τ distance [10] between our estimated ranking and the ideal ranking. That is, let σ k be the sorted list of top-k pages based on the estimated personalized ranking scores. Let σ k be the sorted top-k list based on the the personalized ranking computed from the user s true preference vector. Let S be the union of σ k and σ k. Then the Kendall τ distance between σ k and σ k is computed as τ(σ k, σ k) = (i, j) : i, j S, σ k(i) < σ k (j), σ k(i) > σ k(j) S S 1 (14) The value of Kendall s τ distance ranges between 0 and 1, taking 0 when the two ranks are identical. Given that most users never look beyond the top 20 entries in search results [1], using a k value between 10 and 100 may be a good choice to compare the difference in the ranks perceived by the users. 3. Improvement in search quality: The ultimate success of a personalized ranking scheme can be measured by the quality of the search results determined by the users. Given that an effective ranking scheme should place relevant pages close to the top of the search results, we may use the average rank of relevant pages in the search result as a measure of its quality if we know what pages users deem relevant to each query. In the next section, we evaluate the effectiveness of our personalized search framework using all three metrics described above. 4. EXPERIMENTS In this section we discuss various experiments we have done to evaluate our proposed methods and show the results. We first describe our experimental setup in Section 4.1. Then in Section 4.2 we describe a simulation-based experiment to measure the accuracy of our learning method. Finally in Section 4.3 we present the results from our user survey that measures the perceived quality of our personalized ranking method. 4.1 Experimental Setup In order to apply the three evaluation metrics described in Section 3.5, we need the following three datasets: (1) users click history, (2) the set of pages that are deemed relevant to the queries that they issue, and (3) the Topic-Sensitive PageRank values for each page. To collect these data, we have contacted 10 subjects in the UCLA Computer Science Department and collected 6 months of their search history by recording all the queries they issued to Google and the search results that they clicked on. Table 2 shows some high-level statistics on this query trace. Statistics Value # of subjects 10 Collection period 04/ /2004 Avg # of queries per subject Ave # of clicks per query 0.91 Table 2: Statistics on our dataset To identify the set of pages that are relevant to queries, we carried out a human survey. In this survey, we first picked the most frequent 10 queries in our query trace, and for each query, each of the 10 subjects were shown 10 randomly selected pages in our repository that match the query. Then the subjects were asked to select the pages they found relevant to their own information need for that query. On average 3.1 (out of 10) search results were considered relevant to each query by each user in our survey. Finally, we computed TSPR values for 500 million web pages collected in a large scale crawl in That is, based on the link structure captured in the snapshot, we computed the original PageRank and the Topic-Sensitive PageRank values for each of the 16 first-level topic listed in the Open Directory. The computation of these values was performed on a workstation equipped with a 2.4GHz Pentium 4 CPU and 1GB of RAM. The computation of 500 million TSPR values roughly took 10 hours to finish for each topic on the workstation.

7 4.2 Accuracy of Learning Method In this section we first try to measure the accuracy of our learning method. Here, we are concerned with both the accuracy of our method and the size of the click history necessary for accurate estimation. Even if a user s preference can be learned accurately in principle, it may not be possible in practice if it requires a sample size significantly larger than what we can actually collect. The best way of measuring the accuracy of our method is to estimate the users topic preferences from the real-life data we have collected, and ask the users how accurate our results are. The problem with this method is that, although users could tell which are the topics they are most interested in, it tends to be very difficult for them to assign an accurate weight to each of these topics. For example, if a user is interested in Computers and News, is her topic preference vector [0.5, 0.5] or [0.4, 0.6]? This innate inaccuracy in users topic preference estimations makes it difficult to investigate the accuracy of our method using real-life data. Thus, we will use a synthetic dataset generated by simulation based on our topic-driven searcher model: 1. Generation of topic preference vector. In our implementation, the number of topics the user is interested in is fixed to K as an experimental parameter. Then we randomly choose K topics and assign random weights to each selected topic. The weights for other topics are set to 0. The vector is normalized to sum up to one. 2. Generation of click history. Once we generate a user s topic preference vector, we generate a sequence of L clicks done by the user using the visit probability distribution dictated by our model (Equation 6 in Section 3.2) Accuracy of estimated topic preference vector In order to measure the accuracy of the estimated topic preference vector we generate synthetic user click traces as described above and measure the relative error of our estimated vector compared to the true preference vector (Equation 13). Figure 1 shows the results from this experiment. In the graph, the height of a bar corresponds to the average relative error at the given K and L values. From the figure, we see that at the same K value, as the sample size L increases, the relative error of our method generally decreases. For instance, when K = 4, the relative error at L = 10 is 0.66, but the error goes down to 0.46 when L = For most K values, we also observe that beyond the sample size L = 100, the decreasing trend of the relative error becomes much less significant. For example, when K = 4, the relative error at L = 100 is 0.4, but the error actually gets slightly larger to 0.41 when L = 500. In the graph, an interesting number to note is when K = 3 and L = 100, because this is the setting that we expect to see in practical scenarios. 4 At this parameter setting (K = 3, L = 100) the average relative error from our estimation is 0.3, indicating that we can learn the user s topic preference vector with 70% accuracy. The graph also shows that as the 4 This statement is based on the fact that we were able to collect an average of 200 clicks per user in 7 months and that the analysis of this trace indicates that a typical user is interested in 3 to 4 topics out of 16. Relative Error Sample Size (L) Number of Topics (K) Figure 1: Relative errors in estimated weights user gets interested in more topics (i.e., as K increases), the accuracy of our method generally gets worse. This trend is expected because when K is large the user visits pages on diverse topics. Therefore, we need to collect a larger number of overall clicks, to obtain a similar number of clicks on the pages on a particular topic. Relative Error Baseline Estimation Our Method Number of Topics Figure 2: Comparison of relative errors in estimated weights on sample size 100 In Figure 2, we try to see whether our estimation is meaningful at all by comparing it against the baseline estimation, which assumes an equal weight for every topic (i.e., T(i) = 1 for i = 1,..., 16). The graph for our method 16 is based on when L = 100. From the graph, we can see that when the users are interested in a relatively few number of topics (1 K 6), our estimate can be meaningful. The relative error is significantly smaller than for the baseline estimation, indicating that we do get to learn the user s general preference. However, when the user interest is very diverse, the graph shows that we need to collect much more user-click-history data in order to obtain a meaningful esti-

8 mate for the user preference Accuracy of Personalized PageRank Difference Sample Size (L) Number of Topics (K) Figure 3: Differences in rankings of top 20 pages We now investigate the accuracy of our personalized ranking. To measure this accuracy, we again generate synthetic user click data based on our model and estimate the user s topic preference vector. We then compute the Kendall s τ distance between the ranking computed from the estimated preference vector and the ranking computed from the true preference vector. Figure 3 shows the results when the distance is computed for the top 20 pages. 5 We can see from the figure that, in spite of the relatively large relative error reported in the previous section, our method produces pretty good rankings. For example, when K = 3 and L = 100, the distance is 0.05, meaning that only 5% of the pairs in our ranking are reversed compared to the true ranking. Difference General PageRank Naive Personalized PageRank Our Method Number of Topics Figure 4: Comparison of differences in rankings of top 20 pages on sample size We also computed the distance for top-k pages for larger k values and the results were similar. We now compare our personalized ranking against two other baseline rankings: (1) a ranking based on general PageRank and (2) a ranking based on Equation 12, but assuming that all topic weights are equal (i.e., T(i) = 1 16 for i = 1,..., 16). This comparsion will indicate how much improvement we get from our topic preference estimation. In Figure 4 we compare the Kendall s τ distance of the three ranking methods for our synthetic data. From the graph, we can see that our method consistently produces much smaller distances than the two baseline rankings, indicating the effectiveness of our learning method. 4.3 Quality of Personalized Search We now try to measure how much our personalization method improves the overall quality of search results based on our user survey. To measure this improvement we compare the following four ranking mechanisms: General PageRank: Given a query, we rank the pages that match the query based on their global PageRank values. Topic-Sensitive PageRank: We rank pages assuming that the user is interested in all topics. That is, we rank pages based on PPR T(p) of Equation 12, but assuming that T(i) = 1 for i = 1,..., 16. This represents a ranking method that does not take the user s 16 preference into account. Personalized Topic-Sensitive PageRank: We rank pages based on Equation 12 using the estimated topic preference vector, but excluding the second term P r(q T(i)). That is, we rank pages by P m T(i) TSPRi(p). This represents a ranking method that uses the user preference, but not the query in identifying the likely topic of the query. Query-Biased Personalized Topic-Sensitive PageRank: We rank pages based on Equation 12 without omitting any terms. This represents a ranking method that uses both the user preference and the query to identify the likely topic of the query. In order to measure the quality of a ranking method, we use the data collected from the user survey described in Section 4.1. In the survey, for each of the 10 most common queries from our query trace, we asked our subjects to select the pages that they consider relevant to their own information need among 10 random pages in our repository that match the query. Considering that the queries chosen by us might not be the ones the users would issue in real life (and thus users may not care about the performance of our method on the queries), we also asked each subject to give the probability of her issuing each query in real life. Then for each of our subjects and queries, we compute the weighted average rank of the selected pages using the following formula under each ranking mechanism: AvgRank(u,q) = X p S Pr(q u) R(p) (15) Here S denotes the set of pages the user u selected for query q, Pr(q u) denotes the probability that user u issues query q, and R(p) denotes the rank of page p by the given ranking mechanism. Smaller AvgRank values indicate better placements of relevant results, or better result quality.

9 In Figure 5 we aggregate the results by queries to show how well our methods perform for different queries. The horizontal dotted lines in the graph show the average AvgRank over all queries. We can see that our personalization scheme is very effective and achieves significant performance improvement over traditional methods for most queries. The average improvements of Personalized Page- Rank and Query-Biased Personalized PageRank over Topic- Sensitive PageRank, over all queries, are 32.8% and 24.9%, respectively. 6 We note that, from our experimental results, although the Query-Biased Personalized PageRank takes more information (both the user s topic preference and the query she issues) into consideration, it performs worse than Personalized PageRank in most cases. This result is quite surprising, but the difference is too small to draw a meaningful conclusion. It will be an interesting future work to see whether the Personalized PageRank indeed performs better than the Query-Biased Personalized PageRank in general and if it does, investigate the reason behind it. Ranking PageRank Topic-Sensitive PageRank Personalized PageRank Query-Biased Personalized PageRank computer vision dictionary Athens 2004 Jens java music distributed event system calendar shopping movies Query Figure 5: Comparison of weighted average rankings of selected pages by queries (Lower is better) In Figure 6 we aggregate the results by participants to show the overall effectiveness of our methods for each user. The horizontal dotted lines in the graph show the average results of all participants. We can see that the two personalization-based methods outperform the traditional methods in all cases. For example, the Personalized Page- Rank method outperforms Topic-Sensitive PageRank by more than 70% for the 2 nd participant, while for the 6 th participant the improvement is only around 7%. This demonstrates that the effectiveness of personalization depends on the user s search habits. For example, if the user is not likely to issue ambiguous queries, then Topic-Sensitive PageRank should be able to capture her search intentions pretty well, thus decreasing the improvement our methods could achieve. Nevertheless, our methods still greatly improve over the traditional ones overall. The average improvements of Person- 6 These improvements are statistically significant under the two-sample t-test at the 0.05 level. alized PageRank and Query-Biased Personalized PageRank over Topic-Sensitive PageRank, over all participants, are 24.2% and 15.2%, respectively. Ranking PageRank Topic-Sensitive PageRank Personalized PageRank Query-Biased Personalized PageRank Participant Figure 6: Comparison of weighted average rankings of selected pages by participants (Lower is better) 5. RELATED WORK After the original PageRank paper [11] was published, there has been considerable research on Personalized Page- Rank [5, 12, 13, 14, 15]. A large body of this work studies the scalability and performance issues because computing Personalized PageRank for every user may not scale to billions of users. For example, [5, 13, 14] provide a framework to limit the bias vector space during the computation of PageRanks, so that acceptable performance can be achieved. Other than the scalability studies, [12] tries to tailor the PageRank vectors based on query terms (but not by individual users). In [15] Personalized PageRanks are computed based on the user profiles explicitly specified by the users. Our work is different from this body of work in that we focus on developing an automatic learning mechanism for user preferences, so that they can be used to compute Personalized PageRank. Researchers have also proposed ways to personalize web search based on ideas other than PageRank [16, 17, 18]. For example, [16] extends the well-known HITS algorithm by artificially increasing the authority and hub scores of the pages marked relevant by the user in previous searches. [17] explores ways to consider the topic category of a page during ranking using user-specified topics of interest. [18] does a sophisticated analysis on the correlation between users, their queries and search results clicked to model user preferences, but due to the complexity of the analysis, we believe this method is difficult to scale to general search engines. There also exists much research on learning a user s preference from pages she visited [19, 20, 21]. This body of work, however, mainly relies on content analysis of the visited pages, differently from our work. In [19], for example, multiple TF-IDF vectors are generated, each representing the user s interests in one area. In [20] pages visited by the user are categorized by their similarities to a set of precategorized pages, and user preferences are represented by the topic categories of pages in her browsing history. In [21]

10 the user s preference is learned from both the pages she visited and those visited by users similar to her (collaborative filtering). Our work differs from these studies in that pages are characterized by their Topic-Sensitive PageRanks, which are based on the web link structure. It will be an interesting future work to develop an effective mechanism to combine both the content and the web link structure for personalized search. Finally, Google 7 has started a beta-testing of a new personalized search service 8, which seems to estimate a searcher s interests from her past queries. Unfortunately, the details of the algorithm are not known at this point. 6. CONCLUSION In this paper we have proposed a framework to investigate the problem of personalizing web search based on users past search histories without user efforts. In particular, we first proposed a user model to formalize users interests in web pages and correlate them with users clicks on search results. Then, based on this correlation, we described an intuitive algorithm to actually learn users interests. Finally, we proposed two different methods, based on different assumptions on user behaviors, to rank search results based on the user s interests we have learned. We have conducted both theoretical and real-life experiments to evaluate our approach. In the experiment on synthetic data, we found that for a reasonably small user search trace (containing 100 past clicks on results), the user interests estimated by our learning algorithm can be used to pretty accurately predict her view on the importance of web pages, which is expressed by her Personalized PageRank, showing that our method is effective and easily applicable to real-life search engines. In the real-life experiment, we applied our method to learn the interests of 10 subjects we contacted, and compared the effectiveness of our method in ranking future search results for them to traditional Page- Rank and Topic-Sensitive PageRank. The results showed that, on average, our method performed between 25% 33% better than Topic-Sensitive PageRank, which turned out to be much better than PageRank. In the future we plan to expand our framework to take more user-specific information into consideration. For example, users browsing behaviors, information, etc. The difficulties in doing this include integration of different information sources, modeling of the correlation between various information and the user s search behavior, and efficiency concerns. We also plan to design more sophisticated learning and ranking algorithms to further improve the performance of our system. 7. REFERENCES [1] B.J. Jansen, A. Spink, and T Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the Web. Information Processing and Management, 36(2): , [2] Robert Krovetz and W. Bruce Croft. Lexical ambiguity and information retrieval. Information Systems, 10(2): , [3] Nielsen netratings search engine ratings report [4] J. Carroll and M. Rosson. The paradox of the active user. Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, [5] T. Haveliwala. Topic-Sensitive PageRank. In Proceedings of the Eleventh Intl. World Wide Web Conf., [6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proc. of WWW 98, [7] F. Qiu and J. Cho. Automatic identification of user preferences for personalized search. Technical report, UCLA Computer Science Department, [8] J. Cho and S. Roy. Impact of Web search engines on page popularity. In Proc. of WWW 04, [9] Dimitri P. Bertsekas and John N. Tsitsiklis. Introduction to Probability. Athena Scientific, [10] M. Kendall and J. Gibbons. Rank Correlation Methods. Edward Arnold, London, [11] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, [12] M. Richardson and P. Domingos. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, [13] G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the Twelfth Intl. World Wide Web Conf., [14] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Exploiting the block structure of the web for computing PageRank. Technical report, Stanford University, [15] M. Aktas, M. Nacar, and F. Menczer. Personalizing PageRank based on domain profiles. In Proc. of WebKDD 2004: KDD Workshop on Web Mining and Web Usage Analysis, [16] F. Tanudjaja and L. Mui. Persona: A contextualized and personalized web search. In Proc. of the 35th Annual Hawaii International Conference on System Sciences, [17] P. Chirita, W. Nejdl, R. Paiu, and C. Kohlschuetter. Using ODP metadata to personalize search. In Proceedings of ACM SIGIR 05, [18] J. Sun, H. Zeng, H. Liu, Y. Lu, and Z. Chen. CubeSVD: A novel approach to personalized web search. In Proceedings of the Fourteenth Intl. World Wide Web Conf., [19] L. Chen and K. Sycara. Webmate: a personal agent for browsing and searching. Proc. 2nd Intl. Conf. on Autonomous Agents and Multiagent Systems, pages , [20] A. Pretschner and S. Gauch. Ontology based personalized search. In ICTAI, pages , [21] K. Sugiyama, K. Hatano,, and M. Yoshikawa. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of the Thirteenth Intl. World Wide Web Conf., 2004.

Impact of Search Engines on Page Popularity

Impact of Search Engines on Page Popularity Impact of Search Engines on Page Popularity Junghoo John Cho (cho@cs.ucla.edu) Sourashis Roy (roys@cs.ucla.edu) University of California, Los Angeles Impact of Search Engines on Page Popularity J. Cho,

More information

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Search Engines Considered Harmful In Search of an Unbiased Web Ranking Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo John Cho cho@cs.ucla.edu UCLA Search Engines Considered Harmful Junghoo John Cho 1/38 Motivation If you are not indexed by

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Search Engines Considered Harmful In Search of an Unbiased Web Ranking Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo John Cho cho@cs.ucla.edu UCLA Search Engines Considered Harmful Junghoo John Cho 1/45 World-Wide Web 10 years ago With Web

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Personalizing PageRank Based on Domain Profiles

Personalizing PageRank Based on Domain Profiles Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

An Application of Personalized PageRank Vectors: Personalized Search Engine

An Application of Personalized PageRank Vectors: Personalized Search Engine An Application of Personalized PageRank Vectors: Personalized Search Engine Mehmet S. Aktas 1,2, Mehmet A. Nacar 1,2, and Filippo Menczer 1,3 1 Indiana University, Computer Science Department Lindley Hall

More information

CPSC 532L Project Development and Axiomatization of a Ranking System

CPSC 532L Project Development and Axiomatization of a Ranking System CPSC 532L Project Development and Axiomatization of a Ranking System Catherine Gamroth cgamroth@cs.ubc.ca Hammad Ali hammada@cs.ubc.ca April 22, 2009 Abstract Ranking systems are central to many internet

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

A brief history of Google

A brief history of Google the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee Crawler: A Crawler with High Personalized PageRank Coverage Guarantee Junghoo Cho University of California Los Angeles 4732 Boelter Hall Los Angeles, CA 90095 cho@cs.ucla.edu Uri Schonfeld University of

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

PROS: A Personalized Ranking Platform for Web Search

PROS: A Personalized Ranking Platform for Web Search PROS: A Personalized Ranking Platform for Web Search Paul - Alexandru Chirita 1, Daniel Olmedilla 1, and Wolfgang Nejdl 1 L3S and University of Hannover Deutscher Pavillon Expo Plaza 1 30539 Hannover,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Downloading Hidden Web Content

Downloading Hidden Web Content Downloading Hidden Web Content Alexandros Ntoulas Petros Zerfos Junghoo Cho UCLA Computer Science {ntoulas, pzerfos, cho}@cs.ucla.edu Abstract An ever-increasing amount of information on the Web today

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Recommender Systems (RSs)

Recommender Systems (RSs) Recommender Systems Recommender Systems (RSs) RSs are software tools providing suggestions for items to be of use to users, such as what items to buy, what music to listen to, or what online news to read

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

A Survey of Google's PageRank

A Survey of Google's PageRank http://pr.efactory.de/ A Survey of Google's PageRank Within the past few years, Google has become the far most utilized search engine worldwide. A decisive factor therefore was, besides high performance

More information

Inferring User Search for Feedback Sessions

Inferring User Search for Feedback Sessions Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Personalized Information Retrieval

Personalized Information Retrieval Personalized Information Retrieval Shihn Yuarn Chen Traditional Information Retrieval Content based approaches Statistical and natural language techniques Results that contain a specific set of words or

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably

More information

Verification of Multiple Agent Knowledge-based Systems

Verification of Multiple Agent Knowledge-based Systems Verification of Multiple Agent Knowledge-based Systems From: AAAI Technical Report WS-97-01. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Daniel E. O Leary University of Southern

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

A New Technique for Ranking Web Pages and Adwords

A New Technique for Ranking Web Pages and Adwords A New Technique for Ranking Web Pages and Adwords K. P. Shyam Sharath Jagannathan Maheswari Rajavel, Ph.D ABSTRACT Web mining is an active research area which mainly deals with the application on data

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Towards a hybrid approach to Netflix Challenge

Towards a hybrid approach to Netflix Challenge Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the

More information

Frontiers in Web Data Management

Frontiers in Web Data Management Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Statistical Machine Translation Models for Personalized Search

Statistical Machine Translation Models for Personalized Search Statistical Machine Translation Models for Personalized Search Rohini U AOL India R& D Bangalore, India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

A Recommendation Model Based on Site Semantics and Usage Mining

A Recommendation Model Based on Site Semantics and Usage Mining A Recommendation Model Based on Site Semantics and Usage Mining Sofia Stamou Lefteris Kozanidis Paraskevi Tzekou Nikos Zotos Computer Engineering and Informatics Department, Patras University 26500 GREECE

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:

More information

SUPPORTING PRIVACY PROTECTION IN PERSONALIZED WEB SEARCH- A REVIEW Neha Dewangan 1, Rugraj 2

SUPPORTING PRIVACY PROTECTION IN PERSONALIZED WEB SEARCH- A REVIEW Neha Dewangan 1, Rugraj 2 SUPPORTING PRIVACY PROTECTION IN PERSONALIZED WEB SEARCH- A REVIEW Neha Dewangan 1, Rugraj 2 1 PG student, Department of Computer Engineering, Alard College of Engineering and Management 2 Department of

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

A Novel Approach for Supporting Privacy Protection in Personalized Web Search by Using Data Mining

A Novel Approach for Supporting Privacy Protection in Personalized Web Search by Using Data Mining A Novel Approach for Supporting Privacy Protection in Personalized Web Search by Using Data Mining Kurcheti Hareesh Babu 1, K Koteswara Rao 2 1 PG Scholar,Dept of CSE, Rao & Naidu Engineering College,

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD Abstract There are two common main approaches to ML recommender systems, feedback-based systems and content-based systems.

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

A STUDY ON THE EVOLUTION OF THE WEB

A STUDY ON THE EVOLUTION OF THE WEB A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

Personalized Web Search

Personalized Web Search Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents

More information

ihits: Extending HITS for Personal Interests Profiling

ihits: Extending HITS for Personal Interests Profiling ihits: Extending HITS for Personal Interests Profiling Ziming Zhuang School of Information Sciences and Technology The Pennsylvania State University zzhuang@ist.psu.edu Abstract Ever since the boom of

More information

Impact of Search Engines on Page Popularity

Impact of Search Engines on Page Popularity Introduction October 10, 2007 Outline Introduction Introduction 1 Introduction Introduction 2 Number of incoming links 3 Random-surfer model Case study 4 Search-dominant model 5 Summary and conclusion

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won Introduction A countless number of contents gets posted on the YouTube everyday. YouTube keeps its competitiveness by maximizing

More information

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Query-Sensitive Similarity Measure for Content-Based Image Retrieval Query-Sensitive Similarity Measure for Content-Based Image Retrieval Zhi-Hua Zhou Hong-Bin Dai National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China {zhouzh, daihb}@lamda.nju.edu.cn

More information

PWS Using Learned User Profiles by Greedy DP and IL

PWS Using Learned User Profiles by Greedy DP and IL PWS Using Learned User Profiles by Greedy DP and IL D Parashanthi 1 & K.Venkateswra Rao 2 1 M-Tech Dept. of CSE Sree Vahini Institute of Science and Technology Tiruvuru Andhra Pradesh 2 Asst. professor

More information

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information