PageRank, learning to rank, and context-sensitive search. Aristides Gionis Yahoo! Research, Barcelona

Size: px

Start display at page:

Download "PageRank, learning to rank, and context-sensitive search. Aristides Gionis Yahoo! Research, Barcelona"

Archibald Hodge
6 years ago
Views:

1 PageRank, learning to rank, and context-sensitive search Aristides Gionis Yahoo! Research, Barcelona

2 PageRank

3 random walk graph G = (V, E) d(u) degree of node u model a process of moving on nodes of G in discrete time steps t = 0, 1, 2,... assume the process is at node u at time t if (u, v) E then the process is at node v at time t + 1 with probability 1/d(u)

4 random walk random variable X t denote position at time t Pr[X t+1 = v X t = u] = 1 d(u) 1-st order Markov chain information summarized in the matrix P, with P uv = { 1 d(u) if (u, v) E 0, otherwise called stochastic matrix of the Markov chain if A is the adjacency matrix of G then P = D 1 A where D is the diagonal matrix with D uu = d(u)

5 random walk define q (t) = {q (t) 1,..., q(t) n } the state probability vector q (0) is the initial distribution then q (t+1) = q (t) P so by induction q (t) = q (0) P t if π = πp then π is called stationary distribution let N(u, t) be the number of times the random walk visits node u in t steps then N(u, t) lim = π u t t (assuming that the graph is irreducible, finite and aperiodic)

6 random walk for undirected graphs it is: π u = d(u) 2m for directed graphs, no closed form solution need to solve the system π = πp eigenvalue problem, solved by power-iteration method

7 homework #1 show that for undirected graphs the stationary distribution at each node is proportional to the degree of the node π u = d(u) 2m

8 PageRank [Page et al., 1998] algorithm suggested for ranking results in web search an authority score is assigned to each Web page authority scores independent of the query authority scores corresponds to the stationary distribution of a random walk on the graph, such that with probability α follow a link in the graph with probability 1 α go to a node chosen uniformly at random (teleportation) PageRank also known as random surfer model

9 PageRank as before, the PageRank vector is given by π = πr, where R = αp + (1 α) 1 n 1 and 1 is the matrix with all rows equal to 1 the parameter α is in [0, 1], most typically α = 0.85

10 PageRank variants and enhancements personalized PageRank teleportation to a set of pages defining the preferences of a particular user TrustRank [Gyöngyi et al., 2004] teleportation to trustworthy pages topic-sensitive PageRank [Haveliwala, 2002] teleportation to a set of pages defining a particular topic

11 topic-sensitive PageRank fix a set of k topics (arts, sports, science,... ) each topic t j, j = 1,..., k, is represented by a set of pages T j for each topic t j compute the topic-sensitive PageRank of page u as PR(u, j) = PageRank of u with teleportation to set T j for each query q compute the score score(q) = k Pr(t j q) PR(u, j) j=1

12 homework #2 think of ways to compute Pr(t j q) remember that t j, j = 1,..., k, is a small set of topics (arts, sports, science,... )

13 learning to rank

14 how to rank relevant information? tf idf bm25 PageRank hits topic-sensitive PageRank personalized PageRank spam-score other ad hoc rules? header or boldface matches should weight more(?)...

15 how to rank relevant information? Q: which method to use? A: ALL! Q: how to combine the methods? A: machine learning

16 machine learning classification

17 classification techniques decision trees rule-based methods memory based reasoning neural networks naïve Bayes and Bayesian belief networks support-vector machines (svms)

18 classification with decision trees

19 classification with linear functions id Attributes Value 1 x 11 x x 1k y 1 2 x 21 x x 2k y n x n1 x n2... x nk y n find vector w such that w x i y i or k w j x ij y i j=1

20 linear separation

21 support vector machines projecting to higher dimensions, maximizing the margin

22 learning to rank given query q and document d learn function f (q, d) for query q and set of candidate documents d 1, d 2,..., d m produce the ranking such that f (q, d 1 ) f (q, d 2 )... f (q, d m ) represent the input (q, d) as a vector x q,d (s, q) :: x q,d = x (1) q,d, x (2) q,d,..., z (1) d, z (2) d,... learn f (q, d) = w x q,d

23 learning to rank training query q and two documents d 1 and d 2 with d 1 d 2 d 1 d 2 f (q, d 1 ) f (q, d 2 ) w x q,d1 w x q,d2 w (x q,d1 x q,d2 ) 0 transform to binary classification { +1, if d1 d (x q,d1 x q,d2, y) with y = 2 1, if d 2 d 1

24 context-sensitive search

25 searching in graphs I. context-sensitive search

26 searching in graphs I. context-sensitive search "chilly peppers"

27 searching in graphs I. context-sensitive search "chilly peppers" RHCP mexican cuisine

28 searching in graphs I. context-sensitive search "chilly peppers" food RHCP mexican cuisine

29 searching in graphs I. context-sensitive search "chilly peppers" music RHCP mexican cuisine

30 searching in graphs I. context-sensitive search customize search results to the user s current page or recent history of pages have visited increasing relevance of answers disambiguation suggesting links to wikipedia editors

31 yahoo! contextual shortcuts

32 searching in graphs II. social search

33 searching in graphs II. social search

34 searching in graphs II. social search

35 searching in graphs II. social search consider more information than just contacts preferences geographical information comments favorites tags etc.

36 context-sensitive search keyword-based search query: q = t 1, t 2,..., t n goal: finding a ranked list of documents from a given collection issues: ambiguity, incomplete information enhancing with context information query: q, c = t 1, t 2,..., t n, c where c: context or querying node goal: increase relevance

37 challenges challenge #1 use the query context appropriately in order to obtain relevant results design an appropriate ranking function challenge #2 develop algorithms that perform the context-sensitive search efficiently

38 machine-learning approach learn a ranking function that combines a large number of features content-based features: tf/idf, bm25, etc., as in traditional ir and web search content similarity between the querying node and a target node link-based features: PageRank shortest-path distance from the querying node to a target node spectral distance from the querying node to a target node graph-based similarity measures context-specific PageRank

39 the ranking function combine all features with a linear function given a query pair q, c and a candidate set C V, we compute a feature vector x d, q,c, for each d C the final score for a target node d is w x d, q,c, where w is a vector of weights

40 the candidate set set of documents that contain the query terms additionally most of the correct target nodes to many queries are within short distance from the querying node intuitively, documents corresponding to nodes close to c in terms of graph distance, are more likely to be better answers to an ambiguous query

41 computing the candidate set 1 use a pre-computed index to obtain all documents that contain all query terms 2 run breadth-first search (bfs) starting from the querying node following the forward links up to depth h 3 let C be the set of nodes that are in the intersection of the documents containing q and nodes visited by bfs why apply this hard constraint? experimentally found to improve relevance

42 computing the candidate set bfs from c too inefficient to perform at query time however, we can approximate at query time using landmarks alternatively, for each node c pre-compute nodes at distance h and keep in an index if performed in a batch fashion can be O(n 2 + m) instead of O(nm) (and even faster in practice)

43 content-based features BM25: assume query q = q 1,..., q m and target document d: BM25(q i, d) = m IDF(q i ) i=1 f (q i, d)(k 1 + 1) f (q i, d) + k 1 (1 b + b d d ), where IDF(q i ) is the inverse document frequency of the query term q i IDF(q i ) = log N n(q i) n(q i ) + 0.5

44 content-based features content similarity: T (c): set of terms in the querying node c T (d): set of terms in the target node d define content similarity as the Jaccard coefficient cs(c, d) = J(T (c), T (d)) = T (c) T (d) T (c) T (d)

45 link-based features: context-specific PageRank idea inspired by personalized PageRank [Haveliwala, 2002] given q, c, the main idea is to perform a random walk, such that with probability α follow a link in the graph with probability 1 α ɛ return back to page c with probability ɛ jump to any node in the graph

46 efficiency issues of context-specific PageRank Offline vs. online offline: need to pre-compute and store one PageRank vector for each node c as a potential querying node expensive pre-computation O(n 2 ) storage, since vectors are not (necessarily) sparse online: PageRank computation at query time too slow

47 approximations to context-specific PageRank Idea#1: landmarks select a set of landmark nodes compute context-specific PageRank only for those landmarks given a querying node c, use the PageRank vector of the closest landmark node Idea#2: clustering cluster the graph G compute cluster-specific PageRank vectors (teleport to a random page in the cluster) given a querying node c, use the PageRank vector of the cluster it belongs to

48 other link-based features shortest path distance between querying node c and target d predecessor and successor similarity: let P(v) = {u V (u, v) E} predecessor similarity: pred(c, d) = J(P(c), P(d)) let S(v) = {u V (v, u) E} successor similarity succ(c, d) = J(S(c), S(d))

49 other link-based features: spectral distance graph G = (V, E) an embedding φ : V R m from the nodes of the graph to a low dimensional Euclidean space spectral distance sd(c, d) = L 2 (φ(u), φ(v)) captures closeness of c and d in the graph G Laplacian matrix: L G = D A, where A is the adjacency matrix of G, and D is a diagonal matrix with A ii = d i. projection φ : V R m defined by the (m + 1) eigenvectors of L G

50 learning the ranking function given a query q, c, we define the score of a target node d as score(d) = w x d, q,c we use RankSVM [Joachims, 2002], a supervised method that learns the weight vector w proof of concept implementation on Wikipedia training set: ambiguous queries in Wikipedia

51 obtaining a training set for Wikipedia an illustration of the use of Wikipedia disambiguation pages for finding ambiguous query terms ( bach ) for different contexts ( American literature or Classical music )

52 individual features success k k=1 k=5 k=10 mean median no BFS BM TEXT SUCC PRED SD BFS BM TEXT SUCC PRED SD

53 context-sensitive search results success k k=1 k=5 k=10 mean median BFS TPR BFS CPR BFS LPR BFS NPR BFS GBF TPR, CPR and LPR stand for true, cluster and landmark pageranks, respectively.

54 context-sensitive search results success k k=1 k=5 k=10 mean median NOBFS TPR NOBFS CPR NOBFS LPR NOBFS NPR BFS TPR BFS CPR BFS LPR BFS NPR

55 qualitative results: jaguar Context 1 Context 2 Context 3 Animal Computer Science Automobile Jaguar Atari Jaguar Jaguar XJ Jaguar (car) Mac OS X v10.2 Jaguar (car) European Jaguar Daimler Motor Company Jaguar XK Fender Jaguar SEPECAT Jaguar Jaguar XJS Black panther Mac OS X Jaguar S-Type Fender Bass VI Jaguar X-Type Jaguar E-Type Pantherinae Polymorphism (biology) Jaguar X-Type Panthera British Leyland Motor Corp. Jaguar XJ220 Jaguar XJ220 Royal Air Force Daimler Motor Comp. South America HMS Jaguar (F34) Jaguar AJ-V8 engine

56 qualitative results: cuckoo Context 1 Context 2 Context 3 Animal Computer Science Film Cuckoo Sopwith Cuckoo One Flew over... Common Cuckoo The Cuckoo s Egg Jack Nicholson Cuculus Robert Morris (cryptogr.) Ken Kesey Coccyzus Oberon-2 Academy Award Yellow Cuckoo Clifford Stoll Louise Fletcher Sumatran cuckoo Bernardo Pasquini Cuckoo clock Clamator Dictionary attack 1975 in film Striped Cuckoo Mike Muuss 1970s Dideric Cuckoo Computer insecurity 48th Academy Awards Crotophagidae Louise Fletcher Milos Forman

57 Sandbox demo

58 references I Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages , Toronto, Canada. Morgan Kaufmann. Haveliwala, T. (2002). Topic-sensitive pagerank. In Proceedings of the 11th WWW Conference. Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD).

59 references II Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project.

Link Analysis and Web Search

Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html