Generalized Social Networs Social Networs and Raning Represent relationship between entities paper cites paper html page lins to html page A supervises B directed graph A and B are friends papers share an author A and B are co-worers undirected graph 1 2 Hypertext document or part of document lins to other parts or other documents construct documents of interrelated pieces relate documents to each other pre-dates Web Web iller app. 3 How use lins to improve information search? use structure to compute score for raning include more objects to ran redefines satisfying of query? add to the content of a document ² can deal with objects of mixed types images, PDF, 4 1
Scoring using structure Ideas 1. lin to object suggests it valuable object 2. distance between objects in graph represents degree of relatedness reachable by all in 2 lins Pursuing lining and value Intuition: when Web page points to another Web page, it confers status/authority/ popularity to that page Find a measure that captures intuition Not just web lining Citations in boos, articles others? edge node 5 6 Indegree Indegree = number of lins into a node Most obvious idea: higher indegree => better node Doesn t wor well Need some feedbac in system Leads us to Page and Brin s PageRan PageRan Algorithm that gave Google the leap in quality lin structure centerpiece of scoring Framewor Given a directed graph with n nodes Assign each node a score that represents its importance in structure: PageRan: pr(node) edge pr(node) node 7 8 2
Conferring importance Core ideas: Ø A node should confer some of its importance to the nodes to which it points If a node is important, the nodes it lins to should be important Ø A node should not transfer more importance than it has 9 Attempt 1 Refer to nodes by numbers 1,, n (arbitrary numbering) Let t i denote the number of edges out of node i (outdegree) Node i transfers 1/t i of its importance on each edge out of it Define pr new () = i with edge from i to (pr(i) / t i ) Iterate until converges Problems Sins (nodes with no edges out) Cyclic behavior 4 pr(4) 1 1/2pr(2) 2 1/2pr(2) 3 sin 10 Attempt 2 Random wal model Attempt 1 gives movement from node to lined neighbor with probability 1/outdegree Add random jump to any node pr new () = α/n + (1-α) i with edge from i to (pr(i) / t i ) α parameter chosen empirically Brea cycles Escape from sins 4 pr(4) 1 1/2pr(2) 2 1/2pr(2) 3 sin 11 Normalized? Would lie 1 n (pr ()) = 1 Consider 1 n (pr new ()) = 1 n ( α/n + (1-α) i with edge from i to (pr(i) / t i ) ) (1) = 1 n ( α/n)+ 1 n ((1-α) i with edge from i to (pr(i) / t i )) * (2) = α + (1-α) 1 n i with edge from i to (pr(i) / t i ) (3) = α + (1-α) 1 i n with edge from i to (pr(i) / t i ) * (4) α + (1-α) 1 i n pr(i) with edge from i to (1/ t i ) * (5) = α + (1-α) i with edge from i pr(i) (6) *inner sum i over incoming edges for one *inner sum over outgoing edges for one i i 12 3
Problem for desired normalization Have 1 n (pr new ()) = α + (1-α) i with edge from i pr(i)) Missing pr(i) for nodes with no edges from them sins! Solution: add n edges out of every sin Edge to every node including self Gives 1/n contribution to every node Gives desired normalization: If 1 n (pr initial ()) = 1 then 1 n (pr()) = 1 4 1 3 2 sin 13 Matrix formulation Let E be the n by n adjacency matrix E(i,) = 1 if there is an edge from node i to node = 0 otherwise Define new matrix L: For each row i of E (1 i n) If row i contains t i >0 ones, L(i,)=(1/ t i ) E(i,), 1 n If row i contains 0 ones, L(i,) = 1/n, 1 n Vector pr of PageRan values defined by pr = (α/n, α/n, α/n) T +(1- α) L T pr has a solution representing the steady-state values pr() 14 Eigenvector Formulation pr = (α/n, α/n, α/n) T +(1- α) L T pr = (α/n) Jpr +(1- α) L T pr = ( (α/n) J + (1- α) L T ) pr = ( M ) pr Calculation Choices 1. pr = M pr : Find principle eigenvector of M solves n simultaneous equations: pr() = α/n + (1-α) 1 i n L(i,)pr(i) 2. Use iterative calculation - power method (where we started) Initialize pr initial () = 1/n for each node Until converges { J is the matrix of all 1 s For each node Jpr = (1, 1, 1) T because 1 n (pr()) = 1 pr new () = α/n + (1-α) 1 i n L(i,)pr(i) For each node pr is the principal eigenvector of M pr() = pr new () Av = λv, λ=1 15 } 16 4
Power method Convergence In practice choose convergence criterion e.g. stop iteration when Max n =1 ( pr_new() pr() ) < ε ε = 10-3? 10-4? 10-5? Choice α No single best value 1-α determines rate of convergence Second eigenvalue α = 0.15 common gives 10-4 accuracy in about 60 iterations regardless of size of graph [Chu; Wu] 17 PageRan Observations Can be calculated for any directed graph Google calculates on entire Web graph query independent scoring Huge calculation for Web graph precomputed 1998 Google published: 52 iterations for 322 million lins 45 iterations for 161 million lins PageRan must be combined with querybased scoring for final raning Many variations What Google exactly does secret Can mae some guesses by results 18 HITS Hyperlin Induced Topic Search Second well-nown algorithm By Jon Kleinberg while at IBM Almaden Research Center Same general goal as PageRan Distinguishes 2 inds of nodes Hubs: resource pages Point to many authorities Authorities: good information pages Pointed to by many hubs 19 Mutual reinforcement Authority weight node j: a(j) Vector of weights a Hub weight node j: h(j) Vector of weights h Update: a new () = i with edge from i to (h(i)) h new () = j with edge from to j (a(j)) i j 20 5
Mutual reinforcement Mutual reinforcement Authority weight node j: a(j) Vector of weights a Hub weight node j: h(j) Vector of weights h Update: a new () = i with edge from i to (h(i)) i h(i) Authority weight node j: a(j) Vector of weights a Hub weight node j: h(j) Vector of weights h Update: a new () = i with edge from i to (h(i)) i h new () = j with edge from to j (a(j)) j h new () = j with edge from to j (a(j)) a(j) j 21 22 Matrix formulation Steady state: a = E T h a = E T Ea h = Ea h = EE T h Interpretation? 23 E T (i,) 1 where è i E(,j) 1 where è j Loo inside E(i,) 1 where iè E T (,j) 1 where jè Row i of E T : Row i of E: 1 s where sè i 1 s where iè s Column j of E: Column j of E T 1 s where sè j 1 s where jè s E T E(i,j) is number EE T (i,j) is number of nodes pointing of nodes pointed to both i and j to by both i and j 24 6
Steady state: a = E T h h = Ea Matrix formulation a = E T Ea h = EE T h Interpretation: E T E(i,j): number nodes point to both node i and node j Co-citation EE T (i,j): number nodes pointed to by both node i and node j Bibliographic coupling Iterative Calculation a = h = (1,, 1) T While (not converged) { a new = E t h h new = Ea a = a new / a new normalize to unit vector h = h new / h new normalize to unit vector } Provable convergence by linear algebra 25 26 Use of HITS original use after find Web pages satisfying query: 1. Retrieve documents satisfy query and ran by termbased techniques 2. Keep top c documents: root set of nodes c a chosen constant - tunable 3. Mae base set: a) Root set b) Plus nodes pointed to by nodes of root set c) Plus nodes pointing to nodes of root set using lins to expand matches! 4. Mae base graph: base set plus edges from Web graph between these nodes 5. Apply HITS to base graph 27 Results using HITS Documents raned by authority score a(doc) and hub score h(doc) Authority score primary score for search results Heuristics: delete all lins between pages in same domain Keep only pre-determined number of pages lining into root set ( ~200) Findings (original paper) Number iterations in original tests ~50 most authoritative pages do not contain initial query terms 28 7
Observations HITS can be applied to any directed graph Base graph much smaller than Web graph Kleinberg identified bad phenomena Topic diffusion: generalizes topic when expand root graph to base graph example: want compilers - generalized to programming PageRan and HITS designed independently around 1997 indicates time was ripe for this ind of analysis lots of embellishments by others 29 30 Revisit: How use lins in raning documents? use structure to compute score for raning PageRan, HITS include more objects to ran saw in use of HITS use anchor text (HTML) anchor text labels lin include anchor text as text of document pointed to Anchor text HTML text: All assignments will be made available on <a href="https://piazza.com/princeton/spring2017/ cos435/home">the Piazza course account</a>. Renders as: All assignments will be made available on the Piazza course account. Anchor text: the Piazza course account is anchor text 31 32 8
Using anchor text homewor may not occur in content of doc b doc a homewor doc b Problem Set terms in doc b for building index: homewor: anchor problem: title 1 set: title 2 33 Summary Lin analysis a principal component of raning by modern Web search engines must be combined with content analysis Extend document content with lin info anchor text text of URLs e.g. princeton.edu, aardvarsportsshop.com Expand set of satisfying docs using lins less often used 34 Raning documents w.r.t. query lin analysis + doc. features Secret recipe anchor text query personal information historic information words in doc + word features scores of documents for query - use to ran 35 General Framewor Have set of n features (aa signals) to use in determining raning score Features depend on query: vector Ψ(d i,q) of feature values f for doc d i, query q eg tf.idf score is feature Features are conditioned to be comparable Have parameterized function to combine signals simple: linear α 0 + i=1 α i *(f i ) α i are adjustable weights - how choose? intuition experimentation machine learning n 36 9
Machine Learning Raning documents w.r.t. query Many possibilities overview of one Ordinal Regression Model Goal: get comparison of doc.s correct capture goal Let ω represent vector (α 1,, α n ) want ω T Ψ(d i,q) - ω T Ψ(d j,q) > 0 if and only if lin analysis + doc. features anchor text query personal information historic information words in doc + word features d i more relevant than d j for query q find ω that wors techniques train on nown correct data: humans ran a set of documents for various queries 37 Secret recipe scores of documents for query - use to ran 38 10