Link Analysis Informa0on Retrieval. Evangelos Kanoulas

Link Analysis Informa0on Retrieval Evangelos Kanoulas e.kanoulas@uva.nl

How Search Works Logging Clicks Context Crawling Quality Freshness Spaminess Text processing & Indexing Ranking Algorithm Content Query understanding

Friendship Networks, Cita0on Networks, Link analysis studies the rela0onships (e.g., friendship, cita0on) between objects (e.g., people, publica0ons) to find out about their characteris0cs (e.g., popularity, impact) Social Network Analysis (e.g., on a friendship network) Closeness centrality of a person v is the frac0on of shortest paths between any two persons (u, w) that pass through v Bibliometrics (e.g., on a cita0on network) Co- cita0on measures how many papers cite both u and v Co- reference measures how many common papers both u and v refer to

, and the Web? World Wide Web can be seen as directed graph G(V, E) web pages correspond to ver0ces (or, nodes) V hyperlinks between them correspond to edges E Link analysis on the Web graph can give us clues about which web pages are important and should thus be ranked higher which pairs of web pages are similar to each other which web pages are probably spam and should be ignored

Link Analysis PageRank HITS Random Surfer Model, Markov Chains Hyperlinked- Induced Topic Search Topic- Specific and Personalized PageRank Biased Random Jumps, Linearity of PageRank Similarity Search SimRank, Random Walk with Restarts Spam Detec0on LinkSpam, TrustRank, SpamRank Social Networks SocialPageRank, TunkRank

PageRank Hyperlinks can be interpreted as endorsements for the target web page In- degree as a measure of the importance/ authority/popularity of a web page v is easy to manipulate and does not consider the importance of the source web pages Larry Page & Sergey Brin PageRank considers a web page v important if many important web pages link to it Random surfer model follows a uniform random outgoing link with probability (1-ε) jumps to a uniform random web page with probability ε Intuition: Important web pages are the ones that are visited often

Stochas0c Processes & Markov Chains Discrete stochas0c process is a family of random variables {X t t 2 T } with T = {0, 1, 2, } as discrete 0me domain Stochas0c process is a Markov chain if P [X t = x X t 1 = w,..., X 0 = a] = P [X t = x X t 1 = w] holds, i.e., it is memoryless Markov chain is 0me- homogeneous if for all 0mes t P [X t+1 = x X t = w] =P [X t = x X t 1 = w] holds, i.e., transi0on probabili0es do not depend on 0me

State Space & Transi0on Probability Matrix State space of a Markov chain { X t t T} is the countable set S of all values that X t can assume X t : Ω S Markov chain is in state s at 0me t if X t = s Markov chain is finite if it has a finite state space If a Markov chain is finite and 0me- homogeneous, its transi0on probabili0es can be described as a matrix P = (p ij ) p ij = P [X t = j X t 1 = i] For S = n the transi0on probability matrix P is an n- by- n right- stochas0c matrix (i.e., its rows sum up to 1) 8i : X j p ij =1

Proper0es of Markov Chains State i is reachable from state j if there exists a n 0 such that (P n ) ij > 0 States i and j communicate if i is reachable from j and vice versa Markov chain is irreducible if all states i,j in S communicate Markov chain is posi0ve recurrent if the recurrence probability is 1 and the mean recurrence 0me is finite for every state i

Proper0es of Markov Chains Markov chain is aperiodic if every state I has period 1 Markov chain is ergodic if it is 0me- homogeneous, irreducible, posi0ve recurrent, and aperiodic The 1- by- n vector is the sta0onary state distribu0on of the Markov chain described by P if i 0 X i =1 P = π i is the limit probability that Markov chain is in state i 1/π i reflects the average 0me un0l the Markov chain returns to state i Theorem: If a Markov chain is finite and ergodic, then there exists a unique sta0onary state distribu0on

PageRank as a Markov Chain Random surfer model Follows a uniform random outgoing link with probability (1- ε) Jumps to a uniform random web page with probability ε Let A be the adjacency matrix of the Web graph, matrix T captures following of a uniform random outgoing link T ij = ( 1/out(i) (i, j) 2 E 0 otherwise 1 2 3 4 5 0 1 0 1 0 A = 0 0 1 1 0 1 0 0 0 0 T = 0 0 0 0 1 0 0 1 0 0 0 1/2 0 1/2 0 0 0 1 1/2 0 1/1 0 0 0 0 0 0 0 0 1/1 0 0 1/1 0 0

PageRank as a Markov Chain Random surfer model Follows a uniform random outgoing link with probability (1- ε) Jumps to a uniform random web page with probability ε Vector j captures jumping to a uniform random web page j i =1/ V Transi0on probability matrix of Markov chain then obtained as P =(1 )T + [1...1] T j

Compu0ng π Markov Chain Monte Carlo Sta0onary state distribu0on can be cast into a system of linear equa0ons Solu0on can be found, e.g., using Gauss elimina0on Sta0onary state probability distribu0on is the led eigenvector of the transi0on probability matrix P for eigenvalue λ = 1 P = Can be computed using the characteris0c polynomial (P I) =0

Compu0ng π (in prac0ce) Sta0onary state distribu0on is the limit distribu0on Idea: Compute k- step state probabili0es π k un0l they converge Power (itera0on) method Select arbitrarily ini0al state probability distribu0on π 0 Compute π k = π k- 1 P un0l they converge (e.g. π k - π k- 1 < ε) Power (itera0on) method basically simulates the Markov chain and is the method of choice in prac0ce when dealing with huge state spaces, exploi0ng that matrix- vector mul0plica0on is easy to parallelize

Topic- specific and personalized PageRank PageRank produces one- size- fits- all ranking determined assuming uniform following of links and random jumps How can we obtain topic- specific (e.g., for Sports) or personalized (e.g., based on my bookmarks) rankings? bias random jump probabili0es (i.e., modify the vector j) bias link- following probabili0es (i.e., modify the matrix T) What if we do not have hyperlinks between documents? construct implicit- link graph from user behavior or document contents

Topic- specific PageRank Input: Set of topics C (e.g., Sports, Poli0cs, Food, ) Set of web pages S c for each topic c (e.g., from dmoz.org) Idea: Compute a topic- specific ranking for c by biasing the random jump in PageRank toward web pages S c of that topic P c =(1 )T + [1...1] T j c with jci = ( 1/ S c Method: Precompute topic- specific PageRank vectors π c Classify user query q to obtain topic probabili0es P[c q] Final importance score obtained as linear combina0on = X P [c q] c c2c i 2 S c 0 i/2 S c T. H. Haveliwala: Topic- Sensi,ve PageRank: A Context- Sensi,ve Ranking Algorithm for Web Search, TKDE 15(4):784-796, 2003

Personalized PageRank Idea: Provide every user with a personalized ranking based on her favorite web pages F (e.g., from bookmarks or likes) ( 1/ F i 2 F P F =(1 )T + [1...1] T j F with j Fi = 0 i/2 F Problem: Compu0ng and storing a personalized PageRank vector for every single user is too expensive G. Jeh and J. Widom: Scaling Personalized Web Search, KDD 2003

PageRank in prac0ce PageRank- style methods are used somewhat in ranking Google: probably more important in the early days than today Use case: index selec0on which pages to include in the index which pages to include in which 0er of the index which pages to update more oden Use case: spam figh0ng which sites to exclude from search

Similarity Search How can we use the links between objects to figure out which objects are similar to each other? Not limited to the Web graph but also applicable to k- par0te graphs derived from rela0onal database (students, lecture, etc.) implicit graphs derived from observed user behavior word co- occurrence graphs Applica0ons: Iden0fica0on of similar pairs of objects (e.g., documents or queries) Recommenda0on of similar objects (e.g., documents based on a query)

SimRank Intui0on: Two objects are similar if similar objects point to them s(u, v) = C I(u) I(v) I(u) X i=1 I(v) X j=1 s(i i (u),i j (v)) with confidence constant C < 1, in- neighbors I(u) and I(v), and I i (u) and I j (v) as the i- th and j- th in- neighbor of u and v Example: Universi0es, Professors, Students U1 P1 P2 S1 S2 With C = 0.8: s(p1, P2) = 0.414 s(s1, S2) = 0.331 s(u1, P2) = 0.132 s(p1, S2) = 0.106 s(p2, S2) = 0.088 s(p2, S1) = 0.042 s(u1, S2) = 0.034

SimRank s (0) (u, v) = ( 1 u = v 0 otherwise Repeat un0l convergence: s (k+1) (u, v) = C I(u) I(v) I(u) X i=1 I(v) X j=1 s (k) (I i (u),i j (v)) u 6= v SimRank score s(u, v) can be interpreted as the expected number of steps that it takes two random surfers to meet if they start at nodes u and v walk the graph backwards in lock step (i.e., their steps are synchronous) G. Jeh and J. Widom: SimRank: A Measure of Structural- Contextual Similarity, KDD 2002

Random Walks on Click Graph Consider bi- par0te click graph with queries and documents as ver0ces and weighted edges (q, d) indica0ng users tendency to click on document d for query q giant panda panda bear 2.0 1.0 1.0 d1 d2 2.0 Perform PageRank- style random walk with link- following probabili0es propor0onal to edge weights and random jump to single query or document fiat panda 1.0 d3 k= Applica0ons: query- to- document search query- to- query sugges0on document- by- query annota0on document- to- document sugges0on Annota0on using a random walk: P Query 0.075 boxer dog puppies 0.066 boxer puppy pics 0.060 boxer puppies puppy 0.056 boxer 0.056 boxer puppy pictures 0.049 boxer pups 0.049 boxer puppy 0.038 puppy boxers 0.034 boxer pup 0.030 baby boxer Distance 3 3 1 3 3 3 3 5 3 3

Random Walks on Query Flow Graph Consider query- flow graph with queries as ver0ces and weighted edges (q, q ) reflec0ng how oden q is issued ader q Recommend related queries by performing PageRank- style random walk on the query- flow graph with link- following probabili0es propor0onal to edge weights and random jump to current query (or last few queries) 2.0 endangered species 1.0 panda bear 3.0 giant panda banana apple banana beatles apple beatles banana apple usb no banana cs giant chocolate bar where is the seed in anut banana shoe fruit banana banana cloths eating bugs banana eating bugs banana holiday opening a banana banana shoe fruit banana recipe 22 feb 08 banana jules oliver banana cs banana cloths beatles apple apple ipod scarring srg peppers artwork ill get you bashles dundee folk songs the beatles love album place lyrics beatles beatles scarring paul mcartney yarns from ireland statutory instrument A55 silver beatles tribute band beatles mp3 GHOST S ill get you fugees triger finger remix P. Boldi, F. Bonchi, C. CasYllo, D. Donato, A. Gionis, and S. Vigna: The Query- flow Graph: Model and ApplicaTons, CIKM 2008

Spam Detec0on Discoverability of web pages has oden a direct impact on the commercial success of the business behind them Search Engine Op0miza0on (SEO) seeks to op0mize web pages to make them easier to discover for poten0al customers white hat (op0mizes for the user and respects to search engine policies) black hat (manipulates search results by web spamming techniques) Web spamming techniques and search engines evolved in parallel ini0ally: only content spam, then: link spam, now: social media spam

Content Spam vs. Link Spam vs. Content Hiding Content spam keyword stuffing add unrelated but oden- sought keywords to page invisible content unrelated content invisible to users (like this) Link spam link farms collec0on of pages aiming to manipulate PageRank honeypots create valuable web pages with links to spam page link hijacking leave comments on reputable web pages or blogs Content hiding cloaking show different content to search engine s crawler and users

TrustRank, BadRank Idea: Pages linked to by trustworthy pages tend to be trustworthy TrustRank performs PageRank- style random walk with random jumps only to an explicitly selected set of trusted pages T Idea: Pages linking to spam pages tend to be spam themselves BadRank performs PageRank- style backwards random walk (i.e., following incoming links) with random jumps only to an explicitly selected set of blacklisted pages B Problems: Sets of trusted and blacklisted pages are difficult to maintain TrustRank and BadRank scores are hard to interpret and combine

Spam, Damn Spam and Sta0s0cs Idea: Look for sta0s0cal devia0on Content spam: Compare word frequency distribu0on to distribu0on in good hosts Link spam: Iden0fy outliers in out- degree and in- degree distribu0ons and inspect intersec0on 1E+8 1E+9 1E+7 1E+8 1E+6 1E+7 1E+6 Number of pages 1E+5 1E+4 1E+3 Number of pages 1E+5 1E+4 1E+3 1E+2 1E+2 1E+1 1E+1 1E+0 1E+0 1E+1 1E+2 1E+3 Out-degree 1E+4 1E+5 1E+6 1E+0 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 In-degree 1E+6 1E+7 1E+8

SpamMass Idea: Measure spam mass as the amount of PageRank score that a web page receives from web pages known to be spam Assume that web pages are parti0oned into good pages V + and bad pages V - and that a good core C V + is known Absolute spam mass of page p is then es0mated as SM (p) = π(p) π C (p) with π(p) as its PageRank score and πc (p) as its PageRank score with random jumps only to pages in the good core Rela0ve spam mass of page p is rsm (p) = SM (p)/π(p)

Social Networks Social networks diverse rela0ons (e.g., friendship, liking, check- in, following) between diverse types of objects (e.g., people, en00es, posts, images, videos) Folksonomies (~ folk + taxonomy) allow users to organize objects by tagging no centrally controlled vocabulary Link analysis methods give insights into importance and similarity of objects (e.g., for ranking or recommenda0on)

TunkRank Idea: Measure a Twi}er user s influence as the expected number of people who will read a tweet (including re- tweets) by the user Considers Twi}er s follower graph G(V, E) consis0ng of users as ver0ces V and directed edges E with edge (i, j) indica0ng that user i follows user j Assump0ons: if i follows j, i reads tweet by j with probability 1 / out(i) constant re- twee0ng probability p r(j) = X (i,j)2e (1 + p r(i)) out(i) D. Tunkelang: A TwiGer Analog to PageRank, 2009 h\p://thenoisychannel.com/2009/01/13/a- twi\er- analog- to- pagerank/

Twi}erRank Considers Twi}er s follower graph G(V, E) consis0ng of users as ver0ces V and directed edges E with edge (i, j) indica0ng that user i follows user j PageRank- style random walk with link- following probabili0es T ij = ( N j P(i,k)2E N k sim(i, j) (i, j) 2 E 0 otherwise with N i as the number of tweets published by user i and sim(i, j) reflec0ng similarity between tweets by i and j J. Weng, E.- P. Lim, J. Jiang, and Q. He: TwiGerRank: finding topic- sensi,ve influen,al twigerers, WSDM 2010

Take- Home message Link analysis methods can be used to measure importance and similarity with applica0ons in search and recommenda0on is performed on the basis of finite and ergodic Markov Chains theory with computa0ons easily parallelizable MapReduce