INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5)

Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS) Analyze the structure of very large graph (Web) Link Analysis

PageRank

Early SE and Term Spam Early Search Engines invented term search Crawl the Web Extract teems (e.g. words) from each page Create an inverted index (what terms in which pages) Query processing Find all pages with query trems Rank pages according to importance/relevance E.g. term in the title of a page is more important Spammers invented term spam Add fake terms (in invisible font) Run popular query, see what page comes first, copy it

Google Innovation PageRank Simulate a random surfer starting from a random page following random outlinks Important pages has large chance to be on the simulated random path Page importance and terms are used for ranking Terms around the link Relevance of the page is according to terms within the page and terms around links to this page

Definition of PageRank A function that assigns a real number to each Page More important pages get a higher PageRank Web as a directed graph(nodes-pages, link-edges)

Transition Matrix Probability of jumping from node i to node j Assume equal probability (k out links, 1/k probability each) PageRank is a column vector Probability to be at node i

Stable Distribution Assume initial probability to be at each state is a vector v 0 = 1 n, 1 n,, Transition matrix M 1 n What is the probability after a single step? x = Mv 0 x i = j m ij v j After k steps x k = M k v 0 = MM Mv 0

Markov Process Distribution to be on a node i at step k depends only on distribution of nodes at time k-1. Exists a limiting distribution v = Mv provided The graph is strongly connected (possible to get from any node to any node) There are no dead ends (nodes that have no arcs out) Limiting distribution is an eigenvector of M

Principle Eigenvector Transition matrix M is stochastic (each column adds up to 1) Limiting distribution is the principle eigenvector (associated with largest eigenvalue) v = λmv Computation: iterate my multiplying by matrix M till no significant change 50-75 iterations for Web

Example Assuming transition matrix Successive multiplications

Structure of the Web In practice, web is not strongly connected graph

Structure of the Web Large strongly connected component (SCC) In-component Reach the SCC but could not but not reachable from the SCC Out-component Reachable from the SCC but unable to reach the SCC Two types of Tendrils From the in-component To the out-component Tubes from the in-component to the outcomponent Isolated component

Two general problems Dead-ends Page with no links out Spider traps Groups of pages that do not have links to any other pages Each page has out-links within the group

Avoiding Dead Ends Transition matrix is not stochastic (all zero column) Substochastic matrix- column sums are at most 1 Increasing power of M leads to some/all elements of v going to zero. Example

Dropping dead ends Drop dead ends and their incoming arcs from the graph Other nodes may become dead ends Drop recursively to obtain a strongly connected component Compute PageRank on the remaining graph Restore graph by adding nodes back in reverse order Computing PageRank for restored nodes Each parent with PageRank p and number of outlinks k contribute p/k to the restored node

Example Drop dead ends PageRank on reduced graph Restore C: Restore E: Single parent, same PageRank Result is not a distribution (does not sum up to 1)

Spider Traps and Taxation Example

Teleporting A random surfer has a small probability of jumping from any page to any page e is a vector of all 1 s and β is a small probability (0.15) For dead ends Always a probability to get out

Example Assume β = 0.8

Using PageRank in a SE A secret formula for ranking pages in response to a query Terms relevance PageRank Other 250 properties of pages (Google)

Efficient Computation of PageRank

PageRank for a large graph 50 iterations of matrix-vector multiplication MapReduce method The transition matrix M is very sparse Represent only non-zero elements Modify MapReduce stripping approach to reduce amount of data passed from Map tasks to Reduce tasks

Representing Transition Matrices 10B pages, 10 links per page 1 of each 1B entries is not zero 4 bytes per coordinate index, and 8 bytes for value Total 16 bytes per non-zero entry List all non-zero entries by column Single integer for a number of non-zeroes 4 bytes for row number per each non-zero entries

Example Transition Matrix Representation

PageRank Iteration Using MapReduce Iteration For small n store vector in the main memory of each node Map i, j, m ij i, m ij v j Reduce i, m i1 v 1,, m in v n j m ij v j Large n: break M into vertical stripes, v into horizontal stripes Break M into blocks, v into stripes

Topic-Sensitive PageRank

Motivation Search jaguar Animal, Automobile, MAC OS, ancient game console If SE can guess the topic More relevant results Select small number of topics Create PageRank vector for each topic (eg. 16 DMOZ) Detect user interest with respect to one of these topics

Biased Random Walk Assume random surfers start only from a random sport page Teleport set S of sport pages Usage Decide on topics Select teleport set of each topic Find a way to decide on topic(s) relevant to query Use appropriate PageRank vector

Link Spam

Architecture of a Spam Farm Spammers constantly try to improve the PageRank of their pages Web from the point of view of a spammer Inaccessible pages (amazon) Accessible pages (blog) Own pages (spam)

Spam Farm Single target page and m supporting pages

Analysis of a Spam Farm x- PageRank contributed by accessible pages β i p i k i, p i PageRank, k i number of outlinks y- unknown PageRank of target page PageRank of each supporting page is

PageRank of Target Page Contribution x from outside Contribution of every supporting page Contribution from teleported surfers (ignore) 1 β Total Solve n

Example Assume β = 0.86, c = 0.46 y = 3.6 x + 0.46 m n Amplify x, contribution by outer page by 360% 46% of the fraction of the Web

Combating Link Spam Battle between SE to detect spam-farm-like structures and spammers to invent new ones Consider TrustRank- a variation of topic sensitive PageRank designed to lower the score of spam pages Spam mass- identify pages that are likely to be spam

TrustRank Let S- teleport set to be a set of pages that are considered to be trustworthy Can t inject spam links into them (e.g. no talkbacks) Selecting trustworthy pages Human selected pages Pages from a specific domains (.edu.mil,.gov)

Spam Mass Measure fraction of page PageRank that comes from spam Compute PageRank r Compute TrustRank t The spam mass is r t r Not a spam: negative or small positive Spam: close two one (t is almost zero)

Example Trustworthy pages B and D No spam pages

Hubs and Authorities

HITS Probably used by Ask.com SE Hyperlink induced topic search (HITS) Originally intended to help ranking of query results Not a pre-processing step as PageRank We apply to the entire Web

The Intuition Behind HITS Authorities: Certain page are valuable because they provide information about a topic Hubs: Other pages are valuable as they point to good pages about that topic Example A homepage of the faculty is a HUB A homepage of each course is an Authority Recursive definition Good hub if links to good authorities Good authorities if it is linked by a good hub

Formalizing Hubbiness and Authority Link matrix of the Web L 1 if there is a link from i to j. Transpose L T : 1 if a link from j to I L T is similar to transition matrix M (M has probabilities)

Scores Let h and a be score vectors fro hubbines and authority respectively Scale each vector to sum 1 Computation h = λla, a = μl T h, with scaling constants λ and μ Substitute h = λlμl t h = λμll T h a = μl T λla = λμl T La

Computing L T L is much more sparse compared to L Better compute h and a by a true mutual recursion Algorithm Compute a = μl T h and scale Compute h = λla and scale Repeat until changes are small

Summary

Summary Term spam inject terms and copy pages PageRank and Transition Matrix Page importance defined by a random surfer Dead ends and Spider Traps Taxations/teleporting and removal of dead ends Combatting Spam Farms TrustRank and Spam Mass Topic-sensitive PageRank Teleport sets Hubs and authorities Mutually recursive definition