INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Similar documents
Unit VIII. Chapter 9. Link Analysis

Link Analysis. Chapter PageRank

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030

Part 1: Link Analysis & Page Rank

Slides based on those in:

Jeffrey D. Ullman Stanford University

CS425: Algorithms for Web Scale Data

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Information Networks: PageRank

Analysis of Large Graphs: TrustRank and WebSpam

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Link Analysis in Web Mining

Introduction to Data Mining

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval. Lecture 11 - Link analysis

Jeffrey D. Ullman Stanford University/Infolab

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

COMP 4601 Hubs and Authorities

Link Analysis and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mining Web Data. Lijun Zhang

Lecture 8: Linkage algorithms and web search

COMP Page Rank

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

How to organize the Web?

Mining Web Data. Lijun Zhang

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Link Structure Analysis

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

CS6200 Information Retreival. The WebGraph. July 13, 2015

DATA MINING - 1DL460

CS47300 Web Information Search and Management

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Mathematical Analysis of Google PageRank

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Introduction to Information Retrieval

TODAY S LECTURE HYPERTEXT AND

The PageRank Citation Ranking

Link Analysis. Hongning Wang

Social Network Analysis

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

DATA MINING - 1DL460

Brief (non-technical) history

Graph Algorithms. Revised based on the slides by Ruoming Kent State

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

PageRank and related algorithms

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Lec 8: Adaptive Information Retrieval 2

Recent Researches on Web Page Ranking

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Web Structure Mining using Link Analysis Algorithms

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

University of Maryland. Tuesday, March 2, 2010

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Lecture 8: Linkage algorithms and web search

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

World Wide Web has specific challenges and opportunities

CSI 445/660 Part 10 (Link Analysis and Web Search)

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Link Analysis in the Cloud

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval and Web Search

Exploring both Content and Link Quality for Anti-Spamming

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 27: Learning from relational data

Adaptive methods for the computation of PageRank

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Web Spam Detection with Anti-Trust Rank

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Information Retrieval and Web Search Engines

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

COMP5331: Knowledge Discovery and Data Mining

Searching the Web [Arasu 01]

Degree Distribution: The case of Citation Networks

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Transcription:

INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5)

Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS) Analyze the structure of very large graph (Web) Link Analysis

PageRank

Early SE and Term Spam Early Search Engines invented term search Crawl the Web Extract teems (e.g. words) from each page Create an inverted index (what terms in which pages) Query processing Find all pages with query trems Rank pages according to importance/relevance E.g. term in the title of a page is more important Spammers invented term spam Add fake terms (in invisible font) Run popular query, see what page comes first, copy it

Google Innovation PageRank Simulate a random surfer starting from a random page following random outlinks Important pages has large chance to be on the simulated random path Page importance and terms are used for ranking Terms around the link Relevance of the page is according to terms within the page and terms around links to this page

Definition of PageRank A function that assigns a real number to each Page More important pages get a higher PageRank Web as a directed graph(nodes-pages, link-edges)

Transition Matrix Probability of jumping from node i to node j Assume equal probability (k out links, 1/k probability each) PageRank is a column vector Probability to be at node i

Stable Distribution Assume initial probability to be at each state is a vector v 0 = 1 n, 1 n,, Transition matrix M 1 n What is the probability after a single step? x = Mv 0 x i = j m ij v j After k steps x k = M k v 0 = MM Mv 0

Markov Process Distribution to be on a node i at step k depends only on distribution of nodes at time k-1. Exists a limiting distribution v = Mv provided The graph is strongly connected (possible to get from any node to any node) There are no dead ends (nodes that have no arcs out) Limiting distribution is an eigenvector of M

Principle Eigenvector Transition matrix M is stochastic (each column adds up to 1) Limiting distribution is the principle eigenvector (associated with largest eigenvalue) v = λmv Computation: iterate my multiplying by matrix M till no significant change 50-75 iterations for Web

Example Assuming transition matrix Successive multiplications

Structure of the Web In practice, web is not strongly connected graph

Structure of the Web Large strongly connected component (SCC) In-component Reach the SCC but could not but not reachable from the SCC Out-component Reachable from the SCC but unable to reach the SCC Two types of Tendrils From the in-component To the out-component Tubes from the in-component to the outcomponent Isolated component

Two general problems Dead-ends Page with no links out Spider traps Groups of pages that do not have links to any other pages Each page has out-links within the group

Avoiding Dead Ends Transition matrix is not stochastic (all zero column) Substochastic matrix- column sums are at most 1 Increasing power of M leads to some/all elements of v going to zero. Example

Dropping dead ends Drop dead ends and their incoming arcs from the graph Other nodes may become dead ends Drop recursively to obtain a strongly connected component Compute PageRank on the remaining graph Restore graph by adding nodes back in reverse order Computing PageRank for restored nodes Each parent with PageRank p and number of outlinks k contribute p/k to the restored node

Example Drop dead ends PageRank on reduced graph Restore C: Restore E: Single parent, same PageRank Result is not a distribution (does not sum up to 1)

Spider Traps and Taxation Example

Teleporting A random surfer has a small probability of jumping from any page to any page e is a vector of all 1 s and β is a small probability (0.15) For dead ends Always a probability to get out

Example Assume β = 0.8

Using PageRank in a SE A secret formula for ranking pages in response to a query Terms relevance PageRank Other 250 properties of pages (Google)

Efficient Computation of PageRank

PageRank for a large graph 50 iterations of matrix-vector multiplication MapReduce method The transition matrix M is very sparse Represent only non-zero elements Modify MapReduce stripping approach to reduce amount of data passed from Map tasks to Reduce tasks

Representing Transition Matrices 10B pages, 10 links per page 1 of each 1B entries is not zero 4 bytes per coordinate index, and 8 bytes for value Total 16 bytes per non-zero entry List all non-zero entries by column Single integer for a number of non-zeroes 4 bytes for row number per each non-zero entries

Example Transition Matrix Representation

PageRank Iteration Using MapReduce Iteration For small n store vector in the main memory of each node Map i, j, m ij i, m ij v j Reduce i, m i1 v 1,, m in v n j m ij v j Large n: break M into vertical stripes, v into horizontal stripes Break M into blocks, v into stripes

Topic-Sensitive PageRank

Motivation Search jaguar Animal, Automobile, MAC OS, ancient game console If SE can guess the topic More relevant results Select small number of topics Create PageRank vector for each topic (eg. 16 DMOZ) Detect user interest with respect to one of these topics

Biased Random Walk Assume random surfers start only from a random sport page Teleport set S of sport pages Usage Decide on topics Select teleport set of each topic Find a way to decide on topic(s) relevant to query Use appropriate PageRank vector

Link Spam

Architecture of a Spam Farm Spammers constantly try to improve the PageRank of their pages Web from the point of view of a spammer Inaccessible pages (amazon) Accessible pages (blog) Own pages (spam)

Spam Farm Single target page and m supporting pages

Analysis of a Spam Farm x- PageRank contributed by accessible pages β i p i k i, p i PageRank, k i number of outlinks y- unknown PageRank of target page PageRank of each supporting page is

PageRank of Target Page Contribution x from outside Contribution of every supporting page Contribution from teleported surfers (ignore) 1 β Total Solve n

Example Assume β = 0.86, c = 0.46 y = 3.6 x + 0.46 m n Amplify x, contribution by outer page by 360% 46% of the fraction of the Web

Combating Link Spam Battle between SE to detect spam-farm-like structures and spammers to invent new ones Consider TrustRank- a variation of topic sensitive PageRank designed to lower the score of spam pages Spam mass- identify pages that are likely to be spam

TrustRank Let S- teleport set to be a set of pages that are considered to be trustworthy Can t inject spam links into them (e.g. no talkbacks) Selecting trustworthy pages Human selected pages Pages from a specific domains (.edu.mil,.gov)

Spam Mass Measure fraction of page PageRank that comes from spam Compute PageRank r Compute TrustRank t The spam mass is r t r Not a spam: negative or small positive Spam: close two one (t is almost zero)

Example Trustworthy pages B and D No spam pages

Hubs and Authorities

HITS Probably used by Ask.com SE Hyperlink induced topic search (HITS) Originally intended to help ranking of query results Not a pre-processing step as PageRank We apply to the entire Web

The Intuition Behind HITS Authorities: Certain page are valuable because they provide information about a topic Hubs: Other pages are valuable as they point to good pages about that topic Example A homepage of the faculty is a HUB A homepage of each course is an Authority Recursive definition Good hub if links to good authorities Good authorities if it is linked by a good hub

Formalizing Hubbiness and Authority Link matrix of the Web L 1 if there is a link from i to j. Transpose L T : 1 if a link from j to I L T is similar to transition matrix M (M has probabilities)

Scores Let h and a be score vectors fro hubbines and authority respectively Scale each vector to sum 1 Computation h = λla, a = μl T h, with scaling constants λ and μ Substitute h = λlμl t h = λμll T h a = μl T λla = λμl T La

Computing L T L is much more sparse compared to L Better compute h and a by a true mutual recursion Algorithm Compute a = μl T h and scale Compute h = λla and scale Repeat until changes are small

Summary

Summary Term spam inject terms and copy pages PageRank and Transition Matrix Page importance defined by a random surfer Dead ends and Spider Traps Taxations/teleporting and removal of dead ends Combatting Spam Farms TrustRank and Spam Mass Topic-sensitive PageRank Teleport sets Hubs and authorities Mutually recursive definition