Link Analysis Informa0on Retrieval. Evangelos Kanoulas

Similar documents
CS60092: Informa0on Retrieval

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Part 1: Link Analysis & Page Rank

Big Data Analytics CSCI 4030

Informa(on Retrieval

Big Data Analytics CSCI 4030

Today s lecture hypertext and links

Link Analysis and Web Search

Information Retrieval. Lecture 11 - Link analysis

Link Structure Analysis

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Link Analysis. Hongning Wang

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS425: Algorithms for Web Scale Data

Slides based on those in:

Lecture 8: Linkage algorithms and web search

PPI Network Alignment Advanced Topics in Computa8onal Genomics

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Information Networks: PageRank

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Analysis of Large Graphs: TrustRank and WebSpam

An applica)on of Markov Chains: PageRank. Finding relevant informa)on on the Web

CS6200 Information Retreival. The WebGraph. July 13, 2015

Introduction to Data Mining

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 27: Learning from relational data

World Wide Web has specific challenges and opportunities

How to organize the Web?

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Mining Web Data. Lijun Zhang

COMP Page Rank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

CS47300 Web Information Search and Management

Social Networks Measures. Single- node Measures: Based on some proper7es of specific nodes Graph- based measures: Based on the graph- structure

Mining Web Data. Lijun Zhang

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Unit VIII. Chapter 9. Link Analysis

Brief (non-technical) history

Introduction to Information Retrieval

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Informa(on Retrieval

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

COMP 4601 Hubs and Authorities

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Introduction to Data Mining

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Link Analysis in Web Mining

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Recent Researches on Web Page Ranking

CS395T Visual Recogni5on and Search. Gautam S. Muralidhar

Query Sugges*ons. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

SEMINAR: GRAPH-BASED METHODS FOR NLP

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Advanced Computer Architecture: A Google Search Engine

TODAY S LECTURE HYPERTEXT AND

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Social Dynamics of Informa0on Kris0na Lerman USC Informa0on Sciences Ins0tute

Jeffrey D. Ullman Stanford University/Infolab

Bruno Martins. 1 st Semester 2012/2013

Information Networks: Hubs and Authorities

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Social Network Analysis

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

About the Course. Reading List. Assignments and Examina5on

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Exploring both Content and Link Quality for Anti-Spamming

Searching the Web What is this Page Known for? Luis De Alba

Recommender Systems Collabora2ve Filtering and Matrix Factoriza2on

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Lec 8: Adaptive Information Retrieval 2

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Jeffrey D. Ullman Stanford University

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Link Analysis. Paolo Boldi DSI LAW (Laboratory for Web Algorithmics) Università degli Studi di Milan

Oral Exams Dates. Distributed Data Management Summer Semester 2013 TU Kaiserslautern. Recap: Map and Reduce. (Equi) Join of 3 Rela9ons

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Machine Learning Crash Course: Part I

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Corso di Biblioteche Digitali

Motivation. Motivation

Corso di Biblioteche Digitali

A Survey of Google's PageRank

CS 6604: Data Mining Large Networks and Time-Series

Transcription:

Link Analysis Informa0on Retrieval Evangelos Kanoulas e.kanoulas@uva.nl

How Search Works Logging Clicks Context Crawling Quality Freshness Spaminess Text processing & Indexing Ranking Algorithm Content Query understanding

Friendship Networks, Cita0on Networks, Link analysis studies the rela0onships (e.g., friendship, cita0on) between objects (e.g., people, publica0ons) to find out about their characteris0cs (e.g., popularity, impact) Social Network Analysis (e.g., on a friendship network) Closeness centrality of a person v is the frac0on of shortest paths between any two persons (u, w) that pass through v Bibliometrics (e.g., on a cita0on network) Co- cita0on measures how many papers cite both u and v Co- reference measures how many common papers both u and v refer to

, and the Web? World Wide Web can be seen as directed graph G(V, E) web pages correspond to ver0ces (or, nodes) V hyperlinks between them correspond to edges E Link analysis on the Web graph can give us clues about which web pages are important and should thus be ranked higher which pairs of web pages are similar to each other which web pages are probably spam and should be ignored

Link Analysis PageRank HITS Random Surfer Model, Markov Chains Hyperlinked- Induced Topic Search Topic- Specific and Personalized PageRank Biased Random Jumps, Linearity of PageRank Similarity Search SimRank, Random Walk with Restarts Spam Detec0on LinkSpam, TrustRank, SpamRank Social Networks SocialPageRank, TunkRank

PageRank Hyperlinks can be interpreted as endorsements for the target web page In- degree as a measure of the importance/ authority/popularity of a web page v is easy to manipulate and does not consider the importance of the source web pages Larry Page & Sergey Brin PageRank considers a web page v important if many important web pages link to it Random surfer model follows a uniform random outgoing link with probability (1-ε) jumps to a uniform random web page with probability ε Intuition: Important web pages are the ones that are visited often

Stochas0c Processes & Markov Chains Discrete stochas0c process is a family of random variables {X t t 2 T } with T = {0, 1, 2, } as discrete 0me domain Stochas0c process is a Markov chain if P [X t = x X t 1 = w,..., X 0 = a] = P [X t = x X t 1 = w] holds, i.e., it is memoryless Markov chain is 0me- homogeneous if for all 0mes t P [X t+1 = x X t = w] =P [X t = x X t 1 = w] holds, i.e., transi0on probabili0es do not depend on 0me

State Space & Transi0on Probability Matrix State space of a Markov chain { X t t T} is the countable set S of all values that X t can assume X t : Ω S Markov chain is in state s at 0me t if X t = s Markov chain is finite if it has a finite state space If a Markov chain is finite and 0me- homogeneous, its transi0on probabili0es can be described as a matrix P = (p ij ) p ij = P [X t = j X t 1 = i] For S = n the transi0on probability matrix P is an n- by- n right- stochas0c matrix (i.e., its rows sum up to 1) 8i : X j p ij =1

Proper0es of Markov Chains State i is reachable from state j if there exists a n 0 such that (P n ) ij > 0 States i and j communicate if i is reachable from j and vice versa Markov chain is irreducible if all states i,j in S communicate Markov chain is posi0ve recurrent if the recurrence probability is 1 and the mean recurrence 0me is finite for every state i

Proper0es of Markov Chains Markov chain is aperiodic if every state I has period 1 Markov chain is ergodic if it is 0me- homogeneous, irreducible, posi0ve recurrent, and aperiodic The 1- by- n vector is the sta0onary state distribu0on of the Markov chain described by P if i 0 X i =1 P = π i is the limit probability that Markov chain is in state i 1/π i reflects the average 0me un0l the Markov chain returns to state i Theorem: If a Markov chain is finite and ergodic, then there exists a unique sta0onary state distribu0on

PageRank as a Markov Chain Random surfer model Follows a uniform random outgoing link with probability (1- ε) Jumps to a uniform random web page with probability ε Let A be the adjacency matrix of the Web graph, matrix T captures following of a uniform random outgoing link T ij = ( 1/out(i) (i, j) 2 E 0 otherwise 1 2 3 4 5 0 1 0 1 0 A = 0 0 1 1 0 1 0 0 0 0 T = 0 0 0 0 1 0 0 1 0 0 0 1/2 0 1/2 0 0 0 1 1/2 0 1/1 0 0 0 0 0 0 0 0 1/1 0 0 1/1 0 0

PageRank as a Markov Chain Random surfer model Follows a uniform random outgoing link with probability (1- ε) Jumps to a uniform random web page with probability ε Vector j captures jumping to a uniform random web page j i =1/ V Transi0on probability matrix of Markov chain then obtained as P =(1 )T + [1...1] T j

Compu0ng π Markov Chain Monte Carlo Sta0onary state distribu0on can be cast into a system of linear equa0ons Solu0on can be found, e.g., using Gauss elimina0on Sta0onary state probability distribu0on is the led eigenvector of the transi0on probability matrix P for eigenvalue λ = 1 P = Can be computed using the characteris0c polynomial (P I) =0

Compu0ng π (in prac0ce) Sta0onary state distribu0on is the limit distribu0on Idea: Compute k- step state probabili0es π k un0l they converge Power (itera0on) method Select arbitrarily ini0al state probability distribu0on π 0 Compute π k = π k- 1 P un0l they converge (e.g. π k - π k- 1 < ε) Power (itera0on) method basically simulates the Markov chain and is the method of choice in prac0ce when dealing with huge state spaces, exploi0ng that matrix- vector mul0plica0on is easy to parallelize

Link Analysis PageRank HITS Random Surfer Model, Markov Chains Hyperlinked- Induced Topic Search Topic- Specific and Personalized PageRank Biased Random Jumps, Linearity of PageRank Similarity Search SimRank, Random Walk with Restarts Spam Detec0on LinkSpam, TrustRank, SpamRank Social Networks SocialPageRank, TunkRank

Topic- specific and personalized PageRank PageRank produces one- size- fits- all ranking determined assuming uniform following of links and random jumps How can we obtain topic- specific (e.g., for Sports) or personalized (e.g., based on my bookmarks) rankings? bias random jump probabili0es (i.e., modify the vector j) bias link- following probabili0es (i.e., modify the matrix T) What if we do not have hyperlinks between documents? construct implicit- link graph from user behavior or document contents

Topic- specific PageRank Input: Set of topics C (e.g., Sports, Poli0cs, Food, ) Set of web pages S c for each topic c (e.g., from dmoz.org) Idea: Compute a topic- specific ranking for c by biasing the random jump in PageRank toward web pages S c of that topic P c =(1 )T + [1...1] T j c with jci = ( 1/ S c Method: Precompute topic- specific PageRank vectors π c Classify user query q to obtain topic probabili0es P[c q] Final importance score obtained as linear combina0on = X P [c q] c c2c i 2 S c 0 i/2 S c T. H. Haveliwala: Topic- Sensi,ve PageRank: A Context- Sensi,ve Ranking Algorithm for Web Search, TKDE 15(4):784-796, 2003

Personalized PageRank Idea: Provide every user with a personalized ranking based on her favorite web pages F (e.g., from bookmarks or likes) ( 1/ F i 2 F P F =(1 )T + [1...1] T j F with j Fi = 0 i/2 F Problem: Compu0ng and storing a personalized PageRank vector for every single user is too expensive G. Jeh and J. Widom: Scaling Personalized Web Search, KDD 2003

PageRank in prac0ce PageRank- style methods are used somewhat in ranking Google: probably more important in the early days than today Use case: index selec0on which pages to include in the index which pages to include in which 0er of the index which pages to update more oden Use case: spam figh0ng which sites to exclude from search

Link Analysis PageRank HITS Random Surfer Model, Markov Chains Hyperlinked- Induced Topic Search Topic- Specific and Personalized PageRank Biased Random Jumps, Linearity of PageRank Similarity Search SimRank, Random Walk with Restarts Spam Detec0on LinkSpam, TrustRank, SpamRank Social Networks SocialPageRank, TunkRank

Similarity Search How can we use the links between objects to figure out which objects are similar to each other? Not limited to the Web graph but also applicable to k- par0te graphs derived from rela0onal database (students, lecture, etc.) implicit graphs derived from observed user behavior word co- occurrence graphs Applica0ons: Iden0fica0on of similar pairs of objects (e.g., documents or queries) Recommenda0on of similar objects (e.g., documents based on a query)

SimRank Intui0on: Two objects are similar if similar objects point to them s(u, v) = C I(u) I(v) I(u) X i=1 I(v) X j=1 s(i i (u),i j (v)) with confidence constant C < 1, in- neighbors I(u) and I(v), and I i (u) and I j (v) as the i- th and j- th in- neighbor of u and v Example: Universi0es, Professors, Students U1 P1 P2 S1 S2 With C = 0.8: s(p1, P2) = 0.414 s(s1, S2) = 0.331 s(u1, P2) = 0.132 s(p1, S2) = 0.106 s(p2, S2) = 0.088 s(p2, S1) = 0.042 s(u1, S2) = 0.034

SimRank s (0) (u, v) = ( 1 u = v 0 otherwise Repeat un0l convergence: s (k+1) (u, v) = C I(u) I(v) I(u) X i=1 I(v) X j=1 s (k) (I i (u),i j (v)) u 6= v SimRank score s(u, v) can be interpreted as the expected number of steps that it takes two random surfers to meet if they start at nodes u and v walk the graph backwards in lock step (i.e., their steps are synchronous) G. Jeh and J. Widom: SimRank: A Measure of Structural- Contextual Similarity, KDD 2002

Random Walks on Click Graph Consider bi- par0te click graph with queries and documents as ver0ces and weighted edges (q, d) indica0ng users tendency to click on document d for query q giant panda panda bear 2.0 1.0 1.0 d1 d2 2.0 Perform PageRank- style random walk with link- following probabili0es propor0onal to edge weights and random jump to single query or document fiat panda 1.0 d3 k= Applica0ons: query- to- document search query- to- query sugges0on document- by- query annota0on document- to- document sugges0on Annota0on using a random walk: P Query 0.075 boxer dog puppies 0.066 boxer puppy pics 0.060 boxer puppies puppy 0.056 boxer 0.056 boxer puppy pictures 0.049 boxer pups 0.049 boxer puppy 0.038 puppy boxers 0.034 boxer pup 0.030 baby boxer Distance 3 3 1 3 3 3 3 5 3 3

Random Walks on Query Flow Graph Consider query- flow graph with queries as ver0ces and weighted edges (q, q ) reflec0ng how oden q is issued ader q Recommend related queries by performing PageRank- style random walk on the query- flow graph with link- following probabili0es propor0onal to edge weights and random jump to current query (or last few queries) 2.0 endangered species 1.0 panda bear 3.0 giant panda banana apple banana beatles apple beatles banana apple usb no banana cs giant chocolate bar where is the seed in anut banana shoe fruit banana banana cloths eating bugs banana eating bugs banana holiday opening a banana banana shoe fruit banana recipe 22 feb 08 banana jules oliver banana cs banana cloths beatles apple apple ipod scarring srg peppers artwork ill get you bashles dundee folk songs the beatles love album place lyrics beatles beatles scarring paul mcartney yarns from ireland statutory instrument A55 silver beatles tribute band beatles mp3 GHOST S ill get you fugees triger finger remix P. Boldi, F. Bonchi, C. CasYllo, D. Donato, A. Gionis, and S. Vigna: The Query- flow Graph: Model and ApplicaTons, CIKM 2008

Link Analysis PageRank HITS Random Surfer Model, Markov Chains Hyperlinked- Induced Topic Search Topic- Specific and Personalized PageRank Biased Random Jumps, Linearity of PageRank Similarity Search SimRank, Random Walk with Restarts Spam Detec0on LinkSpam, TrustRank, SpamRank Social Networks SocialPageRank, TunkRank

Spam Detec0on Discoverability of web pages has oden a direct impact on the commercial success of the business behind them Search Engine Op0miza0on (SEO) seeks to op0mize web pages to make them easier to discover for poten0al customers white hat (op0mizes for the user and respects to search engine policies) black hat (manipulates search results by web spamming techniques) Web spamming techniques and search engines evolved in parallel ini0ally: only content spam, then: link spam, now: social media spam

Content Spam vs. Link Spam vs. Content Hiding Content spam keyword stuffing add unrelated but oden- sought keywords to page invisible content unrelated content invisible to users (like this) Link spam link farms collec0on of pages aiming to manipulate PageRank honeypots create valuable web pages with links to spam page link hijacking leave comments on reputable web pages or blogs Content hiding cloaking show different content to search engine s crawler and users

TrustRank, BadRank Idea: Pages linked to by trustworthy pages tend to be trustworthy TrustRank performs PageRank- style random walk with random jumps only to an explicitly selected set of trusted pages T Idea: Pages linking to spam pages tend to be spam themselves BadRank performs PageRank- style backwards random walk (i.e., following incoming links) with random jumps only to an explicitly selected set of blacklisted pages B Problems: Sets of trusted and blacklisted pages are difficult to maintain TrustRank and BadRank scores are hard to interpret and combine

Spam, Damn Spam and Sta0s0cs Idea: Look for sta0s0cal devia0on Content spam: Compare word frequency distribu0on to distribu0on in good hosts Link spam: Iden0fy outliers in out- degree and in- degree distribu0ons and inspect intersec0on 1E+8 1E+9 1E+7 1E+8 1E+6 1E+7 1E+6 Number of pages 1E+5 1E+4 1E+3 Number of pages 1E+5 1E+4 1E+3 1E+2 1E+2 1E+1 1E+1 1E+0 1E+0 1E+1 1E+2 1E+3 Out-degree 1E+4 1E+5 1E+6 1E+0 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 In-degree 1E+6 1E+7 1E+8

SpamMass Idea: Measure spam mass as the amount of PageRank score that a web page receives from web pages known to be spam Assume that web pages are parti0oned into good pages V + and bad pages V - and that a good core C V + is known Absolute spam mass of page p is then es0mated as SM (p) = π(p) π C (p) with π(p) as its PageRank score and πc (p) as its PageRank score with random jumps only to pages in the good core Rela0ve spam mass of page p is rsm (p) = SM (p)/π(p)

Link Analysis PageRank HITS Random Surfer Model, Markov Chains Hyperlinked- Induced Topic Search Topic- Specific and Personalized PageRank Biased Random Jumps, Linearity of PageRank Similarity Search SimRank, Random Walk with Restarts Spam Detec0on LinkSpam, TrustRank, SpamRank Social Networks SocialPageRank, TunkRank

Social Networks Social networks diverse rela0ons (e.g., friendship, liking, check- in, following) between diverse types of objects (e.g., people, en00es, posts, images, videos) Folksonomies (~ folk + taxonomy) allow users to organize objects by tagging no centrally controlled vocabulary Link analysis methods give insights into importance and similarity of objects (e.g., for ranking or recommenda0on)

TunkRank Idea: Measure a Twi}er user s influence as the expected number of people who will read a tweet (including re- tweets) by the user Considers Twi}er s follower graph G(V, E) consis0ng of users as ver0ces V and directed edges E with edge (i, j) indica0ng that user i follows user j Assump0ons: if i follows j, i reads tweet by j with probability 1 / out(i) constant re- twee0ng probability p r(j) = X (i,j)2e (1 + p r(i)) out(i) D. Tunkelang: A TwiGer Analog to PageRank, 2009 h\p://thenoisychannel.com/2009/01/13/a- twi\er- analog- to- pagerank/

Twi}erRank Considers Twi}er s follower graph G(V, E) consis0ng of users as ver0ces V and directed edges E with edge (i, j) indica0ng that user i follows user j PageRank- style random walk with link- following probabili0es T ij = ( N j P(i,k)2E N k sim(i, j) (i, j) 2 E 0 otherwise with N i as the number of tweets published by user i and sim(i, j) reflec0ng similarity between tweets by i and j J. Weng, E.- P. Lim, J. Jiang, and Q. He: TwiGerRank: finding topic- sensi,ve influen,al twigerers, WSDM 2010

Take- Home message Link analysis methods can be used to measure importance and similarity with applica0ons in search and recommenda0on is performed on the basis of finite and ergodic Markov Chains theory with computa0ons easily parallelizable MapReduce