Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Similar documents
10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

COMP5331: Knowledge Discovery and Data Mining

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Link Analysis and Web Search

Information Retrieval. Lecture 11 - Link analysis

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Extracting Information from Complex Networks

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

TODAY S LECTURE HYPERTEXT AND

Social Network Analysis

Recent Researches on Web Page Ranking

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Bruno Martins. 1 st Semester 2012/2013

Lecture 8: Linkage algorithms and web search

CSI 445/660 Part 10 (Link Analysis and Web Search)

Lec 8: Adaptive Information Retrieval 2

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

CS6200 Information Retreival. The WebGraph. July 13, 2015

Authoritative Sources in a Hyperlinked Environment

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Link Analysis. Hongning Wang

Learning to Rank Networked Entities

COMP Page Rank

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Information Networks: PageRank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Searching the Web [Arasu 01]

Information Retrieval and Web Search Engines

PageRank and related algorithms

Web Structure Mining using Link Analysis Algorithms

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

Bibliometrics: Citation Analysis

Discovering Latent Graphs with Positive and Negative Links to Eliminate Spam in Adversarial Information Retrieval

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

How to organize the Web?

Social Network Analysis

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

On Finding Power Method in Spreading Activation Search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Social Network Analysis

On Page Rank. 1 Introduction

Collaborative filtering based on a random walk model on a graph

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

An Improved Computation of the PageRank Algorithm 1

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Mining Web Data. Lijun Zhang

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Information Retrieval and Web Search

Information Networks: Hubs and Authorities

Brief (non-technical) history

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Ranking on Data Manifolds

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Information retrieval. Lecture 9

Link Analysis Ranking Algorithms, Theory, and Experiments

A ew Algorithm for Community Identification in Linked Data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis in Web Mining

Motivation. Motivation

Experimental study of Web Page Ranking Algorithms

CS60092: Informa0on Retrieval

Big Data Analytics CSCI 4030

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

The application of Randomized HITS algorithm in the fund trading network

Introduction to Data Mining

Information retrieval

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Personalized Information Retrieval

Degree Distribution: The case of Citation Networks

Lecture 17 November 7

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Link Structure Analysis

CS425: Algorithms for Web Scale Data

Introduction to Information Retrieval

Temporal Graphs KRISHNAN PANAMALAI MURALI

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Using Non-Linear Dynamical Systems for Web Searching and Ranking

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture November, 2002

Measuring Similarity to Detect

CS6322: Information Retrieval Sanda Harabagiu. Lecture 10: Link analysis

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Part 1: Link Analysis & Page Rank

Transcription:

Generalized Social Networs Social Networs and Raning Represent relationship between entities paper cites paper html page lins to html page A supervises B directed graph A and B are friends papers share an author A and B are co-worers undirected graph 1 2 Hypertext document or part of document lins to other parts or other documents construct documents of interrelated pieces relate documents to each other pre-dates Web Web iller app. 3 How use lins to improve information search? use structure to compute score for raning include more objects to ran redefines satisfying of query? add to the content of a document ² can deal with objects of mixed types images, PDF, 4 1

Scoring using structure Ideas 1. lin to object suggests it valuable object 2. distance between objects in graph represents degree of relatedness reachable by all in 2 lins Pursuing lining and value Intuition: when Web page points to another Web page, it confers status/authority/ popularity to that page Find a measure that captures intuition Not just web lining Citations in boos, articles others? edge node 5 6 Indegree Indegree = number of lins into a node Most obvious idea: higher indegree => better node Doesn t wor well Need some feedbac in system Leads us to Page and Brin s PageRan PageRan Algorithm that gave Google the leap in quality lin structure centerpiece of scoring Framewor Given a directed graph with n nodes Assign each node a score that represents its importance in structure: PageRan: pr(node) edge pr(node) node 7 8 2

Conferring importance Core ideas: Ø A node should confer some of its importance to the nodes to which it points If a node is important, the nodes it lins to should be important Ø A node should not transfer more importance than it has 9 Attempt 1 Refer to nodes by numbers 1,, n (arbitrary numbering) Let t i denote the number of edges out of node i (outdegree) Node i transfers 1/t i of its importance on each edge out of it Define pr new () = i with edge from i to (pr(i) / t i ) Iterate until converges Problems Sins (nodes with no edges out) Cyclic behavior 4 pr(4) 1 1/2pr(2) 2 1/2pr(2) 3 sin 10 Attempt 2 Random wal model Attempt 1 gives movement from node to lined neighbor with probability 1/outdegree Add random jump to any node pr new () = α/n + (1-α) i with edge from i to (pr(i) / t i ) α parameter chosen empirically Brea cycles Escape from sins 4 pr(4) 1 1/2pr(2) 2 1/2pr(2) 3 sin 11 Normalized? Would lie 1 n (pr ()) = 1 Consider 1 n (pr new ()) = 1 n ( α/n + (1-α) i with edge from i to (pr(i) / t i ) ) (1) = 1 n ( α/n)+ 1 n ((1-α) i with edge from i to (pr(i) / t i )) * (2) = α + (1-α) 1 n i with edge from i to (pr(i) / t i ) (3) = α + (1-α) 1 i n with edge from i to (pr(i) / t i ) * (4) α + (1-α) 1 i n pr(i) with edge from i to (1/ t i ) * (5) = α + (1-α) i with edge from i pr(i) (6) *inner sum i over incoming edges for one *inner sum over outgoing edges for one i i 12 3

Problem for desired normalization Have 1 n (pr new ()) = α + (1-α) i with edge from i pr(i)) Missing pr(i) for nodes with no edges from them sins! Solution: add n edges out of every sin Edge to every node including self Gives 1/n contribution to every node Gives desired normalization: If 1 n (pr initial ()) = 1 then 1 n (pr()) = 1 4 1 3 2 sin 13 Matrix formulation Let E be the n by n adjacency matrix E(i,) = 1 if there is an edge from node i to node = 0 otherwise Define new matrix L: For each row i of E (1 i n) If row i contains t i >0 ones, L(i,)=(1/ t i ) E(i,), 1 n If row i contains 0 ones, L(i,) = 1/n, 1 n Vector pr of PageRan values defined by pr = (α/n, α/n, α/n) T +(1- α) L T pr has a solution representing the steady-state values pr() 14 Eigenvector Formulation pr = (α/n, α/n, α/n) T +(1- α) L T pr = (α/n) Jpr +(1- α) L T pr = ( (α/n) J + (1- α) L T ) pr = ( M ) pr Calculation Choices 1. pr = M pr : Find principle eigenvector of M solves n simultaneous equations: pr() = α/n + (1-α) 1 i n L(i,)pr(i) 2. Use iterative calculation - power method (where we started) Initialize pr initial () = 1/n for each node Until converges { J is the matrix of all 1 s For each node Jpr = (1, 1, 1) T because 1 n (pr()) = 1 pr new () = α/n + (1-α) 1 i n L(i,)pr(i) For each node pr is the principal eigenvector of M pr() = pr new () Av = λv, λ=1 15 } 16 4

Power method Convergence In practice choose convergence criterion e.g. stop iteration when Max n =1 ( pr_new() pr() ) < ε ε = 10-3? 10-4? 10-5? Choice α No single best value 1-α determines rate of convergence Second eigenvalue α = 0.15 common gives 10-4 accuracy in about 60 iterations regardless of size of graph [Chu; Wu] 17 PageRan Observations Can be calculated for any directed graph Google calculates on entire Web graph query independent scoring Huge calculation for Web graph precomputed 1998 Google published: 52 iterations for 322 million lins 45 iterations for 161 million lins PageRan must be combined with querybased scoring for final raning Many variations What Google exactly does secret Can mae some guesses by results 18 HITS Hyperlin Induced Topic Search Second well-nown algorithm By Jon Kleinberg while at IBM Almaden Research Center Same general goal as PageRan Distinguishes 2 inds of nodes Hubs: resource pages Point to many authorities Authorities: good information pages Pointed to by many hubs 19 Mutual reinforcement Authority weight node j: a(j) Vector of weights a Hub weight node j: h(j) Vector of weights h Update: a new () = i with edge from i to (h(i)) h new () = j with edge from to j (a(j)) i j 20 5

Mutual reinforcement Mutual reinforcement Authority weight node j: a(j) Vector of weights a Hub weight node j: h(j) Vector of weights h Update: a new () = i with edge from i to (h(i)) i h(i) Authority weight node j: a(j) Vector of weights a Hub weight node j: h(j) Vector of weights h Update: a new () = i with edge from i to (h(i)) i h new () = j with edge from to j (a(j)) j h new () = j with edge from to j (a(j)) a(j) j 21 22 Matrix formulation Steady state: a = E T h a = E T Ea h = Ea h = EE T h Interpretation? 23 E T (i,) 1 where è i E(,j) 1 where è j Loo inside E(i,) 1 where iè E T (,j) 1 where jè Row i of E T : Row i of E: 1 s where sè i 1 s where iè s Column j of E: Column j of E T 1 s where sè j 1 s where jè s E T E(i,j) is number EE T (i,j) is number of nodes pointing of nodes pointed to both i and j to by both i and j 24 6

Steady state: a = E T h h = Ea Matrix formulation a = E T Ea h = EE T h Interpretation: E T E(i,j): number nodes point to both node i and node j Co-citation EE T (i,j): number nodes pointed to by both node i and node j Bibliographic coupling Iterative Calculation a = h = (1,, 1) T While (not converged) { a new = E t h h new = Ea a = a new / a new normalize to unit vector h = h new / h new normalize to unit vector } Provable convergence by linear algebra 25 26 Use of HITS original use after find Web pages satisfying query: 1. Retrieve documents satisfy query and ran by termbased techniques 2. Keep top c documents: root set of nodes c a chosen constant - tunable 3. Mae base set: a) Root set b) Plus nodes pointed to by nodes of root set c) Plus nodes pointing to nodes of root set using lins to expand matches! 4. Mae base graph: base set plus edges from Web graph between these nodes 5. Apply HITS to base graph 27 Results using HITS Documents raned by authority score a(doc) and hub score h(doc) Authority score primary score for search results Heuristics: delete all lins between pages in same domain Keep only pre-determined number of pages lining into root set ( ~200) Findings (original paper) Number iterations in original tests ~50 most authoritative pages do not contain initial query terms 28 7

Observations HITS can be applied to any directed graph Base graph much smaller than Web graph Kleinberg identified bad phenomena Topic diffusion: generalizes topic when expand root graph to base graph example: want compilers - generalized to programming PageRan and HITS designed independently around 1997 indicates time was ripe for this ind of analysis lots of embellishments by others 29 30 Revisit: How use lins in raning documents? use structure to compute score for raning PageRan, HITS include more objects to ran saw in use of HITS use anchor text (HTML) anchor text labels lin include anchor text as text of document pointed to Anchor text HTML text: All assignments will be made available on <a href="https://piazza.com/princeton/spring2017/ cos435/home">the Piazza course account</a>. Renders as: All assignments will be made available on the Piazza course account. Anchor text: the Piazza course account is anchor text 31 32 8

Using anchor text homewor may not occur in content of doc b doc a homewor doc b Problem Set terms in doc b for building index: homewor: anchor problem: title 1 set: title 2 33 Summary Lin analysis a principal component of raning by modern Web search engines must be combined with content analysis Extend document content with lin info anchor text text of URLs e.g. princeton.edu, aardvarsportsshop.com Expand set of satisfying docs using lins less often used 34 Raning documents w.r.t. query lin analysis + doc. features Secret recipe anchor text query personal information historic information words in doc + word features scores of documents for query - use to ran 35 General Framewor Have set of n features (aa signals) to use in determining raning score Features depend on query: vector Ψ(d i,q) of feature values f for doc d i, query q eg tf.idf score is feature Features are conditioned to be comparable Have parameterized function to combine signals simple: linear α 0 + i=1 α i *(f i ) α i are adjustable weights - how choose? intuition experimentation machine learning n 36 9

Machine Learning Raning documents w.r.t. query Many possibilities overview of one Ordinal Regression Model Goal: get comparison of doc.s correct capture goal Let ω represent vector (α 1,, α n ) want ω T Ψ(d i,q) - ω T Ψ(d j,q) > 0 if and only if lin analysis + doc. features anchor text query personal information historic information words in doc + word features d i more relevant than d j for query q find ω that wors techniques train on nown correct data: humans ran a set of documents for various queries 37 Secret recipe scores of documents for query - use to ran 38 10