Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Similar documents
Link Analysis and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

Introduction to Data Mining

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Slides based on those in:

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Mining Web Data. Lijun Zhang

Unit VIII. Chapter 9. Link Analysis

Mining Web Data. Lijun Zhang

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Searching the Web What is this Page Known for? Luis De Alba

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Popularity of Twitter Accounts: PageRank on a Social Network

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Introduction to Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Information Networks: PageRank

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

COMP5331: Knowledge Discovery and Data Mining

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Link Analysis in Web Mining

CSE 494: Information Retrieval, Mining and Integration on the Internet

Data-Intensive Computing with MapReduce

Bruno Martins. 1 st Semester 2012/2013

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

An Improved Computation of the PageRank Algorithm 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSI 445/660 Part 10 (Link Analysis and Web Search)

Adaptive methods for the computation of PageRank

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Administrative. Web crawlers. Web Crawlers and Link Analysis!

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS425: Algorithms for Web Scale Data

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

2.3 Algorithms Using Map-Reduce

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Information Retrieval Spring Web retrieval

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Recent Researches on Web Page Ranking

CS 137 Part 4. Structures and Page Rank Algorithm

Link Structure Analysis

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

The PageRank Citation Ranking

Social-Network Graphs

Distributed computing: index building and use

Lecture 8: Linkage algorithms and web search

Information Retrieval. Lecture 11 - Link analysis

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Searching the Web for Information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Part 1: Link Analysis & Page Rank

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Motivation. Motivation

Graph Algorithms. Revised based on the slides by Ruoming Kent State

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Efficient query processing

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Measuring Similarity to Detect

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

University of Maryland. Tuesday, March 2, 2010

COMP Page Rank

PageRank and related algorithms

Link Analysis. Hongning Wang

I/O-Efficient Techniques for Computing Pagerank

Final Exam Search Engines ( / ) December 8, 2014

Google Scale Data Management

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Searching the Web [Arasu 01]

Information Retrieval May 15. Web retrieval

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Graphs. Data Structures and Algorithms CSE 373 SU 18 BEN JONES 1

Web Structure Mining using Link Analysis Algorithms

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Lecture 17 November 7

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Predictive Indexing for Fast Search

Advanced Database Systems

Knowledge Discovery and Data Mining 1 (VO) ( )

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Transcription:

Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1

Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on Other Pages Classifier Predictions Spam, Adult, Review, Celebrity, Link Analysis HITS (Hubs and Authorities) PageRank 1/26/12 16:36 2

Pagerank Intuition Think of Web as a big graph. Suppose surfer keeps randomly clicking on the links. Importance of a page = probability of being on the page Derive transition matrix from adjacency matrix Suppose N forward links from page P Then the probability that surfer clicks on any one is 1/N 1/26/12 16:36 3

Matrix Representation Let M be an N N matrix m uv = 1/N v if page v has a link to page u m uv = 0 if there is no link from v to u Let R 0 be the initial rank vector Let R i be the N 1 rank vector for i th iteration Then R i+1 = M R i A D B C A B C D M A B C D 0 0 0 ½ 0 0 0 ½ 1 1 0 0 0 0 1 0 R 0 ¼ ¼ ¼ ¼ 1/26/12 16:36 4

Problem: Page Sinks. Sink = node (or set of nodes) with no out-edges. Why is this a problem? A B C 1/26/12 16:36 5

Solution to Sink Nodes Let: (1-c) = chance of random transition from a sink. N = the number of pages K = 1 / N M * = cm + (1-c)K R i+1 = M * R i 1/26/12 16:36 6

Computing PageRank - Example M = A B C D A B C D 0 0 0 ½ 0 0 0 ½ 1 1 0 0 0 0 1 0 A D B C M * = 0.05 0.05 0.05 0.45 0.05 0.05 0.05 0.45 0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05 R 0 R 30 ¼ ¼ ¼ ¼ 0.176 0.176 0.332 0.316 1/26/12 16:36 7

What About Sparsity? M * = M* = cm + (1-c)K Ooops 0.05 0.05 0.05 0.45 0.05 0.05 0.05 0.45 0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05 K = 1 / N

Authority and Hub Pages (1) page is a good authority (with respect to a given query) if it is pointed to by many good hubs (with respect to the query). page is a good hub page (with respect to a given query) if it points to many good authorities (for the query). Good authorities & hubs reinforce 1/26/12 16:36 9

Authority and Hub Pages (2) Authorities and hubs for a query tend to form a bipartite subgraph of the web graph. hubs authorities (A page can be a good authority and a good hub) 1/26/12 16:36 10

Linear Algebraic Interpretation PageRank = principle eigenvector of M * in limit HITS = principle eigenvector of M * (M * ) T Where [ ] T denotes transpose 1 2 3 4 1 3 2 4 Stability Small changes to graph small changes to weights. Can prove PageRank is stable And HITS isn t T = 1/26/12 16:36 11

Stability Analysis (Empirical) Make 5 subsets by deleting 30% randomly 1 1 3 1 1 1 2 2 5 3 3 2 3 3 12 6 6 3 4 4 52 20 23 4 5 5 171 119 99 5 6 6 135 56 40 8 7 10 179 159 100 7 8 8 316 141 170 6 9 9 257 107 72 9 10 13 170 80 69 18 PageRank much more stable 1/26/12 16:36 12

Practicality Challenges M no longer sparse (don t represent explicitly!) Data too big for memory (be sneaky about disk usage) Stanford Version of Google : 24 million documents in crawl 147GB documents 259 million links Computing pagerank few hours on single 1997 workstation But How? Next discussion from Haveliwala paper 1/26/12 16:36 13

Efficient Computation: Preprocess Remove dangling nodes Pages w/ no children Then repeat process Since now more danglers Stanford WebBase 25 M pages 81 M URLs in the link graph After two prune iterations: 19 M nodes 1/26/12 16:36 14

Representing Links Table Stored on disk in binary format Source node (32 bit integer) 0 1 2 Outdegree (16 bit int) 4 3 5 Destination nodes (32 bit integers) 12, 26, 58, 94 5, 56, 69 1, 9, 10, 36, 78 Size for Stanford WebBase: 1.01 GB Assumed to exceed main memory (But source & dest assumed to fit) 1/26/12 16:36 15

Algorithm 1 dest node source node = dest links (sparse) source s Source[s] = 1/N while residual >τ { d Dest[d] = 0 while not Links.eof() { Links.read(source, n, dest 1, dest n ) for j = 1 n Dest[dest j ] = Dest[dest j ]+Source[source]/n } d Dest[d] = (1-c) * Dest[d] + c/n /* dampening c= 1/N */ residual = Source Dest /* recompute every few iterations */ Source = Dest } 1/26/12 16:36 16

Analysis If memory can hold both source & dest IO cost per iteration is Links Fine for a crawl of 24 M pages But web > 8 B pages in 2005 [Google] Increase from 320 M pages in 1997 [NEC study] If memory only big enough to hold just dest? Sort Links on source field Read Source sequentially during rank propagation step Write Dest to disk to serve as Source for next iteration IO cost per iteration is Source + Dest + Links But What if memory can t even hold dest? Random access pattern will make working set = Dest Thrash!!!.???? = 1/26/12 16:36 17 dest node source node dest links source

Block-Based Algorithm Partition Dest into B blocks of D pages each If memory = P physical pages D < P-2 since need input buffers for Source & Links Partition (sorted) Links into B files Links i only has some of the dest nodes for each source Specifically, Links i only has dest nodes such that DD*i <= dest < DD*(i+1) Where DD = number of 32 bit integers that fit in D pages source node dest node = dest links (sparse) source 1/26/12 16:36 18

Partitioned Link File Source node (32 bit int) 0 1 2 Outdegr (16 bit) 4 3 5 Num out (16 bit) 2 1 3 Destination nodes (32 bit integer) 12, 26 5 1, 9, 10 Buckets 0-31 0 1 2 4 3 5 1 1 1 58 56 36 Buckets 32-63 0 1 2 4 3 5 1 1 1 94 69 78 Buckets 64-95 1/26/12 16:36 19

Analysis of Block Algorithm IO Cost per iteration = B* Source + Dest + Links *(1+e) e is factor by which Links increased in size Typically 0.1-0.3 Depends on number of blocks Algorithm ~ nested-loops join 1/26/12 16:36 20

Comparing the Algorithms 1/26/12 16:36 21

Comparing the Algorithms 1/26/12 16:36 22

Adding PageRank to a SearchEngine Weighted sum of importance+similarity with query Score(q, d) = w*sim(q, p) + (1-w) * R(p), if sim(q, p) > 0 = 0, otherwise Where 0 < w < 1 sim(q, p), R(p) must be normalized to [0, 1]. 1/26/12 16:36 23

Summary of Key Points PageRank Iterative Algorithm Sink Pages Efficiency of computation Memory! Don t represent M* explicitly. Minimize IO Cost. Break arrays into Blocks. Single precision numbers ok. Number of iterations of PageRank. Weighting of PageRank vs. doc similarity. 1/26/12 16:36 24