Calcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi

Size: px
Start display at page:

Download "Calcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi"

Transcription

1 Calcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi Roberto Tempo CNR-IEIIT Consiglio Nazionale delle Ricerche Politecnico di Torino tempo@polito.it

2 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

3 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

4 Randomized Algorithm Randomized Algorithm (RA): An algorithm that makes random choices during its execution to produce a result Example: Matlab code set_r=1:0.01:3; 01:3; for k=1:length(set_r) if (FRA(r) > 0) a_opt(k) = rand; a_lin(k)=(e/(e-1))*r; a_sqr(k)=r+sqrt(2*(r-1)); a=a_sqr(k); a_sub(k)=(a/(a-1))*(r+log(a)-1); end

5 Randomized Algorithms (RAs) Randomized algorithms are frequently used in many areas of engineering, computer science, physics, finance, optimization, but their hi appearance in systems and control is mostly limited to Monte Carlo simulations

6 Randomized Algorithms (RAs) Computer science (RQS for sorting) Mathematics of finance (expected value computation) Distributed algorithms (PageRank in Google) Robotics (motion and path planning problems) Bioinformatics (string matching problems) Computer vision (computational geometry)

7 A Success Story: Randomization in Computer Science

8 A Success Story in CS Problem: Sorting N real numbers Algorithm: RandQuickSort (RQS) RQS is implemented in a C library of Linux for sorting numbers C.A.R. Hoare (1962) D.E. Knuth (1998)

9 A Success Story in CS Problem: Sorting N real numbers Algorithm: RandQuickSort (RQS) RQS is implemented in a C library of Linux for sorting numbers Sorting Problem given N real x 1 x 2 x 3 sort them in numbers x 4 x 5 x 6 increasing order S 1

10 RandQuickSort (RQS) TheideaistodividetheoriginalsetS 1 into two sets having (approximately) the same cardinality This requires finding the median of S 1 (which may be difficult) This operation is performed using randomization

11 RandQuickSort (RQS) RQS is a recursive algorithm consisting of two phases 1. randomly select a number x i (e.g. x 4 ) 2. deterministic comparisons between x i and other (N-1) numbers x 2 x 3 x 1 x 5 x 6 x 4 S 2 S 3 numbers smaller than x 4 numbers larger than x 4

12 RQS: Binary Tree Structure We use randomization at each step of the (binary) tree

13 Running Time of RQS Because of randomization, running time may be different from one run of the algorithm to the next one RQS is very fast: Average running time is O(N log(n)) Major improvement compared to brute force approach O(N 2 )whenn =2 M Average running time holds for every input with probability at least 1-1/N (i.e. it is highly probable)

14 (Quasi) Monte Carlo Methods for Computational Finance QMC methods to estimate the prize of collaterized mortgage obligations The problem is to approximate the average mortgage [0,1] f ( u )d u n with its empirical mean (based on random samples u i ) N 1 f ( ui ) N Curse of dimensionality: n = 360! i 1

15 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

16 The PageRank Problem A Systems and Control Viewpoint The PageRank Problem is useful for developing novel ideas within systems and control

17 The PageRank Problem A Systems and Control Viewpoint Technical Tool: Theory of Stochastic Matrices

18 PageRank PageRank is used in search engines to indicate the importance of the page currently visited PageRank has broader utility than search engines Used in various areas for ranking other objects o scientific journals (Eigenfactor) o e-commerce and e-consultancy o cancer biology and protein detection o ranking top tennis players (Jimmy Connors #1) o

19 Website of Umberto Eco Looking for Umberto Eco in Google we find immediately the website

20 PageRank for Umberto Eco Using a PageRank checker we compute PageRank is Google s view of the importance of this page PageRank is a numerical value in the interval (0,1] which indicates the importance of the page you are visiting

21 Graphs 1 2 Directed graph with nodes and links 3 4 5

22 Random Web Surfer Model Consider a set of pages (nodes) connected by directed communication links Web surfer moves along randomly following the hyperlink structure When arriving at a page with several outgoing links, one is chosen at random, then the random surfer moves to a new page, and so on

23 Random Web Surfer Model Web representation with incoming and outgoing links

24 Random Web Surfer Model

25 Random Web Surfer Model Pick an outgoing link at random

26 Random Web Surfer Model Arriving at a new web page

27 Random Web Surfer Model Pick another outgoing link at random

28 Random Web Surfer Model

29 Random Web Surfer Model

30 Random Web Surfer Model

31 Random Web Surfer Model Ifapageisimportant then it is visited more often... The time the random surfer spends on a page is a measure of its importance If important pages point to your page, then your page becomes important because it is visited often What is the probability that your page is visited? Need to rank the pages in order of importance for facilitating the web search

32 Graph Representation Directed graph with nodes (pages) and links representing the web Graph is not necessarily strongly connected (from 5 you cannot reach other pages) Graph is constructed using crawlers, spiders, sniffers moving continuously along the web

33 Hyperlink Matrix For each node we count the number of outgoing links and normalize their sum to 1 Hyperlink matrix is a nonnegative (column) substochastic matrix 5

34 Hyperlink Matrix / /2 0 A 1/ /2 0 1/

35 PageRank: Bringing Order to the Web We rank n web pages in order of importance PageRank x* of the hyperlink matrix A is defined as x*=ax* where x* [0,1] n and i x i *=1 x* is a nonnegative unit eigenvector corresponding to the eigenvalue 1 of A S. Brin, L. Page (1998) S. Brin, L. Page, R. Motwani, T. Winograd (1999)

36 PageRank: Bringing Order to the Web We rank n web pages in order of importance PageRank x* of the hyperlink matrix A is defined as x*=ax* where x* [0,1] n and i x i *=1 x* is a nonnegative unit eigenvector corresponding to the eigenvalue 1 of A The question is when x* exists and it is unique

37 Issue of Black Holes First issue: We have black holes (pages having no outgoing link) Random surfer gets stuck when visiting a pdf file In this case the back button of the browser is used Mathematically, the hyperlink matrix is nonnegative and (column) substochastic Easy fix: Add new links to make the matrix stochastic

38 Page 5 is a Black Hole / /2 0 A 1/ /2 0 1/

39 1 2 Easy Fix: Add New Link We add a new outgoing link from page 5 to page A / /2 0 1/ /2 1 1/

40 In General the Fix is not so Easy 1 2 Page 5 has two incoming links 3 4 5

41 In General the Fix is not so Easy 1 2 We add an outgoing link from 5 to A / /3 0 1/ /3 1 1/ /3 0

42 In General the Fix is not so Easy 1 2 or we add an outgoing link from 5 to 4? A / /3 0 1/ /3 0 1/ /3 0

43 Modified Hyperlink Matrix A solution may be to break page 5 into two pages 5a and 5b This artificially changes the number of pages (not only the number of links) The topology of the network changes 5a 5b

44 Modified Hyperlink Matrix / / / /3 1 0 A 4 1/ / a 5b

45 Assumption: No Black Holes This is a web modeling problem Assume that there are no black holes This implies that A is a nonnegative stochastic matrix (instead of substochastic) having at least one eigenvalue equal to one Second issue: This eigenvalue is not necessarily unique

46 Teleportation Matrix Teleportation: After a while the random surfer gets bored and decides to jump over long distance to another page not directly connected to that currently visited New pagemay pg be geographically g located far away

47 Recall the Random Web Surfer Model Web representation with incoming and outgoing links

48 Recall the Random Web Surfer Model

49 Recall the Random Web Surfer Model

50 Recall the Random Web Surfer Model

51 Recall the Random Web Surfer Model

52 Teleportation Model We are teleported to a web page located far away

53 Random Web Surfer Model Again Pick another outgoing link at random

54 Random Web Surfer Model Again Pick another outgoing link at random

55 Teleportation Model Again We are teleported to another web page located far away

56 Convex Combination of Matrices Teleportation model is represented as a convex combination of matrices A and S/n 1 1 S = 11 T is a rank-one matrix S 1 vector with all entries equal to one 1 1 Consider a matrix M defined as M =(1-m) A + m/n S m (0,1) where n is the number of pages The value m = 0.15 is used at Google

57 Matrix M M is a convex combination of two nonnegative stochastic matrices and m (0,1) M is a positive stochastic matrix

58 Properties of Matrix M and Perron Theorem Matrix M is primitive (M k is positive for some k) M is irreducible and the corresponding graph is strongly connected (every page is connected to every page) The eigenvalue 1 is a simple eigenvalue of maximum modulus The corresponding eigenvector is positive

59 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

60 Monte Carlo and Las Vegas Monte Carlo was invented by Metropolis, Ulam, von Neumann, Fermi, (Manhattan project) Metropolis Fermi Ulam, Feymann, von Neumann Las Vegas first appeared in computer science in the late seventies

61 Randomized Algorithm Randomized Algorithm (RA): An algorithm that makes random choices during its execution to produce a result Example: Matlab code set_r=1:0.01:3; 01:3; for k=1:length(set_r) if (FRA(r) > 0) a_opt(k) = rand; a_lin(k)=(e/(e-1))*r; a_sqr(k)=r+sqrt(2*(r-1)); a=a_sqr(k); a_sub(k)=(a/(a-1))*(r+log(a)-1); end

62 Monte Carlo Randomized Algorithm Monte Carlo Randomized Algorithm (MCRA): A randomized algorithm that may produce incorrect results, but with bounded probability of error Prob{error > } <2e (-2N2 ) Hoeffding inequality where is the probabilistic accuracy of the estimate, N is the sample size (sample complexity) and e is the Euler number

63 Example of Monte Carlo: Area/Volume Estimation Estimate the volume of the red area: Generate N samples uniformly in the rectangle (hint: use Matlab rand) Count how many fall within the red area (M), estimated area = M/N

64 Homework Using rand we can generate N samples in a rectangle Question: how can we generate N samples in a sphere? 1 step RACT: Randomized Algorithms Control Toolbox for Matlab

65 Las Vegas Randomized Algorithm Las Vegas Randomized Algorithm (LVRA): A randomized algorithm that always produces correct results, the only variation from one run to another is the running time Example: Randomized Quick Sort (RQS)

66 Example of Las Vegas: Discrete Random Variables Consider discrete random variables q 1 q 2 q 3 q 4 q 5 q 6 q 7 q 8 q 9 q 10

67 Example of Las Vegas: Discrete Random Variables Consider discrete random variables q 1 q 2 q 3 q 4 q 5 q 6 q 7 q 8 q 9 q 10

68 Example of Las Vegas: Discrete Random Variables Consider discrete random variables q 1 q 2 q 3 q 4 q 5 q 6 q 7 q 8 q 9 q 10

69 Parallel and Distributed Simulations Random samples are independent identically distributed This approach leads to parallel and distributed simulations IBM Blue Gene Cray-1 vector processor

70 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

71 PageRank Computation with Power Method PageRank is computed with the power method x(k+1) = Mx(k) If i x i (0) = 1 convergence is guaranteed by Perron Theorem because M is a primitive matrix x(k) x* for k Remark: PageRank computation may be interpreted as finding the stationary distribution of a Markov Chain (steady-state probability that page i is visited is x* ) i

72 PageRank Computation with Power Method / /2 1/3 A 0 1/2 0 1/3 0 1/2 1/2 0 m M x* T

73 Size of the Web ThesizeofM is more than 8 billion (and it is increasing)! The PageRank computation requires iterations This computation takes about a week and it is performed centrally at Google once a month More and more computing power is needed

74 Columbia River, The Dalles, Oregon Land is cheap Electricity is cheap Water of the river may be used as a cooling device

75 Randomized d Decentralized Algorithms

76 Randomized Decentralized Algorithms for PageRank Computation Main idea: Develop Las Vegas randomized decentralized algorithms for computing PageRank H. Ishii, R. Tempo, Distributed Randomized Algorithms for the PageRank Computation, IEEE TAC, 2010

77 Decentralized Communication Protocol 1 4 Gossip communication protocol: at time k randomly select page i (i=4) for PageRank update 2 3

78 Decentralized Communication Protocol 1 4 Gossip communication protocol: at time k randomly select page i (i=4) for PageRank update 2 3

79 Decentralized Communication Protocol 1 4 Gossip communication protocol: at time k randomly select page i (i=4) for PageRank update send PageRank value of page i to the outgoing pages that are linked to page i

80 Decentralized Communication Protocol 1 4 Gossip communication protocol: at time k randomly select page i (i=4) for PageRank update send PageRank value of page i to the outgoing pages that are linked to page i 2. request PageRank values from incoming pages that are linked to page i

81 Las Vegas Randomized Approach The pages taking action are determined via a stochastic process θ(k) {1,, n} θ(k) is assumed to be i.i.d. with uniform probability Prob{θ(k)=i} = 1/n If at time k θ(k) =i then page i initiates PageRank update

82 Randomized Update Scheme Consider the randomized update scheme x(k+1) = A θ(k) x(k) where A θ(k) are the decentralized link matrices (example next) Study average properties of A θ(k) Define the time average k 1 yk ( ) xi ( ) k 1 i0

83 Decentralized Link Matrices - 1 A / /2 1/3 0 1/2 0 1/3 0 1/2 1/2 0 A 4 1/3 1/3 1/3 0 1/2 1/2 0

84 Decentralized Link Matrices - 2 A / /2 1/3 0 1/2 0 1/3 0 1/2 1/2 0 A / / /3 0 1/2 1/2 0

85 Decentralized Link Matrices - 3 A / /2 1/3 0 1/2 0 1/3 0 1/2 1/2 0 A /3 0 1/2 0 1/ /2 1/3 0 1/2 1/2 0

86 Decentralized Link Matrices - 4 A / /2 1/3 0 1/2 0 1/3 0 1/2 1/2 0 A /2 1/ /2 0 1/ /2 2/3 A /3 0 1/2 0 1/ /2 1/3 0 1/2 1/2 0

87 Modified Randomized Update Scheme Recall: Need to work with positive stochastic matrices (teleportation model) Consider the modified randomized update scheme x(k+1) = M θ(k) x(k) where M θ(k) are the modified decentralized link matrices computed as M i =(1-r) A i + r/n S i=1,2,,n and r (0,1) is a design parameter r =2m/(n mn +2m)

88 Convergence Properties Theorem (convergence properties) The time average y(k) of the modified randomized update scheme converges to PageRank x* inmse E[ y(k) -x* 2 ] 0 for k provided that i x i (0) = 1 Proof: Based on the theory of ergodic matrices H. Ishii, R. Tempo (2010)

89 Comments Time average y(k) can be computed recursively as a function of y(k-1) Because of averaging convergence rate is 1/k Sparsity of the matrix A is preserved because x(k+1) = Mx(k) =(1-r) Ax(k) +r/n 1 where 1 is a vector with all entries equal to one A / / 2 1 / / / / 2 1 / M=

90 Extensions Stopping criteria to compute approximately PageRank Extensions to simultaneous update of multiple web pages Robustness for fragile links Convergence results for update schemes based on outgoing links (not incoming)

91 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

92 ISI Web of Knowledge

93 Ranking Journals: Impact Factor Impact Factor IF IF 2010 number citations in 2010 of articles published in number of articles published in Remark: Impact Factor is a flat criterion (it does not take into account where the citations come from)

94 ISI Web of Knowledge

95 Ranking Journals: Eigenfactor Eigenfactor EF Ranking journals using ideas from PageRank In Eigenfactor journals are considered influential if they are cited often by other influential journals What is the probability that a journal is cited? C. T. Bergstrom (2007)

96 Article Vector Article vector v v i number articles published by journal i in number articles published by all journals in v i is the fraction of all published articles coming from journal i during the period Article vector v is a stochastic vector Total number of journals = 7611

97 Cross-citation Matrix - 1 Aij number citations in 2010 from journal j to articles published in journal i in Self-citations are omitted 0 A 0 0 Normalization to obtain a column substochastic matrix Aij Aij Akj (note: if 0/0 we set A ij =0) k

98 Cross-citation Matrix - 2 Black holes: journals that do not cite any other journal Columns with entries equal to zero A substochastic matrix A Replace A with stochastic matrix introducing the vector v A v

99 Journal Influence Vector Consider the Eigenfactor equation (same form of the PageRank equation) M 1 m A m v m T ( ) x* isthejournal influence vector (defined as PageRank for M) which provides the weights Eigenfactor EF is defined as 100 Ax* EF ( Ax*) i i

100 Comments - 1 In Eigenfactor equation we replace the rank-one matrix 1/nS= 1/n 11 T with the rank-one weighted matrix v 1 T 1/n S 1/ n 1/ n 1/ n 1/ n M is a positive stochastic matrix v v v 1 1 T 1 v Journal influence vector x* exists and it is unique It is the eigenvector corresponding to the simple eigenvalue 1 (from Perron Theorem) n v n

101 Comments - 2 Journal influence vector x* is the steady-state fraction of time spent visiting each journal in M Eigenfactor EF j of journal j is the percentage of the total weighted 100 Ax* EF citations that journal j receives ( Ax*) i from all 7611 journals Weights given by x* (with no weights we obtain IF) i

102 PageRank and Twitter Followers

103 PageRank and Twitter Followers

104 References

105 PageRank Journal References H. Ishii, R. Tempo, E.-W. Bai, PageRank and Web aggregation, IEEE TAC, summer 2012 O. Fercoq, M. Akian, M. Bouhtou, S. Gaubert, PageRank, Markov Chains and Ergodic Control, IEEE TAC, to appear 2012 W.-X. Zhao, H.-F. Chen, H.T. Fang, Almost-Sure Convergence of the Randomized Scheme, 2012 C. de Kerchove, L. Ninove, and P. Van Dooren. Influence of the Outlinks in PageRank. LAA, 2008 A.V. Nazin and B.T. Polyak. Randomized Algorithm to Determine the Eigenvector of a Stochastic Matrix, Automation and Remote Control, 2011

106 References on Randomized Algorithms G. Calafiore, F. Dabbene and R. Tempo Research on Probabilistic Design Methods, Automatica, 2011 [1304 downloads during March-December 2011, #2 in the ranking] F. Dabbene and R. Tempo, Probabilistic and Randomized Tools for Control Design, g, The Control Handbook (W. S. Levine Ed.), Taylor & Francis, 2010 R. Tempo, G. Calafiore and F. Dabbene, Randomized Algorithms for Analysis and Control of Uncertain Systems, Springer-Verlag, London, 2005 (second edition to appear in 2012)

107 List of Topics o o o o o o Randomization in CS and Math Finance The PageRank Problem in Google Monte Carlo and Las Vegas (Distributed) Algorithms PageRank Computation Ranking Scientific Journals Conclusions

108 Where is the Beef? Where is the controller? Question that the late Peter Dorato posed several times when I was studying control problems without a controller

109 Consensus and PageRank Consider a set of agents each having a numerical value transmitted to the neighboring agents iteratively Consensus: All agents eventually reach a common value Consensus of multi-agent systems and PageRank computation share striking similarity properties

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems Roberto Tempo IEIIT-CNR Politecnico di Torino tempo@polito.it This talk The objective of this talk is to discuss

More information

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems Roberto Tempo IEIIT-CNR Politecnico di Torino tempo@polito.it This talk The objective of this talk is to discuss

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy CENTRALITIES Carlo PICCARDI DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy email carlo.piccardi@polimi.it http://home.deib.polimi.it/piccardi Carlo Piccardi

More information

Mathematical Analysis of Google PageRank

Mathematical Analysis of Google PageRank INRIA Sophia Antipolis, France Ranking Answers to User Query Ranking Answers to User Query How a search engine should sort the retrieved answers? Possible solutions: (a) use the frequency of the searched

More information

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank Hans De Sterck Department of Applied Mathematics University of Waterloo, Ontario, Canada joint work with Steve McCormick,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. III (Jan.-Feb. 2017), PP 01-07 www.iosrjournals.org PageRank Algorithm Albi Dode 1, Silvester

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Stochastic Simulation: Algorithms and Analysis

Stochastic Simulation: Algorithms and Analysis Soren Asmussen Peter W. Glynn Stochastic Simulation: Algorithms and Analysis et Springer Contents Preface Notation v xii I What This Book Is About 1 1 An Illustrative Example: The Single-Server Queue 1

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS535 Big Data Fall 2017 Colorado State University  9/5/2017. Week 3 - A. FAQs. This material is built based on, S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

A brief history of Google

A brief history of Google the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

DSCI 575: Advanced Machine Learning. PageRank Winter 2018 DSCI 575: Advanced Machine Learning PageRank Winter 2018 http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf Web Search before Google Unsupervised Graph-Based Ranking We want to rank importance based on

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addison Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou Sun yzsun@cs.ucla.edu January 10, 2017 Overview of Information Network Analysis Network Representation Network

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

A Reordering for the PageRank problem

A Reordering for the PageRank problem A Reordering for the PageRank problem Amy N. Langville and Carl D. Meyer March 24 Abstract We describe a reordering particularly suited to the PageRank problem, which reduces the computation of the PageRank

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

TODAY S LECTURE HYPERTEXT AND

TODAY S LECTURE HYPERTEXT AND LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Adaptive methods for the computation of PageRank

Adaptive methods for the computation of PageRank Linear Algebra and its Applications 386 (24) 51 65 www.elsevier.com/locate/laa Adaptive methods for the computation of PageRank Sepandar Kamvar a,, Taher Haveliwala b,genegolub a a Scientific omputing

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Stochastic Algorithms

Stochastic Algorithms Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations Monte Carlo Las Vegas We have already encountered some

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Content Anchor text Link analysis for ranking Pagerank and variants HITS The Web as a Directed Graph Page A Anchor

More information

Application of PageRank Algorithm on Sorting Problem Su weijun1, a

Application of PageRank Algorithm on Sorting Problem Su weijun1, a International Conference on Mechanics, Materials and Structural Engineering (ICMMSE ) Application of PageRank Algorithm on Sorting Problem Su weijun, a Department of mathematics, Gansu normal university

More information

Advanced Computer Architecture: A Google Search Engine

Advanced Computer Architecture: A Google Search Engine Advanced Computer Architecture: A Google Search Engine Jeremy Bradley Room 372. Office hour - Thursdays at 3pm. Email: jb@doc.ic.ac.uk Course notes: http://www.doc.ic.ac.uk/ jb/ Department of Computing,

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science CS-C3160 - Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Contents of this chapter Story

More information

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations

A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations A PageRank Algorithm based on Asynchronous Iterations Daniel Silvestre, João Hespanha and Carlos Silvestre Abstract We address the PageRank problem of associating a relative importance value to all web

More information

Optimized Graph-Based Trust Mechanisms using Hitting Times

Optimized Graph-Based Trust Mechanisms using Hitting Times Optimized Graph-Based Trust Mechanisms using Hitting Times Alejandro Buendia New York, NY 10027 alb2281@columbia.edu Daniel Boley Computer Science and Engineering Minneapolis, MN 55455 boley@umn.edu Columbia

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent

More information

Short-Cut MCMC: An Alternative to Adaptation

Short-Cut MCMC: An Alternative to Adaptation Short-Cut MCMC: An Alternative to Adaptation Radford M. Neal Dept. of Statistics and Dept. of Computer Science University of Toronto http://www.cs.utoronto.ca/ radford/ Third Workshop on Monte Carlo Methods,

More information

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016 CPSC 340: Machine Learning and Data Mining Ranking Fall 2016 Assignment 5: Admin 2 late days to hand in Wednesday, 3 for Friday. Assignment 6: Due Friday, 1 late day to hand in next Monday, etc. Final:

More information

Page Rank Algorithm. May 12, Abstract

Page Rank Algorithm. May 12, Abstract Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of

More information

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang 1 PAGERANK ON AN EVOLVING GRAPH Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Present by Yanzhao Yang 1 Evolving Graph(Web Graph) 2 The directed links between web

More information

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules?

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Social Networks 2015 Lecture 10: The structure of the web and link analysis 04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information

More information

1.6 Case Study: Random Surfer

1.6 Case Study: Random Surfer Memex 1.6 Case Study: Random Surfer Memex. [Vannevar Bush, 1936] Theoretical hypertext computer system; pioneering concept for world wide web. Follow links from book or film to another. Tool for establishing

More information

CSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego

CSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego CSE 22 Divide-and-conquer algorithms Fan Chung Graham UC San Diego A useful fact about trees Any tree on n vertices contains a vertex v whose removal separates the remaining graph into two parts, one of

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

More information

How Google Finds Your Needle in the Web's

How Google Finds Your Needle in the Web's of the content. In fact, Google feels that the value of its service is largely in its ability to provide unbiased results to search queries; Google claims, "the heart of our software is PageRank." As we'll

More information

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018 PageRank CS16: Introduction to Data Structures & Algorithms Spring 2018 Outline Background The Internet World Wide Web Search Engines The PageRank Algorithm Basic PageRank Full PageRank Spectral Analysis

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

Heat Kernels and Diffusion Processes

Heat Kernels and Diffusion Processes Heat Kernels and Diffusion Processes Definition: University of Alicante (Spain) Matrix Computing (subject 3168 Degree in Maths) 30 hours (theory)) + 15 hours (practical assignment) Contents 1. Solving

More information

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

My Best Current Friend in a Social Network

My Best Current Friend in a Social Network Procedia Computer Science Volume 51, 2015, Pages 2903 2907 ICCS 2015 International Conference On Computational Science My Best Current Friend in a Social Network Francisco Moreno 1, Santiago Hernández

More information

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network

More information

Graph and Link Mining

Graph and Link Mining Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 Wolf-Tilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview

More information

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Link Analysis in Graphs: PageRank Link Analysis Graphs Recall definitions from Discrete math and graph theory. Graph. A graph

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. Link Analysis SEEM5680 Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. 1 The Web as a Directed Graph Page A Anchor hyperlink Page

More information

Divide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n.

Divide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n. Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addon Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively.

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

Modeling web-crawlers on the Internet with random walksdecember on graphs11, / 15

Modeling web-crawlers on the Internet with random walksdecember on graphs11, / 15 Modeling web-crawlers on the Internet with random walks on graphs December 11, 2014 Modeling web-crawlers on the Internet with random walksdecember on graphs11, 2014 1 / 15 Motivation The state of the

More information

Graph Theory Problem Ideas

Graph Theory Problem Ideas Graph Theory Problem Ideas April 15, 017 Note: Please let me know if you have a problem that you would like me to add to the list! 1 Classification Given a degree sequence d 1,...,d n, let N d1,...,d n

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information