CS 6604: Data Mining Large Networks and Time-Series

Size: px

Start display at page:

Download "CS 6604: Data Mining Large Networks and Time-Series"

Justin Scott
5 years ago
Views:

1 CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash

2 Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking Link Analysis using Hubs and Authorities PageRank Block models and personalized PageRank Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg Vundekode 2017 CS 6604: DM Large Networks & Time-Series 2

3 SEARCHING THE WEB Vundekode 2017 CS 6604: DM Large Networks & Time-Series 3

4 Problem of Ranking No external database Ranking methods look at the Web itself Vundekode 2017 CS 6604: DM Large Networks & Time-Series 4

5 Search is a hard problem! Any setting Not just on the Web Keyword queries List is short and inexpressive Synonymy, Polysemy Authoring style and vocabulary Vundekode 2017 CS 6604: DM Large Networks & Time-Series 5

6 Search on the Web Everyone is an author. Everyone is a searcher. New problems? Diversity in authoring styles no common criterion to rank Diversity in searchers specific category? Dynamic Web content NEWS! Vundekode 2017 CS 6604: DM Large Networks & Time-Series 6

7 Problem transformed! Scarcity Abundance Finding the most relevant results Solution? Ranking Understanding network structure of Web pages Vundekode 2017 CS 6604: DM Large Networks & Time-Series 7

8 LINK ANALYSIS Essential for Ranking Vundekode 2017 CS 6604: DM Large Networks & Time-Series 8

9 Start from the right perspective! There is no point in looking inside a Web page to see how relevant it is to the query. Number of links to a page from other relevant pages reflects its relevance to the query better! Shows the authority of a page on the topic Links serve as implicit endorsements Vundekode 2017 CS 6604: DM Large Networks & Time-Series 9

10 Voting by In-Links Collect a large sample of pages relevant to the query Let them vote through their links Pick the page with highest number of votes In-degree Does this work for all kinds of queries?? Vundekode 2017 CS 6604: DM Large Networks & Time-Series 10

11 One-word query : newspapers Results: Mix of prominent newspapers AND Pages that receive high in-links irrespective of the query Yahoo!, Amazon, Facebook, Vundekode 2017 CS 6604: DM Large Networks & Time-Series 11

12 In-Links Network for newspapers Results wanted Vundekode 2017 CS 6604: DM Large Networks & Time-Series 12

13 Lists of Links New Approach Pages that compile lists of resources relevant to the topic A page s value as a list = sum of votes received by all pages that it voted for. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 13

14 List-Finding Technique Better lists (Better sense of where good results are) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 14

15 Principle of Repeated Improvement Better lists Weigh their votes heavily Re-compute votes Weight of each page s vote = its value as a list Improves scores of relevant results Vundekode 2017 CS 6604: DM Large Networks & Time-Series 15

16 Refined scores for newspapers Vundekode 2017 CS 6604: DM Large Networks & Time-Series 16

17 Hubs and Authorities Hubs for the query the high-value lists Authorities for the query the highly endorsed answers For each page (p) in the network, estimate its value as a potential hub and a potential authority calculate auth(p), hub(p) (Initial values for both = 1) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 17

18 Rules Authority Update Rule For each page p, update auth(p) to be the sum of the hub scores of all the pages that point to it. Hub Update Rule For each page p, update hub(p) to be the sum of the authority scores of all the pages that it point to. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 18

19 Note 1 application of Authority Update Rule voting by in-links 1 application of Authority Update Rule + 1 application of Hub Update Rule original list-finding technique Vundekode 2017 CS 6604: DM Large Networks & Time-Series 19

20 Principle of Repeated Improvement Start with all hub scores and authority scores equal to 1 Choose number of steps k Perform a sequence of k hub-authority updates First apply the Authority Update Rule to current set of scores Then apply the Hub Update Rule to the resulting set of scores Normalize the scores Vundekode 2017 CS 6604: DM Large Networks & Time-Series 20

21 Normalized and Re-weighted votes Vundekode 2017 CS 6604: DM Large Networks & Time-Series 21

22 k? Normalized values converge to limits Skipping proof (Section 14.6, if interested) Also proved that the limiting hub and authority values are a property purely of the link structure These limiting values correspond to a kind of an equilibrium Balance between hub and authorities Vundekode 2017 CS 6604: DM Large Networks & Time-Series 22

23 Limiting values for newspapers Vundekode 2017 CS 6604: DM Large Networks & Time-Series 23

24 PAGERANK Vundekode 2017 CS 6604: DM Large Networks & Time-Series 24

25 Intuition behind Hubs-Authorities Are auth and hub scores sufficient for all kinds of queries? No!! Only the ones with a commercial aspect. Why? Competing firms don t link to each other. Only way is to get a set of hub pages that link to them all. Hubs play a powerful endorsement role without themselves being heavily endorsed. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 25

26 Intuition behind PageRank Endorsement passing directly from one prominent page to another Nodes repeatedly pass endorsements across their outgoing links with weights based on its current estimate of PageRank Endorsements eventually pool at the most relevant nodes Vundekode 2017 CS 6604: DM Large Networks & Time-Series 26

27 PageRank Update Rule Each page divides its current PageRank equally across its out-going links, and passes these equal shares to the pages it points to. If a page has no out-going links, it passes all its current PageRank to itself. Each page updates its new PageRank to be the sum of the shares it receives. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 27

28 PageRank Network of n nodes initial PageRank = 1/n Choose number of steps = k Perform a sequence of k updates to PageRank values using the PageRank Update Rule Note: Total PageRank in the network = 1 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 28

29 Example Vundekode 2017 CS 6604: DM Large Networks & Time-Series 29

30 Initialize PageRank values 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 30

31 Step 1 1/2 1/16 1/16 1/16 1/16 1/16 1/16 1/8 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 31

32 Step 2 3/16 1/4 1/4 1/32 1/32 1/32 1/32 1/16 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 32

33 k? PageRank values converge to limiting values These limiting values exhibit kind of an equilibrium Values remain same on applying one step of the PageRank Update Rule Unique set of equilibrium values for strongly connected networks Skipping proofs Vundekode 2017 CS 6604: DM Large Networks & Time-Series 33

34 Equilibrium PageRank Values Vundekode 2017 CS 6604: DM Large Networks & Time-Series 34

35 Problem? Slow-leak: Wrong nodes might end up with all the PageRank! Vundekode 2017 CS 6604: DM Large Networks & Time-Series 35

36 Solution Scaling Factor : 0<s<1 Scaled PageRank Update Rule: First apply the Basic PageRank Update Rule. Then scale down all PageRank values by a factor of s. Total PageRank of the network now is s. Divide the residual (1-s) equally over all nodes ((1-s)/n to each node) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 36

37 k? PageRank values again converge to limiting values These limiting values exhibit kind of an equilibrium Values remain same on applying 1 step of Scaled PageRank Update Rule Unique set of equilibrium values for every s for any network Optimal s value between 0.8 to 0.9 Slow-leak prominent on larger networks Vundekode 2017 CS 6604: DM Large Networks & Time-Series 37

38 Random Walks Claim: The probability of being at a page X after k steps of random walk is precisely the PageRank of X after k applications of the Basic PageRank Update Rule. Skipping proof (Section 14.6, if interested) Try to intuitively think of the earlier network with F and G nodes being in a cycle. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 38

39 Questions on PageRank? Vundekode 2017 CS 6604: DM Large Networks & Time-Series 39

40 Block Models and Personalized PageRank Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg Vundekode 2017 CS 6604: DM Large Networks & Time-Series 40

41 We will discuss PageRank for community detection Personalized PageRank Seed Set Expansion Problem Evaluation of Ranking Methods - Developed a framework by studying seed set expansion applied to the stochastic block model Vundekode 2017 CS 6604: DM Large Networks & Time-Series 41

42 Random Walks Given a graph, a random walk is an iterative process that starts from a random vertex, and at each step, either follows a random outgoing edge of the current vertex or jumps to a random vertex. Given some seeds in a community in a graph, can we find the rest of the community? Using random walks rooting at the seeds Vundekode 2017 CS 6604: DM Large Networks & Time-Series 42

43 Personalized PageRank Page Rank (PR) measures stationary distribution of one specific kind of random walk that starts from a random vertex and in each iteration, with a predefined probability p, jumps to a random vertex, and with probability1-p follows a random outgoing edge of the current vertex. Personalized Page Rank (PPR) is the same as PR other than the fact that jumps are back to one of a given set of starting vertices. In a way, the walk in PPR is biased towards (or personalized for) this set of starting vertices and is more localized compared to the random walk performed in PR. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 43

44 Seed Set Expansion Problem Given graph G and subset of nodes S, known to be present in a community, Find the rest of the community Common techniques Personalized PageRank Heat Kernel method Vundekode 2017 CS 6604: DM Large Networks & Time-Series 44

45 Stochastic Block Model A distribution over graphs that generalizes the ER random graph model to include a planted block structure. Partition of nodes into C disjoint sets V 1,V 2, V c where V i = π i n Create C x C matrix P where entry p ij = prob(v i and V j are connected) So SBM is described as G(n, π, P) where π = (π 1, π 2,, π C ) Two-block SBM : Seed set and remainder of the graph Vundekode 2017 CS 6604: DM Large Networks & Time-Series 45

46 Approach For each node v in the graph, and each k, ranking methods use: landing probabilities of node, starting from a particular seed node in S (or a node chosen uniformly at random from S). Geometrically, these rankings amount to sweeps through the space of landing probabilities with hyperplanes normal to some vector, where personalized PageRank and the heat kernel correspond to different choices of vectors Vundekode 2017 CS 6604: DM Large Networks & Time-Series 46

47 Method Derive centroids, for each block, in the space of landing probabilities Observation: The optimal hyperplane for performing a linear sweep between the two centroids is asymptotically concentrated for large graphs on the weights of personalized PageRank (for a specific choice of the PageRank parameter corresponding to parameters of the SBM) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 47

48 2-block SBM We have 2 classes and a distribution of landing probabilities We can use discriminant functions to classify the points into the two blocks community and remainder of the graph. Geometric discriminant functions linear sweep through feature space Fisherian discriminant functions descriptive model using multivariate Gaussians feature space Vundekode 2017 CS 6604: DM Large Networks & Time-Series 48

49 Weight vectors For PageRank, the weight vector is For heat kernel method, it is Vundekode 2017 CS 6604: DM Large Networks & Time-Series 49

50 2-block SBM (Geometric) They theoretically proved an asymptotic equivalence between personalized PageRank and geometric classification of SBMs in the space of landing probabilities. They showed that: Vundekode 2017 CS 6604: DM Large Networks & Time-Series 50

51 Note It is assumed for that proof that the SBM is dense. It might not hold good for a sparse block model. The entire derivation works even if the intra-connectivity is lower than inter-connectivity among the blocks. α close to 0 is best for identifying very strong planted partitions, p in p out, whereas α close to 1 is best when the planted partition is very weak and the difference p in p out is small. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 51

52 2-block SBM (Fisherian) The classes are described by multivariate Gaussians N(a, Σ a ) and N(b, Σ b ) for the in-class and out-class, respectively. Resulting functions: Vundekode 2017 CS 6604: DM Large Networks & Time-Series 52

53 Results Quadratic discriminant function has considerably improved recall over ordinary personalized PageRank The linear SBM method, assuming a common covariance matrix for the two classes, exhibits a recall nearly identical to the quadratic method. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 53

54 Summarizing Personalized PageRank is shown as the optimal geometric discriminant function in the space of landing probabilities for classifying nodes in a hidden seed community in an SBM. Building on this connection between SBMs and personalized PageRank, more complex covariance-adjusted linear and quadratic approaches to classification in the space of landing probabilities were developed and evaluated. These classifiers dramatically outperform personalized PageRank and heat kernel methods for recovering seed sets in SBMs. The connection between personalized PageRank and SBMs is surprising, and it points toward a huge research scope Vundekode 2017 CS 6604: DM Large Networks & Time-Series 54

CSI 445/660 Part 10 (Link Analysis and Web Search)

CSI 445/660 Part 10 (Link Analysis and Web Search) Ref: Chapter 14 of [EK] text. 10 1 / 27 Searching the Web Ranking Web Pages Suppose you type UAlbany to Google. The web page for UAlbany is among the