CS 6604: Data Mining Large Networks and Time-Series

Size: px
Start display at page:

Download "CS 6604: Data Mining Large Networks and Time-Series"

Transcription

1 CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash

2 Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking Link Analysis using Hubs and Authorities PageRank Block models and personalized PageRank Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg Vundekode 2017 CS 6604: DM Large Networks & Time-Series 2

3 SEARCHING THE WEB Vundekode 2017 CS 6604: DM Large Networks & Time-Series 3

4 Problem of Ranking No external database Ranking methods look at the Web itself Vundekode 2017 CS 6604: DM Large Networks & Time-Series 4

5 Search is a hard problem! Any setting Not just on the Web Keyword queries List is short and inexpressive Synonymy, Polysemy Authoring style and vocabulary Vundekode 2017 CS 6604: DM Large Networks & Time-Series 5

6 Search on the Web Everyone is an author. Everyone is a searcher. New problems? Diversity in authoring styles no common criterion to rank Diversity in searchers specific category? Dynamic Web content NEWS! Vundekode 2017 CS 6604: DM Large Networks & Time-Series 6

7 Problem transformed! Scarcity Abundance Finding the most relevant results Solution? Ranking Understanding network structure of Web pages Vundekode 2017 CS 6604: DM Large Networks & Time-Series 7

8 LINK ANALYSIS Essential for Ranking Vundekode 2017 CS 6604: DM Large Networks & Time-Series 8

9 Start from the right perspective! There is no point in looking inside a Web page to see how relevant it is to the query. Number of links to a page from other relevant pages reflects its relevance to the query better! Shows the authority of a page on the topic Links serve as implicit endorsements Vundekode 2017 CS 6604: DM Large Networks & Time-Series 9

10 Voting by In-Links Collect a large sample of pages relevant to the query Let them vote through their links Pick the page with highest number of votes In-degree Does this work for all kinds of queries?? Vundekode 2017 CS 6604: DM Large Networks & Time-Series 10

11 One-word query : newspapers Results: Mix of prominent newspapers AND Pages that receive high in-links irrespective of the query Yahoo!, Amazon, Facebook, Vundekode 2017 CS 6604: DM Large Networks & Time-Series 11

12 In-Links Network for newspapers Results wanted Vundekode 2017 CS 6604: DM Large Networks & Time-Series 12

13 Lists of Links New Approach Pages that compile lists of resources relevant to the topic A page s value as a list = sum of votes received by all pages that it voted for. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 13

14 List-Finding Technique Better lists (Better sense of where good results are) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 14

15 Principle of Repeated Improvement Better lists Weigh their votes heavily Re-compute votes Weight of each page s vote = its value as a list Improves scores of relevant results Vundekode 2017 CS 6604: DM Large Networks & Time-Series 15

16 Refined scores for newspapers Vundekode 2017 CS 6604: DM Large Networks & Time-Series 16

17 Hubs and Authorities Hubs for the query the high-value lists Authorities for the query the highly endorsed answers For each page (p) in the network, estimate its value as a potential hub and a potential authority calculate auth(p), hub(p) (Initial values for both = 1) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 17

18 Rules Authority Update Rule For each page p, update auth(p) to be the sum of the hub scores of all the pages that point to it. Hub Update Rule For each page p, update hub(p) to be the sum of the authority scores of all the pages that it point to. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 18

19 Note 1 application of Authority Update Rule voting by in-links 1 application of Authority Update Rule + 1 application of Hub Update Rule original list-finding technique Vundekode 2017 CS 6604: DM Large Networks & Time-Series 19

20 Principle of Repeated Improvement Start with all hub scores and authority scores equal to 1 Choose number of steps k Perform a sequence of k hub-authority updates First apply the Authority Update Rule to current set of scores Then apply the Hub Update Rule to the resulting set of scores Normalize the scores Vundekode 2017 CS 6604: DM Large Networks & Time-Series 20

21 Normalized and Re-weighted votes Vundekode 2017 CS 6604: DM Large Networks & Time-Series 21

22 k? Normalized values converge to limits Skipping proof (Section 14.6, if interested) Also proved that the limiting hub and authority values are a property purely of the link structure These limiting values correspond to a kind of an equilibrium Balance between hub and authorities Vundekode 2017 CS 6604: DM Large Networks & Time-Series 22

23 Limiting values for newspapers Vundekode 2017 CS 6604: DM Large Networks & Time-Series 23

24 PAGERANK Vundekode 2017 CS 6604: DM Large Networks & Time-Series 24

25 Intuition behind Hubs-Authorities Are auth and hub scores sufficient for all kinds of queries? No!! Only the ones with a commercial aspect. Why? Competing firms don t link to each other. Only way is to get a set of hub pages that link to them all. Hubs play a powerful endorsement role without themselves being heavily endorsed. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 25

26 Intuition behind PageRank Endorsement passing directly from one prominent page to another Nodes repeatedly pass endorsements across their outgoing links with weights based on its current estimate of PageRank Endorsements eventually pool at the most relevant nodes Vundekode 2017 CS 6604: DM Large Networks & Time-Series 26

27 PageRank Update Rule Each page divides its current PageRank equally across its out-going links, and passes these equal shares to the pages it points to. If a page has no out-going links, it passes all its current PageRank to itself. Each page updates its new PageRank to be the sum of the shares it receives. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 27

28 PageRank Network of n nodes initial PageRank = 1/n Choose number of steps = k Perform a sequence of k updates to PageRank values using the PageRank Update Rule Note: Total PageRank in the network = 1 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 28

29 Example Vundekode 2017 CS 6604: DM Large Networks & Time-Series 29

30 Initialize PageRank values 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 30

31 Step 1 1/2 1/16 1/16 1/16 1/16 1/16 1/16 1/8 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 31

32 Step 2 3/16 1/4 1/4 1/32 1/32 1/32 1/32 1/16 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 32

33 k? PageRank values converge to limiting values These limiting values exhibit kind of an equilibrium Values remain same on applying one step of the PageRank Update Rule Unique set of equilibrium values for strongly connected networks Skipping proofs Vundekode 2017 CS 6604: DM Large Networks & Time-Series 33

34 Equilibrium PageRank Values Vundekode 2017 CS 6604: DM Large Networks & Time-Series 34

35 Problem? Slow-leak: Wrong nodes might end up with all the PageRank! Vundekode 2017 CS 6604: DM Large Networks & Time-Series 35

36 Solution Scaling Factor : 0<s<1 Scaled PageRank Update Rule: First apply the Basic PageRank Update Rule. Then scale down all PageRank values by a factor of s. Total PageRank of the network now is s. Divide the residual (1-s) equally over all nodes ((1-s)/n to each node) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 36

37 k? PageRank values again converge to limiting values These limiting values exhibit kind of an equilibrium Values remain same on applying 1 step of Scaled PageRank Update Rule Unique set of equilibrium values for every s for any network Optimal s value between 0.8 to 0.9 Slow-leak prominent on larger networks Vundekode 2017 CS 6604: DM Large Networks & Time-Series 37

38 Random Walks Claim: The probability of being at a page X after k steps of random walk is precisely the PageRank of X after k applications of the Basic PageRank Update Rule. Skipping proof (Section 14.6, if interested) Try to intuitively think of the earlier network with F and G nodes being in a cycle. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 38

39 Questions on PageRank? Vundekode 2017 CS 6604: DM Large Networks & Time-Series 39

40 Block Models and Personalized PageRank Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg Vundekode 2017 CS 6604: DM Large Networks & Time-Series 40

41 We will discuss PageRank for community detection Personalized PageRank Seed Set Expansion Problem Evaluation of Ranking Methods - Developed a framework by studying seed set expansion applied to the stochastic block model Vundekode 2017 CS 6604: DM Large Networks & Time-Series 41

42 Random Walks Given a graph, a random walk is an iterative process that starts from a random vertex, and at each step, either follows a random outgoing edge of the current vertex or jumps to a random vertex. Given some seeds in a community in a graph, can we find the rest of the community? Using random walks rooting at the seeds Vundekode 2017 CS 6604: DM Large Networks & Time-Series 42

43 Personalized PageRank Page Rank (PR) measures stationary distribution of one specific kind of random walk that starts from a random vertex and in each iteration, with a predefined probability p, jumps to a random vertex, and with probability1-p follows a random outgoing edge of the current vertex. Personalized Page Rank (PPR) is the same as PR other than the fact that jumps are back to one of a given set of starting vertices. In a way, the walk in PPR is biased towards (or personalized for) this set of starting vertices and is more localized compared to the random walk performed in PR. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 43

44 Seed Set Expansion Problem Given graph G and subset of nodes S, known to be present in a community, Find the rest of the community Common techniques Personalized PageRank Heat Kernel method Vundekode 2017 CS 6604: DM Large Networks & Time-Series 44

45 Stochastic Block Model A distribution over graphs that generalizes the ER random graph model to include a planted block structure. Partition of nodes into C disjoint sets V 1,V 2, V c where V i = π i n Create C x C matrix P where entry p ij = prob(v i and V j are connected) So SBM is described as G(n, π, P) where π = (π 1, π 2,, π C ) Two-block SBM : Seed set and remainder of the graph Vundekode 2017 CS 6604: DM Large Networks & Time-Series 45

46 Approach For each node v in the graph, and each k, ranking methods use: landing probabilities of node, starting from a particular seed node in S (or a node chosen uniformly at random from S). Geometrically, these rankings amount to sweeps through the space of landing probabilities with hyperplanes normal to some vector, where personalized PageRank and the heat kernel correspond to different choices of vectors Vundekode 2017 CS 6604: DM Large Networks & Time-Series 46

47 Method Derive centroids, for each block, in the space of landing probabilities Observation: The optimal hyperplane for performing a linear sweep between the two centroids is asymptotically concentrated for large graphs on the weights of personalized PageRank (for a specific choice of the PageRank parameter corresponding to parameters of the SBM) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 47

48 2-block SBM We have 2 classes and a distribution of landing probabilities We can use discriminant functions to classify the points into the two blocks community and remainder of the graph. Geometric discriminant functions linear sweep through feature space Fisherian discriminant functions descriptive model using multivariate Gaussians feature space Vundekode 2017 CS 6604: DM Large Networks & Time-Series 48

49 Weight vectors For PageRank, the weight vector is For heat kernel method, it is Vundekode 2017 CS 6604: DM Large Networks & Time-Series 49

50 2-block SBM (Geometric) They theoretically proved an asymptotic equivalence between personalized PageRank and geometric classification of SBMs in the space of landing probabilities. They showed that: Vundekode 2017 CS 6604: DM Large Networks & Time-Series 50

51 Note It is assumed for that proof that the SBM is dense. It might not hold good for a sparse block model. The entire derivation works even if the intra-connectivity is lower than inter-connectivity among the blocks. α close to 0 is best for identifying very strong planted partitions, p in p out, whereas α close to 1 is best when the planted partition is very weak and the difference p in p out is small. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 51

52 2-block SBM (Fisherian) The classes are described by multivariate Gaussians N(a, Σ a ) and N(b, Σ b ) for the in-class and out-class, respectively. Resulting functions: Vundekode 2017 CS 6604: DM Large Networks & Time-Series 52

53 Results Quadratic discriminant function has considerably improved recall over ordinary personalized PageRank The linear SBM method, assuming a common covariance matrix for the two classes, exhibits a recall nearly identical to the quadratic method. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 53

54 Summarizing Personalized PageRank is shown as the optimal geometric discriminant function in the space of landing probabilities for classifying nodes in a hidden seed community in an SBM. Building on this connection between SBMs and personalized PageRank, more complex covariance-adjusted linear and quadratic approaches to classification in the space of landing probabilities were developed and evaluated. These classifiers dramatically outperform personalized PageRank and heat kernel methods for recovering seed sets in SBMs. The connection between personalized PageRank and SBMs is surprising, and it points toward a huge research scope Vundekode 2017 CS 6604: DM Large Networks & Time-Series 54

CSI 445/660 Part 10 (Link Analysis and Web Search)

CSI 445/660 Part 10 (Link Analysis and Web Search) CSI 445/660 Part 10 (Link Analysis and Web Search) Ref: Chapter 14 of [EK] text. 10 1 / 27 Searching the Web Ranking Web Pages Suppose you type UAlbany to Google. The web page for UAlbany is among the

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Information Networks: Hubs and Authorities

Information Networks: Hubs and Authorities Information Networks: Hubs and Authorities Web Science (VU) (706.716) Elisabeth Lex KTI, TU Graz June 11, 2018 Elisabeth Lex (KTI, TU Graz) Links June 11, 2018 1 / 61 Repetition Opinion Dynamics Culture

More information

Link Analysis: Web Structure and Search

Link Analysis: Web Structure and Search Link Analysis: Web Structure and Search Web Science (VU) (706716) Elisabeth Lex ISDS, TU Graz June 12, 2017 Elisabeth Lex (ISDS, TU Graz) Links June 12, 2017 1 / 69 Outline 1 Information Networks 2 Paths

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Social Networks 2015 Lecture 10: The structure of the web and link analysis 04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

Social and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo

Social and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo Social and Technological Network Analysis Lecture 5: Web Search and Random Walks Dr. Cecilia Mascolo In This Lecture We describe the concept of search in a network. We describe powerful techniques to enhance

More information

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Algorithms, Games, and Networks February 21, Lecture 12

Algorithms, Games, and Networks February 21, Lecture 12 Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,

More information

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

More information

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena CSE 190 Lecture 16 Data Mining and Predictive Analytics Small-world phenomena Another famous study Stanley Milgram wanted to test the (already popular) hypothesis that people in social networks are separated

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University Introduction Searching on the web is defined as the process of

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns This is a closed-book exam. You should have no material on your desk other than the exam itself and a pencil or pen.

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

More information

Motivation. Motivation

Motivation. Motivation COMS11 Motivation PageRank Department of Computer Science, University of Bristol Bristol, UK 1 November 1 The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 199s, the amount of

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Analysis of Large Graphs: TrustRank and WebSpam

Analysis of Large Graphs: TrustRank and WebSpam Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Graph and Link Mining

Graph and Link Mining Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting CS2 Algorithms and Data Structures Note 10 Depth-First Search and Topological Sorting In this lecture, we will analyse the running time of DFS and discuss a few applications. 10.1 A recursive implementation

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1 Graphs and Networks Page /7: Lecture 8. Clustering Heuristics Wednesday, November 8, 26 8:49 AM Today we will talk about clustering and partitioning in graphs, and sometimes in data sets. Partitioning

More information

Problem Definition. Clustering nonlinearly separable data:

Problem Definition. Clustering nonlinearly separable data: Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided Large Attributed Graph Clustering with Multiple Sparse Annotations (PAKDD 2016) Problem Definition Clustering

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Diffusion and Clustering on Large Graphs

Diffusion and Clustering on Large Graphs Diffusion and Clustering on Large Graphs Alexander Tsiatas Final Defense 17 May 2012 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of large graphs: The World

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Local Community Detection in Dynamic Graphs Using Personalized Centrality

Local Community Detection in Dynamic Graphs Using Personalized Centrality algorithms Article Local Community Detection in Dynamic Graphs Using Personalized Centrality Eisha Nathan, Anita Zakrzewska, Jason Riedy and David A. Bader * School of Computational Science and Engineering,

More information

Semi-Supervised Learning: Lecture Notes

Semi-Supervised Learning: Lecture Notes Semi-Supervised Learning: Lecture Notes William W. Cohen March 30, 2018 1 What is Semi-Supervised Learning? In supervised learning, a learner is given a dataset of m labeled examples {(x 1, y 1 ),...,

More information

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science CS-C3160 - Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Contents of this chapter Story

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

MCL. (and other clustering algorithms) 858L

MCL. (and other clustering algorithms) 858L MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 5 Finding meaningful clusters in data So far we ve been in the vector quantization mindset, where we want to approximate a data set by a small number

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

Alpha-Beta Community

Alpha-Beta Community Alpha-Beta Community Supasorn Suwajanakorn March 15, 2010 Supasorn Suwajanakorn () Alpha-Beta Community March 15, 2010 1 / 19 Definition from Nina s paper Definition 1. Given a graph G = (V, E), where

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information