CS 6604: Data Mining Large Networks and Time-Series
|
|
- Justin Scott
- 5 years ago
- Views:
Transcription
1 CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash
2 Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking Link Analysis using Hubs and Authorities PageRank Block models and personalized PageRank Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg Vundekode 2017 CS 6604: DM Large Networks & Time-Series 2
3 SEARCHING THE WEB Vundekode 2017 CS 6604: DM Large Networks & Time-Series 3
4 Problem of Ranking No external database Ranking methods look at the Web itself Vundekode 2017 CS 6604: DM Large Networks & Time-Series 4
5 Search is a hard problem! Any setting Not just on the Web Keyword queries List is short and inexpressive Synonymy, Polysemy Authoring style and vocabulary Vundekode 2017 CS 6604: DM Large Networks & Time-Series 5
6 Search on the Web Everyone is an author. Everyone is a searcher. New problems? Diversity in authoring styles no common criterion to rank Diversity in searchers specific category? Dynamic Web content NEWS! Vundekode 2017 CS 6604: DM Large Networks & Time-Series 6
7 Problem transformed! Scarcity Abundance Finding the most relevant results Solution? Ranking Understanding network structure of Web pages Vundekode 2017 CS 6604: DM Large Networks & Time-Series 7
8 LINK ANALYSIS Essential for Ranking Vundekode 2017 CS 6604: DM Large Networks & Time-Series 8
9 Start from the right perspective! There is no point in looking inside a Web page to see how relevant it is to the query. Number of links to a page from other relevant pages reflects its relevance to the query better! Shows the authority of a page on the topic Links serve as implicit endorsements Vundekode 2017 CS 6604: DM Large Networks & Time-Series 9
10 Voting by In-Links Collect a large sample of pages relevant to the query Let them vote through their links Pick the page with highest number of votes In-degree Does this work for all kinds of queries?? Vundekode 2017 CS 6604: DM Large Networks & Time-Series 10
11 One-word query : newspapers Results: Mix of prominent newspapers AND Pages that receive high in-links irrespective of the query Yahoo!, Amazon, Facebook, Vundekode 2017 CS 6604: DM Large Networks & Time-Series 11
12 In-Links Network for newspapers Results wanted Vundekode 2017 CS 6604: DM Large Networks & Time-Series 12
13 Lists of Links New Approach Pages that compile lists of resources relevant to the topic A page s value as a list = sum of votes received by all pages that it voted for. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 13
14 List-Finding Technique Better lists (Better sense of where good results are) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 14
15 Principle of Repeated Improvement Better lists Weigh their votes heavily Re-compute votes Weight of each page s vote = its value as a list Improves scores of relevant results Vundekode 2017 CS 6604: DM Large Networks & Time-Series 15
16 Refined scores for newspapers Vundekode 2017 CS 6604: DM Large Networks & Time-Series 16
17 Hubs and Authorities Hubs for the query the high-value lists Authorities for the query the highly endorsed answers For each page (p) in the network, estimate its value as a potential hub and a potential authority calculate auth(p), hub(p) (Initial values for both = 1) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 17
18 Rules Authority Update Rule For each page p, update auth(p) to be the sum of the hub scores of all the pages that point to it. Hub Update Rule For each page p, update hub(p) to be the sum of the authority scores of all the pages that it point to. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 18
19 Note 1 application of Authority Update Rule voting by in-links 1 application of Authority Update Rule + 1 application of Hub Update Rule original list-finding technique Vundekode 2017 CS 6604: DM Large Networks & Time-Series 19
20 Principle of Repeated Improvement Start with all hub scores and authority scores equal to 1 Choose number of steps k Perform a sequence of k hub-authority updates First apply the Authority Update Rule to current set of scores Then apply the Hub Update Rule to the resulting set of scores Normalize the scores Vundekode 2017 CS 6604: DM Large Networks & Time-Series 20
21 Normalized and Re-weighted votes Vundekode 2017 CS 6604: DM Large Networks & Time-Series 21
22 k? Normalized values converge to limits Skipping proof (Section 14.6, if interested) Also proved that the limiting hub and authority values are a property purely of the link structure These limiting values correspond to a kind of an equilibrium Balance between hub and authorities Vundekode 2017 CS 6604: DM Large Networks & Time-Series 22
23 Limiting values for newspapers Vundekode 2017 CS 6604: DM Large Networks & Time-Series 23
24 PAGERANK Vundekode 2017 CS 6604: DM Large Networks & Time-Series 24
25 Intuition behind Hubs-Authorities Are auth and hub scores sufficient for all kinds of queries? No!! Only the ones with a commercial aspect. Why? Competing firms don t link to each other. Only way is to get a set of hub pages that link to them all. Hubs play a powerful endorsement role without themselves being heavily endorsed. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 25
26 Intuition behind PageRank Endorsement passing directly from one prominent page to another Nodes repeatedly pass endorsements across their outgoing links with weights based on its current estimate of PageRank Endorsements eventually pool at the most relevant nodes Vundekode 2017 CS 6604: DM Large Networks & Time-Series 26
27 PageRank Update Rule Each page divides its current PageRank equally across its out-going links, and passes these equal shares to the pages it points to. If a page has no out-going links, it passes all its current PageRank to itself. Each page updates its new PageRank to be the sum of the shares it receives. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 27
28 PageRank Network of n nodes initial PageRank = 1/n Choose number of steps = k Perform a sequence of k updates to PageRank values using the PageRank Update Rule Note: Total PageRank in the network = 1 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 28
29 Example Vundekode 2017 CS 6604: DM Large Networks & Time-Series 29
30 Initialize PageRank values 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 30
31 Step 1 1/2 1/16 1/16 1/16 1/16 1/16 1/16 1/8 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 31
32 Step 2 3/16 1/4 1/4 1/32 1/32 1/32 1/32 1/16 Vundekode 2017 CS 6604: DM Large Networks & Time-Series 32
33 k? PageRank values converge to limiting values These limiting values exhibit kind of an equilibrium Values remain same on applying one step of the PageRank Update Rule Unique set of equilibrium values for strongly connected networks Skipping proofs Vundekode 2017 CS 6604: DM Large Networks & Time-Series 33
34 Equilibrium PageRank Values Vundekode 2017 CS 6604: DM Large Networks & Time-Series 34
35 Problem? Slow-leak: Wrong nodes might end up with all the PageRank! Vundekode 2017 CS 6604: DM Large Networks & Time-Series 35
36 Solution Scaling Factor : 0<s<1 Scaled PageRank Update Rule: First apply the Basic PageRank Update Rule. Then scale down all PageRank values by a factor of s. Total PageRank of the network now is s. Divide the residual (1-s) equally over all nodes ((1-s)/n to each node) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 36
37 k? PageRank values again converge to limiting values These limiting values exhibit kind of an equilibrium Values remain same on applying 1 step of Scaled PageRank Update Rule Unique set of equilibrium values for every s for any network Optimal s value between 0.8 to 0.9 Slow-leak prominent on larger networks Vundekode 2017 CS 6604: DM Large Networks & Time-Series 37
38 Random Walks Claim: The probability of being at a page X after k steps of random walk is precisely the PageRank of X after k applications of the Basic PageRank Update Rule. Skipping proof (Section 14.6, if interested) Try to intuitively think of the earlier network with F and G nodes being in a cycle. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 38
39 Questions on PageRank? Vundekode 2017 CS 6604: DM Large Networks & Time-Series 39
40 Block Models and Personalized PageRank Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg Vundekode 2017 CS 6604: DM Large Networks & Time-Series 40
41 We will discuss PageRank for community detection Personalized PageRank Seed Set Expansion Problem Evaluation of Ranking Methods - Developed a framework by studying seed set expansion applied to the stochastic block model Vundekode 2017 CS 6604: DM Large Networks & Time-Series 41
42 Random Walks Given a graph, a random walk is an iterative process that starts from a random vertex, and at each step, either follows a random outgoing edge of the current vertex or jumps to a random vertex. Given some seeds in a community in a graph, can we find the rest of the community? Using random walks rooting at the seeds Vundekode 2017 CS 6604: DM Large Networks & Time-Series 42
43 Personalized PageRank Page Rank (PR) measures stationary distribution of one specific kind of random walk that starts from a random vertex and in each iteration, with a predefined probability p, jumps to a random vertex, and with probability1-p follows a random outgoing edge of the current vertex. Personalized Page Rank (PPR) is the same as PR other than the fact that jumps are back to one of a given set of starting vertices. In a way, the walk in PPR is biased towards (or personalized for) this set of starting vertices and is more localized compared to the random walk performed in PR. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 43
44 Seed Set Expansion Problem Given graph G and subset of nodes S, known to be present in a community, Find the rest of the community Common techniques Personalized PageRank Heat Kernel method Vundekode 2017 CS 6604: DM Large Networks & Time-Series 44
45 Stochastic Block Model A distribution over graphs that generalizes the ER random graph model to include a planted block structure. Partition of nodes into C disjoint sets V 1,V 2, V c where V i = π i n Create C x C matrix P where entry p ij = prob(v i and V j are connected) So SBM is described as G(n, π, P) where π = (π 1, π 2,, π C ) Two-block SBM : Seed set and remainder of the graph Vundekode 2017 CS 6604: DM Large Networks & Time-Series 45
46 Approach For each node v in the graph, and each k, ranking methods use: landing probabilities of node, starting from a particular seed node in S (or a node chosen uniformly at random from S). Geometrically, these rankings amount to sweeps through the space of landing probabilities with hyperplanes normal to some vector, where personalized PageRank and the heat kernel correspond to different choices of vectors Vundekode 2017 CS 6604: DM Large Networks & Time-Series 46
47 Method Derive centroids, for each block, in the space of landing probabilities Observation: The optimal hyperplane for performing a linear sweep between the two centroids is asymptotically concentrated for large graphs on the weights of personalized PageRank (for a specific choice of the PageRank parameter corresponding to parameters of the SBM) Vundekode 2017 CS 6604: DM Large Networks & Time-Series 47
48 2-block SBM We have 2 classes and a distribution of landing probabilities We can use discriminant functions to classify the points into the two blocks community and remainder of the graph. Geometric discriminant functions linear sweep through feature space Fisherian discriminant functions descriptive model using multivariate Gaussians feature space Vundekode 2017 CS 6604: DM Large Networks & Time-Series 48
49 Weight vectors For PageRank, the weight vector is For heat kernel method, it is Vundekode 2017 CS 6604: DM Large Networks & Time-Series 49
50 2-block SBM (Geometric) They theoretically proved an asymptotic equivalence between personalized PageRank and geometric classification of SBMs in the space of landing probabilities. They showed that: Vundekode 2017 CS 6604: DM Large Networks & Time-Series 50
51 Note It is assumed for that proof that the SBM is dense. It might not hold good for a sparse block model. The entire derivation works even if the intra-connectivity is lower than inter-connectivity among the blocks. α close to 0 is best for identifying very strong planted partitions, p in p out, whereas α close to 1 is best when the planted partition is very weak and the difference p in p out is small. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 51
52 2-block SBM (Fisherian) The classes are described by multivariate Gaussians N(a, Σ a ) and N(b, Σ b ) for the in-class and out-class, respectively. Resulting functions: Vundekode 2017 CS 6604: DM Large Networks & Time-Series 52
53 Results Quadratic discriminant function has considerably improved recall over ordinary personalized PageRank The linear SBM method, assuming a common covariance matrix for the two classes, exhibits a recall nearly identical to the quadratic method. Vundekode 2017 CS 6604: DM Large Networks & Time-Series 53
54 Summarizing Personalized PageRank is shown as the optimal geometric discriminant function in the space of landing probabilities for classifying nodes in a hidden seed community in an SBM. Building on this connection between SBMs and personalized PageRank, more complex covariance-adjusted linear and quadratic approaches to classification in the space of landing probabilities were developed and evaluated. These classifiers dramatically outperform personalized PageRank and heat kernel methods for recovering seed sets in SBMs. The connection between personalized PageRank and SBMs is surprising, and it points toward a huge research scope Vundekode 2017 CS 6604: DM Large Networks & Time-Series 54
CSI 445/660 Part 10 (Link Analysis and Web Search)
CSI 445/660 Part 10 (Link Analysis and Web Search) Ref: Chapter 14 of [EK] text. 10 1 / 27 Searching the Web Ranking Web Pages Suppose you type UAlbany to Google. The web page for UAlbany is among the
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank
More informationInformation Networks: Hubs and Authorities
Information Networks: Hubs and Authorities Web Science (VU) (706.716) Elisabeth Lex KTI, TU Graz June 11, 2018 Elisabeth Lex (KTI, TU Graz) Links June 11, 2018 1 / 61 Repetition Opinion Dynamics Culture
More informationLink Analysis: Web Structure and Search
Link Analysis: Web Structure and Search Web Science (VU) (706716) Elisabeth Lex ISDS, TU Graz June 12, 2017 Elisabeth Lex (ISDS, TU Graz) Links June 12, 2017 1 / 69 Outline 1 Information Networks 2 Paths
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford
More informationLink Structure Analysis
Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationSocial Networks 2015 Lecture 10: The structure of the web and link analysis
04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information
More informationLarge-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies
Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point
More informationSocial and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo
Social and Technological Network Analysis Lecture 5: Web Search and Random Walks Dr. Cecilia Mascolo In This Lecture We describe the concept of search in a network. We describe powerful techniques to enhance
More informationSocial and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo
Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationHow to organize the Web?
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper
More informationInformation Networks: PageRank
Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationInformation Retrieval. Lecture 11 - Link analysis
Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks
More information1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
!"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationCOMP 4601 Hubs and Authorities
COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one
More informationWeb consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page
Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.
More informationLink Analysis. Hongning Wang
Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationAlgorithms, Games, and Networks February 21, Lecture 12
Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,
More information3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today
3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
More informationCSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena
CSE 190 Lecture 16 Data Mining and Predictive Analytics Small-world phenomena Another famous study Stanley Milgram wanted to test the (already popular) hypothesis that people in social networks are separated
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationLecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science
Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches
More informationPagerank Scoring. Imagine a browser doing a random walk on web pages:
Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably
More informationAuthoritative Sources in a Hyperlinked Environment
Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University Introduction Searching on the web is defined as the process of
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationMIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns
MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns This is a closed-book exam. You should have no material on your desk other than the exam itself and a pencil or pen.
More informationSimilarity Ranking in Large- Scale Bipartite Graphs
Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads
More informationCS6200 Information Retreival. The WebGraph. July 13, 2015
CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs
More informationMotivation. Motivation
COMS11 Motivation PageRank Department of Computer Science, University of Bristol Bristol, UK 1 November 1 The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 199s, the amount of
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationAnalysis of Large Graphs: TrustRank and WebSpam
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationPageRank and related algorithms
PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationGraph and Link Mining
Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }
More informationPage rank computation HPC course project a.y Compute efficient and scalable Pagerank
Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationCS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting
CS2 Algorithms and Data Structures Note 10 Depth-First Search and Topological Sorting In this lecture, we will analyse the running time of DFS and discuss a few applications. 10.1 A recursive implementation
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationSingle link clustering: 11/7: Lecture 18. Clustering Heuristics 1
Graphs and Networks Page /7: Lecture 8. Clustering Heuristics Wednesday, November 8, 26 8:49 AM Today we will talk about clustering and partitioning in graphs, and sometimes in data sets. Partitioning
More informationProblem Definition. Clustering nonlinearly separable data:
Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided Large Attributed Graph Clustering with Multiple Sparse Annotations (PAKDD 2016) Problem Definition Clustering
More informationEinführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme
Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClustering: Classic Methods and Modern Views
Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationSocial Network Analysis
Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page
More informationCentralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum
More informationDiffusion and Clustering on Large Graphs
Diffusion and Clustering on Large Graphs Alexander Tsiatas Final Defense 17 May 2012 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of large graphs: The World
More informationBrief (non-technical) history
Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University
More informationLocal Community Detection in Dynamic Graphs Using Personalized Centrality
algorithms Article Local Community Detection in Dynamic Graphs Using Personalized Centrality Eisha Nathan, Anita Zakrzewska, Jason Riedy and David A. Bader * School of Computational Science and Engineering,
More informationSemi-Supervised Learning: Lecture Notes
Semi-Supervised Learning: Lecture Notes William W. Cohen March 30, 2018 1 What is Semi-Supervised Learning? In supervised learning, a learner is given a dataset of m labeled examples {(x 1, y 1 ),...,
More informationCS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science
CS-C3160 - Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Contents of this chapter Story
More informationLecture 27: Learning from relational data
Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission
More informationMCL. (and other clustering algorithms) 858L
MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted
More informationSemantic text features from small world graphs
Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationBruno Martins. 1 st Semester 2012/2013
Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4
More informationCHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS
CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationLecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 5 Finding meaningful clusters in data So far we ve been in the vector quantization mindset, where we want to approximate a data set by a small number
More informationLINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION
LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationLecture 17 November 7
CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine
More informationLink Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.
Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation
More informationAlpha-Beta Community
Alpha-Beta Community Supasorn Suwajanakorn March 15, 2010 Supasorn Suwajanakorn () Alpha-Beta Community March 15, 2010 1 / 19 Definition from Nina s paper Definition 1. Given a graph G = (V, E), where
More informationCollaborative filtering based on a random walk model on a graph
Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationFeature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.
CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationMAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds
MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationCS224W Final Report Emergence of Global Status Hierarchy in Social Networks
CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,
More information