Estimating Sizes of Social Networks via Biased Sampling

Size: px
Start display at page:

Download "Estimating Sizes of Social Networks via Biased Sampling"

Transcription

1 Estimating Sizes of Social Networks via Biased Sampling Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India Yahoo! Labs: WWW / 20

2 Social Network size estimation Goal: Obtaining estimates for sizes of (sub)populations in social network. Why: Advertisement - estimate of market share. Business development - merger/acquisition or asset valuation. Yahoo! Labs: WWW / 20

3 The Problem Difficulties: Social network have become pretty big: Facebook (650,000,000) Qzone (200,000,000) Twitter (175,000,000)... No public API for population size queries. What is the total number of registered users? What is the number of registered (self-declared) year olds living in New-York? Even if a public API is provided an independent estimate is needed. Exhaustive crawl is time/space/communication intensive and violates politeness. Yahoo! Labs: WWW / 20

4 Population size estimation Population sizes can be estimated efficiently using the birthday paradox. The birthday paradox : Given r uniform samples from a set of n elements, the expected number of collisions is r(r 1) 2n. A collision is a pair of identical samples. Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x 2, x 3 ), (x 2,x 5 ), and (x 3,x 5 ). Yahoo! Labs: WWW / 20

5 Population size estimation Using the birthday paradox inversely: When observing C collisions the pouplation can be estimated by n r 2 2C If r = const n this gives a rather good estimator. Similar to mark-and-recapture which counts collisions between two sample sets (but is essentially equivalent). Newer version of mark-and-recapture also handles non-uniform but a-priory known distributions [Chao, 1987]. Social network size estimation [Ye and Wu, 2010] Alas, we cannot sample users uniformly from most social networks... Yahoo! Labs: WWW / 20

6 Uniform distribution on graphs Social networks can be viewed as an undirected graph which we can traverse using their public APIs. Special random walks can generate close to uniform sampling: 1 Bipartite Query-Web page graph [Bharat and Broder, 1998] [Bar-Yossef and Gurevich, 2007]. 2 Social network [Gjoka et al, 2010]. Uses only r = const n samples, but obtaining each sample might be hard. Yahoo! Labs: WWW / 20

7 Graph size estimation It is possible to estimate the size of some graphs directly. 1 Estimate the size of a tree [Knuth, 1974]. 2 Estimate the size of a directed acyclic graph [Pitt, 1987]. We give an estimator for the size of undirected graphs (and sub graphs) which: 1 Counts collisions but uses the graph s stationary distribution. (does not require a uniform sample) 2 Requires asymptotically less than n samples to converge. 3 Obtains samples efficiently. (provable small number of random walk steps.) Yahoo! Labs: WWW / 20

8 Assumptions The graph can be traversed from nodes to neighboring nodes. We can perform a random walk the graph: start at any node In each step, proceed to one of the neighbors uniformly at random. Yahoo! Labs: WWW / 20

9 Facts about random walks This random walk yields the stationary distribution. 1 The probability to get the i th node is d i D. 2 d i i th node s degree. 3 D = n i=1 d i. taking a few steps/several walks ensures independence between two consecutive samples. Yahoo! Labs: WWW / 20

10 Algorithm Outline 1 Sample r users using random walk. 2 C the number of collisions. 3 Ψ 1 the sum of the sampled nodes degrees. 4 Ψ 1 the sum of the inverse sampled nodes degrees. The estimated number of nodes: ˆn = Ψ 1Ψ 1 2C. Yahoo! Labs: WWW / 20

11 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

12 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

13 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

14 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

15 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

16 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

17 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

18 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

19 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

20 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

21 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

22 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

23 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

24 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

25 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

26 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

27 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

28 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

29 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

30 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

31 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

32 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

33 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

34 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

35 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

36 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

37 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

38 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

39 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

40 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

41 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

42 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

43 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

44 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

45 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

46 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

47 Sampled Nodes: d f f c c d Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 26/12 ˆn:

48 Proof Intuition Notations: n the graph size, d i node i degree, Expectations: r number of samples D = n i=1 d i ˆn E [Ψ 1 ] = rd n i=1 E [C] = ( r 2) n i=1 ( di D ) 2, E [Ψ 1 ] = rn D ( di D ) 2. E [Ψ 1 ]E [Ψ 1 ] 2E [C] = n r r 1 n. ˆn = Ψ 1Ψ 1 2C E [Ψ 1]E [Ψ 1 ] 2E [C] n Yahoo! Labs: WWW / 20

49 Analytic Results Main statement: Using r(n, ɛ, δ) samples: Pr[n(1 ɛ) ˆn n(1 + ɛ)] 1 δ Uniform vs Biased: Example n = 10 9 n 30, n log n 6, 000. Sampling method Number of samples Any graph, uniform O( n) Synthetic graph, Zipfian degree distribution α = 2, d m = n, O( 4 n log n) random walk Yahoo! Labs: WWW / 20

50 Setup Networks of known sizes: Network Size Edges Synthetic 1,000,000 Zipfian α = 2, d m = 1000 DBLP 845,211 co-authorship IMDB 1,955,508 co-casting Yahoo! Labs: WWW / 20

51 A Synthetic Network, Degree Zipfian α = 2, d m = 1000 Size estimation [Relative to network size] Synthetic network Confidence interval Unif. dist. non unique 95% Deg. dist. non unique 95% Deg. dist. non unique 5% Unif. dist. non unique 5% Number of samples [Percentage of network size] Yahoo! Labs: WWW / 20

52 DBLP - The Digital Bibliography and Library Project Size estimation [Relative to network size] DBLP network Confidence interval Unif. dist. non unique 95% Deg. dist. non unique 95% Deg. dist. non unique 5% Unif. dist. non unique 5% Number of samples [Percentage of network size] Yahoo! Labs: WWW / 20

53 IMDB - The Internet Movie Database Size estimation [Relative to network size] IMDB Confidence interval Unif. dist. non unique 95% Deg. dist. non unique 95% Deg. dist. non unique 5% Unif. dist. non unique 5% Number of samples [Percentage of network size] Yahoo! Labs: WWW / 20

54 Facebook Date April 2009 October 2010 Sampling method uniform random walk Number of samples Collision estimator Facebook report Thanks to Minas Gjoka for the Facebook crawls. Yahoo! Labs: WWW / 20

55 Conclusions An efficient algorithm to estimate the size of a social network using public API was presented. Its effectiveness was demonstrated on synthetic and real world networks. This algorithm outperforms prior art methods by using biased sampling. This algorithm also applies for sub-populations. Yahoo! Labs: WWW / 20

56 Thanks! Yahoo! Labs: WWW / 20

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy,

More information

Sampling Large Graphs: Algorithms and Applications

Sampling Large Graphs: Algorithms and Applications Sampling Large Graphs: Algorithms and Applications Don Towsley Umass - Amherst Joint work with P.H. Wang, J.Z. Zhou, J.C.S. Lui, X. Guan Measuring, Analyzing Large Networks - large networks can be represented

More information

Outsourcing Privacy-Preserving Social Networks to a Cloud

Outsourcing Privacy-Preserving Social Networks to a Cloud IEEE INFOCOM 2013, April 14-19, Turin, Italy Outsourcing Privacy-Preserving Social Networks to a Cloud Guojun Wang a, Qin Liu a, Feng Li c, Shuhui Yang d, and Jie Wu b a Central South University, China

More information

Fast Low-Cost Estimation of Network Properties Using Random Walks

Fast Low-Cost Estimation of Network Properties Using Random Walks Fast Low-Cost Estimation of Network Properties Using Random Walks Colin Cooper, Tomasz Radzik, and Yiannis Siantos Department of Informatics, King s College London, WC2R 2LS, UK Abstract. We study the

More information

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Social Networks 2015 Lecture 10: The structure of the web and link analysis 04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Graph and Link Mining

Graph and Link Mining Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }

More information

Sampling Large Graphs: Algorithms and Applications

Sampling Large Graphs: Algorithms and Applications Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan Measuring, analyzing large

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Framework and Algorithms for Network Bucket Testing

Framework and Algorithms for Network Bucket Testing Framework and Algorithms for Network Bucket Testing Liran Katzir Yahoo! Labs., Haifa, Israel lirank@yahoo-inc.com Edo Liberty Yahoo! Labs., Haifa, Israel edo@yahoo-inc.com Oren Somekh Yahoo! Labs., Haifa,

More information

Counting YouTube Videos via Random Prefix Sampling

Counting YouTube Videos via Random Prefix Sampling Counting YouTube Videos via Random Prefix Sampling Jia Zhou, Yanhua Li, Vijay Kumar Adhikari, and Zhi-Li Zhang Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55414,

More information

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH 116 Fall 2017 Reiews/Critiques I will choose one reiew to grade this week. Graph Data: Social

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

SociaLite: A Datalog-based Language for

SociaLite: A Datalog-based Language for SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler

More information

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200? How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200? Questions from last time Avg. FB degree is 200 (suppose).

More information

Sybil defenses via social networks

Sybil defenses via social networks Sybil defenses via social networks Abhishek University of Oslo, Norway 19/04/2012 1 / 24 Sybil identities Single user pretends many fake/sybil identities i.e., creating multiple accounts observed in real-world

More information

Concise Papers. Bias Correction in a Small Sample from Big Data 1 INTRODUCTION 2 RELATED WORK. Jianguo Lu and Dingding Li

Concise Papers. Bias Correction in a Small Sample from Big Data 1 INTRODUCTION 2 RELATED WORK. Jianguo Lu and Dingding Li 658 I TRANSACTIONS ON KNOWLDG AND DATA NGINRING, VOL. 5, NO., NOVMBR 03 Concise Papers Bias Correction in a Small Sample from Big Data Jianguo Lu and Dingding Li Abstract This paper discusses the bias

More information

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, Jiawei Han University of Illinois at Urbana-Champaign (UIUC) Facebook Inc. U.S. Army Research

More information

Empirical Characterization of P2P Systems

Empirical Characterization of P2P Systems Empirical Characterization of P2P Systems Reza Rejaie Mirage Research Group Department of Computer & Information Science University of Oregon http://mirage.cs.uoregon.edu/ Collaborators: Daniel Stutzbach

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

CPS 102: Discrete Mathematics. Quiz 3 Date: Wednesday November 30, Instructor: Bruce Maggs NAME: Prob # Score. Total 60

CPS 102: Discrete Mathematics. Quiz 3 Date: Wednesday November 30, Instructor: Bruce Maggs NAME: Prob # Score. Total 60 CPS 102: Discrete Mathematics Instructor: Bruce Maggs Quiz 3 Date: Wednesday November 30, 2011 NAME: Prob # Score Max Score 1 10 2 10 3 10 4 10 5 10 6 10 Total 60 1 Problem 1 [10 points] Find a minimum-cost

More information

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm 8:50pm THURSDAY Location: AK 232 Fall 2016 Data acquisition and measurement ia Sampling and Estimation

More information

A New Algorithm for Multiple Key Interpolation Search in Uniform List of Numbers

A New Algorithm for Multiple Key Interpolation Search in Uniform List of Numbers A New Algorithm for Multiple Key Interpolation Search in Uniform List of Numbers AHMED TAREK California University of Pennsylvania Department of Math and Computer Science 50 University Avenue, California

More information

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

ANT-INSPIRED DENSITY ESTIMATION VIA RANDOM WALKS. Nancy Lynch, Cameron Musco, Hsin-Hao Su BDA 2016 July, 2016 Chicago, Illinois

ANT-INSPIRED DENSITY ESTIMATION VIA RANDOM WALKS. Nancy Lynch, Cameron Musco, Hsin-Hao Su BDA 2016 July, 2016 Chicago, Illinois ANT-INSPIRED DENSITY ESTIMATION VIA RANDOM WALKS Nancy Lynch, Cameron Musco, Hsin-Hao Su BDA 2016 July, 2016 Chicago, Illinois 1. Introduction Ants appear to use estimates of colony density (number of

More information

Impact of Clustering on Epidemics in Random Networks

Impact of Clustering on Epidemics in Random Networks Impact of Clustering on Epidemics in Random Networks Joint work with Marc Lelarge INRIA-ENS 8 March 2012 Coupechoux - Lelarge (INRIA-ENS) Epidemics in Random Networks 8 March 2012 1 / 19 Outline 1 Introduction

More information

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks A Walk in Facebook: Uniform Sampling of Users in Online Social Networks Minas Gjoka CalIT2 UC Irvine mgjoka@uci.edu Maciej Kurant CalIT2 UC Irvine maciej.kurant@gmail.com Carter T. Butts Sociology Dept

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show

More information

Testing the Cluster Structure of Graphs Christian Sohler

Testing the Cluster Structure of Graphs Christian Sohler Testing the Cluster Structure of Graphs Christian Sohler Very Large Networks Examples Social networks The World Wide Web Cocitation graphs Coauthorship graphs Data size GigaByte upto TeraByte (only the

More information

Efficient Search Engine Measurements

Efficient Search Engine Measurements Efficient Search Engine Measurements Ziv Bar-Yossef Maxim Gurevich July 18, 2010 Abstract We address the problem of externally measuring aggregate functions over documents indexed by search engines, like

More information

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys). Tirgul 7 Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys belong to a universal group of keys, U = {1... M}.

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

arxiv: v1 [stat.me] 2 Oct 2018

arxiv: v1 [stat.me] 2 Oct 2018 SAMPLING-BASED ESTIMATION OF IN-DEGREE DISTRIBUTION WITH APPLICATIONS TO DIRECTED COMPLEX NETWORKS NELSON ANTUNES, SHANKAR BHAMIDI, TIANJIAN GUO, VLADAS PIPIRAS, AND BANG WANG arxiv:1810.01300v1 [stat.me]

More information

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

CS250: Discrete Math for Computer Science. L20: Complete Induction and Proof of Euler s Characterization of Eulerian-Walks

CS250: Discrete Math for Computer Science. L20: Complete Induction and Proof of Euler s Characterization of Eulerian-Walks CS250: Discrete Math for Computer Science L20: Complete Induction and Proof of Euler s Characterization of Eulerian-Walks Last time: Eulerian Graphs 1 2 1 0 1 2 2 2 2 4 2 2 0 1 2 3 4 5 Def. An Eulerian

More information

Lecture 6: Spectral Graph Theory I

Lecture 6: Spectral Graph Theory I A Theorist s Toolkit (CMU 18-859T, Fall 013) Lecture 6: Spectral Graph Theory I September 5, 013 Lecturer: Ryan O Donnell Scribe: Jennifer Iglesias 1 Graph Theory For this course we will be working on

More information

Scalable Influence Maximization in Social Networks under the Linear Threshold Model

Scalable Influence Maximization in Social Networks under the Linear Threshold Model Scalable Influence Maximization in Social Networks under the Linear Threshold Model Wei Chen Microsoft Research Asia Yifei Yuan Li Zhang In collaboration with University of Pennsylvania Microsoft Research

More information

Absorbing Random walks Coverage

Absorbing Random walks Coverage DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random

More information

Estimating Deep Web Properties by Random Walk

Estimating Deep Web Properties by Random Walk University of Windsor Scholarship at UWindsor Electronic Theses and Dissertations 2013 Estimating Deep Web Properties by Random Walk Sajib Kumer Sinha University of Windsor Follow this and additional works

More information

Absorbing Random walks Coverage

Absorbing Random walks Coverage DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random

More information

KEYWORD search is a well known method for extracting

KEYWORD search is a well known method for extracting IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 7, JULY 2014 1657 Efficient Duplication Free and Minimal Keyword Search in Graphs Mehdi Kargar, Student Member, IEEE, Aijun An, Member,

More information

Figure 1: A directed graph.

Figure 1: A directed graph. 1 Graphs A graph is a data structure that expresses relationships between objects. The objects are called nodes and the relationships are called edges. For example, social networks can be represented as

More information

Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method

Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method Sajib Kumer Sinha University of Windsor Getting uniform random samples from a search engine s index is a challenging problem

More information

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input Initial Assumptions Modern Distributed Computing Theory and Applications Ioannis Chatzigiannakis Sapienza University of Rome Lecture 4 Tuesday, March 6, 03 Exercises correspond to problems studied during

More information

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks A Walk in Facebook: Uniform Sampling of Users in Online Social Networks Minas Gjoka, Maciej Kurant, Carter T. Butts, Athina Markopoulou California Institute for Telecommunications and Information Technology

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

CS 6604: Data Mining Large Networks and Time-Series

CS 6604: Data Mining Large Networks and Time-Series CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking

More information

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring New Directions in Traffic Measurement and Accounting C. Estan and G. Varghese Presented by Aaditeshwar Seth 1 Need for traffic measurement Internet backbone monitoring Short term Detect DoS attacks Long

More information

Graph Cube: On Warehousing and OLAP Multidimensional Networks

Graph Cube: On Warehousing and OLAP Multidimensional Networks Graph Cube: On Warehousing and OLAP Multidimensional Networks Peixiang Zhao, Xiaolei Li, Dong Xin, Jiawei Han Department of Computer Science, UIUC Groupon Inc. Google Cooperation pzhao4@illinois.edu, hanj@cs.illinois.edu

More information

Centrality in Large Networks

Centrality in Large Networks Centrality in Large Networks Mostafa H. Chehreghani May 14, 2017 Table of contents Centrality notions Exact algorithm Approximate algorithms Conclusion Centrality notions Exact algorithm Approximate algorithms

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

A brief history of Google

A brief history of Google the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

09 B: Graph Algorithms II

09 B: Graph Algorithms II Correctness and Complexity of 09 B: Graph Algorithms II CS1102S: Data Structures and Algorithms Martin Henz March 19, 2010 Generated on Thursday 18 th March, 2010, 00:20 CS1102S: Data Structures and Algorithms

More information

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm: The clustering problem: partition genes into distinct sets with high homogeneity and high separation Hierarchical clustering algorithm: 1. Assign each object to a separate cluster.. Regroup the pair of

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look

More information

Sequential Monte Carlo Method for counting vertex covers

Sequential Monte Carlo Method for counting vertex covers Sequential Monte Carlo Method for counting vertex covers Slava Vaisman Faculty of Industrial Engineering and Management Technion, Israel Institute of Technology Haifa, Israel May 18, 2013 Slava Vaisman

More information

Random graph models with fixed degree sequences: choices, consequences and irreducibilty proofs for sampling

Random graph models with fixed degree sequences: choices, consequences and irreducibilty proofs for sampling Random graph models with fixed degree sequences: choices, consequences and irreducibilty proofs for sampling Joel Nishimura 1, Bailey K Fosdick 2, Daniel B Larremore 3 and Johan Ugander 4 1 Arizona State

More information

Summary of Raptor Codes

Summary of Raptor Codes Summary of Raptor Codes Tracey Ho October 29, 2003 1 Introduction This summary gives an overview of Raptor Codes, the latest class of codes proposed for reliable multicast in the Digital Fountain model.

More information

SOFIA: Social Filtering for Niche Markets

SOFIA: Social Filtering for Niche Markets Social Filtering for Niche Markets Matteo Dell'Amico Licia Capra University College London UCL MobiSys Seminar 9 October 2007 : Social Filtering for Niche Markets Outline 1 Social Filtering Competence:

More information

Graph Exploration: How to do better than the random walk? Adrian Kosowski. INRIA Bordeaux Sud-Ouest.

Graph Exploration: How to do better than the random walk? Adrian Kosowski. INRIA Bordeaux Sud-Ouest. Graph Exploration: How to do better than the random walk? Adrian Kosowski INRIA Bordeaux Sud-Ouest kosowski@labri.fr Réunion Displexity La Rochelle, April 4, 2013 Talk outline Introduction to network exploration

More information

Random Sampling from a Search Engine s Index

Random Sampling from a Search Engine s Index Random Sampling from a Search Engine s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1 Search Engine Samplers Search Engine Web Queries Public Interface Sampler Top

More information

Using Non-Linear Dynamical Systems for Web Searching and Ranking

Using Non-Linear Dynamical Systems for Web Searching and Ranking Using Non-Linear Dynamical Systems for Web Searching and Ranking Panayiotis Tsaparas Dipartmento di Informatica e Systemistica Universita di Roma, La Sapienza tsap@dis.uniroma.it ABSTRACT In the recent

More information

Choosing a Random Peer

Choosing a Random Peer Choosing a Random Peer Jared Saia University of New Mexico Joint Work with Valerie King University of Victoria and Scott Lewis University of New Mexico P2P problems Easy problems on small networks become

More information

Graph Data Processing with MapReduce

Graph Data Processing with MapReduce Distributed data processing on the Cloud Lecture 5 Graph Data Processing with MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Graph Theory for Network Science

Graph Theory for Network Science Graph Theory for Network Science Dr. Natarajan Meghanathan Professor Department of Computer Science Jackson State University, Jackson, MS E-mail: natarajan.meghanathan@jsums.edu Networks or Graphs We typically

More information

Biological Networks Analysis

Biological Networks Analysis Biological Networks Analysis Introduction and Dijkstra s algorithm Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The clustering problem: partition genes into distinct

More information

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming

More information

1.1 Our Solution: Random Walks for Uniform Sampling In order to estimate the results of aggregate queries or the fraction of all web pages that would

1.1 Our Solution: Random Walks for Uniform Sampling In order to estimate the results of aggregate queries or the fraction of all web pages that would Approximating Aggregate Queries about Web Pages via Random Walks Λ Ziv Bar-Yossef y Alexander Berg Steve Chien z Jittat Fakcharoenphol x Dror Weitz Computer Science Division University of California at

More information

Outline. Last 3 Weeks. Today. General background. web characterization ( web archaeology ) size and shape of the web

Outline. Last 3 Weeks. Today. General background. web characterization ( web archaeology ) size and shape of the web Web Structures Outline Last 3 Weeks General background Today web characterization ( web archaeology ) size and shape of the web What is the size of the web? Issues The web is really infinite Dynamic content,

More information

Local Partitioning using PageRank

Local Partitioning using PageRank Local Partitioning using PageRank Reid Andersen Fan Chung Kevin Lang UCSD, UCSD, Yahoo! What is a local partitioning algorithm? An algorithm for dividing a graph into two pieces. Instead of searching for

More information

ASAP: Fast, Approximate Graph Pattern Mining at Scale

ASAP: Fast, Approximate Graph Pattern Mining at Scale ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;

More information

Gary Viray Founder, Search Opt Media Inc. Search.Rank.Convert.

Gary Viray Founder, Search Opt Media Inc. Search.Rank.Convert. SEARCH + SOCIAL Gary Viray Founder, Search Opt Media Inc. Goo gol Google Algorithm Change Google Toolbar December 2000 Birth of Toolbar Pagerank They move the toilet mid stream. 404P Pages are ranking

More information

On Asymptotic Cost of Triangle Listing in Random Graphs

On Asymptotic Cost of Triangle Listing in Random Graphs On Asymptotic Cost of Triangle Listing in Random Graphs Di Xiao, Yi Cui, Daren B.H. Cline, Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M University May

More information

SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks

SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks 2008 IEEE Symposium on Security and Privacy SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks Haifeng Yu National University of Singapore haifeng@comp.nus.edu.sg Michael Kaminsky

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Maximizing the Spread of Influence through a Social Network

Maximizing the Spread of Influence through a Social Network Maximizing the Spread of Influence through a Social Network By David Kempe, Jon Kleinberg, Eva Tardos Report by Joe Abrams Social Networks Infectious disease networks Viral Marketing Viral Marketing Example:

More information

Diffusion and Clustering on Large Graphs

Diffusion and Clustering on Large Graphs Diffusion and Clustering on Large Graphs Alexander Tsiatas Final Defense 17 May 2012 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of large graphs: The World

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 25, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 25, 2011 1 / 17 Clique Trees Today we are going to

More information

A quick review. Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.)

A quick review. Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.) Gene expression profiling A quick review Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.) The Gene Ontology (GO) Project Provides shared vocabulary/annotation

More information

Inferring Coarse Views of Connectivity in Very Large Graphs

Inferring Coarse Views of Connectivity in Very Large Graphs Inferring Coarse Views of Connectivity in Very Large Graphs Reza Motamedi, Reza Rejaie, Walter Willinger, Daniel Lowd, Roberto Gonzalez http://onrg.cs.uoregon.edu/walkabout 10/8/14 1 Introduction! Large-scale

More information

Scalable Network Analysis

Scalable Network Analysis Inderjit S. Dhillon University of Texas at Austin COMAD, Ahmedabad, India Dec 20, 2013 Outline Unstructured Data - Scale & Diversity Evolving Networks Machine Learning Problems arising in Networks Recommender

More information

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur Lecture : Graphs Rajat Mittal IIT Kanpur Combinatorial graphs provide a natural way to model connections between different objects. They are very useful in depicting communication networks, social networks

More information

Performance and cost effectiveness of caching in mobile access networks

Performance and cost effectiveness of caching in mobile access networks Performance and cost effectiveness of caching in mobile access networks Jim Roberts (IRT-SystemX) joint work with Salah Eddine Elayoubi (Orange Labs) ICN 2015 October 2015 The memory-bandwidth tradeoff

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Mathematics of Networks Manar Mohaisen Department of EEC Engineering Adjacency matrix Network types Edge list Adjacency list Graph representation 2 Adjacency matrix Adjacency matrix

More information

Efficient Identification of Starters and Followers in Social Media

Efficient Identification of Starters and Followers in Social Media Efficient Identification of Starters and Followers in Social Media Michael Mathioudakis Department of Computer Science University of Toronto mathiou@cs.toronto.edu Nick Koudas Department of Computer Science

More information

Getafix: Workload-aware Distributed Interactive Analytics

Getafix: Workload-aware Distributed Interactive Analytics Getafix: Workload-aware Distributed Interactive Analytics Presenter: Mainak Ghosh Collaborators: Le Xu, Xiaoyao Qian, Thomas Kao, Indranil Gupta, Himanshu Gupta Data Analytics 2 Picture borrowed from https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51640

More information

Think before You Discard: Accurate Triangle Counting in Graph Streams with Deletions

Think before You Discard: Accurate Triangle Counting in Graph Streams with Deletions Think before You Discard: Accurate Triangle Counting in Graph Streams with Deletions Kijung Shin 1( ), Jisu Kim 2, Bryan Hooi 2, and Christos Faloutsos 1 School of Computer Science, Carnegie Mellon University,

More information

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval Charu C. Aggarwal 1, Haixun Wang # IBM T. J. Watson Research Center Hawthorne, NY 153, USA 1 charu@us.ibm.com # Microsoft Research

More information

US Patent 6,658,423. William Pugh

US Patent 6,658,423. William Pugh US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that

More information

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou Sun yzsun@cs.ucla.edu January 10, 2017 Overview of Information Network Analysis Network Representation Network

More information