Estimating Sizes of Social Networks via Biased Sampling

Size: px

Start display at page:

Download "Estimating Sizes of Social Networks via Biased Sampling"

Phebe Jefferson
5 years ago
Views:

1 Estimating Sizes of Social Networks via Biased Sampling Liran Katzir, Edo Liberty, and Oren Somekh Yahoo! Labs, Haifa, Israel International World Wide Web Conference, 28th March - 1st April 2011, Hyderabad, India Yahoo! Labs: WWW / 20

2 Social Network size estimation Goal: Obtaining estimates for sizes of (sub)populations in social network. Why: Advertisement - estimate of market share. Business development - merger/acquisition or asset valuation. Yahoo! Labs: WWW / 20

3 The Problem Difficulties: Social network have become pretty big: Facebook (650,000,000) Qzone (200,000,000) Twitter (175,000,000)... No public API for population size queries. What is the total number of registered users? What is the number of registered (self-declared) year olds living in New-York? Even if a public API is provided an independent estimate is needed. Exhaustive crawl is time/space/communication intensive and violates politeness. Yahoo! Labs: WWW / 20

4 Population size estimation Population sizes can be estimated efficiently using the birthday paradox. The birthday paradox : Given r uniform samples from a set of n elements, the expected number of collisions is r(r 1) 2n. A collision is a pair of identical samples. Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x 2, x 3 ), (x 2,x 5 ), and (x 3,x 5 ). Yahoo! Labs: WWW / 20

5 Population size estimation Using the birthday paradox inversely: When observing C collisions the pouplation can be estimated by n r 2 2C If r = const n this gives a rather good estimator. Similar to mark-and-recapture which counts collisions between two sample sets (but is essentially equivalent). Newer version of mark-and-recapture also handles non-uniform but a-priory known distributions [Chao, 1987]. Social network size estimation [Ye and Wu, 2010] Alas, we cannot sample users uniformly from most social networks... Yahoo! Labs: WWW / 20

6 Uniform distribution on graphs Social networks can be viewed as an undirected graph which we can traverse using their public APIs. Special random walks can generate close to uniform sampling: 1 Bipartite Query-Web page graph [Bharat and Broder, 1998] [Bar-Yossef and Gurevich, 2007]. 2 Social network [Gjoka et al, 2010]. Uses only r = const n samples, but obtaining each sample might be hard. Yahoo! Labs: WWW / 20

7 Graph size estimation It is possible to estimate the size of some graphs directly. 1 Estimate the size of a tree [Knuth, 1974]. 2 Estimate the size of a directed acyclic graph [Pitt, 1987]. We give an estimator for the size of undirected graphs (and sub graphs) which: 1 Counts collisions but uses the graph s stationary distribution. (does not require a uniform sample) 2 Requires asymptotically less than n samples to converge. 3 Obtains samples efficiently. (provable small number of random walk steps.) Yahoo! Labs: WWW / 20

8 Assumptions The graph can be traversed from nodes to neighboring nodes. We can perform a random walk the graph: start at any node In each step, proceed to one of the neighbors uniformly at random. Yahoo! Labs: WWW / 20

9 Facts about random walks This random walk yields the stationary distribution. 1 The probability to get the i th node is d i D. 2 d i i th node s degree. 3 D = n i=1 d i. taking a few steps/several walks ensures independence between two consecutive samples. Yahoo! Labs: WWW / 20

10 Algorithm Outline 1 Sample r users using random walk. 2 C the number of collisions. 3 Ψ 1 the sum of the sampled nodes degrees. 4 Ψ 1 the sum of the inverse sampled nodes degrees. The estimated number of nodes: ˆn = Ψ 1Ψ 1 2C. Yahoo! Labs: WWW / 20

11 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

12 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

13 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

14 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

15 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

16 Sampled Nodes: Sampled Node Degree: C: Ψ 1 : Ψ 1 : ˆn:

17 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

18 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

19 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

20 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

21 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

22 Sampled Nodes: d Sampled Node Degree: 3 C: 0 Ψ 1 : 3 Ψ 1 : 1/3 ˆn:

23 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

24 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

25 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

26 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

27 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

28 Sampled Nodes: d f Sampled Node Degree: 3 2 C: 0 0 Ψ 1 : 3 5 Ψ 1 : 1/3 5/6 ˆn:

29 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

30 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

31 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

32 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

33 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

34 Sampled Nodes: d f f Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 ˆn: 4

35 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

36 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

37 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

38 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

39 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

40 Sampled Nodes: d f f c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 ˆn: 4 8

41 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

42 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

43 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

44 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

45 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

46 Sampled Nodes: d f f c c Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 ˆn: 4 8 6

47 Sampled Nodes: d f f c c d Sampled Node Degree: C: Ψ 1 : Ψ 1 : 1/3 5/6 16/12 19/12 22/12 26/12 ˆn:

48 Proof Intuition Notations: n the graph size, d i node i degree, Expectations: r number of samples D = n i=1 d i ˆn E [Ψ 1 ] = rd n i=1 E [C] = ( r 2) n i=1 ( di D ) 2, E [Ψ 1 ] = rn D ( di D ) 2. E [Ψ 1 ]E [Ψ 1 ] 2E [C] = n r r 1 n. ˆn = Ψ 1Ψ 1 2C E [Ψ 1]E [Ψ 1 ] 2E [C] n Yahoo! Labs: WWW / 20

49 Analytic Results Main statement: Using r(n, ɛ, δ) samples: Pr[n(1 ɛ) ˆn n(1 + ɛ)] 1 δ Uniform vs Biased: Example n = 10 9 n 30, n log n 6, 000. Sampling method Number of samples Any graph, uniform O( n) Synthetic graph, Zipfian degree distribution α = 2, d m = n, O( 4 n log n) random walk Yahoo! Labs: WWW / 20

50 Setup Networks of known sizes: Network Size Edges Synthetic 1,000,000 Zipfian α = 2, d m = 1000 DBLP 845,211 co-authorship IMDB 1,955,508 co-casting Yahoo! Labs: WWW / 20

51 A Synthetic Network, Degree Zipfian α = 2, d m = 1000 Size estimation [Relative to network size] Synthetic network Confidence interval Unif. dist. non unique 95% Deg. dist. non unique 95% Deg. dist. non unique 5% Unif. dist. non unique 5% Number of samples [Percentage of network size] Yahoo! Labs: WWW / 20

52 DBLP - The Digital Bibliography and Library Project Size estimation [Relative to network size] DBLP network Confidence interval Unif. dist. non unique 95% Deg. dist. non unique 95% Deg. dist. non unique 5% Unif. dist. non unique 5% Number of samples [Percentage of network size] Yahoo! Labs: WWW / 20

53 IMDB - The Internet Movie Database Size estimation [Relative to network size] IMDB Confidence interval Unif. dist. non unique 95% Deg. dist. non unique 95% Deg. dist. non unique 5% Unif. dist. non unique 5% Number of samples [Percentage of network size] Yahoo! Labs: WWW / 20

54 Facebook Date April 2009 October 2010 Sampling method uniform random walk Number of samples Collision estimator Facebook report Thanks to Minas Gjoka for the Facebook crawls. Yahoo! Labs: WWW / 20

55 Conclusions An efficient algorithm to estimate the size of a social network using public API was presented. Its effectiveness was demonstrated on synthetic and real world networks. This algorithm outperforms prior art methods by using biased sampling. This algorithm also applies for sub-populations. Yahoo! Labs: WWW / 20

56 Thanks! Yahoo! Labs: WWW / 20

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna,