DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

Similar documents
DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

Unbiased Sampling of Facebook

Walking in Facebook: A Case Study of Unbiased Sampling of OSNs

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

Making social interactions accessible in online social networks

Empirical Studies on Methods of Crawling Directed Networks

Part 1: Link Analysis & Page Rank

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030

Sampling Large Graphs: Algorithms and Applications

Sampling Large Graphs: Algorithms and Applications

Multigraph Sampling of Online Social Networks

Mining and Analyzing Online Social Networks

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

A two-phase Sampling based Algorithm for Social Networks

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Empirical Characterization of P2P Systems

Supplementary file for SybilDefender: A Defense Mechanism for Sybil Attacks in Large Social Networks

DS504/CS586: Big Data Analytics Big Data Clustering II

Probabilistic Models in Social Network Analysis

Mining Social Network Graphs

Estimating Sizes of Social Networks via Biased Sampling

Modeling and Estimating the Structure of D2D-Based Mobile Social Networks

OSN: when multiple autonomous users disclose another individual s information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Lecture 27: Learning from relational data

DS504/CS586: Big Data Analytics Big Data Clustering II

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li

CS224W: Analysis of Network Jure Leskovec, Stanford University

A Short History of Markov Chain Monte Carlo

Crawling Online Social Graphs

Markov Chain Monte Carlo (part 1)

Extracting Information from Complex Networks

Graph Data Management

Video Sharing Websites Study

This is the published version of a paper presented at Network Intelligence Conference (ENIC), 2015 Second European.

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Data mining --- mining graphs

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Cycles in Random Graphs

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

Data-Intensive Distributed Computing

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Graph Exploration: How to do better than the random walk? Adrian Kosowski. INRIA Bordeaux Sud-Ouest.

1 Methods for Posterior Simulation

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

MCMC Methods for data modeling

CS281 Section 9: Graph Models and Practical MCMC

Supervised Random Walks

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

COMP 4601 Hubs and Authorities

Topic mash II: assortativity, resilience, link prediction CS224W

Estimating Deep Web Properties by Random Walk

Theory of Computing. Lecture 10 MAS 714 Hartmut Klauck

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

Graph similarity. Laura Zager and George Verghese EECS, MIT. March 2005

Framework and Algorithms for Network Bucket Testing

Online Social Networks and Media. Community detection

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Quantitative Biology II!

Overview of the Stateof-the-Art. Networks. Evolution of social network studies

Degree Ranking Using Local Information

ST440/540: Applied Bayesian Analysis. (5) Multi-parameter models - Initial values and convergence diagn

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Graph and Link Mining

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

: Semantic Web (2013 Fall)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Graph Data Processing with MapReduce

Link Analysis and Web Search

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Optimizing Capacity-Heterogeneous Unstructured P2P Networks for Random-Walk Traffic

Aiding the Detection of Fake Accounts in Large Scale Social Online Services

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Popularity of Twitter Accounts: PageRank on a Social Network

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

11/26/17. directed graphs. CS 220: Discrete Structures and their Applications. graphs zybooks chapter 10. undirected graphs

Big Data Management and NoSQL Databases

Enhancing Stratified Graph Sampling Algorithms based on Approximate Degree Distribution

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

CS4800: Algorithms & Data Jonathan Ullman

Introduction to Algorithms. Lecture 11. Prof. Patrick Jaillet

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS6200 Information Retreival. The WebGraph. July 13, 2015

Graph Theory. Many problems are mapped to graphs. Problems. traffic VLSI circuits social network communication networks web pages relationship

Graph Data Management Systems in New Applications Domains. Mikko Halin

Lecture 26: Graphs: Traversal (Part 1)

Link Analysis in the Cloud

Biological Networks Analysis

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Covert and Anomalous Network Discovery and Detection (CANDiD)

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Transcription:

Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016

Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: 2 Mining of Massive Datasets, http:// www.mmds.org

Graph Data: Media Networks Connections between political blogs Polarization of the network [Adamic-Glance, 2005] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Graph Data: Information Nets Citation networks and Maps of science [Börner et al., 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Graph Data: Communication Nets domain2 domain1 router domain3 Internet J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

Graph Data: Topological Networks Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Graph representation of networks Friendship Resistance Co-authorship Following One-way road Wireless channel Undirected links Directed links Friend & foe Trust & distrust + - + + Multi-relational links Hyperlinks Repulsion & cohesion Signed links

Mining in Big Graphs v Network Statistic Analysis (this lecture) Network Size Degree distribution. v Node Ranking (Next lecture) Identifying most influential nodes Viral Marketing, resource allocation

Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: 9 Mining of Massive Datasets, http:// www.mmds.org

Sampling graphs Random sampling (uniform & independent) crawling } vertex sampling } BFS sampling } edge sampling } random walk sampling 10 10

Random Walks on Graphs Random walk sampling Random Walk Routing Molecule in liquid Influence diffusion

Undirected Graphs Undirected!! 2 3 1 6 4 5

Random Walk v Adjacency matrix! # A = # # # " 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 $ & & & & %! # D = # # # " 3 0 0 0 0 2 0 0 0 0 3 0 0 0 0 2 v Transition Probability Matrix P ij = 1 k i v E : number of links v Stationary Distribution $ & & & & % Symmetric 1 " $ P = A D 1 = $ $ $ # 2 4 3 Undirected 0 1/ 3 1/ 3 1/ 3 1/ 2 0 1/ 2 0 1/ 3 1/ 3 0 1/ 3 1/ 2 0 1/ 2 0 % ' ' ' ' & π i = d i 2 E

Metropolis-Hastings Random Walk v Adjacency matrix! # A = # # # " 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 $ & & & & %! # D = # # # " 3 0 0 0 0 2 0 0 0 0 3 0 0 0 0 2 v Transition Probability Matrix 1 k min(1, υ ) if w neighbor of υ MH k k P υ w υ, w = MH 1 Pυ, y if w= υ y υ v E : number of links v Stationary Distribution π υ = 1 V $ & & & & % Symmetric 1 " $ P = A D 1 = $ $ $ # 2 4 3 Undirected 0 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 0 1/ 3 1/ 3 0 1/ 3 1/ 3 0 1/ 3 1/ 3 % ' ' ' ' &

Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant, Carter Butts, Athina Markopoulou UC Irvine, EPFL Walking in Facebook 15

Outline v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion Walking in Facebook 16

Online Social Networks (OSNs) v A network of declared friendships between users v Allows users to maintain relationships G F E H C D B v Many popular OSNs with different focus Facebook, LinkedIn, Flickr, A Social Graph Walking in Facebook 17

Why Sample OSNs? v Representative samples desirable study properties test algorithms v Obtaining complete dataset difficult companies usually unwilling to share data tremendous overhead to measure all (~100TB for Facebook) Walking in Facebook 18

Problem statement v Obtain a representative sample of users in a given OSN by exploration of the social graph. in this work we sample Facebook (FB) explore graph using various crawling techniques Walking in Facebook 19

Related Work v Graph traversal (BFS) A. Mislove et al, IMC 2007 Y. Ahn et al, WWW 2007 C. Wilson, Eurosys 2009 v Random walks (MHRW, RDS) M. Henzinger et al, WWW 2000 D. Stutbach et al, IMC 2006 A. Rasti et al, Mini Infocom 2009 Walking in Facebook 20

Outline v Motivation and Problem Statement v Sampling Methodology crawling methods data collection convergence evaluation method comparisons v Data Analysis v Conclusion Walking in Facebook 21

(1) Breadth-First-Search (BFS) v Starting from a seed, explores all neighbor nodes. Process continues iteratively without replacement. G F E H C v BFS leads to bias towards high degree nodes Lee et al, Statistical properties of Sampled Networks, Phys Review E, 2006 D A B Unexplored v Early measurement studies of OSNs use BFS as primary sampling technique i.e [Mislove et al], [Ahn et al], [Wilson et al.] Explored Visited Walking in Facebook 22

(2) Random Walk (RW) Explores graph one node at a time with replacement P Degree of node υ RW υ, w In the stationary distribution π υ = Number of edges = 1 k k 2 υ υ E H G D 1/3 F E C 1/3 A 1/3 Next candidate Current node B Walking in Facebook 23

(3) Re-Weighted Random Walk (RWRW) Hansen-Hurwitz estimator v Corrects for degree bias at the end of collection v Without re-weighting, the probability distribution for node property A is: pa ( ) i u A1 A i i = = 1 V u V v Re-Weighted probability distribution : Subset of sampled nodes with value i All sampled nodes pa ( ) i u A = u V i 1/ 1/ k k u u Degree of node u Walking in Facebook 24

(4) Metropolis-Hastings Random Walk (MHRW) v Explore graph one node at a time with replacement H G C F E 1 k min(1, υ ) if w neighbor of υ MH k k P υ w υ, w = MH 1 Pυ, y if w= υ y υ v In the stationary distribution π υ = 1 V MH P AA D 1/3 1/5 A 2/15 1/3 Walking in Facebook 25 B Next candidate Current node 1 1 1 = 1 ( + + ) = 3 3 5 MH 1 3 P AC = = 3 5 2 15 1 5

Uniform userid Sampling (UNI) v As a basis for comparison, we collect a uniform sample of Facebook userids (UNI) rejection sampling on the 32-bit userid space v UNI not a general solution for sampling OSNs userid space must not be sparse names instead of numbers Walking in Facebook 26

Summary of Datasets Sampling method MHRW RW BFS UNI #Valid Users 28x81K 28x81K 28x81K 984K # Unique Users 957K 2.19M 2.20M 984K Egonets for a subsample of MHRW - local properties of nodes Datasets available at: http://odysseas.calit2.uci.edu/research/osn.html Walking in Facebook 27

Data Collection Basic Node Information v What information do we collect for each sampled node u? UserID Name Networks Privacy settings Friend List UserID Name Networks Privacy Settings u UserID Name Networks Privacy settings 1111 UserID Name Networks Privacy settings Regional School/Workplace Send Message View Friends Profile Photo Add as Friend Walking in Facebook 28

Detecting Convergence Number of samples (iterations) to loose dependence from starting points? Walking in Facebook 29

Online Convergence Diagnostics Geweke v Detects convergence for a single walk. Let X be a sequence of samples for metric of interest. Xa Xb EX ( a) EX ( b) z = Var( X ) Var( X ) a b J. Geweke, Evaluating the accuracy of sampling based approaches to calculate posterior moments in Bayesian Statistics 4, 1992 Walking in Facebook 30

Online Convergence Diagnostics Gelman-Rubin v Detects convergence for m>1 walks Walk 1 Between walks variance Walk 2 Walk 3 R n 1 m+ 1 B = + n mn W Within walks variance A. Gelman, D. Rubin, Inference from iterative simulation using multiple sequences in Statistical Science Volume 7, 1992 Walking in Facebook 31

When do we reach equilibrium? Node Degree Burn-in determined to be 3K Walking in Facebook 32

Methods Comparison Node Degree v Poor performance for BFS, RW 28 crawls v MHRW, RWRW produce good estimates per chain overall Walking in Facebook 33

Sampling Bias BFS v Low degree nodes underrepresented by two orders of magnitude v BFS is biased Walking in Facebook 34

Sampling Bias MHRW, RW, RWRW v v Degree distribution identical to UNI (MHRW,RWRW) RW as biased as BFS but with smaller variance in each walk Walking in Facebook 35

Practical Recommendations for Sampling Methods v Use MHRW or RWRW. Do not use BFS, RW. v Use formal convergence diagnostics assess convergence online use multiple parallel walks v MHRW vs RWRW RWRW slightly better performance MHRW provides a ready-to-use sample Walking in Facebook 36

Outline v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion Walking in Facebook 37

FB Social Graph Degree Distribution, Power Law Distribution f(k)=b*k -a a 1 =1.32 a 2 =3.38 v Degree distribution not a power law Walking in Facebook 38

Any Comments & Critiques?

Next Class: Graph Mining (II) v Do assigned readings before class v Submit reviews/critiques v Attend in-class discussions 40