Jure Leskovec Computer Science Department Cornell University / Stanford University
Large on line systems have detailed records of human activity On line communities: Facebook (64 million users, billion dollar business) MySpace (300 million users) Communication: Instant Messenger (~1 billion users) News and Social media: Blogging (250 million blogs world wide, presidential candidates run blogs) On line worlds: World of Warcraft (internal economy 1 billion USD) Second Life (GDP of 700 million USD in 07) 2
a) World wide web b) Internet (AS) c) Social networks d) Communication e) Citations f) Protein interactions 3
We know lots about the network structure: Properties: Scale free [Barabasi 99], 6 degrees of separation [Milgram 67], Small world [Watts Strogatz 98], Bipartite cores [Kumar et al. 99], Network motifs [Milo et al. 02], We Communities know much [Newman less 99], Hubs and authorities [Page et al. 98, Kleinberg 99] about processes and Models: Preferential attachment [Barabasi 99], Smallworld [Watts Strogatz dynamics 98], of Copying networks model [Kleinberg el al. 99], Heuristically optimized tradeoffs [Fabrikant et al. 02],Latent Space Models [Raftery et al. 02], Searchability [Kleinberg 00],Bowtie[Broder et al. 00], Exponential Random Graphs [Frank Strauss 86], Transitstub [Zegura 97], Jellyfish [Tauro et al. 01] 4
Network Dynamics: Network evolution How network structure changes as the network grows and evolves? Diffusion and cascading behavior How do rumors and diseases spread over networks? 5
We need massive network data for the patterns to emerge: MSN Messenger network [WWW 08] (the largest social network ever analyzed) 240M people, 255B messages, 4.5 TB data Product recommendations [EC 06] 4M people, 16M recommendations Blogosphere [work in progress] 60M posts, 120M links 6
Diffusion & Cascades Network Evolution Patterns & Observations Models & Algorithms Q1: What do cascades look like? Q2: How likely are people to get influenced? Q3: How do we find influential nodes? Q4: How do we quickly detect epidemics? Q5: How does network structure change as the network grows/evolves? Q6:How do we generate realistic looking and evolving networks? 7
Diffusion & Cascades Network Evolution Patterns & Observations Models & Algorithms Q1: What do cascades look like? Q2: How likely are people to get influenced? Q3: How do we find influential nodes? Q4: How do we quickly detect epidemics? Q5: How does network structure change as the network grows/evolves? Q6:How do we generate realistic looking and evolving networks? 8
Behavior that cascades from node to node like an epidemic News, opinions, rumors Word of mouth in marketing Infectious diseases As activations spread through the network they leave a trace a cascade Cascade Network (propagation graph) 9
Network diffusion has been extensively studied: Human behavior [Granovetter 78] Diseases and epidemics [Bailey 75] Innovations [Rogers We know 95] much less On the web [Gruhl et al. 04] about individual Organizations [Burt 04, Aral Brynjolfsson van Alstyne 07] For marketing cascading purposes [Richardson Domingos events that 02, Hill Provost Volinsky lead 06] to diffusion Trading behaviors [Hirshleifer et al. 94] Decision making [Bikhchandani 98, Surowiecky 05] 10
[w/ Adamic Huberman, EC 06] People send and receive product recommendations, purchase products 10% credit 10% off Data: Large online retailer: 4 million people, 16 million recommendations, 500k products 11
[w/ Glance Hurst et al., SDM 07] Bloggers write posts and refer (link) to other posts and the information propagates Data: 10.5 million posts, 16 million links 12
[w/ Kleinberg Singh, PAKDD 06] propagation Are they stars? Chains? Trees? Information cascades (blogosphere): Viral marketing (DVD recommendations): Viral marketing cascades are more social: Collisions (no summarizers) Richer non tree structures (ordered by frequency) 13
Prob. of adoption depends on the number of friends who have adopted [Bass 69, Shelling 78] What is the shape? Distinction has consequences for models and algorithms Prob. of adoption Prob. of adoption To find the answer we need lots of data k = number of friends adopting Diminishing returns? k = number of friends adopting Critical mass? 14
[w/ Adamic Huberman, EC 06] Probability of purchasing 0.1 0.08 0.06 0.04 0.02 0 DVD recommendations (8.2 million observations) Adoption curve follows the 0 10 20 30 40 # diminishing recommendations returns. received Can we exploit this? Later similar findings were made for group membership [Backstrom Huttenlocher Kleinberg 06], and probability of communication [Kossinets Watts 06] 15
Patterns & Observations Models & Algorithms Diffusion & Cascades A1: Cascade shapes A2: Human adoption follows diminishing returns Q3: How do we find influential nodes? Q4: How do we quickly detect epidemics? Network Evolution Q5: How does network structure change as the network grows/evolves? Q6:How do we generate realistic looking and evolving networks? 16
Blogs information epidemics Which are the influential/infectious blogs? Viral marketing Who are the trendsetters? Influential people? Disease spreading Where to place monitoring stations to detect epidemics? 17
[w/ Krause Guestrin et al., KDD 07] c 1 c 3 How to quickly detect epidemics as they spread? c 2 18
[w/ Krause Guestrin et al., KDD 07] Cost: Cost of monitoring is node dependent Reward: Minimize the number of affected nodes: If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: Minimize time to detection ( ) Maximize number of detected outbreaks A R(A) 19
Given: Graph G(V,E), budget M Data on how cascades C 1,, C i,,c K spread over time Select a set of nodes A maximizing the reward subject to cost(a) M Reward for detecting cascade i Solving the problem exactly is NP hard Max cover [Khuller et al. 99] 20
[w/ Krause Guestrin et al., KDD 07] Problem structure Submodularity of the reward functions (think of it as concavity ) CELF algorithm with approximate guarantee Speed up Lazy evaluation 21
[w/ Krause Guestrin et al., KDD 07] S 1 New monitored node: S S 1 S 3 S 2 Adding S helps a lot S 2 Adding S helps very S 4 little Placement A={S 1, S 2 } Placement B={S 1, S 2, S 3, S 4 } Gain of adding a node to small set is larger than gain of adding a node to large set Submodularity: diminishing returns, think of it as concavity ) 22
We must show R is submodular: A B R(A {u}) R(A) R(B {u}) R(B) Gain of adding a node to a small set Natural example: Sets A 1, A 2,, A n R(A) = size of union of A i (size of covered area) Gain of adding a node to a large set B A u If R 1,,R K are submodular, then R i is submodular 23
[w/ Krause Guestrin et al., KDD 07] Theorem: Reward function is submodular Consider cascade i: R i (u k ) = set of nodes saved from u k R i (A) = size of union R i (u k ), u k A R i is submodular Global optimization: R(A) = R i (A) R is submodular Cascade i u 2 R i (u 2 ) u 1 R i (u 1 ) 24
[w/ Krause Guestrin et al., KDD 07] We develop CELF algorithm: Two independent runs of a modified greedy Solution set A : ignore cost, greedily optimize reward Solution set A : greedily optimize reward/cost ratio Pick best of the two: arg max(r(a ), R(A )) a Theorem: If d R is submodular then CELF is b near optimal: c a c CELF achieves ½(1 1/e) factor approximation b Current solution: {a, {} {a} c} d e Marginal reward 25
Question: Which blogs should one read to be most up to date? Idea: Select blogs to cover the blogosphere. Each dot is a blog Proximity is based on the number of common cascades 26
[w/ Krause Guestrin et al., KDD 07] Which blogs should one read to catch big stories? CELF Reward (higher is better) In links Out links # posts (used by Technorati) Number of selected blogs (sensors) Random For more info see our website: www.blogcascade.org 27
[w/ Krause et al., J. of Water Resource Planning] Given: a real city water distribution network data on how contaminants spread over time Place sensors (to save lives) Problem posed by the US Environmental Protection Agency S c 2 c 1 28
[w/ Ostfeld et al., J. of Water Resource Planning] Population saved (higher is better) Number of placed sensors CELF Our approach performed best at the Battle of Water Sensor Networks competition Degree Random Population Flow Author Score CMU (CELF) 26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph 7 U Michigan 4 Michigan Tech U 3 Malcolm 2 Proteo 2 Technion (2) 1 29
Patterns & Observations Models & Algorithms Diffusion & Cascades A1: Cascade shapes A2: Human adoption follows diminishing returns A3, A4: CELF algorithm for detecting cascades and outbreaks Network Evolution Q5: How does network structure change as the network grows/evolves? Q6: How do we generate realistic looking and evolving networks? 30
Empirical findings on real graphs led to new network models log prob. Model Explains log degree Power law degree distribution Preferential attachment Such models make assumptions/predictions about other network properties What about network evolution? 31
[w/ Kleinberg Faloutsos, KDD 05] What is the relation between the number of nodes and the edges over time? E(t) Internet a=1.2 Prior work assumes: constant average degree over time Networks are denser over time Densification Power Law: E(t) Citations N(t) a=1.6 a densification exponent (1 a 2) N(t) 32
[w/ Kleinberg Faloutsos, KDD 05] Prior models and intuition say that the network diameter slowly grows (like log N, log log N) diameter diameter What individual node Diameter shrinks over time behaviors are causing such as the network grows the distances between the nodes slowly decrease patterns? size of the graph time Internet Citations 33
[w/ Backstrom Kumar Tomkins, KDD 08] We directly observe atomic events of network evolution (and not only network snapshots) and so on for millions We observe evolution at finest scale Test individual edge attachment Directly observe events leading to network properties Compare network models by likelihood (and not by just summary network statistics) 34
[w/ Backstrom Kumar Tomkins, KDD 08] Network datasets Full temporal information from the first edge onwards LinkedIn (N=7m, E=30m), Flickr (N=600k, E=3m), Delicious (N=200k, E=430k), Answers (N=600k, E=2m) We study 3 processes that control the evolution P1) Node arrival: node enters the network P2) Edge initiation: node wakes up, initiates an edge, goes to sleep P3) Edge destination: where to attach a new edge Are edges more likely to attach to high degree nodes? Are edges more likely to attach to nodes that are close? 35
[w/ Backstrom Kumar Tomkins, KDD 08] Are edges more likely to connect to higher degree nodes? G np PA Flickr First direct proof of preferential attachment! p e ( k) Network τ k τ G np 0 PA 1 Flickr 1 Delicious 1 Answers 0.9 LinkedIn 0.6 36
[w/ Backstrom Kumar Tomkins, KDD 08] Just before the edge (u,w) is placed how many hops is between u and w? G np PA Flickr Fraction of triad closing edges Network % Δ Flickr 66% Delicious 28% Answers 23% LinkedIn 50% Real edges are local. Most of them close triangles! u v w 37
Want to generate realistic networks: Given a real network Generate a synthetic network Why synthetic graphs? Compare graphs properties, e.g., degree distribution Anomaly detection, Simulations, Predictions, Null model, Sharing privacy sensitive graphs, Q: Which network properties do we care about? Q: What is a good model and how do we fit it? 38
[w/ Chakrabarti Kleinberg Faloutsos, PKDD 05] Edge probability Edge probability p ij (3x3) (9x9) Initiator (27x27) Kronecker product of graph adjacency matrices (actually, there is also a nice social interpretation of the model) Given a real graph. We prove Kronecker graphs mimic real graphs: How to estimate the initiator G 1? Power law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties 39
[w/ Faloutsos, ICML 07] Maximum likelihood estimation arg max G 1 P( ) Kronecker G 1 Naïve estimation takes O(N!N 2 ): N! for different node labelings: We estimate the Our solution: Metropolis sampling: N! (big) const model in O(E) N 2 for traversing graph adjacency matrix Our solution: Kronecker product (E << N 2 ): N 2 E Do stochastic gradient descent G = a b 1 c 40 d
[w/ Faloutsos, ICML 07] We search the space of ~10 1,000,000 permutations Fitting takes 2 hours Real and Kronecker are very close G1 = 0.99 0.54 0.49 0.13 Degree distribution Path lengths Network values probability # reachable pairs network value node degree number of hops rank 41
[w/ Dasgupta Lang Mahoney, WWW 08] Fitting Epinions we obtained G 1 = What does this tell about the network structure? 0.5 edges Core periphery Core Periphery No communities 0.9 edges 0.1 edges No good cuts 0.9 0.5 edges 0.91 0.54 0.49 0.13 As opposed to: 0.9 0.1 0.1 0.9 which gives a hierarchy 0.1 0.9 0.1 42
Small and large networks are fundamentally different Scientific collaborations 0.9 0.1 (N=397, E=914) 0.1 0.9 Collaboration 0.91 network 0.54 (N=4,158, 0.49E=13,422) 0.13 43
Why are networks the way they are? Only recently have basic properties been observed on a large scale Confirms social science intuitions; calls others into question. Benefits of working with large data What patterns do we observe in massive networks? What microscopic mechanisms cause them? Social network of the whole world? 44
[w/ Horvitz, WWW 08, Nature 08] Small world experiment [Milgram 67] People send letters from Nebraska to Boston How many steps does it take? Messenger social network largest network analyzed 240M people, 30B conversations, 4.5TB data Milgram s small world experiment MSN Messenger network Number of steps between pairs of people (i.e., hops + 1) 45
Predictive modeling of information diffusion When, where and what information will create a cascade? Where should one tap the network to get the effect they want? Social Media Marketing How do news and information spread New ranking and influence measures Sentiment analysis from cascade structure How to introduce incentives? 46
Observations: Data analysis Actively influencing the network Models: Predictions Algorithms: Applications 47