On Application-Aware Data Extraction for Big Data in Social Networks

Size: px

Start display at page:

Download "On Application-Aware Data Extraction for Big Data in Social Networks"

Stephanie Griffith
5 years ago
Views:

1 On Application-Aware Data Extraction for Big Data in Social Networks Ming-Syan Chen Research Center for Information Tech. Innovation, Academia Sinica EE Department, National Taiwan Univ.

2 Fast Increasing of Social Network Activities Example social networks: Twitter Facebook Flickr MSN Wikipedia Amazon.com Such a network Very huge in size! Cannot easily be analyzed M.-S. Chen 2 2

million users Amazon Co-purchasing Network half million product nodes

3 The Amount of Information is Huge! Twitter 150+ million members 50 million tweets per day Facebook 800+ million users Amazon Co-purchasing Network half million product nodes several million recomm. links Web Pages Yahoo! Over one billion Web Pages From twitter.om Amazon From SNSP M.-S. Chen 3

4 Example of Big Data and Social Network Volume: thousands of people! Velocity: fast accumulated!! Variety: eating different food!!! M.-S. Chen 4

5 Example of Big Data and Social Network For some gossip in this occasion, Veracity is an issue and the information Value could be low. Mrs. Chang just did a face lift! Mr. Lin won the lottery! M.-S. Chen 5

6 Information Extraction for Big Data in Social Networks Extracting important information from large social network graphs To allow data analysts to mine the information in large social networks, to enable scalable storage and querying, and to facilitate the development of real-world applications M.-S. Chen 6

7 Outline Graph reduction Summarization, sampling, and extraction Information Extraction on Social Network Graphs Capturing key parameters (parameter extraction) Guide query (information extraction) Decomposing SN graphs (structure extraction) M.-S. Chen 7

8 Graph Reduction Graph summarization (going thru all data) e.g., NTU has 32K students, 20% are sushi lovers, 25% prefer steak, also 15% are artists, 20% are engineers, etc. Graph sampling (going thru a subset) Getting a small representative set of NTU students (which preferably fit statistics) Graph extraction Application/goal-oriented data extraction, e.g., only picking good eaters for feast contest. M.-S. Chen 8

Parameter extraction (e.g., company stat.

9 Graph Extraction To handle complicated things with simple skills. Application/goal-oriented data extraction Three levels of information extraction from SN graphs Parameter extraction (e.g., company stat.) Fast calculation of closeness centrality Information extraction (e.g., company biz.) Guide query Structure extraction (e.g., company org.) Decomposing SN graphs M.-S. Chen 執簡御繁

10 Parameter extraction Structure extraction weapon Information extraction (regarding capability) M.-S. Chen 10

11 Outline Graph reduction Information Extraction on Social Network Graphs Capturing key parameters (parameter extraction) Guide query (information extraction) Decomposing SN graphs (structure extraction) M.-S. Chen 11

12 Closeness centrality There are several interesting quantities, including closeness centrality, network diameters, degree distribution, in SN graphs. Closeness centrality of node v, C c (v): the inverse of the average shortest path distance from v to any other node in a network. If C c (v) is large, v is around the center as it requires only few hops to reach others. M.-S. Chen 12

13 Response to Dynamic Changes It is frequent to have edge insertion or deletion in a social network It is desirable to fast update the closeness centrality of every node in response to edge insertion/deletion. Example use: pick a number of people (the nodes with high CCs) who can maximize advertisement effectiveness. M.-S. Chen

14 Example of Closeness Centrality C c (v): the inverse of the average shortest path distance from v to other nodes C c C c 14 1 ( v) ( w) Thus, node w is closer to all other node than the node v C c ( v) V u V 1 p( v, u) An unweighted and undirected graph G with 14 nodes and 18 edges M.-S. Chen 14

15 Calculating Closeness Centrality One can calculate closeness centralities of all vertices by solving All Pairs Shortest Paths (APSP) problem. O(n(m+n)) based on the breadth-first search (BFS) method for undirected graph, where n and m are the number of nodes and edges in the graph. In a dynamic graph, re-solving APSP problem after each edge insertion or deletion is not efficient. Note that only some pairs of shortest paths will be affected due to certain edge changes. Identify them (unstable node pairs) for fast calculation of CC M.-S. Chen 15

16 Example For example, with the addition of (a,b) Un-changed shortest paths p(b,v), p(c,t) and p(r,h), etc. Changed shortest paths Before edge insertion p(a,b)={a,d,w,b}, p(a,c)={a,d,w,r,c} and p(u,v)={u,l,o,d,w,r,s,v}, etc. After edge insertion (we then call these nodes unstable) p(a,b)={a,b}, p(a,c)={a,b,c} and p(u,v)={u,x,a,b,c,v}, etc. (a): the original unweighted and undirected graph G. (b): G =G e(a,b). M.-S. Chen 16

Illustration of Unstable Node Pairs To find V u : u-unstable node set, whose shortest paths to u changed after the edge addition If we perform BFS at node u in G and G to obtain G u and G

17 Illustration of Unstable Node Pairs To find V u : u-unstable node set, whose shortest paths to u changed after the edge addition If we perform BFS at node u in G and G to obtain G u and G u, we can find only the shortest paths p(u,b), p(u,c), p(u,h), p(u,v) and p(u,t) changed. unstable node pairs: (u,b), (u,c), (u,h), (u,v) and (u,t). V u ={b,c,h,v,t} G u G u M.-S. Chen 17

18 (Main Theorem) After the addition of edge (a,b), every unstable node pair (whose shortest path changed) {v,u} will have v V a and u V b V a V b Only these shortest paths will change after edge addition (and need to be re-calculated)

19 Concurrent Calculation of CC in SN Perform in parallel BFS at nodes a and b in G to obtain V a ={a,x,l,u},v b ={b,c,h,v,t}, simultaneously. Calculate G a Calculate G b Calculate G a and V a Calculate G b and V b Time Perform BFS starting at a V b Perform BFS starting at x V b Inform nodes in these unstable pairs to re-calculate their shortest paths to others and CC Perform BFS starting at l V b Perform BFS starting at u V b M.-S. Chen 19

20 Experiments To evaluate CENDY, we conducted experiments on six real unit-weighted graph datasets of different types. The case of edge deletion can be done similarly (in light of a companion theorem proposed) M.-S. Chen 20

21 Experiments Evaluation on Edge Insertion From this table, we can see that the closeness centralities of all vertices and APL can be updated only by a few of BFS processes. e.g., DBLP contains 460,4 nodes. The naïve way requires to perform 460K BFS processes to update closeness centrality and APL. However, CENDY only requires 4K BFS processes to finish the task. M.-S. Chen 21

22 Remark In response to the fast changes in SN, CENDY is devised to efficiently update the closeness centrality of each node in the social network. The design of new algorithms is called for to efficiently calculate other key parameters in the fast changing social network M.-S. Chen 22

23 Outline Graph reduction Information Extraction on Social Network Graphs Capturing key parameters (parameter extraction) Guide query (information extraction) Decomposing SN graphs (structure extraction) M.-S. Chen 23

24 Motivation of Guide Query Several works on information finding in social networks Expert finding [Deng 08][Lappas 09] To find the experts based on some given requirement Gateway finding [Koren 06][Wang 10] To find the gateways between the source group and the target group Active Friending [Wu ] To explore social networks to improve friend finding Guide query [Lin ] To explore social networks to improve friend finding [Deng 08] ICDM [Lappas 09] KDD [Koren 06] KDD Wang 10] KDD [Wu ] KDD20. [Lin ] WAIM 20 M.-S. Chen 24

25 Motivation of Guide Query (Cont d) By expert finding, the answer is a list of experts ranked by their expertise. Using the guide query, the answer is a list of informative friends of the querier ranked by the ability of gathering information from experts Exploring social relationship Taking the probabilities of getting help into consideration M.-S. Chen 25

26 Guide Query: Graph Extraction based on Your Friends These two friends are who I should ask for information. This friend is also who I should ask since she can collect information from her friends. I want to know information about Company A or B. A B A A C D A C E E B M.-S. Chen 26

27 Quide Query Guided query [Lin ] For a user initiating the query, the answer is the user s neighbors that are informative about user-assigned attributes. An informative neighbor should either have the attributes itself or know some other friends that have the attributes. [Lin ] Y.-C. Lin, P. S. Yu, M.-S. Chen, Guide Query M.-S. Chen in Social Networks, WAIM

28 Problem Definition Given a query node q and a set of keywords W = {w 1, w 2,, w W }, the guide query is to find the top-k informative neighbors of q considering W. q = N 0 W = {A, B} {A} N 4 N 41 N 11 {C} N 0 {D} N 3 N 12 N 1 {C} {A} N 32 N i N i target candidate {A, B} N N 31 2 {A} N {A} N 21 N 33 N 34 M.-S. Chen 28

29 Problem (Cont d) In the model, an edge is labeled with the probability that a node successfully spreads the request to the linked node. We rank the candidates based on how informative they are, which is evaluated by the proposed {A} N InfScore and 11 DivScore {C} P=0.6 N 0 N 4 {D} N 3 N 41 N 12 N 1 {C} P=0.7 {A} N 32 P=0.3 P=0.2 {A, B} {A} N N 21 N 2 N 31 {A} P=0.8 N 34 N 33 M.-S. Chen 29

30 InfScore InfScore: The informative level for a candidate node (i.e., the ability to spread the request to targets). Modeled by the expected number of targets a candidate is able to spread the request to. {A} N 4 N 41 N 11 {C} N 0 {D} N 3 N 12 N 1 {C} {A} N 32 {A, B} {A} N N 21 N 2 N 31 {A} N 34 N 33 M.-S. Chen 30

31 InfScore InfRatio is defined as the probability that a specific candidate successfully spreads the request to a certain target. e.g., the InfRatio from N 1 to N is 0.25 {A} N 11 {C} N 0 N 4 {D} N 3 N 41 N 12 N 1 {C} P=0.25 {A, B} {A} N N 21 {A} N N 31 2 {A} N 33 N 32 P=0.25 P=0.25 N 34 M.-S. Chen 31

32 InfScore (intensity) The InfScore is the weighted sum of InfRatio. InfScore(N 1 ) = *2 = 1.5 (N 11 ) (N 12 ) (N ) InfScore(N 4 ) = = 1.5 (N 4 ) (N 41 ) {A} N 4 N 41 N 11 {C} N 0 {D} N 3 N InfScore N N N N N 12 {A, B} N 1 {C} P=0.25 {A} N N 21 N 2 {A} N 31 {A} P=0.25 N 32 P=0.25 N 34 N 33 M.-S. Chen 32

33 DivScore (Diversity) The DivScore is an entropy-like measure to evaluate the diversity of possibly accessible target nodes. For each node, the target vector X T is defined as follows. Each item in the vector is a normalized InfScore value, describing the probability distribution on different targets. With the target vector, the DivScore is defined as follows.

34 DivScore We design the DivScore as the probability distribution to each possibly accessible target. Example: DivScore(N 3 ) = [-(1/3)*log 2 (1/3)]*2 + [-(1/6)*log 2 (1/6)]*2 Distribution of N 3 : [0.5/1.5, 0.5/1.5, 0.25/1.5, 0.25/1.5] =[1/3, 1/3, 1/6, 1/6] {A} N 11 {C} N 0 N 4 {D} N 3 N 41 N DivScore N N N N N 12 {A, B} N 1 {C} P=0.25 {A} N N 21 N 2 {A} N 31 P=0.25 {A} N 32 P=0.25 N 34 N 33

35 Experimental Setup DBLP dataset [DBLP] Co-authorship network Edge probability Based on the WC (weighted cascade) model p(n i -> N j ) = 1 / d(n j ) d(n j ) is the in-degree of N j Node attribute Conference names of an author s publications [DBLP] [Chen 10] W. Chen, et al., Scalable Influence Maximization for Prevalent Viral Marketing in Large-Scale Social Networks, KDD M.-S. Chen 35

Experimental Results Suppose Ming-Syan Chen wants to discuss with people who have published papers on KDD, SDM, CIKM, ICDM, PKDD, which coauthors should he first connect to? (i.e., Either coauthors who have these conf.

36 Experimental Results Suppose Ming-Syan Chen wants to discuss with people who have published papers on KDD, SDM, CIKM, ICDM, PKDD, which coauthors should he first connect to? (i.e., Either coauthors who have these conf. papers or coauthors who coauthored with people who have these conf. papers.) Query input: q = Ming-Syan Chen k = 10 W = [KDD, SDM, CIKM, ICDM, PKDD] M.-S. Chen 36

37 Remark The key notion is to guide the query to right candidates in the social network. For each candidate, a combination of the expertise and the social relationship with the person initiating the query is considered Just like the group formation (KDD-12) and this expert finding problem (WAIM-), more applications/tools can be enhanced with SR considered M.-S. Chen 37

38 Outline Graph reduction Information Extraction on Social Network Graphs Capturing key parameters (parameter extraction) Guide query (information extraction) Decomposing SN graphs (structure extraction) M.-S. Chen 38

39 Diffusion Analysis in Social Networks Diffusion of Information can be used to model the interaction among nodes in a network, e.g., Viruses spread over the internet. Disease spread in the community. Rumors/news spread among humans. M.-S. Chen 39

40 Example Diffusion Information diffusion can happen in social networks, such as facebook and twitter. n 1 n n 4 2 n 8 n 5 n 7 n 2 n 6 n 9 Underlying network Path of Infection M.-S. Chen 40

41 The Network is Hidden In some situations, the underlying network is not known (due to cost or privacy issue). Network inference problem (NIP) is studied to discover the underlying network n 1 n 3 1 n 5 3 n 8 0 n 7 n 4 To infer the network 2 from what happened. n 2 n 9 n 6 M.-S. Chen 41

42 Network Inference Problem Assume there is an underlying information network. NIP is to infer the information network given a set of cascades. A cascade t s = [t 1 s,, t N s ] is the time records of information s spreading over the network. (N is #nodes), i.e., node n i gets s (infected) in time t i s If a node i is never infected with s, set t i s =. Ex : t s = [,, 2,, 0,1] n 1 n 2 n 5 n 4 n 6 M.-S. Chen 42 0 n 3 2 1

43 Clustering Cascades Traditionally, NIP assumes there is one underlying network, which may not always be true in reality e.g., Sports news, political news, and entertainment news are likely to spread in different ways Hence, we would like to cluster cascades so that the cascades in each cluster spread in the same pattern An SN graph is hence decomposed into application-specific ones M.-S. Chen 43

44 Example Cascades Cascade a (Lakers news) n 2 n 1 n 3 n 5 n 4 0 n 6 1 Cascade b (49ers news) n 1 n 3 n n 2 n 5 n 6 Cascade c (Redskins news) n 1 n 4 n n 5 n 3 n 6 Cascade d (Heats news) Cascade e (Jets news) Cascade f (Celtics news) n 1 n 4 2 n 2 n 5 1 n 3 n n 1 n n 2 n 5 2 n 3 n 6 n 2 n 1 n 3 n 4 M.-S. Chen 44 1 n 5 0 n 6

45 To Model Inference Network Modeling method: If two nodes are always infected in short time, the weight would be large. w ij = 1 s:t i s <t j s s:t i s <t j s 1 t j s t i s Consider w 12 as an example. {s: t 1 s < t 2 s } = {b, c, e} w 12 = 1 3 ( ) =

46 Example Inference Network n n n n n n 6 M.-S. Chen 46

47 To Cluster Cascades by K-Means Transform cascade t to N-dim indicator based on whether nodes are infected or not. Ex: t a =,,,, 0,1 [0,0,0,0,1,1] t b = 0,,, 1,, [1,0,0,1,0,0] t c = 0,1,2,,, [1,1,1,0,0,0] Run K-means to get the clustering result. (a, d, f) and (b, c, e) 47

48 Graph Decomposition By considering cascades {a, d, f} and cascades {b, c, e} independently (based on which nodes are infected), the original SN graph is decomposed in accordance with the information carried. Cascades {a, d, f} (NBA) Cascades {b, c, e} (NFL) 0.25 n 2 n n n n n n 5 n 3 n n 6 M.-S. Chen n n

49 Remark Traditionally NIP results in a dense and complex network, which is difficult to capture knowledge. By properly clustering cascades, we can have a few resulting concise networks which carry clearer information These resulting networks better match the corresponding cascades than a single dense network. M.-S. Chen 49

50 Conclusion Information extraction is an application/goaloriented process to capture the key ingredients (parameters, information, structure, etc) in the huge SN The procedure of information extraction can be integrated into related process for better efficiency in practice M.-S. Chen 50

51 Thank you! M.-S. Chen 51

52 Graph Summarization Condense the original graph to a more compact form Lossless and lossy methods Required to examine the entire network G 4 c 5 6 A revised example form S. Navlakha et al. Graph Summarization with Bounded Error. SIGMOD 08 a b d Gs {5, 10} {6, 10} Sa={2,3} Sb={1,9} Sc={7,8,10} Sd={4,5,6} M.-S. Chen 52

Graph Sampling Graph Sampling Selecting a subset of the original data Characteristics of the original graph are preserved Only a proportion

53 Graph Sampling Graph Sampling Selecting a subset of the original data Characteristics of the original graph are preserved Only a proportion of nodes in the network are visited Sampling Plotted by NodeXL, an EXCEL template created by the NodeXL team at Microsoft Research M.-S. Chen 53

54 A Running Example of CENDY Originally, we have the closeness centralities of all nodes and the average path length of the graph. C c 14 1 ( x) A= a b c d h l o r s t u v w x An unweighted and undirected graph G with 14 nodes and 18 edges L G (14 1) M.-S. Chen 54

55 Example (Cont d) For the insertion of the edge e(a,b). We perform BFS at node a in G and G to obtain G a and G a, and then have V a ={b,c,h,v,t}. G a G a M.-S. Chen 55

56 Example (Cont d) Also, we perform BFS at node b in G and G to obtain G b and G b, and then have V b ={a,x,l,u}. G b G b M.-S. Chen 56

57 Example (Cont d) Then, in light of the main theorem, we re-calculate the paths between V a and V b For example, for node x V b, we calculate (1): p(x,t) - p (x,t) = 7 (1+1+3) = 2 (2): p(x,h) - p (x,h) = 6 4 = 2 (3): p(x,v) - p (x,v) = 6 4 = 2 (4): p(x,c) - p (x,c) = 5 3 = 2 (5): p(x,b) - p (x,b) = 4 2 = 2 and then update its new closeness centrality: C c ( x) 47 (1) (2) (3) (4) (5) G x G x M.-S. Chen 57

58 Example (Cont d) Finally, we update the closeness centralities of the referenced nodes and recalculate the APL. A= a b c d h l o r s t u v w x a b c d h l o r s t u v w x L G M.-S. Chen (14 1)

59 Example Scenario N 0 is initiating a query to find a job in company A or company B. Which friend should N 0 ask for information? N 4 N 41 {A} N 11 {C} N 0 {D} N 3 N 12 N 1 {C} {A} N 32 N 2 N 31 N 34 {A, B} {A} N N 21 {A} N 33 M.-S. Chen 59

60 New Contributions Given M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring Networks of Diffusion and Influence. In KDD 10, Our work is unique in that: 1. We assume there could be many underlying networks (rather than only one). 2. We model and learn a weighted graph (rather than an unweighted one). M.-S. Chen 60

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Gautam Bhat, Rajeev Kumar Singh Department of Computer Science and Engineering Shiv Nadar University Gautam Buddh Nagar,