Node similarity and classification

Size: px

Start display at page:

Download "Node similarity and classification"

Candice Harmon
5 years ago
Views:

1 Node similarity and classification Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016

2 Acknowledgements Some part of this lecture is taken from: Other adapted content is from Social Network Data Analytics (Springer) Ed. Charu Aggarwal, March 2011 GRAPH MINING WS

3 Correlations in Network Individual behaviors are correlated in a network environment homophily influence Confounding GRAPH MINING WS

4 Classification with Network Data How to leverage this correlation observed in networks to help predict user attributes or interests? Predict the labels for non marked nodes GRAPH MINING WS

5 Why node classification? Suggesting new connections or contacts Recommendation systems to suggest objects (music, movies, activities) Automatically understand roles in a network (hubs, activators, influencing nodes, ) Identify experts for question answering systems Targeted advertising Study of communities (key individuals, group starters...) Study of diseases and cures Identify unusual behaviors or behavioral changes Finding similar nodes and outliers GRAPH MINING WS

6 Why node classification is useful? Not all the nodes have labels (users are not willing to provide explanations) Automatic discovery of roles allows analysis of big graphs Labels provided by the users can be misleading Labels are sparse (some categories might be missing or incomplete) GRAPH MINING WS

7 Node classification problem Given: Graph G: V, E, W with vertices V, edges E and weight matrix W Labeled nodes V '() V, unlabeled nodes V +, = V V '() Y the set of m possible labels Y '() = y 2,, y ' the initial labels on nodes in V '() Problem: Infer labels Y for all nodes in V G??? 1 2 V ' Y={1,2}? 1 V + GRAPH MINING WS

8 Node classification problem (2) Can be generalized to multilabel and multiclass classification: With multiclass classification assume that each labeled node has a probability distribution on the labels. Can work on generalized graph structures hypergraphs, graphs with weighted, labeled, timestamped edges, multigraphs, probabilistic graphs and so on. GRAPH MINING WS

9 The importance of the graph structure The graph structure encodes important information for node classification Two important concepts from social sciences: Homophily: or birds of the feather similar individuals are connected with similar people (friends of friends can be easily friends) Co-citation regularity: if two people share a link most probably are similar in other connections (e.g., music tastes) So it is reasonable to think that labels propagate in the network following the links Methods that work with points in the space perform poorly in a graph The label propagates (up to some extent)! GRAPH MINING WS

10 Node features We use the term features for both graph indexes and for node classification, however we distinguish: Graph features: small subgraphs (paths, trees, structures) that are frequent or informative Node features: this is what we will use in this lecture Node features: measurable characteristics of the nodes that help discriminating a node from another or stating the the similarity with other nodes. Examples of features: In/out degree of the node Number of l-labeled edges from that node Number of paths in that goes through the node Number of triangles Degree and number within ego-net edges GRAPH MINING WS

11 Node classification approaches Similarity based Find nodes that share the same characteristics with other nodes Iterative learning Learn a set of labels and propagate the information to similar nodes Label propagation Labeled nodes propagate the information to the neighbors with some probability GRAPH MINING WS

12 Lecture road Similarity based Iterative classification Label propagation GRAPH MINING WS

13 Real-world Applications GRAPH MINING WS

14 Movies recommendations GRAPH MINING WS

15 Search Engines (IR) Topical Sessions similar Queries URLs popular music videos yahoo music GRAPH MINING WS

16 Similarity based approaches Equivalences in terms of structure Structural, Automorphic, and Regular Role extraction methods: RolX Recursive similarities Paths, Max-flow, SimRank GRAPH MINING WS

17 Structural Equivalence Two nodes u and v are structurally equivalent if they have the same relationships to all other nodes Homophily Hypothesis: Structurally equivalent nodes are likely to similar in other ways Solely based on the structure of the network à ignore weights and time Rarely appears in real-world network Lorrain, F. and White, H.C.. Structural equivalence of individuals in social networks. The Journal of math. sociology, GRAPH MINING WS

18 Automorphic equivalence Two nodes u and v are automorphically equivalent if all the nodes can be relabeled to form an isomorphic graph with the labels of u and v interchanged (just change the node id) Swapping u and v (possibly along with their neighbors) does not change graph distances Two nodes that are automorphically equivalent share the same edge/node-label sequences and structures. Borgatti, S.P. and Everett, M.G., Notions of position in social network analysis. Sociological methodology. GRAPH MINING WS

19 Regular equivalence Two nodes u and v are regularly equivalent if they are equally related to equivalent others Less restrictive than Structural and Automorphic equivalence Assumes a similarity between sets of nodes Prof. Einstein Prof. Hilbert Professors Billy and John are similar because they are both connected to a professor. Same for prof. Einstein and Hilbert Billy John Students Regular equivalence doesn t care about which connections but to which set/group a node is connected Borgatti, S.P. and Everett, M.G., Regular blockmodels of multiway, multimode matrices. Social Networks. GRAPH MINING WS

20 Relation among equivalences What is the relation among the three equivalences? Regular Automorphic Structural GRAPH MINING WS

21 RolX: Role extraction algorithm Input Adjacency matrix (n n) Recursive Feature Extraction Node x Feature matrix d dim space n dimensional space Non-negative matrix factorization (NMF) Role Extraction Node x Role matrix Role x Feature Matrix r dimensional space Output Henderson, K., Gallagher, B., Eliassi-Rad, T., Tong, H., Basu, S., Akoglu, L., Koutra, D., Faloutsos, C. and Li, L., Rolx: structural role extraction & mining in large graphs. SIGKDD GRAPH MINING WS

22 Recursive Feature extraction (ReFex) Transform the network connectivity into recursive structural features. Technically, embeds the graph into an F dimensional space, where F is a set of features (degree, self-loops, avg edge weight, # of edges in egonet) Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H. and Faloutsos, C., It's who you know: graph mining using recursive structural features. SIGKDD. GRAPH MINING WS

23 ReFeX: mining features Local: Measures of the node degree Neighborhood Egonet: The egonet (or ego-network) of a node is the node itself, the adjecent nodes, and the graph induced by those nodes Computed based on each node s ego network: #of within-egonet edges, #of edges entering & leaving the egonet Recursive Some aggregate (mean, sum, max, min, ) of another feature over a node s neighbors The aggregation can be computed over any real-valued feature, including other recursive features (this process might not stop if uncontrolled!!!) Regional GRAPH MINING WS

24 ReFex (2) Number of possible recursive features is infinite ReFeX pruning Feature values are mapped to small integers via vertical logarithmic binning Log binning: discretize the features taking non uniform (but logarithmic) bins = the first p V nodes with the lowest feature value are assigned to bin 0, than divide the remaining V - p V taking the first p( V - p V ) nodes and so on. Logarithmic binning increase the chances two features Look pairs of features whose values never disagree at any node by more than a threshold s, and connect in a graph. For each component take one feature. A graph based approach (motivated by power law distribution) Threshold automatically set GRAPH MINING WS

25 RolX: Role extraction algorithm Input Adjacency matrix (n n) Recursive Feature Extraction Node x Feature matrix d dim space n dimensional space Non-negative matrix factorization (NMF) Role Extraction Node x Role matrix Role x Feature Matrix r dimensional space Output Henderson, K., Gallagher, B., Eliassi-Rad, T., Tong, H., Basu, S., Akoglu, L., Koutra, D., Faloutsos, C. and Li, L., Rolx: structural role extraction & mining in large graphs. SIGKDD GRAPH MINING WS

26 Role extraction: Feature grouping Find r overlapping clusters in the feature space Each node can have multiple roles at the same time Generate a rank r approximation of the node x feature matrix V Use non-negative matrix factorization: V GF The G matrix assigns nodes to roles The F matrix represents how the features explain the roles GRAPH MINING WS

27 A (very brief) glimpse to matrix factorization If V is a matrix, it is possible to find an approximation of this matrix multiplying two (lower rank) matrices In particular, we want to find two matrices W,H, such that V WH For instance = However, the exact factorization is not always possible!!! Idea: let us find W, H, W 0, H 0, such that arg min V WH Q N,O where A Q is the square root of the sum of the squared elements in the matrix A Intuitively: you want to minimize the difference between the single elements of the matrix V and the product GF GRAPH MINING WS

28 Selecting the right number of roles Roles summarize the behaviour or alternatively, they compress the feature matrix V (lower dimension description) What is the best model? The best model is the one that has fewer errors and requires less space Idea: use the Minimum Description Length (MDL) paradigm to select the number of rules the results in the best compression L: description length M: #of bits to describe the model E: cost of describing the errors in V GF Find r such that it minimizes L = M + E Minimize M + E Assuming any value requires b bits (the role values), than the number of bits for M is M = br(n + f), why? (think about the dimension of the matrices) What about E? E is the amount of errors. However, since V GF is not normally ^_,` distributed, RolX uses the KL divergence E = V Z,[ log V Z,[ + (GF) Z,[ Z,[ (aq) _,` GRAPH MINING WS

29 Node proximity Intuitively, a good measure of node-proximity should reward many short heavy paths GRAPH MINING WS

30 Graph-theoretic Approaches Idea: more similar than Simple Metrics: Number of hops Sum of weights of hops H. Tong, Y. Koren, & C. Faloutsos. Fast direction-aware proximity for graph mining. SIGKDD, GRAPH MINING WS

31 Graph-theoretic Approaches do not always capture meaningful relationships s t dist(s,t) = 2 s t s t linked via only one path (s,t) probably unrelated Problem: Does not take into account multiple paths H. Tong, Y. Koren, & C. Faloutsos. Fast direction-aware proximity for graph mining. SIGKDD, GRAPH MINING WS

32 Max-flow Min-cut Approach Idea: 1. assign a limited capacity to each edge proportional to its weight 2. compute maximal number of simultaneous units delivered from s to t s 1/1 1/2 t d(s,t)=1 Problem: The number of hops does not count! s 1/1 1/2 1/1 1/1 d(s,t)=1 t Same maximal flow Although red closer than green GRAPH MINING WS

33 SimRank Idea: two objects are similar if they are referenced by similar objects G(V,E) G 2 (V 2,E 2 ) Structural context Glen Jeh and Jennifer Widom. SimRank: a measure of structural-context similarity. SIGKDD, 2002 GRAPH MINING WS

34 SimRank Structural context decay factor ε[0,1] total # of in-neighbors pairs Avg similarity between in-neighbors of α and in-neighbors of b similarity of inneighbors G(V,E) α b GRAPH MINING WS

35 SimRank for Bipartite Graphs G(V,E) A,B Avg similarity between out-neighbors of A and out-neighbors of B music videos music c d c,d Avg similarity between in-neighbors of c and in-neighbors of d [Jeh, Widom 02; Improvements: Antonellis+ 08 SimRank++, C. Li, Han+ 10, Y. Zhang 13, P. Li+ 14 ] GRAPH MINING WS

36 Lecture road Similarity based Iterative classification Label propagation GRAPH MINING WS

37 Iterative classification methods Idea: Use features that take into account the neighbor nodes and repeat the classification several time until nothing changes Suppose for each node we have two features: 1. Number of neighbors with class A 2. Number of neighbors without a class [1,1] [1,1] A A Repeat until convergence [1,1] [1,1] A A [1,1] [1,1] A A Learn A Recompute [0,1] Labels and features on [2,1] [0,1] apply to nodes [2,1] [1,0] unlabeled A [2,1] Neville, J. and Jensen, D., Iterative classification in relational data. AAAI. GRAPH MINING WS

38 Iterative Classification Algorithm (ICA) Train classifier using the labeled instances Until convergence Apply classifier to the unlabeled nodes Updated the feature vectors for unlabeled nodes Return the labels for the labeled nodes Neville, J. and Jensen, D., Iterative classification in relational data. AAAI. GRAPH MINING WS

39 Extension to multi-labels Each node has a distribution over the labels To avoid noise keep only the top-k labels for each unlabeled node sorted in descending order. Intuition: remove the less confident labels Lu, Q. and Getoor, L., Link-based classification. In ICML. GRAPH MINING WS

40 Lecture road Similarity based Iterative classification Label propagation GRAPH MINING WS

41 Guilt-by-Association Techniques????? Given: graph and few labeled nodes Find: class (red/green) for rest nodes Assuming: network effect (homophily/ heterophily) red green Fraudster Accomplice Honest GRAPH MINING WS

42 Personalized Random Walk with Restarts (RWR) Idea: Propagate labels from a set of nodes to the rest of the graph [Brin+ 98; Haveliwala 03; Tong+ 06; Minkov, Cohen 07] GRAPH MINING WS

43 PageRank: A kind of random walk John Andrew Bill Laura The importance of John is high is if Laura, Andrew, and Bill are also important PRank John = jk(,l mz'' njk(,l o(+p( njk(,l q,rpst uvwxyz { }~ }~ DATA ANALYTICS WORKSHOP 43

44 Random Walk In a random walk you do the opposite, you assume that the walker moves randomly and chooses one of the neighbors to visit 1/2 Andrew Tom 1/2 John 1. Tom chooses Andrew or Laura with probability 1/2. 2. Once he chooses one it increases the number of times he visited that node 3. Continue the process until nothing change anymore (at a probabilistic level) Laura John will receive many visits since many nodes are connected to him DATA ANALYTICS WORKSHOP 44

45 Personalized RWR Tom Now assume that with probability c you perform another move and with probability (1-c) you jump back to Tom Andrew Laura Therefore the probability for the walker of being in Tom place will be Probability of visiting Andrew at time (t-1) Tom t = c Prob Andrew t 1 + Prob Laura t (1 c) Number of Tom s neighbors Probability of jumping back to Tom DATA ANALYTICS WORKSHOP 45

46 Comparing two nodes Start two random walks from the two nodes you want to compare separately Compare the final scores you obtain for each node in the graph using some vector comparison (e.g., cosine similarity, KL-divergence) Tom Andrew Alice John Paul [Tom, Andrew, Alice, John,, Paul] Vector for Tom = [0.2, 0.2, 0.1, 0.3,, 0.01] Vector for John = [0.05, 0.15, 0.2, 0.2,..., 0.2] Compare the two vectors (e.g., subtract) DATA ANALYTICS WORKSHOP 46

47 Personalized RWR Tom Now assume that with probability c you perform another move and with probability (1-c) you jump back to Tom restart prob [I cd 1 A]x = (1 c)y ½ ½ ½ 0 ½ 0 1 0? graph relevance structure vector starting vector DATA ANALYTICS WORKSHOP 47

48 Personalized RWR Personalized RWR is defined as the probability of a random surfer of reaching a node n after wandering in the graph for a long time After some time the value for n (and for all the other nodes) does not changes anymore Technically speaking the random walk converges to a stationary distribution Let A be the adjacency matrix of graph G = V, E x = cd 2 Ax + 1 c y where D is a diagonal matrix with the node degree in the diagonal and y is a vector containing the probability of starting from any of the nodes y Z, x is the probability of reaching any node in the graph dim x = dim y = V What is D 2 A? Recall that the inverse of diagonal matrix is a diagonal matrix containing the reciprocal of the elements in the diagonal We need to find x Ix = cd 2 Ax + 1 c y Ix cd 2 Ax = 1 c y (I cd 2 A)x = 1 c y x = I cd 2 A 2 1 c y GRAPH MINING WS

49 Another way of looking at the RW x = cd 2 Ax + 1 c y Suppose the surfer starts from the beginning, it will choose one node at random among y In one step he will either choose, with probability (1-c) another starting node or with probability c to move to one neighbor. Why? Think about the process without restart, setting W = D 2 A. x 2 = Wx x = Wx 2 = WWx = W x x, = W, x That means that W Z,[, contains the probability of reaching j starting from i in n steps GRAPH MINING WS

Semi-Supervised Learning (SSL) Idea If you

and the homophily (similarity) between nodes

similarity between nodes Inference: exploit

8 S T E P 2-0.1 0.6 0.8-0.3-0.3 Ji, M., Sun, Y.

Graph regularized transductive classification

50 Semi-Supervised Learning (SSL) Idea If you have few labeled nodes exploit the structure and the homophily (similarity) between nodes graph-based few labeled nodes edges: similarity between nodes Inference: exploit neighborhood information S T E P 1?? 0.8 S T E P Ji, M., Sun, Y., Danilevsky, M., Han, J. and Gao, J., Graph regularized transductive classification on heterogeneous information networks. ECML/PKDD. GRAPH MINING WS

1 d3 0 1 0 graph final homophily structure labels

51 SSL Equation Laplacian matrix 0.8 [I + a(d A)]x = y ? 1 d graph final homophily structure labels strength of neighbors ~ stiffness of spring d1 d known labels?? -0.3 GRAPH MINING WS

52 SSL Equation What does it compute? [I + a(d A)]x = y Hard? Let s unroll it! x = a D A x + y = aax adx + y, x Z = a A Z[ x [ + (y Z ad ZZ x Z ) [ 2 Sum the labels from the neighbors Difference between the learned label in the node and the input label GRAPH MINING WS

Belief Propagation Iterative message-based method Propagation matrix : ² Homophily class of receiver AI class of sender 0.9 0.1 0.1 0.9... 12 until st nd round stop criterion fulfilled PL - Pearl, J.

53 Belief Propagation Iterative message-based method Propagation matrix : ² Homophily class of receiver AI class of sender until st nd round stop criterion fulfilled PL - Pearl, J., Reverend Bayes on inference engines: A distributed hierarchical approach. In AAAI. - Yedidia, J.S., Freeman, W.T. and Weiss, Y., Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium. GRAPH MINING WS

54 Belief Propagation Iterative message-based method Propagation matrix : ² Homophily class of receiver class of sender ² Heterophily GRAPH MINING WS

55 Belief Propagation Equations message(i > j) belief(i) x homophily strength i j GRAPH MINING WS

56 Belief Propagation Equations belief of i b i (x i ) φ i (x i ) m ij (x i ) prior belief j N(i) messages from neighbors i j GRAPH MINING WS

= φ i (x i ) ψ ij (x i, x j ) m ni (x i )

$N (i) non-linear n N(i)\ j FaBP [Koutra+]:$ c'a] b h = φ h 1 1 1 d1 d2 d3 linear 0 1 0

57 Fast Belief Propagation Original [Yedidia+]: Belief Propagation m ij (x j ) = φ i (x i ) ψ ij (x i, x j ) m ni (x i ) xi b i (x i ) = η φ i (x i ) m ij (x i ) j N (i) non-linear n N(i)\ j FaBP [Koutra+]: Linearized BP BPis approximated by [I + ad c'a] b h = φ h d1 d2 d3 linear ? prior beliefs Koutra, D., Ke, T.Y., Kang, U., Chau, D.H.P., Pao, H.K.K. and Faloutsos, C., Unifying guilt-by-association approaches: Theorems and fast algorithms. PKDD GRAPH MINING WS

58 Qualitative Comparison GBA Method Heterophily Scalability Convergence RWR SSL BP? FABP GRAPH MINING WS

59 Correspondence of Methods RWR SSL BP Random Walk Semi-supervised Belief with Restart Learning Propagation Method Matrix unknown known RWR [I c AD -1 ] x = (1-c)y SSL [I + a(d - A)] x = y FABP [I + a D - c A] b h = φ h ? GRAPH MINING WS

60 In the next episode Link prediction Community detection And much more GRAPH MINING WS

61 Questions? GRAPH MINING WS

62 References Lorrain, F. and White, H.C.. Structural equivalence of individuals in social networks. The Journal of mathematical sociology, Borgatti, S.P. and Everett, M.G., Notions of position in social network analysis. Sociological methodology. Borgatti, S.P. and Everett, M.G., Regular blockmodels of multiway, multimode matrices. Social Networks. Henderson, K., Gallagher, B., Eliassi-Rad, T., Tong, H., Basu, S., Akoglu, L., Koutra, D., Faloutsos, C. and Li, L., Rolx: structural role extraction & mining in large graphs. SIGKDD Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H. and Faloutsos, C., It's who you know: graph mining using recursive structural features. SIGKDD. H. Tong, Y. Koren, & C. Faloutsos. Fast direction-aware proximity for graph mining. SIGKDD, Glen Jeh and Jennifer Widom. SimRank: a measure of structural-context similarity. SIGKDD, 2002 Ji, M., Sun, Y., Danilevsky, M., Han, J. and Gao, J., Graph regularized transductive classification on heterogeneous information networks. ECML/PKDD. Pearl, J., Reverend Bayes on inference engines: A distributed hierarchical approach. In AAAI. Yedidia, J.S., Freeman, W.T. and Weiss, Y., Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium. Koutra, D., Ke, T.Y., Kang, U., Chau, D.H.P., Pao, H.K.K. and Faloutsos, C., Unifying guilt-byassociation approaches: Theorems and fast algorithms. PKDD Neville, J. and Jensen, D., Iterative classification in relational data. AAAI. Lu, Q. and Getoor, L., Link-based classification. In ICML. GRAPH MINING WS

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems