Section 7.12: Similarity By: Ralucca Gera, NPS
Motivation We talked about global properties Average degree, average clustering, ave path length We talked about local properties: Some node centralities Next we ll look at pairwise properties of nodes: Node equivalence and similarity, Correlation between pairs of nodes Why care? How do items get suggested to users? Based on past history of the user Based on the similarity of the user with the other people 2
Similarity In complex network, one can measure similarity Between vertices Between networks We consider the similarity between vertices in the same network. How can this similarity be quantified? Differentiating vertices helps tease apart the types and relationships of vertices Useful for click here for pages/movies/books similar to this one We determine similarities based on the network structure There are two types of similarities: Structural equivalence (main one is the Pearson Corr. Coeff) Regular equivalence 3
Structural vs. Regular Equivalent nodes Defn: Two vertices are structurally equivalent if they share many neighbors. Such as two math students and that know the same professors Defn: Two vertices are regularly equivalent if they have many neighbors that are also equivalent. Such as two Deans and that have similar ties: provost, president, department chairs (some could actually be common) 4
Structural Equivalence
Structural Equivalent nodes Two vertices are structurally equivalent if they share many neighbors How can we measure this similarity? 6
Structural similarity 1. Count of common neighbors = 2. Normalized common neighbors (divide by constant): by the number of vertices: by the max number of common neighbors Jaccard similarity is the quotient of number of common to the union of all neighbors: Fast to compute (thus popular) as it just counts neighbors 7
Structural similarity (2) 3. Cosine similarity (divide by variable): Let x be the row corresponding to vertex i in A, and Let y be the column corresponding to vertex j in A: =?? simplifies to = = Convention: If deg i = 0 (denominator is 0), then cos = 0. Range: 8
Cosine similarity Cosine similarity is an example of a technique used in information retrieval, text analysis, or any comparison of to, where each of and can be vectorized based on their components For example: To find the closest document(s)/website(s) to another document or a query 9
Technique used in text analysis A collection of n documents (,,, ) can be represented in the vector space model by a term-document matrix. A list of t terms of interest:,,, An entry in the matrix corresponds to the weight of a term in the document; zero means the term has no significance in the document or it simply doesn t exist in the document. T 1 T 2. T t D 1 w 11 w 21 w t1 D 2 w 12 w 22 w t2 : : : : : : : : D n w 1n w 2n w tn Source: Raymond J. Mooney University of Texas at Austin 10
Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 5 D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? Source: Raymond J. Mooney University of Texas at Austin 11
Cosine Similarity Measure Cosine similarity measures the cosine of the angle between two vectors. Inner product normalized by the vector lengths. t ( d q w ij w iq ) j i 1 CosSim(d j, q) = t 2 t d j q w w ij i 1 i 1 iq 2 D 1 t 3 Q t 1 t 2 D 2 D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 10 / (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim(D 2, Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T 1 + 0T 2 + 2T 3 D 1 is 6 times better than D 2 using cosine similarity Source: Raymond J. Mooney University of Texas at Austin 12
Illustration of 3 Nearest Neighbor for Text It works the same with vectors of the adjacency matrix, where the terms are just the neighbors, and similarity is how many common neighbors there are query Source: Raymond J. Mooney University of Texas at Austin 13
1. Count of common neighbors: = So far: Structural similarity 2. Normalized common neighbors (divide by the): number of vertices: max count of common neighbors possible exact count of common neighbors (Jaccard similarity) 3. Cosine similarity:
Pearson Correlation Coeff (1) 4. Pearson Correlation Coefficient (compare to the expected value of common neighbors in a network in which vertices choose their neighbors at random): Let i and j be two vertices of deg i and deg j The probability of j to choose at random one of i s neighbors is if loops in G) (or Thus the probability that all of j s neighbors are neighbors of i is 15
Pearson Correlation Coeff (2) 4. Pearson Correlation Coefficient (continued) is the actual number of common neighbors minus the expected common neighbors that will be normalized: = = = = ] = ] FOIL and simplify using Since the mean of the row is < > = 16 Since 1 are constants with respect to k
Pearson Correlation Coeff (3) 4. Pearson Correlation Coefficient (continued) is the actually number of common neighbors minus the expected common neighbors that will be normalized:, = ] = 0 if the # of common neighbors is what is expected by chance < 0 if the # of common neighbors is less than what is expected by chance > 0 if the # of common neighbors is more than what is expected by chance 17
Pearson Correlation Coeff (4) 4. Pearson Correlation Coefficient (continued) is the actually number of common neighbors minus the expected common neighbors normalized (by row and ): =,, = ] = 0 if the # of common neighbors is what is expected by chance 1 < 0 if the # of common neighbors is less than what is expected by chance 0 <1if the # of common neighbors is more than what is expected by chance shows more/less vertex similarity in a network compared to vertices expected to be neighbors at random
More measures There are more measures for structural equivalences, the Pearson Correlation Coefficient is commonly used Another one is Euclidean distance which is the cardinality of the symmetric difference between the neighbors of i and neighbors of j All these are measures of structural equivalence of a network: they measure different ways of counting common neighbors 19
Overview updated: Structural similarity 1. Count of common neighbors: = 2. Normalized common neighbors (divide by the): number of vertices: max count of common neighbors possible exact count of common neighbors (Jaccard similarity) 3. Cosine similarity: 4. Pearson Correlation Coefficient (exists in NetworkX) : =,, = ]
Regular Equivalence
Structural vs. Regular Equivalent nodes Defn: Two vertices are structurally equivalent if they share many neighbors. Such as two math students and that know the same professors Defn: Two vertices are regularly equivalent if they have many neighbors that are also equivalent. Such as two Deans and that have similar ties: provost, president, department chairs (some could actually be common) 22
Defn of regular Equivalence Defn. Two nodes are regularly equivalent if they are equally related to equivalent nodes (i.e. their neighbors need to be equivalent as well). We capture equivalence through color pattern adjacencies. To have similar roles, these nodes need different colors because of their adjacencies White and Reitz, 1983; Everette and Borgatti, 1991 23
Structural vs. Regular equivalent color classes Nodes with connectivity to the same nodes (common information about the network) vs. Nodes with similar roles in the network (similar positions in the network). Leonid E. Zhukov 24
Regular Equivalence (1) Regular equivalences: nodes don t have to be adjacent to be regular equivalent REGE an algorithm developed in 1970 to discover regular equivalences, but the operations are very involved and it is not easy to interpret (Newman) Newer methods: METHOD 1: develop that is high if the neighbors k and of i and j are (regularly) equivalent σ counts how many common neighbors are there, times their neighbors regular similarity σ A recurrence relation in terms of, much like eigenvector. 25
Regular Equivalence (2) Is equivalent to where α is the leading eigenvalue of A, and σ is the eigenvector. But: It may not give high self-similarity as it heavily depends on the similarity of neighbors of i It doesn t give high similarity to pairs of vertices with lots of common neighbors (because it looks at values of and, including ) Needs to be altered, and we ll see two methods next. 26
Regular Equivalence (3) Is equivalent to + δ with, δ = Calculate by starting with the initial values:, Note: is a weighted some of the even powers of A (walks of even length), but we want all the lengths
Regular Equivalence (4) METHOD 2 Is equivalent to with: + δ with σ α 0, σ α α, σ α α α α, is a weighted (? ) count of all walks between i and j, (with < 1 so longer paths weigh less) Leicht, Holme, and Newman, 2006 28
Regular Equivalence (5) METHOD 2 Is equivalent to + δ with This looks a lot like Katz centrality (recall α + β, where β is a constant initial weight given to each vertex so that its out degree matters) Katz centrality of vertex i is the sum of, 29
Regular Equivalence (6) The regular equivalence tends to be high for vertices of high degree (more chances of their neighbors being similar) However, some low degree vertices could be similar as well, so the formula can be modified if desired: which is equivalent to + δ 30
The connection between structural and regular equivalence
Equivalence (1) R is a weighted count of all walks between i and j, i.e: σ α α α α, Recall that the structural equivalence attempts to count the # of common neighbors of i and j in different ways What is the correlation between the number of common neighbors of 4 and 5, and paths of length 2 between 4 and 5? 32
Equivalence (2) R is a weighted count of all walks between i and j, i.e: σ α α α α, Recall that the structural equivalence attempts to count the # of common neighbors in different ways So the regular equivalence is a generalization of structural equivalence ( counts just paths of length 2) 3