Report on the paper Summarization-based Mining Bipartite Graphs

Report on the paper Summarization-based Mining Bipartite Graphs Annika Glauser ETH Zuerich Spring 2015 Extract from the paper [1]:

Introduction The paper Summarization-based Mining Bipartite Graphs introduces a new algorithm SC- Miner (Summarization-Compression Miner) including graph summarization, graph clustering, link prediction and discovery of the hidden structure. The objective of graph summarization is to produce a compressed representation of the input graph. The aim of graph clustering is to group similar nodes of the graph together in clusters. Link prediction wants to predict missing or eventual future links of the graph, and with trying to discover the hidden structure of a graph, we want to say something about the structure of the data. So the algorithm converts a large bipartite graph in a highly compact smaller graph, which should give an idea of the structure of the data, abstract the original graph, cluster its nodes and predict missing or future links. This makes the algorithm into a useful tool for Data-Mining. For an illustrative example look at the graphs below. The left graph is an input graph that corresponds to (a very small part of) data from an online movie rating site. The users are denoted by squares while the movies are represented by circles. If a user liked a movie, he or she is connected to it by an edge. In the graph on the right, the nodes that could be in the same cluster are circled together. And as the other users that liked the movie Pitch Perfect liked The Devil Wears Prada, it might be predicted that Anna likes it too (denoted by the bold edge). Figure 0: A bipartite input graph Figure 1: Graph with possible clusters and predicted edges The hidden structure of the data then might look like the graph in figure 2. The problem is, the clusters and predicted edges in figure 1 and the structure in figure 2 are just possible solutions for the problem. Because a bipartite graph can be represented by a lot of different - not necessarily good - summarizations. As it can be proven that finding the global optimal summarization is NP-hard, the algorithm SCMiner follows a heuristic approach to search for the local optima. Figure 2: A possible hidden structure 1

Model To be more formal than in the previous section, the input of the algorithm is a large bipartite graph G = (V 1, V 2, E) where V 1 and V 2 are node sets of type 1 or type 2 respectively, and E the set of edges between them. The first part of the output is a summary Graph G S = (S 1, S 2, E ) that contains two sets S 1 and S 2 with clusters of nodes of their corresponding type - called super nodes - and a set of Edges E between the super nodes. The second part of the output is an additional Graph G A = (V 1, V 2, E ) that contains the original node sets V 1 and V 2 and a set E of added or deleted edges between them, that would be needed to restore the original graph G. The edges in E with a (+) sign have to be added to G S in order to obtain G from it and vice versa with the edges marked with a (-) sign. To go with the previous example: The original Graph is denoted by G = (V 1, V 2, E) where V 1 = {A, B, C, D, E} (the users) and V 2 = {P, T d, S, T } (the movies). The summary graph G S = (S 1, S 2, E ) consists of the four super nodes S 11 = {A, B, C}, S 12 = {D, E}, S 21 = {P, T d} and S 22 = {S, T }. The additional Graph G A consists of the deleted edge (C, S) (marked with a (+) sign) and the added edge (A, T d) (marked with a (-) sign). Figure 3: An example for the model In fact, the minus and plus signs in G A can be omitted, as they can be derived by comparing G A with G S : if an edge of G A is in G S, the edge was added by the algorithm and is not part of the original graph G. Vice versa for an edge in G A that doesn t appear in G S this means the original edge was deleted. 2

Data Compression As already mentioned, there are a lot of different summarizations for one bipartite graph. Naturally for our example graph we could just look at the different summarizations and chose the best one. As the input normally is a lot bigger than the graph in figure 0, the algorithm tries to improve the summarization step wise. But how to measure the goodness of a summarization? The Minimum Description Length (MDL) principle states, that the more we can compress the data (graph), the more we learn of its underlying structure. Therefore the goodness of a summarization is measurable in the shortness of its description length. Inspired by this principle, the authors propose the following coding scheme: they measure the coding cost CC(H) of a graph H = (V 1, V 2, E) by the lower bound of the coding cost of its compressed adjacent matrix A V 1 V 2 with a ij = 1 if (V 1i, V 1j ) E. Which is: 1 CC(H) = V 1 V 2 H(A) (1) where H(A) = p 0 (A) log 2 (1/p 0 (A)) + p 1 (A) log 2 (1/p 1 (A)) - the entropy of A - and p 0 and p 1 the probabilities of finding 1 and 0 entries in the adjacency matrix A of H. The additional graph G A from the previously introduced model can be represented by a simple adjacency matrix A GA {0, 1} V 1 V 2, for G S however we need in addition to the adjacency matrix A GS {0, 1} S 1 S 2 a list of the nodes and their corresponding super nodes. The coding cost of this list is: CC(list) = 2 N i V i S ij log 2 S ij i=1 j=1 Where N i is the number of super nodes of type i, S ij the number of nodes in super node S ij and V i the number of nodes of type i. With this, the coding cost of a summarization G in the previously introduced model is: 2 CC(G) = CC(G S ) + CC(G A ) + CC(list) (3) The goal of the algorithm is to find a summarization that minimizes (3), because corresponding to the MDL principle, the solution should be optimal when the MDL is minimal. About (2): The information content of a certain Event E can be measured by the function I(E) = I(p(E)) = log 2 (1/p(E)) where p(e) is the probability of E. The unit of measure is bits. So in fact, I(E) tells us how many bits we need to encode the event E. In (2) this event is: v V i : v S ij. The probability of this event (when picking v at random) is S ij / V i and therefore the information content is: v V i : I(v S ij ) = log 2 ( V i / S ij ). To encode the list, we omit the names of the nodes and just concatenate the codes that correspond to the super nodes to which the nodes belong. (The order of the nodes is given by the order of the nodes in the adjacency matrix of G A ). As there are S ij nodes in S ij there are S ij strings of log 2 ( V i / S ij ) bits in the coding of the list. By summing this over all super nodes S ij in the summarization we get the number of required bits to encode the whole list. About (1): The entropy H(A) is the average information content of an entry of the adjacency matrix A. And therefore (1) is the average information content of the whole adjacency matrix. (2) 1 The corresponding formula in the paper has a minus sign, which is a typo - as I verified with the authors. 2 The paper is rather inexact in this equation. In the model, the list is integrated in G S, therefore the equation should be CC(G) = CC(G S ) + CC(G A ) with CC(G S ) = CC(A GS ) + CC(list). 3

This was all very theoretic, so let s look at our previous example graph from figure 0 denoted by G. To make it comparable to other summarizations, we need to represent G by G S, G A and a list. As we haven t changed anything yet, A GS {0, 1} S 1 S 2 (the adjacency matrix of G S ) is in {0, 1} V 1 V 2 and A GA {0, 1} V 1 V 2 is the zero matrix. No nodes were grouped yet, so each node is in its own super node: A GS = B 1 1 0 0 C 1 1 1 0 0 0 1 1 A G A = E 0 0 1 1 A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 list : A : A B : B C : C D : D E : E P : P T d : T d S : S T : T For the coding costs we have: CC(G S ) = S 1 S 2 H(A GS ) = 5 4 ( 1 2 log 2(2) + 1 2 log 2(2)) = 5 4 1 = 20 CC(G A ) = V 1 V 2 H(A GA ) = 5 4 (0log 2 (0) + 1log 2 (1)) = 5 4 0 = 0 CC(list) = 2 Ni i=1 j=1 S V i ij log 2 = 5 1 log S ij 2 5 + 4 1 log 1 2 4 = 19.6 1 Therefore CC(G) = CC(G S ) + CC(G A ) + CC(list) = 20 + 0 + 19.6 = 39.6 Now let s say the output of the algorithm are the graphs on the right side of figure 3. Then we have: A : S 11 P : S 21 ( S 21 S 22 ) B S G S = 11 1 0 0 0 0 0 B : S 11 T d : S 21 G S 12 0 1 A = C 0 0 1 0 C : S 11 S : S 22 D : S 12 T : S 22 E : S 12 The the coding cost changes to: CC(G S ) = S 1 S 2 H(A GS ) = 2 2 ( 1 2 log 2(2) + 1 2 log 2(2)) = 2 2 1 = 4 CC(G A ) = V 1 V 2 H(A GA ) = 5 4 ( 1 log 10 2(10) + 9 log 10 2( 10 )) = 9.38 9 CC(list) = 2 Ni i=1 j=1 S V i ij log 2 = 3 log S ij 2 5 + 2 log 3 2 5 + 2 2 log 2 2 4 = 8.85 2 And CC(G) = CC(G S ) + CC(G A ) + CC(list) = 4 + 9.38 + 8.85 = 22.23 This tells us that the graph in figure 3 is a better summarization of the input graph G than G itself. Which should have already been clear. 4

Edge Modification Imagine we have a group of nodes that we d like to merge. For this, all nodes in the group need to have exactly the same link pattern. If this is not the case, we need to change their patterns to match each other. Let s look at the graph in figure 0 again and assume we want to merge A, B and C: They have the common neighbour P, what means that we don t have to change any link patterns to this node. But T d is only connected to B and C, and S has only a connection to C. Now for each not common neighbour in the group there is the question: Do we want to make the node into a common neighbour of the group - and therefore connect it to all nodes in the group to which it s not already connected to? Or do we want to cut all ties between the node and the group - and therefore delete all edges between them? The answer depends on the cost of the operation which is the same as the number of edges that need to be added or deleted: Cost remove = Cost add = p i=1 p i=1 { S 1i S 2k if S 1i links to S 2k 0 otherwise { 0 if S 1i links to S 2k S 1i S 2k otherwise Where S 2k is the not common neighbour in question and S 11,..., S 1p are the super nodes of the group that should be merged. In figure 4 we would either need to add the edge (A, T d) for T d or delete the edges (B, T d) and (C, T d). So: Cost remove (T d) = 2 1 = Cost add (T d) Figure 4: Group of nodes to merge and its neighbours Figure 5: Group with changed edges to its neighbours and we add edge (A, T d), as it is cheaper. For S it s the other way round and deleting (C, S) is cheaper than adding edges from S to A and B. The result is shown in figure 5. A special case would be if the adding and the removing cost is the same. Then it would be necessary to look further into the properties of the node to decide if it should be added to the common neighbours or not. The routine ModifyEdge(group, G S, G A ) takes as input such a group of nodes that should be merged, computes their not common neighbours and then removes or adds edges between each of the not common neighbours and the group according to the above cost function. This is done by changing entries in A GS and adding a 1 entry for each changed edge to A GA. 5

The Algorithm So far we know how to model a summarization, how to calculate its cost and - if given a group - how to change the edges and merge the group (the merging is a simple relabeling and shrinking of A GS and changing of some names in the list). What s still missing now is how to find these groups. Let s look again at the example of the online movie rating site: the aim is to group similar users and similar movies together. Two users are similar if they like the same movies, so their similarity could be defined as the number of movies they both like. As some users might have only liked 5 movies, two of them that have 5 movies in common are very similar, but for users that have liked about 100 movies, 5 of them in common are not that much. Therefore is the similarity of two users the fraction of their liked movies that both of them liked. More formally: n k=1 sim(s 1i, S 1j ) = S 2k n+m k=1 S (4) 2k Where S 21,..., S 2n denote the common neighbours of S 1i and S 1j, and S 2(n+1),..., S 2(n+m) are the super nodes that are only connected to one of them (not common neighbours). Analogous for super nodes of type 2. The similarity ranges from 0 to 1. If S 1i and S 1j have exactly the same neighbours, m is equal to zero, which makes the upper and the lower term of the above equation the same and the resulting similarity equal to one. On the other hand, if they don t share any neighbours, n is equal to zero and therefore the upper term is zero too, what makes the similarity equal to zero. As the similarity is only non-zero if the nodes have a common neighbour and therefore are hop-2-neighbours of each other - means reachable from each other in two hops - the above similarity gets called hop-2-similarity (hop2sim) in the paper. For the example graph in figure 0 the similarities are as follows: sim(a, B) = 1 2 sim(p, T d) = 2 3 sim(a, C) = 1 3 sim(p, S) = 1 5 sim(b, C) = 2 3 sim(t d, S) = 1 4 sim(c, D) = 1 4 sim(s, T ) = 2 3 sim(c, E) = 1 4 sim(d, E) = 1 Figure 6: The input graph from figure 0 Now, to specify which nodes should be merged, the algorithm uses a threshold th for the similarites. It starts at 1.0, so in figure 6 we would merge D and E. After doing this, the similarities look like in figure 8. As there are no nodes with similarity 1.0 anymore, the threshold has to be reduced by a reduction step ɛ in order to get other groups that can be merged. Combining all the seen steps results in the algorithm on the next page: Figure 7: (top) the hop2sim s from the original graph Figure 8: (bottom) the hop2sim s after merging E and D into super node S 12 sim(a, B) = 1 sim(p, T d) = 2 2 3 sim(a, C) = 1 sim(p, S) = 1 3 5 sim(b, C) = 2 sim(t d, S) = 1 3 4 sim(c, S 12 ) = 1 sim(s, T ) = 2 4 3 6

Algorithm SCMiner, extracted from [1], p.1252 Input: Bipartite graph G = (V, E), Reduce step ɛ Output: Summary Graph G S, Additional Graph G A //Initialization G S = G, G A = (V, ); Compute mincc using Eq.(3) with G S and G A ; bestg S = G S, bestg A = G A ; Compute hop2sim for each S G S using Eq.(4); //Search for best Summarization while th > 0 do for each node S G S do Get SN with S SN and hop2sim(s, S ) > th; end for Combine SN and get non-overlapped groups allgroup; for each group allgroup do ModifyEdge(group, G S, G A ); Merge nodes of G S with same link pattern; Compute cc using Eq.(3) with G S and G A ; Record bestg S, bestg A, and mincc if cc < mincc; end for if allgroup == then th = th ɛ; else th = 1.0; end if end while return bestg A, bestg A ; The inputs are a bipartite Graph G = (V 1, V 2, E), as stated before, and the step size ɛ. The output is the summarization of the graph G represented by the summary graph G S = (S 1, S 2, E ) and the additional graph G A = (V 1, V 2, E ). This summarization has the minimum coding cost subject to the proposed coding schema. The algorithm first initializes G S as G and G A as empty, and sets it as the best solution (as if no better one is found, it is the best). It then computes the minimum coding cost mincc, and the hop2sim for each (super) node S G S. Then it searches iteratively for groups with similarities above a certain threshold th. For that it collects for each node S all hoptwo-neighbours that have a similarity above the threshold and saves them in the set SN. After doing that for all S G S, it merges these sets. The result is a set of non-overlapping groups. For each of these groups then possibly edges have to be added or removed with the ModifyEdge method. At the end of the ModifyEdge method the hop2sim of all not common neighbours have to be recomputed as their edges and therefore their similarities might have changed. After the nodes of the group changed their link pattern to exactly the same, they can be merged into one super node. If the coding cost for this augmented graph is lower than for the currently best summarization graph, it gets set as the best. The threshold starts at 1.0 and if there is no group for this threshold it gets iteratively reduced by ɛ. This makes sure that the nodes with more similarity get merged first. After a group has been found for a threshold th it gets set back to 1.0, to merge the following groups again similarity-wise. Once the threshold reaches zero the algorithm stops. The number of necessary iterations depends on the reduction step ɛ. For a large ɛ more groups of nodes get merged per iteration step than for a smaller ɛ and therefore the threshold reaches zero faster. But on the other hand might a larger ɛ result in a less exact result. According to the authors, the best value for ɛ lies in [0.01, 0.1]. 7

Example Execution of the Algorithm SCMiner Let s take the example graph of figure 0 as input graph and set ɛ = 0.5. Then the model and his representation look like this after the initialization: Iteration #0, th = 1.0, mincc = 39.6 A GS = B 1 1 0 0 C 1 1 1 0 0 0 1 1 E 0 0 1 1 list : A : A B : B C : C D : D E : E P : P T d : T d S : S T : T A GA = A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 sim(a, B) = 1 2 sim(p, T d) = 2 3 sim(a, C) = 1 3 sim(p, S) = 1 5 sim(b, C) = 2 3 sim(t d, S) = 1 4 sim(c, D) = 1 4 sim(s, T ) = 2 3 sim(c, E) = 1 4 sim(d, E) = 1 Iteration #1, th = 1.0, mincc = 39.6 As the threshold is 1.0 the only nodes who s similarity satisfies that are D and E. For merging them in the super nods S 12 we don t need to modify any edges nor recalculate any similarities but just substitute the name. The augmented model looks as follows: A GS = B 1 1 0 0 C 1 1 1 0 S 12 0 0 1 1 list : A : A B : B C : C D : S 12 P : P T d : T d S : S T : T A GA = A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 E : S 12 sim(a, B) = 1 2 sim(p, T d) = 2 3 sim(a, C) = 1 3 sim(p, S) = 1 5 sim(b, C) = 2 3 sim(t d, S) = 1 4 sim(c, S 12 ) = 1 4 sim(s, T ) = 2 3 The cost for this is 33.6 so we record G S, G A and mincc. As we found a group in this iteration, the threshold gets set to 1.0 (where it already is). Iteration #2, th = 1.0, mincc = 33.6 As there are no nodes with similarity one, allgroup is the empty set and we have nothing to merge. At the end of the iteration we reduce the threshold by the reduction step ɛ. 8

Iteration #3, th = 0.5, mincc = 33.6 For the threshold 0.5 we find some node pairs that have a greater or equal similarity: (A, B), (B, C), (P, T d) and (S, T ). Combining them gives us the three groups {A, B, C}, {P, T d} and {S, T }, as illustrated in figure 9. Figure 9: G S with marked pairs (left) and G S with combination of the previous pairs in to non-overlapping groups (right) Iteration #3, group = {A,B,C}, mincc = 33.6 We start with the group {A,B,C}. As A, B and C don t have the same link pattern, we call the routine ModifyEdge({A,B,C}, G S, G A ) and change the edges of the group like in the section Edge Modification. After this we need to update the hop2sim s and then we can merge the nodes into the super node S 11. The cost cc = 30.2 is smaller than mincc, therefore we record G S, G A and mincc again. Because there are still other groups we don t change the threshold yet. A GS = A GA = ( ) S 11 1 1 0 0 S 12 0 0 1 1 B 0 0 0 0 C 0 0 1 0 list : A : S 11 P : P B : S 11 T d : T d C : S 11 S : S D : S 12 T : T E : S 12 sim(s 11, S 12 ) = 0 = 0 4 sim(p, T d) = 3 = 1 3 sim(p, S) = 0 = 0 5 sim(t d, S) = 0 = 0 5 sim(s, T ) = 2 = 1 2 9

Iteration #3, group = {P,Td}, mincc = 30.2 Next we look at the group {P,Td}. Because of the edges we changed while processing the previous group, these two nodes now have the same link pattern and can be directly merged into the super node S 21. Therefore we only need to relabel and record G S, G A and mincc again, as the cost is cc = 26.2. A GS = A GA = ( S 21 S T ) S 11 1 0 0 S 12 0 1 1 B 0 0 0 0 C 0 0 1 0 list : A : S 11 P : S 21 B : S 11 T d : S 21 C : S 11 S : S D : S 12 T : T E : S 12 sim(s 11, S 12 ) = 0 sim(s 21, S) = 0 sim(s, T ) = 2 = 1 2 Iteration #3, group = {S,T}, mincc = 26.2 Group {S,T} can be processed analogous to the previous group: same link pattern, so merge them directly in to the super node S 22, then relabel and record G S, G A and mincc, because cc = 22.2. A GS = A GA = ( S 21 S 22 ) S 11 1 0 S 12 0 1 B 0 0 0 0 C 0 0 1 0 list : A : S 11 P : S 21 B : S 11 T d : S 21 C : S 11 S : S 22 D : S 12 T : S 22 E : S 12 sim(s 11, S 12 ) = 0 sim(s 21, S 22 ) = 0 After this, we set the threshold to 1.0 because we found a group in this iteration. Iteration #4, th = 1.0, mincc = 22.2 As all similarities between the super nodes are zero, allgroup is the empty set, and we reduce the threshold. Iteration #5, th = 0.5, mincc = 22.2 allgroup is still the empty set, and we reduce the threshold again. 10

Iteration #6, th = 0, mincc = 22.2 The threshold is zero, so we don t enter the while loop, but return the recorded bestg S and bestg A. Analysis With N nodes and an average degree d av of each vertex, the runtime complexity to compute the hop two similarity of each node is O(N d av 3 ), as each of the N nodes has on average d av neighbours of the other type, of which each has on average d av neighbours. For each of these hop two neighbours the information of the common or not common neighbours have to be accessed. That is a minimum of d av (if all neighbours are the same) and a maximum of 2 d av 1 on average, therefore needs time O(d av ). So: N (nodes) d av (neighbour nodes of the other type) d av (neighbour nodes of the own type) O(d av ) (time for computing one similarity) = O(N d av 3 ) During the ModifyEdge method in SCMiner, not all similarities change and need to be recomputed but only the ones of the nodes of which edges got deleted or to which edges got added. These are one average d av, so the N above can be replaced by a d av, making the complexity roughly O(N d av 4 ). The number of merging steps is affected by ɛ but N in average, making the whole runtime complexity O(N d av 4 ). As the runtime of the algorithm is heavily dependent on the (re)computation of the similarites, the best case for the above algorithm is, when there is no noise in the input graph. That doesn t mean that the input data is faulty, but noise in the sense of edges that are in the input graph but need to get deleted or added to get the output graph. If there are no unnecessary edges, no edges have to be modified and therefore no similarities have to be recomputed. Additionally, the similarities would all be either one or zero and all merges that need to be done would be done in the first iteration. The worst case on the opposite would be, if each group to merge would consist of the minimal two (super) nodes. This is the case if there are a lot of non-uniform distributed noisy edges, what leads to merging the small groups of more similar nodes first and then gradually build from these the large super nodes in the output. This needs a lot of merging steps and therefore an awful lot of similarity computations. (Naturally this depends on the number of nodes that end up in one super node in the output. If there are two nodes per super node this is not as bad as having a single super node containing all nodes in the output.) Real World Examples One type of examples for the usage of the algorithm mentioned in the paper are websites - to rate movies, jokes or join newsgroups - from which the providers want to collect data. Two other examples also mentioned, were the data set WorldCities, that consists of the distribution of global service firms in the top world cities, and the reactions from proteins to drugs. 11

Sources [1] Jing Feng, Xiao He, Bettina Konte, Christian Böhm, Claudia Plant. Summarization-based Mining Bipartite Graphs. In KDD, pages 1249-1257, 2012. [2] Information Theory (lecture), ETH Zürich, Hamed Hassani, spring semester 2015 12