Report on the paper Summarization-based Mining Bipartite Graphs

Similar documents
Randomness and Computation March 25, Lecture 5

(Refer Slide Time: 01:00)

Joint Entity Resolution

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Paths, Flowers and Vertex Cover

Introduction to SNNS

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Jessica Su (some parts copied from CLRS / last quarter s notes)

Pregel. Ali Shah

GRAPH DECOMPOSITION BASED ON DEGREE CONSTRAINTS. March 3, 2016

MCL. (and other clustering algorithms) 858L

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

Information Integration of Partially Labeled Data

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)

Big Data Management and NoSQL Databases

Scanning Real World Objects without Worries 3D Reconstruction

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

(Refer Slide Time: 01.26)

3 No-Wait Job Shops with Variable Processing Times

Data Mining 4. Cluster Analysis

Algorithms and Data Structures

Graph Contraction. Graph Contraction CSE341T/CSE549T 10/20/2014. Lecture 14

Examples of P vs NP: More Problems

Social-Network Graphs

ICS 161 Algorithms Winter 1998 Final Exam. 1: out of 15. 2: out of 15. 3: out of 20. 4: out of 15. 5: out of 20. 6: out of 15.

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions

Planar Graphs and Surfaces. Graphs 2 1/58

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

10-701/15-781, Fall 2006, Final

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

Spectral Methods for Network Community Detection and Graph Partitioning

CS521 \ Notes for the Final Exam

1 Definition of Reduction

Notes for Lecture 24

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 03/02/17

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

CPSC 536N: Randomized Algorithms Term 2. Lecture 10

CSE 158. Web Mining and Recommender Systems. Midterm recap

CS S Lecture February 13, 2017

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Paths, Flowers and Vertex Cover

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Introduction to Graph Theory

Lecture 16 October 23, 2014

LECTURES 3 and 4: Flows and Matchings

7. Decision or classification trees

Algorithms Exam TIN093/DIT600

Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching. 1 Primal/Dual Algorithm for weighted matchings in Bipartite Graphs

Lecture 22 - Oblivious Transfer (OT) and Private Information Retrieval (PIR)

Exact Algorithms Lecture 7: FPT Hardness and the ETH

arxiv:cs/ v1 [cs.ds] 20 Feb 2003

Move-to-front algorithm

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions

PERFECT MATCHING THE CENTRALIZED DEPLOYMENT MOBILE SENSORS THE PROBLEM SECOND PART: WIRELESS NETWORKS 2.B. SENSOR NETWORKS OF MOBILE SENSORS

Matchings in Graphs. Definition 1 Let G = (V, E) be a graph. M E is called as a matching of G if v V we have {e M : v is incident on e E} 1.

Combinatorial Optimization

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Evolutionary Algorithms

Modularity CMSC 858L

6. Lecture notes on matroid intersection

Rectangular Partitioning

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Algorithms and Game Theory Date: 12/3/15

reductions Nathan

MAS 341: GRAPH THEORY 2016 EXAM SOLUTIONS

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

Programming and Data Structure

COMP260 Spring 2014 Notes: February 4th

Non Overlapping Communities

SFU CMPT Lecture: Week 8

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

Stanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 02/26/15

Bipartite Perfect Matching in O(n log n) Randomized Time. Nikhil Bhargava and Elliot Marx

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

1 Matchings in Graphs

Introduction to Algorithms

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 02 Lecture - 45 Memoization

Depth First Search A B C D E F G A B C 5 D E F 3 2 G 2 3

CSE 417 Network Flows (pt 3) Modeling with Min Cuts

Greedy Algorithms 1. For large values of d, brute force search is not feasible because there are 2 d

Commando: Solution. Solution 3 O(n): Consider two decisions i<j, we choose i instead of j if and only if : A S j S i

MA 1128: Lecture 02 1/22/2018

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

Tutorial for Algorithm s Theory Problem Set 5. January 17, 2013

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Hierarchical Clustering of Process Schemas

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Clustering Algorithms for general similarity measures

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Section 2.0: Getting Started

6 Randomized rounding of semidefinite programs

6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS. Vehicle Routing Problem, VRP:

Maximizing edge-ratio is NP-complete

Transcription:

Report on the paper Summarization-based Mining Bipartite Graphs Annika Glauser ETH Zuerich Spring 2015 Extract from the paper [1]:

Introduction The paper Summarization-based Mining Bipartite Graphs introduces a new algorithm SC- Miner (Summarization-Compression Miner) including graph summarization, graph clustering, link prediction and discovery of the hidden structure. The objective of graph summarization is to produce a compressed representation of the input graph. The aim of graph clustering is to group similar nodes of the graph together in clusters. Link prediction wants to predict missing or eventual future links of the graph, and with trying to discover the hidden structure of a graph, we want to say something about the structure of the data. So the algorithm converts a large bipartite graph in a highly compact smaller graph, which should give an idea of the structure of the data, abstract the original graph, cluster its nodes and predict missing or future links. This makes the algorithm into a useful tool for Data-Mining. For an illustrative example look at the graphs below. The left graph is an input graph that corresponds to (a very small part of) data from an online movie rating site. The users are denoted by squares while the movies are represented by circles. If a user liked a movie, he or she is connected to it by an edge. In the graph on the right, the nodes that could be in the same cluster are circled together. And as the other users that liked the movie Pitch Perfect liked The Devil Wears Prada, it might be predicted that Anna likes it too (denoted by the bold edge). Figure 0: A bipartite input graph Figure 1: Graph with possible clusters and predicted edges The hidden structure of the data then might look like the graph in figure 2. The problem is, the clusters and predicted edges in figure 1 and the structure in figure 2 are just possible solutions for the problem. Because a bipartite graph can be represented by a lot of different - not necessarily good - summarizations. As it can be proven that finding the global optimal summarization is NP-hard, the algorithm SCMiner follows a heuristic approach to search for the local optima. Figure 2: A possible hidden structure 1

Model To be more formal than in the previous section, the input of the algorithm is a large bipartite graph G = (V 1, V 2, E) where V 1 and V 2 are node sets of type 1 or type 2 respectively, and E the set of edges between them. The first part of the output is a summary Graph G S = (S 1, S 2, E ) that contains two sets S 1 and S 2 with clusters of nodes of their corresponding type - called super nodes - and a set of Edges E between the super nodes. The second part of the output is an additional Graph G A = (V 1, V 2, E ) that contains the original node sets V 1 and V 2 and a set E of added or deleted edges between them, that would be needed to restore the original graph G. The edges in E with a (+) sign have to be added to G S in order to obtain G from it and vice versa with the edges marked with a (-) sign. To go with the previous example: The original Graph is denoted by G = (V 1, V 2, E) where V 1 = {A, B, C, D, E} (the users) and V 2 = {P, T d, S, T } (the movies). The summary graph G S = (S 1, S 2, E ) consists of the four super nodes S 11 = {A, B, C}, S 12 = {D, E}, S 21 = {P, T d} and S 22 = {S, T }. The additional Graph G A consists of the deleted edge (C, S) (marked with a (+) sign) and the added edge (A, T d) (marked with a (-) sign). Figure 3: An example for the model In fact, the minus and plus signs in G A can be omitted, as they can be derived by comparing G A with G S : if an edge of G A is in G S, the edge was added by the algorithm and is not part of the original graph G. Vice versa for an edge in G A that doesn t appear in G S this means the original edge was deleted. 2

Data Compression As already mentioned, there are a lot of different summarizations for one bipartite graph. Naturally for our example graph we could just look at the different summarizations and chose the best one. As the input normally is a lot bigger than the graph in figure 0, the algorithm tries to improve the summarization step wise. But how to measure the goodness of a summarization? The Minimum Description Length (MDL) principle states, that the more we can compress the data (graph), the more we learn of its underlying structure. Therefore the goodness of a summarization is measurable in the shortness of its description length. Inspired by this principle, the authors propose the following coding scheme: they measure the coding cost CC(H) of a graph H = (V 1, V 2, E) by the lower bound of the coding cost of its compressed adjacent matrix A V 1 V 2 with a ij = 1 if (V 1i, V 1j ) E. Which is: 1 CC(H) = V 1 V 2 H(A) (1) where H(A) = p 0 (A) log 2 (1/p 0 (A)) + p 1 (A) log 2 (1/p 1 (A)) - the entropy of A - and p 0 and p 1 the probabilities of finding 1 and 0 entries in the adjacency matrix A of H. The additional graph G A from the previously introduced model can be represented by a simple adjacency matrix A GA {0, 1} V 1 V 2, for G S however we need in addition to the adjacency matrix A GS {0, 1} S 1 S 2 a list of the nodes and their corresponding super nodes. The coding cost of this list is: CC(list) = 2 N i V i S ij log 2 S ij i=1 j=1 Where N i is the number of super nodes of type i, S ij the number of nodes in super node S ij and V i the number of nodes of type i. With this, the coding cost of a summarization G in the previously introduced model is: 2 CC(G) = CC(G S ) + CC(G A ) + CC(list) (3) The goal of the algorithm is to find a summarization that minimizes (3), because corresponding to the MDL principle, the solution should be optimal when the MDL is minimal. About (2): The information content of a certain Event E can be measured by the function I(E) = I(p(E)) = log 2 (1/p(E)) where p(e) is the probability of E. The unit of measure is bits. So in fact, I(E) tells us how many bits we need to encode the event E. In (2) this event is: v V i : v S ij. The probability of this event (when picking v at random) is S ij / V i and therefore the information content is: v V i : I(v S ij ) = log 2 ( V i / S ij ). To encode the list, we omit the names of the nodes and just concatenate the codes that correspond to the super nodes to which the nodes belong. (The order of the nodes is given by the order of the nodes in the adjacency matrix of G A ). As there are S ij nodes in S ij there are S ij strings of log 2 ( V i / S ij ) bits in the coding of the list. By summing this over all super nodes S ij in the summarization we get the number of required bits to encode the whole list. About (1): The entropy H(A) is the average information content of an entry of the adjacency matrix A. And therefore (1) is the average information content of the whole adjacency matrix. (2) 1 The corresponding formula in the paper has a minus sign, which is a typo - as I verified with the authors. 2 The paper is rather inexact in this equation. In the model, the list is integrated in G S, therefore the equation should be CC(G) = CC(G S ) + CC(G A ) with CC(G S ) = CC(A GS ) + CC(list). 3

This was all very theoretic, so let s look at our previous example graph from figure 0 denoted by G. To make it comparable to other summarizations, we need to represent G by G S, G A and a list. As we haven t changed anything yet, A GS {0, 1} S 1 S 2 (the adjacency matrix of G S ) is in {0, 1} V 1 V 2 and A GA {0, 1} V 1 V 2 is the zero matrix. No nodes were grouped yet, so each node is in its own super node: A GS = B 1 1 0 0 C 1 1 1 0 0 0 1 1 A G A = E 0 0 1 1 A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 list : A : A B : B C : C D : D E : E P : P T d : T d S : S T : T For the coding costs we have: CC(G S ) = S 1 S 2 H(A GS ) = 5 4 ( 1 2 log 2(2) + 1 2 log 2(2)) = 5 4 1 = 20 CC(G A ) = V 1 V 2 H(A GA ) = 5 4 (0log 2 (0) + 1log 2 (1)) = 5 4 0 = 0 CC(list) = 2 Ni i=1 j=1 S V i ij log 2 = 5 1 log S ij 2 5 + 4 1 log 1 2 4 = 19.6 1 Therefore CC(G) = CC(G S ) + CC(G A ) + CC(list) = 20 + 0 + 19.6 = 39.6 Now let s say the output of the algorithm are the graphs on the right side of figure 3. Then we have: A : S 11 P : S 21 ( S 21 S 22 ) B S G S = 11 1 0 0 0 0 0 B : S 11 T d : S 21 G S 12 0 1 A = C 0 0 1 0 C : S 11 S : S 22 D : S 12 T : S 22 E : S 12 The the coding cost changes to: CC(G S ) = S 1 S 2 H(A GS ) = 2 2 ( 1 2 log 2(2) + 1 2 log 2(2)) = 2 2 1 = 4 CC(G A ) = V 1 V 2 H(A GA ) = 5 4 ( 1 log 10 2(10) + 9 log 10 2( 10 )) = 9.38 9 CC(list) = 2 Ni i=1 j=1 S V i ij log 2 = 3 log S ij 2 5 + 2 log 3 2 5 + 2 2 log 2 2 4 = 8.85 2 And CC(G) = CC(G S ) + CC(G A ) + CC(list) = 4 + 9.38 + 8.85 = 22.23 This tells us that the graph in figure 3 is a better summarization of the input graph G than G itself. Which should have already been clear. 4

Edge Modification Imagine we have a group of nodes that we d like to merge. For this, all nodes in the group need to have exactly the same link pattern. If this is not the case, we need to change their patterns to match each other. Let s look at the graph in figure 0 again and assume we want to merge A, B and C: They have the common neighbour P, what means that we don t have to change any link patterns to this node. But T d is only connected to B and C, and S has only a connection to C. Now for each not common neighbour in the group there is the question: Do we want to make the node into a common neighbour of the group - and therefore connect it to all nodes in the group to which it s not already connected to? Or do we want to cut all ties between the node and the group - and therefore delete all edges between them? The answer depends on the cost of the operation which is the same as the number of edges that need to be added or deleted: Cost remove = Cost add = p i=1 p i=1 { S 1i S 2k if S 1i links to S 2k 0 otherwise { 0 if S 1i links to S 2k S 1i S 2k otherwise Where S 2k is the not common neighbour in question and S 11,..., S 1p are the super nodes of the group that should be merged. In figure 4 we would either need to add the edge (A, T d) for T d or delete the edges (B, T d) and (C, T d). So: Cost remove (T d) = 2 1 = Cost add (T d) Figure 4: Group of nodes to merge and its neighbours Figure 5: Group with changed edges to its neighbours and we add edge (A, T d), as it is cheaper. For S it s the other way round and deleting (C, S) is cheaper than adding edges from S to A and B. The result is shown in figure 5. A special case would be if the adding and the removing cost is the same. Then it would be necessary to look further into the properties of the node to decide if it should be added to the common neighbours or not. The routine ModifyEdge(group, G S, G A ) takes as input such a group of nodes that should be merged, computes their not common neighbours and then removes or adds edges between each of the not common neighbours and the group according to the above cost function. This is done by changing entries in A GS and adding a 1 entry for each changed edge to A GA. 5

The Algorithm So far we know how to model a summarization, how to calculate its cost and - if given a group - how to change the edges and merge the group (the merging is a simple relabeling and shrinking of A GS and changing of some names in the list). What s still missing now is how to find these groups. Let s look again at the example of the online movie rating site: the aim is to group similar users and similar movies together. Two users are similar if they like the same movies, so their similarity could be defined as the number of movies they both like. As some users might have only liked 5 movies, two of them that have 5 movies in common are very similar, but for users that have liked about 100 movies, 5 of them in common are not that much. Therefore is the similarity of two users the fraction of their liked movies that both of them liked. More formally: n k=1 sim(s 1i, S 1j ) = S 2k n+m k=1 S (4) 2k Where S 21,..., S 2n denote the common neighbours of S 1i and S 1j, and S 2(n+1),..., S 2(n+m) are the super nodes that are only connected to one of them (not common neighbours). Analogous for super nodes of type 2. The similarity ranges from 0 to 1. If S 1i and S 1j have exactly the same neighbours, m is equal to zero, which makes the upper and the lower term of the above equation the same and the resulting similarity equal to one. On the other hand, if they don t share any neighbours, n is equal to zero and therefore the upper term is zero too, what makes the similarity equal to zero. As the similarity is only non-zero if the nodes have a common neighbour and therefore are hop-2-neighbours of each other - means reachable from each other in two hops - the above similarity gets called hop-2-similarity (hop2sim) in the paper. For the example graph in figure 0 the similarities are as follows: sim(a, B) = 1 2 sim(p, T d) = 2 3 sim(a, C) = 1 3 sim(p, S) = 1 5 sim(b, C) = 2 3 sim(t d, S) = 1 4 sim(c, D) = 1 4 sim(s, T ) = 2 3 sim(c, E) = 1 4 sim(d, E) = 1 Figure 6: The input graph from figure 0 Now, to specify which nodes should be merged, the algorithm uses a threshold th for the similarites. It starts at 1.0, so in figure 6 we would merge D and E. After doing this, the similarities look like in figure 8. As there are no nodes with similarity 1.0 anymore, the threshold has to be reduced by a reduction step ɛ in order to get other groups that can be merged. Combining all the seen steps results in the algorithm on the next page: Figure 7: (top) the hop2sim s from the original graph Figure 8: (bottom) the hop2sim s after merging E and D into super node S 12 sim(a, B) = 1 sim(p, T d) = 2 2 3 sim(a, C) = 1 sim(p, S) = 1 3 5 sim(b, C) = 2 sim(t d, S) = 1 3 4 sim(c, S 12 ) = 1 sim(s, T ) = 2 4 3 6

Algorithm SCMiner, extracted from [1], p.1252 Input: Bipartite graph G = (V, E), Reduce step ɛ Output: Summary Graph G S, Additional Graph G A //Initialization G S = G, G A = (V, ); Compute mincc using Eq.(3) with G S and G A ; bestg S = G S, bestg A = G A ; Compute hop2sim for each S G S using Eq.(4); //Search for best Summarization while th > 0 do for each node S G S do Get SN with S SN and hop2sim(s, S ) > th; end for Combine SN and get non-overlapped groups allgroup; for each group allgroup do ModifyEdge(group, G S, G A ); Merge nodes of G S with same link pattern; Compute cc using Eq.(3) with G S and G A ; Record bestg S, bestg A, and mincc if cc < mincc; end for if allgroup == then th = th ɛ; else th = 1.0; end if end while return bestg A, bestg A ; The inputs are a bipartite Graph G = (V 1, V 2, E), as stated before, and the step size ɛ. The output is the summarization of the graph G represented by the summary graph G S = (S 1, S 2, E ) and the additional graph G A = (V 1, V 2, E ). This summarization has the minimum coding cost subject to the proposed coding schema. The algorithm first initializes G S as G and G A as empty, and sets it as the best solution (as if no better one is found, it is the best). It then computes the minimum coding cost mincc, and the hop2sim for each (super) node S G S. Then it searches iteratively for groups with similarities above a certain threshold th. For that it collects for each node S all hoptwo-neighbours that have a similarity above the threshold and saves them in the set SN. After doing that for all S G S, it merges these sets. The result is a set of non-overlapping groups. For each of these groups then possibly edges have to be added or removed with the ModifyEdge method. At the end of the ModifyEdge method the hop2sim of all not common neighbours have to be recomputed as their edges and therefore their similarities might have changed. After the nodes of the group changed their link pattern to exactly the same, they can be merged into one super node. If the coding cost for this augmented graph is lower than for the currently best summarization graph, it gets set as the best. The threshold starts at 1.0 and if there is no group for this threshold it gets iteratively reduced by ɛ. This makes sure that the nodes with more similarity get merged first. After a group has been found for a threshold th it gets set back to 1.0, to merge the following groups again similarity-wise. Once the threshold reaches zero the algorithm stops. The number of necessary iterations depends on the reduction step ɛ. For a large ɛ more groups of nodes get merged per iteration step than for a smaller ɛ and therefore the threshold reaches zero faster. But on the other hand might a larger ɛ result in a less exact result. According to the authors, the best value for ɛ lies in [0.01, 0.1]. 7

Example Execution of the Algorithm SCMiner Let s take the example graph of figure 0 as input graph and set ɛ = 0.5. Then the model and his representation look like this after the initialization: Iteration #0, th = 1.0, mincc = 39.6 A GS = B 1 1 0 0 C 1 1 1 0 0 0 1 1 E 0 0 1 1 list : A : A B : B C : C D : D E : E P : P T d : T d S : S T : T A GA = A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 sim(a, B) = 1 2 sim(p, T d) = 2 3 sim(a, C) = 1 3 sim(p, S) = 1 5 sim(b, C) = 2 3 sim(t d, S) = 1 4 sim(c, D) = 1 4 sim(s, T ) = 2 3 sim(c, E) = 1 4 sim(d, E) = 1 Iteration #1, th = 1.0, mincc = 39.6 As the threshold is 1.0 the only nodes who s similarity satisfies that are D and E. For merging them in the super nods S 12 we don t need to modify any edges nor recalculate any similarities but just substitute the name. The augmented model looks as follows: A GS = B 1 1 0 0 C 1 1 1 0 S 12 0 0 1 1 list : A : A B : B C : C D : S 12 P : P T d : T d S : S T : T A GA = A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 E : S 12 sim(a, B) = 1 2 sim(p, T d) = 2 3 sim(a, C) = 1 3 sim(p, S) = 1 5 sim(b, C) = 2 3 sim(t d, S) = 1 4 sim(c, S 12 ) = 1 4 sim(s, T ) = 2 3 The cost for this is 33.6 so we record G S, G A and mincc. As we found a group in this iteration, the threshold gets set to 1.0 (where it already is). Iteration #2, th = 1.0, mincc = 33.6 As there are no nodes with similarity one, allgroup is the empty set and we have nothing to merge. At the end of the iteration we reduce the threshold by the reduction step ɛ. 8

Iteration #3, th = 0.5, mincc = 33.6 For the threshold 0.5 we find some node pairs that have a greater or equal similarity: (A, B), (B, C), (P, T d) and (S, T ). Combining them gives us the three groups {A, B, C}, {P, T d} and {S, T }, as illustrated in figure 9. Figure 9: G S with marked pairs (left) and G S with combination of the previous pairs in to non-overlapping groups (right) Iteration #3, group = {A,B,C}, mincc = 33.6 We start with the group {A,B,C}. As A, B and C don t have the same link pattern, we call the routine ModifyEdge({A,B,C}, G S, G A ) and change the edges of the group like in the section Edge Modification. After this we need to update the hop2sim s and then we can merge the nodes into the super node S 11. The cost cc = 30.2 is smaller than mincc, therefore we record G S, G A and mincc again. Because there are still other groups we don t change the threshold yet. A GS = A GA = ( ) S 11 1 1 0 0 S 12 0 0 1 1 B 0 0 0 0 C 0 0 1 0 list : A : S 11 P : P B : S 11 T d : T d C : S 11 S : S D : S 12 T : T E : S 12 sim(s 11, S 12 ) = 0 = 0 4 sim(p, T d) = 3 = 1 3 sim(p, S) = 0 = 0 5 sim(t d, S) = 0 = 0 5 sim(s, T ) = 2 = 1 2 9

Iteration #3, group = {P,Td}, mincc = 30.2 Next we look at the group {P,Td}. Because of the edges we changed while processing the previous group, these two nodes now have the same link pattern and can be directly merged into the super node S 21. Therefore we only need to relabel and record G S, G A and mincc again, as the cost is cc = 26.2. A GS = A GA = ( S 21 S T ) S 11 1 0 0 S 12 0 1 1 B 0 0 0 0 C 0 0 1 0 list : A : S 11 P : S 21 B : S 11 T d : S 21 C : S 11 S : S D : S 12 T : T E : S 12 sim(s 11, S 12 ) = 0 sim(s 21, S) = 0 sim(s, T ) = 2 = 1 2 Iteration #3, group = {S,T}, mincc = 26.2 Group {S,T} can be processed analogous to the previous group: same link pattern, so merge them directly in to the super node S 22, then relabel and record G S, G A and mincc, because cc = 22.2. A GS = A GA = ( S 21 S 22 ) S 11 1 0 S 12 0 1 B 0 0 0 0 C 0 0 1 0 list : A : S 11 P : S 21 B : S 11 T d : S 21 C : S 11 S : S 22 D : S 12 T : S 22 E : S 12 sim(s 11, S 12 ) = 0 sim(s 21, S 22 ) = 0 After this, we set the threshold to 1.0 because we found a group in this iteration. Iteration #4, th = 1.0, mincc = 22.2 As all similarities between the super nodes are zero, allgroup is the empty set, and we reduce the threshold. Iteration #5, th = 0.5, mincc = 22.2 allgroup is still the empty set, and we reduce the threshold again. 10

Iteration #6, th = 0, mincc = 22.2 The threshold is zero, so we don t enter the while loop, but return the recorded bestg S and bestg A. Analysis With N nodes and an average degree d av of each vertex, the runtime complexity to compute the hop two similarity of each node is O(N d av 3 ), as each of the N nodes has on average d av neighbours of the other type, of which each has on average d av neighbours. For each of these hop two neighbours the information of the common or not common neighbours have to be accessed. That is a minimum of d av (if all neighbours are the same) and a maximum of 2 d av 1 on average, therefore needs time O(d av ). So: N (nodes) d av (neighbour nodes of the other type) d av (neighbour nodes of the own type) O(d av ) (time for computing one similarity) = O(N d av 3 ) During the ModifyEdge method in SCMiner, not all similarities change and need to be recomputed but only the ones of the nodes of which edges got deleted or to which edges got added. These are one average d av, so the N above can be replaced by a d av, making the complexity roughly O(N d av 4 ). The number of merging steps is affected by ɛ but N in average, making the whole runtime complexity O(N d av 4 ). As the runtime of the algorithm is heavily dependent on the (re)computation of the similarites, the best case for the above algorithm is, when there is no noise in the input graph. That doesn t mean that the input data is faulty, but noise in the sense of edges that are in the input graph but need to get deleted or added to get the output graph. If there are no unnecessary edges, no edges have to be modified and therefore no similarities have to be recomputed. Additionally, the similarities would all be either one or zero and all merges that need to be done would be done in the first iteration. The worst case on the opposite would be, if each group to merge would consist of the minimal two (super) nodes. This is the case if there are a lot of non-uniform distributed noisy edges, what leads to merging the small groups of more similar nodes first and then gradually build from these the large super nodes in the output. This needs a lot of merging steps and therefore an awful lot of similarity computations. (Naturally this depends on the number of nodes that end up in one super node in the output. If there are two nodes per super node this is not as bad as having a single super node containing all nodes in the output.) Real World Examples One type of examples for the usage of the algorithm mentioned in the paper are websites - to rate movies, jokes or join newsgroups - from which the providers want to collect data. Two other examples also mentioned, were the data set WorldCities, that consists of the distribution of global service firms in the top world cities, and the reactions from proteins to drugs. 11

Sources [1] Jing Feng, Xiao He, Bettina Konte, Christian Böhm, Claudia Plant. Summarization-based Mining Bipartite Graphs. In KDD, pages 1249-1257, 2012. [2] Information Theory (lecture), ETH Zürich, Hamed Hassani, spring semester 2015 12