Iterative Graph Summarization based on Grouping

Size: px

Start display at page:

Download "Iterative Graph Summarization based on Grouping"

Luke Short
5 years ago
Views:

1 Iterative Graph Summarization based on Grouping Sirui Li Supervisor: Dr. Qing Wang COMP4560: Advanced Computing Project Australian National University Semester 1, 2017 May 26, 2017

2 Acknowledgements First and foremost, I would like to express my thanks of gratitude to my supervisor, Dr. Qing Wang, who gave me the golden opportunity to do this project which I am really interested in. Without her assistance and guidence, the project and the report cannot be accomplished. Secondly I would also like to thank Dr. Weifa Liang who is the course convener, for providing helpful assistance on academic writing and presentation. Most importantly, my thanks go to my family and friends, without the understanding and support of them, the project cannot be finalized. 1

3 Abstract Analysing large datasets to discover useful insights is an important task in many application domains. In order to find the information, graphs are widely used to model entities (as nodes) and their relationships (as edges) in the datasets. These graphs may contain millions of nodes and edges. To reduce the complexity and understand the underlying information of large graphs, graph summarization is a critical technique. There are two major methods in graph summarization: statistical methods and grouping-based methods. Statistical methods, such as degree distributions, are useful, but it is hard to control the resolution [9]. Grouping-based methods, or aggregation-based methods, produce summaries from large graphs by grouping nodes based on attributes and relationships. Existing grouping-based methods, such as SNAP and k-snap, allow users to select node attributes and relationships of the summaries, but they require high user interaction, which means users need to know the detail about the nodes and edges of the summaries before the analysis. This project is to propose an algorithm to summarize graphs iteratively until reaching a desired resolution specified by users. The algorithm just requires users to specify some entity sets as well as the resolution, and then the algorithm finds the intermediate nodes appear during the search and groups these nodes by some method. Keywords: Graph Summarization, Statistical Methods, Grouping-based Methods, Aggregation-based Methods 2

4 Contents Acknowledgements 1 Abstract 2 1 Introduction Objectives Contributions Outline Related Work SNAP and k-snap Algorithms ESPRESSO Algorithm and REX Algorithm Methodology Problem Definition Algorithm Design Path-based Algorithm Ripple Algorithm Evaluation Experiment Design Experimental Results Path-based Algorithm Ripple Algorithm Conclusion and Future Work 24 Appendices 25 A Independent Study Contract 26 3

5 CONTENTS 4 B Project Description 29 C Software Description 32 D README 36

6 Chapter 1 Introduction Modelling real-life entities and relationships between them into graphs is a common method to extract underlying information in large datasets. One typical application is the social networking analysis. Mining social networkings can help analysers understand how people interact with each other. However, decoding useful information in large graphs is too difficult. Hence, graph summarization is needed to summarize large graphs into small graphs which can be easily understood. Graph summarization is a process of transferring large graphs into concise forms [9]. Generally, two kinds of methods are used in graph summarization: statistical methods and grouping-based methods. Statistical methods [9] include degree distributions, hop-plots and so on. The major downside of these methods is that the resolution of summaries cannot be controlled by users. Grouping-based methods, also known as aggregation-based methods [9], are used to produce summary graphs by grouping nodes based on selected attributes of nodes and edges. Figure 1.1 which is taken from [9] presents an overall picture of grouping-based graph summarization process, the graph at left has 7445 vertices and 19,971 edges. Given some conditions, a summary graph is shown on the right. In the summary graph, each vertex is a set of vertices from the original graph, called supernode or group; each edge represents the connection between two sets of vertices, called superedge or group relationship. The summary graph on the right can help understand the relationships among different sets of vertices in the original graph. 5

7 CHAPTER 1. INTRODUCTION 6 Figure 1.1: A summary graph (right) is generated for the original graph (left), which is taken from [9] Two operations are proposed in the grouping-based methods: SNAP and k-snap [9]. These two operations produce a summary graph based on user-selected vertex attributes and relationships. Both of them require attributes homogeneity, that is, vertices in the same group must have the same value for each user-selected attribute [9]. The SNAP operation also requires relationship homogeneity, that is, if there exists a relationship between two groups, then every node in one group has to connect to at least one node in another group [9]. Nonetheless, homogeneity is hard to achieve in real-life applications because data may involve too much noise. Thus, k-snap operation relaxes this requirement and allows users to control the size of the summary graph. Even if SNAP and k-snap operations can produce summary graphs with high quality, there still exist some drawbacks. For example, they require high user interaction, which means users are expected to know the detail about summary graphs before building up the summary graphs. Unfortunately, it is highly impracticable and almost impossible in real life applications, as most users may lack knowledge about useful relationships in most cases. 1.1 Objectives The goal of my project is to propose an iterative algorithm for grouping-based methods based on (1) user interested entity sets (2) user-identified resolution. It can answer questions of the form Have people at ANU collaborated with people at Peking University? or Have people at ANU collaborated with people at Peking University in 27th international ACM conference?. The specific objectives are: 1. Develop an iterative algorithm for grouping-based graph summarization. 2. Rank the summary graph.

8 CHAPTER 1. INTRODUCTION 7 3. Evaluate effectiveness and efficiency of the algorithm. Note that there are actually two algorithms (Path-based Algorithm and Ripple Algorithm) introduced in this report. Path-based Algorithm can only be applied when there are 2 user-specified entity sets. Ripple Algorithm improves the Pathbased Algorithm such that users can specify 2 or more entity sets to study. 1.2 Contributions My work has addressed the limitations of previous works and can provide the following noval features: Users can specify an arbitrary number of their interested entity sets. The threshold of path lengths between user-specified entity sets can be identified by users. Entities that appear in one or more paths between user-specified entity sets can be grouped in the summary graph. How to group these intermediate entities will be further discussed in chapter 3. Users can control the resolution of the summary graph, i.e. users can specify how many supernodes in the summary graph. 1.3 Outline This project report is structured as follows: Chapter 2 presents the related works for graph summarization and discusses the limitations. Chapter 3 presents the notions I define for presenting the algorithms and how the proposed algorithms (Path-based Algorithm and Ripple Algorithm) work. Chapter 4 shows the evaluation of the performance of these two algorithms in terms of the participant ratio and running time. Chapter 5 concludes this report and discusses the future work.

9 Chapter 2 Related Work Graph summarization is widely used in many application domains. Various graph summarization techniques have been proposed to help users understand the characteristics of large graphs [6], [10]. A recent survey by Yike Liu [5] summarizes a comprehensive overview of the state-of-the-art methods for graph summarization. 2.1 SNAP and k-snap Algorithms Grouping-based methods are one of the most popular techniques for graph summarization. These methods aggregate vertices of an original graph into supernodes and connect them with superedges, producing a summary graph [5]. Several recent studies have investigated to the problem of how to build up a summary graph by applying grouping-based methods. Summarization by Grouping Nodes on Attributes and Pairwise Relationships (SNAP) and k-snap are two database-stype algorithms to summarize graphs [9]. They can only deal with categorical node attributes [10]. The SNAP algorithm produces a summary graph by grouping nodes based on user-selected node attributes and relationships. The k-snap algorithm also allows users to customize the summary graph [10] and extends the SNAP algorithm which allows users to control the resolution of the summary graph by drill-down or roll-up operations. Although the SNAP and k-snap algorithms can produce a summary graph based on user-selected node attributes and relationships, they still exist some drawbacks. In many real-life cases, users do not know what relationships are useful between their selected nodes and they do not know whether there exist other entities between their selected nodes. Thus, users might prefer to just specify their interested entity sets, and then expect an algorithm to return a summary graph which can discover other entities existing between their specified entity sets. 8

10 CHAPTER 2. RELATED WORK ESPRESSO Algorithm and REX Algorithm There are many previous works focusing on extracting relationships between input entities from large graphs. For example, [7] proposed a connection subgraph discovery algorithm over RDF graphs. The notion connection subgraph proposed by [2] is to extract a connected subgraph based on a pair of query vertices and the number of vertices in the subgraph can be controlled. [4] addressed the extraction of subgraphs based on a set of query vertices in entity-relationship graphs. [1] partitioned graphs with respect to the context of a vertex. ESPRESSO algorithm proposed by [8] and REX algorithm proposed by [3] are based on these previous works and focus on producing a concise summary graph to explain the relationship between two sets of entities in a knowledge graph. The limitation of these two algorithms is that users can only specify two entity sets of their interest and produce related explanations.

11 Chapter 3 Methodology In this chapter, I will present my algorithms of producing the summary graphs from a given data graph. For clarity, I will first define the notions of data graphs, summary graphs, path, path length, as well as the resolution of summary graphs. Then, I will introduce each algorithm in detail. Note that, each algorithm introduced in this report can be used in both directed and undirected graphs. For simplicity, I only utilize undirected graphs as examples to present both algorithms. 3.1 Problem Definition Figure 3.1: A summary graph G (b) is generated from a data graph D (a), which is taken from [9] A data graph, denoted by D = (V D, E D, T, A, τ, ψ), is a graph with a set V D of entities as vertices, a set E D of relationships as edges, a set A of attributes, a set 10

12 CHAPTER 3. METHODOLOGY 11 T of entity types, and two assignment functions: τ : V D T, ψ : V D A that assign to each vertex an entity type (from a set T of possible types) and a set of attributes (from a set A of possible attribute values). Figure 3.1(a) presents a data graph D. Each vertex in D has an entity type t T and is associated with a set of attributes in A. A summary graph G, denoted by G = (V G, E G, T, A, τ, ψ ) is a graph with a set V G of entity sets as vertices or supernodes, a set E G of relationships as edges or superedges, and two assignment functions: τ : V G T, ψ : V G A that assign to each vertex a entity type (from a set T of possible types) and a set of attribute values (from a set A of possible attributes). Here, each supernode in G is a subset of vertices of V D and each superedge in G is a subset of edges in E D. Figure 3.1(b) presents a summary graph G with 4 supernodes (S1, S2, S3 and S4) and 4 superedges (S1-S2, S2-S3, S2-S4, S3-S4). From the definition of path in Wikipedia a path in a graph is a finite or infinite sequence of edges which connect a sequence of vertices which, by most definitions, are all distinct from one another. I use P to refer to a path in this report. The path length, denoted as L, is the number of edges occuring in a path. For example, the paths from S1 to S3 and their lengths in Figure 3.1(b) can be: path (P ) length (L) P 1 = < S1, S2, S3 > 2 P 2 = < S1, S2, S4, S3 > 3 In a graph, the resolution of a graph means the size of the graph, i.e. how many vertices in a graph. In this report, I use r to refer to the user-specified resolution. In Figure 3.1(b), the resolution is Algorithm Design In this section, I will introduce two algorithms for graph summarization, i.e. Pathbased Algorithm and Ripple Algorithm proposed in this report. In Path-based Algorithm, users are allowed to take a pair of entity sets as well as a threshold of path lengths as input. Given an input, the algorithm finds out the intermediate entities appearing on related paths and groups them by entity type. After grouping, these entity sets are presented as supernodes in the summary graph G. If there is an edge between node A and node B in the data graph, and then there must exist a superedge between supernodes who contain node A and node B, respectively.

13 CHAPTER 3. METHODOLOGY 12 In Ripple Algorithm, users may take a number of interested entity sets as well as a threshold of path lengths as input. Intermediate entities are found by Breadth- First-Search (BFS). After grouping these intermediate entities, an initial summary graph is returned. This algorithm further allows users to specify the resolution r of the final summary graph G and hence the initial summary graph can be split until the resolution meets the user requirement. The major difference between these two algorithms is that Path-based Algorithm can only be used to find relationships between a pair of entity sets; whereas, the Ripple Algorithm can be used to find relationships among two or more entity sets. In principle, Ripple Algorithm can be used without the limitation on the number of entity sets as input Path-based Algorithm In this section I discuss the details of Path-based Algorithm. A high-level description is provided in Figure 3.2. Input: - A data graph: D - A pair of entity sets (V 1, V 2 ) - Threshold of path lengths: L Output: - A summary graph G Path-based Algorithm 1: S 1 = V 1, S 2 = V 2 \\Initialize supernode 1 (S 1 ) and supernode 2 (S 2 ) 2: find all paths whose lengths are not greater than L between S 1 and S 2 3: group vertices on related paths by entity types, each group is a supernode 4: if there exists an edge between node A and node B do: 5: add a superedge between S i and S j which contain node A and node B, respectively 6: return G Figure 3.2: Path-based Algorithm Let S i be the supernode in a summary graph G. There are 4 phases in Path-based Algorithm: 1. Initialization (line 1) Finding vertices who meet the input requirement for entity sets (V 1 and V 2 ), and save them in the corresponding supernode (S 1 and S 2 ). 2. Find paths (line 2) Finding paths between two supernodes S 1 and S 2 and the length of each path is not greater than L.

14 CHAPTER 3. METHODOLOGY Group vertices (line 3 ) Grouping vertices in the previous paths based on entity types. 4. Build superedges (line 4 to line 5) If there is an edge between node A and node B in the paths, and then adding a superedge between supernodes which contain node A and node B, respectively. Finally, Path-based Algorithm returns a summary graph. author 3 paper proceeding writes writes 1 5 writes 10 in_pr journal writes writes 11 4 Figure 3.3: A data graph example In the follows, I am going to illustrate how the Path-based Algorithm works with an example. Example 1 Assume Figure 3.3 is the input data graph D and users ask the question: Have people at ANU collaborated with people at Peking University? Phase 1: Initialization Assume that people at ANU are grouped into V 1, people at Peking University are grouped into V 2. That is, V 1 = {1, 2}, V 2 = {3, 4}. Hence, S 1 = V 1 = {1, 2}, S 2 = V 2 = {3, 4}. Phase 2: Find paths Suppose that the user-specified threshold L = 2; therefore, the algorithm finds out all paths whose length are not greater than 2. The related paths are: P 1 : < 1, 5, 3 >

15 CHAPTER 3. METHODOLOGY 14 P 2 : < 2, 5, 3 > P 3 : < 2, 6, 4 > Phase 3: Group vertices Grouping vertices on the above 3 paths based on entity types. The grouping result is: entity type vertices S 1 {1, 2} S 2 {3, 4} paper {5, 6} Phase 4: Build superedges Take P 1 as an example, there is an edge between node 1 and node 5. Because node 1 is in S 1, node 5 is in paper, so there should be a superedge between S 1 and paper. Iteratively repeating to add superedges based on each path. After building all superedges, the Algorithm summary graph Design of this example (Path-based is in Figure 3.4. Algorithm) So the answer for the previous question is people at ANU co-write 2 papers (5,6) with people at Peking University. S3 = {5,6} S1 writes S3 writes S2 summary graph G Figure 3.4: A summary graph G However, if users specify n supernodes (n > 2), the algorithm has to repeat phase 2 (Find paths) n(n 1) 2 times to find all paths between every pair of input supernodes, which has low efficiency. I thus developed Ripple Algorithm to address the limitation and to allow users to control the resolution Ripple Algorithm The Path-based Algorithm produces a summary graph based on a pair of input entity sets as well as the path length threshold. Unfortunately, only studying two entity sets is not sufficient to deal with real-life datasets or data graphs in many cases, as most real life data graphs are complex. Users might want to study the relationships between many entity sets to see how they influence each other. Ripple Algorithm is introduced to solve the limitation of Path-based Algorithm and further allows

16 CHAPTER 3. METHODOLOGY 15 users to specify the resolution r of the final summary graph G. Ripple Algorithm will iteratively split supernodes until the resolution of a summary graph meets a user-specified resolution r. Before introducing the Ripple Algorithm, I am going to define two ratios: splitting ratio and participant ratio. Splitting ratio Splitting ratio is to help Ripple Algorithm make decisions on which group to split and how to split it. Given a summary graph G with n supernodes, a splitting ratio will be calcuated for each attribute. Then we pick up the attribute who has the minimum splitting ratio as the splitting attribute, the supernode who has the splitting attribute is chosen as splitting supernode. The splitting ratio (s) is definded as follows: s = M(G i, A) (3.1) G i M(G i, A) = {u.a u G i } where u.a refers to the value of attribute A of vertex u. Participant Ratio To figure out whether the relationship between two supernodes is strong or not, I define a ratio p to statistically reflect it. Given the relationship E between two groups (G i,g j ), we use the following equation to calculate the strength of this group relationship E. p = N(G i, G j ) (3.2) G i where N(G i, G j ) = {u Gi v Gj (u, v) E} The larger the participant ratio is, the stronger the relationship is. A high-level description of Ripple Algorithm is provided in Figure 3.5.

17 CHAPTER 3. METHODOLOGY 16 Input: - A data graph: D - User-specified entity sets: (V 1, V 2,..., V n) - Threshold of path lengths: L - Resolution: r Output: - A summary graph G Ripple Algorithm 1: S 1 = V 1, S 2 = V 2,..., S n = V n\\initialize supernode 1 to supernode n 2: step = 1, SearchV ertices = φ, G i = φ\\initialize step, SearchV ertices and G i 3: for S i in (S 1, S 2,..., S n) do: 4: G i = {vertices in S i } 5: SearchVertices = vertices in S i 6: while step < L do: 7: SearchVertices iter = φ 8: for each supernode S in Gi do: 9: if S has vertices in SearchVertices do: 10: apply BFS on such vertices to find neighbors 11: if neighbor n has not been searched do: 12: SearchVertices iter.append(n) 13: group neighbors 14: add new groups to G i based on the result on line 13 15: SearchVertices = SearchVertices iter 16: merge groups in G i who are overlapping 17: merge same-type groups who have the same neighbors and relationships 18: step += 1 19: Merge all G i into one summary graph G 20: merge groups who are overlapping 21: merge same-type groups who have the same neighbors and relationships 22: while size(g) < r do: 23: find splitting attribute and splitting group 24: split G 25: return G Figure 3.5: Ripple Algorithm There are 4 phases in Ripple Algorithm: 1. Initialization (line 1 to line 2) Finding vertices who meet the input requirements for entity sets {V 1,..., V n }, and save them in the corresponding supernodes {S 1,..., S n }. SearchV ertices is a set to save all found vertices that have not been searched but they will be searched in the next step. The summary graph which is searched from S i saves in a set named G i. 2. Iterative search for each input supernode Si (line 3 to line 18) For each input supernode S i, doing L-step BFS search and group found neighboring vertices. The grouping result saves in G i. 3. Merge each Gi (line 19 to line 21) Merge all G i into one summary graph G.

18 CHAPTER 3. METHODOLOGY Resolution control (line 22 to line 25) Users identify the resolution r. The choice of the splitting supernode and splitting attribute is based on the splitting ratio. Note that the iterative group splitting will continue until the resolution of current summary graph is equal to or greater than r. As discussed in the above 4 phases, the threshold of path lengths between any pair of input entity sets is [1, 2 L]. Figure 3.6: High-level overview of Ripple Algorithm with 2 input entity sets In the follows, I am going to illustrate how the Ripple Algorithm works with an example. Example 2 Consider the data graph in Figure 3.3 and the Ripple Algorithm will answer the question: Have people at ANU collaborated with people at Peking University in 27th international ACM conference? Phase 1: Initialization S 1 = V 1 = {1, 2} contains people at ANU, S 2 = V 2 = {3, 4} contains people at Peking University and S 3 = V 3 = {10} is the conference proceeding. Assume that users specify the threshold of path lengths as 2. Phase 2: Iterative search for each input supernode S i For each input supernode S i, do 2-step BFS search and get the summary graph G i. Phase 3: Merge each G i After merging each G i, the summary graph of this example presents in Figure 3.7. So the answer of the question is that papers in

19 CHAPTER 3. METHODOLOGY 18 Algorithm Design (Ripple Algorithm) S 4 are collected in some journals (S 5 ) and conference proceeding (S 3 ). These papers (S 4 ) are written by people at ANU and people at Peking University. cites S1 in_jo S5 S4 S3 S2 S4 = {5,6,7,8} S5 = {11} summary graph G Figure 3.7: Summary graph G Phase 4: Resolution control Assume that users identify the r = 10. The algorithm will yield a summary graph whose resolution is equal to or greater than 10.

20 Chapter 4 Evaluation This chapter presents the experiments of my proposed algorithms. The experiments include the experiment design and experimental results. In experiment design section {4.1}, evaluation dataset and experimental environment will be introduced. In section {4.2}, the performance of these two algorithms will be discussed in terms of the running time. The summary graphs of Ripple Algorithm will also be discussed based on participant ratio. 4.1 Experiment Design The algorithms were implemented using Python All experiments were performed on a MacBook Pro with 2.2 GHz Intel Core i7 CPU and 16 Gbytes RAM. I used one dataset: ACM bibliographical network (ACM network) in my experiments which has 5 entity types and 6 relationship types. In the ACM network, each paper is written by at least one author and each paper is collected in a conference proceeding or a journal. One paper can cite other papers or be cited by other papers. Each conference proceeding or journal is published by a publisher. The files in the ACM network are in the CSV format, so I constructed the data graph of ACM network based on the given relation schemas. Detailed diagram to represent the relationship schemas of this dataset is shown in the Figure

21 CHAPTER 4. EVALUATION 20 Figure 4.1: The Relation Schemas of ACM Bibliographical Network Table 4.1: Size of ACM network Table Name # Records author 5500 paper 5579 proceeding 6421 journal 128 publisher Experimental Results In this section, Path-based Algorithm and Ripple Algorithm were tested individually. For Path-based Algorithm, I tested the running time under different conditions. The experimental results of Ripple Algorithm will be presented from 2 aspects. First, I tested the Ripple Algorithm without iterative splitting. The other evaluation was to test resolution control of Ripple Algorithm.

22 CHAPTER 4. EVALUATION Path-based Algorithm To test the running time, the Path-based Algorithm was applied to 3 test cases (shown in the following table). Each test case was run 3 times with different thresholds of path lengths (L = 3, 4, 5). The running time of 3 test cases with different thresholds are shown in Figure 4.2(a). We can draw the conclusion: For a given pair of entity sets, the running time of Path-based Algorithm increases with increasing threshold of path lengths. (a) Path-based Algorithm (b) Ripple Algorithm Figure 4.2: Running time of two algorithms Ripple Algorithm First, the Ripple Algorithm was applied to 3 test cases (shown in the following table). For each test case, the algorithm was run 3 times with different thresholds of path lengths (L = 2, 3, 4). The running time of the 3 test cases are in Figure 4.2(b). We can draw the conclusion: Given entity sets, the running time of Ripple Algorithm increases as the threshold increases. Figure 4.3(a) is the summary graph of TC1 (L = 2) with participant ratios and

23 CHAPTER 4. EVALUATION 22 Figure 4.3(b) is the summary graph of TC1 (L = 3) with participant ratio. Comparing these two summary graphs, we can know that given a number of entity sets, most participant ratios slightly reduce as the threshold of path lengths increases. (a) TC1 (L = 2) V1 size: paper size: 4802 V3 size: V2 size: 12 proceeding size: publisher size: 8 author size: 105 journal size: (b) TC1 (L = 3) (c) TC1 (L=2, r = 15) Figure 4.3: Summary graphs To test the resolution control of Ripple Algorithm, I used the summary graph of TC1 (L = 2) and assumed user-specified resolution r = 15. After calculating the splitting ratio for each attribute of the summary graph in Figure 4.3(a), the algorithm finds the splitting supernode is paper 1 and splits paper 1 based on joid attribute.

24 CHAPTER 4. EVALUATION 23 Figure 4.3(c) is the splitting result.

25 Chapter 5 Conclusion and Future Work In this project, I have developed 2 algorithms for grouping-based summarization: Path-based Algorithm and Ripple Algorithm. Path-based Algorithm can summarize a data graph based on a pair of input entity sets as well as the threshold of path lengths. However, Path-based Algorithm is inefficient if users identify more than 2 entity sets, as finding path can be only implemented between two vertices. That is, if users input n supernodes (n>2), the Path-based Algorithm has to iteratively find paths n(n 1) 2 times. Ripple Algorithm allows users to select two or more entity sets, finding out all intermediate vertices between input entity sets and discovering their relationships. Also, users can control the resolution of a summary graph. However, it is subjective to noise. If there are too much noise in the data graph, it is more likely to produce many supernodes in the summary graph, and some supernodes may just have one vertex. The major future work for Ripple Algorithm is to propose another splitting measure. The current splitting method of Ripple Algorithm is splitting ratio measure which choose the attribute who has the minimum splitting ratio as splitting attribute. However, it is not working well in some cases. Also, this measure is more likely to produce some supernodes which just have one vertex. 24

26 Appendices 25

27 Appendix A Independent Study Contract 26

28 APPENDIX A. INDEPENDENT STUDY CONTRACT 27

29 APPENDIX A. INDEPENDENT STUDY CONTRACT 28

30 Appendix B Project Description 29

31 APPENDIX B. PROJECT DESCRIPTION 30

32 APPENDIX B. PROJECT DESCRIPTION 31

33 Appendix C Software Description 32

34 Software Brief There are two directories in this software folder, i.e. "Path-based" directory and "Ripple" directory. Path-based directory This directory is used to implement the Path-based Algorithm. The folder named "files" saves the test dataset (ACM network). There is only one python file named "path-based.py" in this directory. Directly used modules Six modules are directly used from python (python ) library. They are: Functions Six functions are used in this program. One funciton named "RepresentsInt(s)" is obtained from Details about each function can be found in README. 1. RepresentsInt(s) 2. find_selectedsn() 3. find_allpaths(l) 4. cate_type(listofpaths)

5. build_adlist(paths) 6. build_sg(nodeset) Ripple directory This directory is used to implement the Ripple Algorithm. The folder named "files" saves the test dataset (ACM network).

35 5. build_adlist(paths) 6. build_sg(nodeset) Ripple directory This directory is used to implement the Ripple Algorithm. The folder named "files" saves the test dataset (ACM network). There are 3 python files in this directory: "Dataset_To_Graph_1.py", "Search_Each_Supernode_2.py" and "Resolution_Control_3.py". Directly used modules Seven modules are directly used from python (python ) library. They are: Functions Thirteen functions are used in this program. RepresentsInt(s) and to_graph(combinedgroup) + to_edge(l) functions are others' works, obtained from and respectively. Details about each function can be found in README. 1. RepresentsInt(s) 2. buildadlist(node1, node2) 3. find_selectedsn()

36 4. typeequalsupernode() + typenotequalsupernode() 5. find_out_neighs(node, grouptype) 6. find_in_neighs(node, grouptype) 7. to_graph(combinedgroup) + to_edge(l) 8. comb_groups(gg) 9. find_finaledges(nodeset) 10. NodesetMapAttr() 11. AttrMapNum() 12. FindSplitAttr() + FindMinNum() 13. find_relationship(groups) Test Environment All experiments were performed on a MacBook Pro with 2.2 GHz Intel Core i7 CPU and 16 Gbytes RAM. One test dataset named "ACM bibliographical network" is provided by my supervisor, Qing Wang. The implementation was written in Python Process To test the performance and correctness of my proposed algorithms, each algorithm was applied to 3 test cases. To compare the running time, each test case was run 3 times with different thresholds of path lengths. Details can be seen in Evaluation chapter in report. How to run my software is introduced in README with command examples.

37 Appendix D README 36

38 Iterative Graph Summarization based on Grouping This project proposes two algorithms (Path-based Algorithm and Ripple Algorithm). Both of them produce a summary graph based on user-identified supernodes, finding intermediate supernodes between them. Ripple Algorithm further allows users to control the resolution of a summary graph. Getting Started This project was written and tested in macos Sierra. So the commands shown in this Readme are in mac style. Prerequisites 1. Python 2.x version. The algorithms were implemented using Python There should be a "files" folder in "Path-based" and "Ripple" directory and test dataset (ACM network) in the folder. 3. networkx, csv, Counter, copy, groupby, timeit as well as itemgetter python modules are required. They can be installed by running "pip install module name". 4. There should be 1 python file in Path-based directory, 3 python files in Ripple directory. Running the tests This section will introduce how to run these two algorithms, respectively. Path-based Algorithm 1. Go into the "Path-based" directory in the terminal. 2. Run: python path-based.py 3. Give input as the program requires. For each attribute value, you can just give the keyword. The attribute you can choose is in the D.add_node() statement. Example

39 > cd ~/Path-based > python path-based.py > the number of attrs for supernode 1: 2 > the attribute 1 for supernode 1, use = to assign value: node_type=author > the attribute 2 for supernode 1, use = to assign value: affiliation=duke > the number of attrs for supernode 2: 2 > the attribute 1 for supernode 2, use = to assign value: node_type=author > the attribute 2 for supernode 2, use = to assign value: affiliation=ibm > threshold of path lengths: 3 The detail about the summary graph will be shown. Ripple Algorithm 1. Go into the "Ripple" directory in the terminal. 2. Run: python Resolution_Control_3.py 3. Give input as the program requires. For each attribute value, you can just give the keyword. The attribute you can choose is in the D.add_node() statement in "Dataset_To_Graph_1.py". Example > cd ~/Ripple > python Resolution_Control_3.py > the number of input supernodes: 3 > the number of attrs for supernode 1: 2 > the attribute 1 for supernode 1, use = to assign value: node_type=author > the attribute 2 for supernode 1, use = to assign value: affiliation=duke > the number of attrs for supernode 2: 2 > the attribute 1 for supernode 2, use = to assign value: node_type=author > the attribute 2 for supernode 2, use = to assign value: affiliation=ibm > the number of attrs for supernode 3: 2 > the attribute 1 for supernode 3, use = to assign value: node_type=paper > the attribute 2 for supernode 3, use = to assign value: title=algorithms > threshold of path lengths: 2 The detail about the summary graph will be shown. > specify the resolution:15 The detail about the split summary graph will be shown. Project Structure Functions in each algorithm program are introduced in this section.

40 Path-based Algorithm Codes in "Path-based" directory implemented the Path-based Algorithm. There is only one python file named path-based.py in this directory. Functions are: 1. RepresentsInt(s): check if a string "s" represents an integer, codes of this function obtained from here. 2. find_selectedsn(): find vertices which meet the input requirement for entity sets. 3. find_allpaths(l): find paths (length is not greater than L) between 2 input supernodes. 4. cate_type(listofpaths): group vertices on paths based on entity type. 5. build_adlist(paths): build adjacent list based on found paths. 6. build_sg(nodeset): build a summary graph. Ripple Algorithm Codes in "Ripple" directory implemented the Ripple Algorithm. There are three python files in this directory, i.e. "Dataset_To_Graph_1.py", "Search_Each_Supernode_2.py" and "Resolution_Control_3.py". Dataset_To_Graph_1.py This python file is used to build the dataset into a data graph D. Functions are: 1. RepresentsInt(s): check if a string "s" represents an integer, codes of this function obtained from here. 2. buildadlist(node1, node2): build adjacent lists if there is an edge between node1 and node2 in data graph D. 3. find_selectedsn(): find vertices which meet the input requirement for entity sets. 4. typeequalsupernode() + typenotequalsupernode(): two functions are used to group vertices in the data graph D based on entity types. Here, user-identified supernodes are regarded as new entity types (Si) Search_Each_Supernode_2.py This python file is used to do Breadth-First-Search from each input supernode and merge groups. Functions are: 1. find_out_neighs(node, grouptype): find all "out" neighboring vertices of a group and group them based on direction, entity type and edge type. 2. find_in_neighs(node, grouptype): find all "in" neighboring vertices of a group and group them based on direction, entity type and edge type. 3. to_graph(combinedgroup) + to_edge(l): construct groups in "CombinedGroup" into graphs. Using the connected components to find groups which are overlapping. Codes of this function obtained from here. 4. comb_groups(gg): (1) find groups in "GG" which are in the same type and have the same neighbors. (2) For each same neighbor, check whether they are connected by the same edge type. If groups meet (1) and (2) requirements, merging them into one group.

41 5. find_finaledges(nodeset): find the relationships between groups in "nodeset". Resolution_Control_3.py This python file is used to allow users to control the resolution of the summary graph which is built in "Search_Each_Supernode_2.py". Functions are: 1. NodesetMapAttr(): find distinct attribute values for each attribute. 2. AttrMapNum(): calculate the number of distinct values for each attribute. 3. FindSplitAttr() + FindMinNum(): find splitting attribute and splitting group. 4. find_relationship(groups): find the relationships between groups in "Groups". Known Issues 1. To find all paths between two input entity sets, Path-based Algorithm does not consider about the direction, so it may take a long time if users specify a large value for threshold. Solution is to restart the program with a smaller threshold value. 2. There is no space when you assign values to attributes. 3. Capitalization of attribute values does matter! Please make sure the attribute value you chosen follows the dataset in "files" folder. If the program says: There may exist typo in your input!! Check the input again and restart the program with correct input. 4. In the result of Ripple Algorithm, it may contain supernodes whose name start with "M". Supernodes start with "M" refer to nodes who have the same entity type (node_type) with input supernodes. For example, "S1" is a supernode which contains authors work for Duke University; therefore, "M1" is a supernode which contains authors who do not work for Duke University. 5. If there exist some supernodes which end with number, it means there exist supernodes who have the same entity type (node_type) with them, but they have different superedges. For example, there may exist "paper" and "paper1" in the result. It means these two supernodes are all papers but they are connected by different superedges, so they cannot be merged into one group. Author Codes and this README are written by Sirui Li (u ). Mail: u @anu.edu.au Acknowledgments Two functions in this project are others' works and they have been cited in this README. I really appreciate the guidance from my supervisor Qing Wang.

42 Bibliography [1] J. Cheng, Y. Ke, W. Ng, and J. X. Yu. Context-aware object connection discovery in large graphs. In Data Engineering, ICDE 09. IEEE 25th International Conference on, pages IEEE, [2] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [3] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex: explaining relationships between entity pairs. Proceedings of the VLDB Endowment, 5(3): , [4] G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: mining informative entity relationship subgraphs. In Proceedings of the 18th ACM conference on Information and knowledge management, pages ACM, [5] Y. Liu, A. Dighe, T. Safavi, and D. Koutra. A graph summarization: A survey. arxiv preprint arxiv: , [6] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages ACM, [7] C. Ramakrishnan, W. H. Milnor, M. Perry, and A. P. Sheth. Discovering informative connection subgraphs in multi-relational graphs. ACM SIGKDD Explorations Newsletter, 7(2):56 63, [8] S. Seufert, K. Berberich, S. J. Bedathur, S. K. Kondreddi, P. Ernst, and G. Weikum. Espresso: Explaining relationships between entity sets. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages ACM, [9] Y. Tian and J. M. Patel. Interactive graph summarization. In Link Mining: Models, Algorithms, and Applications, pages Springer,

43 BIBLIOGRAPHY 42 [10] N. Zhang, Y. Tian, and J. M. Patel. Discovery-driven graph summarization. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages IEEE, 2010.

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer