Iterative Graph Summarization based on Grouping

Size: px
Start display at page:

Download "Iterative Graph Summarization based on Grouping"

Transcription

1 Iterative Graph Summarization based on Grouping Sirui Li Supervisor: Dr. Qing Wang COMP4560: Advanced Computing Project Australian National University Semester 1, 2017 May 26, 2017

2 Acknowledgements First and foremost, I would like to express my thanks of gratitude to my supervisor, Dr. Qing Wang, who gave me the golden opportunity to do this project which I am really interested in. Without her assistance and guidence, the project and the report cannot be accomplished. Secondly I would also like to thank Dr. Weifa Liang who is the course convener, for providing helpful assistance on academic writing and presentation. Most importantly, my thanks go to my family and friends, without the understanding and support of them, the project cannot be finalized. 1

3 Abstract Analysing large datasets to discover useful insights is an important task in many application domains. In order to find the information, graphs are widely used to model entities (as nodes) and their relationships (as edges) in the datasets. These graphs may contain millions of nodes and edges. To reduce the complexity and understand the underlying information of large graphs, graph summarization is a critical technique. There are two major methods in graph summarization: statistical methods and grouping-based methods. Statistical methods, such as degree distributions, are useful, but it is hard to control the resolution [9]. Grouping-based methods, or aggregation-based methods, produce summaries from large graphs by grouping nodes based on attributes and relationships. Existing grouping-based methods, such as SNAP and k-snap, allow users to select node attributes and relationships of the summaries, but they require high user interaction, which means users need to know the detail about the nodes and edges of the summaries before the analysis. This project is to propose an algorithm to summarize graphs iteratively until reaching a desired resolution specified by users. The algorithm just requires users to specify some entity sets as well as the resolution, and then the algorithm finds the intermediate nodes appear during the search and groups these nodes by some method. Keywords: Graph Summarization, Statistical Methods, Grouping-based Methods, Aggregation-based Methods 2

4 Contents Acknowledgements 1 Abstract 2 1 Introduction Objectives Contributions Outline Related Work SNAP and k-snap Algorithms ESPRESSO Algorithm and REX Algorithm Methodology Problem Definition Algorithm Design Path-based Algorithm Ripple Algorithm Evaluation Experiment Design Experimental Results Path-based Algorithm Ripple Algorithm Conclusion and Future Work 24 Appendices 25 A Independent Study Contract 26 3

5 CONTENTS 4 B Project Description 29 C Software Description 32 D README 36

6 Chapter 1 Introduction Modelling real-life entities and relationships between them into graphs is a common method to extract underlying information in large datasets. One typical application is the social networking analysis. Mining social networkings can help analysers understand how people interact with each other. However, decoding useful information in large graphs is too difficult. Hence, graph summarization is needed to summarize large graphs into small graphs which can be easily understood. Graph summarization is a process of transferring large graphs into concise forms [9]. Generally, two kinds of methods are used in graph summarization: statistical methods and grouping-based methods. Statistical methods [9] include degree distributions, hop-plots and so on. The major downside of these methods is that the resolution of summaries cannot be controlled by users. Grouping-based methods, also known as aggregation-based methods [9], are used to produce summary graphs by grouping nodes based on selected attributes of nodes and edges. Figure 1.1 which is taken from [9] presents an overall picture of grouping-based graph summarization process, the graph at left has 7445 vertices and 19,971 edges. Given some conditions, a summary graph is shown on the right. In the summary graph, each vertex is a set of vertices from the original graph, called supernode or group; each edge represents the connection between two sets of vertices, called superedge or group relationship. The summary graph on the right can help understand the relationships among different sets of vertices in the original graph. 5

7 CHAPTER 1. INTRODUCTION 6 Figure 1.1: A summary graph (right) is generated for the original graph (left), which is taken from [9] Two operations are proposed in the grouping-based methods: SNAP and k-snap [9]. These two operations produce a summary graph based on user-selected vertex attributes and relationships. Both of them require attributes homogeneity, that is, vertices in the same group must have the same value for each user-selected attribute [9]. The SNAP operation also requires relationship homogeneity, that is, if there exists a relationship between two groups, then every node in one group has to connect to at least one node in another group [9]. Nonetheless, homogeneity is hard to achieve in real-life applications because data may involve too much noise. Thus, k-snap operation relaxes this requirement and allows users to control the size of the summary graph. Even if SNAP and k-snap operations can produce summary graphs with high quality, there still exist some drawbacks. For example, they require high user interaction, which means users are expected to know the detail about summary graphs before building up the summary graphs. Unfortunately, it is highly impracticable and almost impossible in real life applications, as most users may lack knowledge about useful relationships in most cases. 1.1 Objectives The goal of my project is to propose an iterative algorithm for grouping-based methods based on (1) user interested entity sets (2) user-identified resolution. It can answer questions of the form Have people at ANU collaborated with people at Peking University? or Have people at ANU collaborated with people at Peking University in 27th international ACM conference?. The specific objectives are: 1. Develop an iterative algorithm for grouping-based graph summarization. 2. Rank the summary graph.

8 CHAPTER 1. INTRODUCTION 7 3. Evaluate effectiveness and efficiency of the algorithm. Note that there are actually two algorithms (Path-based Algorithm and Ripple Algorithm) introduced in this report. Path-based Algorithm can only be applied when there are 2 user-specified entity sets. Ripple Algorithm improves the Pathbased Algorithm such that users can specify 2 or more entity sets to study. 1.2 Contributions My work has addressed the limitations of previous works and can provide the following noval features: Users can specify an arbitrary number of their interested entity sets. The threshold of path lengths between user-specified entity sets can be identified by users. Entities that appear in one or more paths between user-specified entity sets can be grouped in the summary graph. How to group these intermediate entities will be further discussed in chapter 3. Users can control the resolution of the summary graph, i.e. users can specify how many supernodes in the summary graph. 1.3 Outline This project report is structured as follows: Chapter 2 presents the related works for graph summarization and discusses the limitations. Chapter 3 presents the notions I define for presenting the algorithms and how the proposed algorithms (Path-based Algorithm and Ripple Algorithm) work. Chapter 4 shows the evaluation of the performance of these two algorithms in terms of the participant ratio and running time. Chapter 5 concludes this report and discusses the future work.

9 Chapter 2 Related Work Graph summarization is widely used in many application domains. Various graph summarization techniques have been proposed to help users understand the characteristics of large graphs [6], [10]. A recent survey by Yike Liu [5] summarizes a comprehensive overview of the state-of-the-art methods for graph summarization. 2.1 SNAP and k-snap Algorithms Grouping-based methods are one of the most popular techniques for graph summarization. These methods aggregate vertices of an original graph into supernodes and connect them with superedges, producing a summary graph [5]. Several recent studies have investigated to the problem of how to build up a summary graph by applying grouping-based methods. Summarization by Grouping Nodes on Attributes and Pairwise Relationships (SNAP) and k-snap are two database-stype algorithms to summarize graphs [9]. They can only deal with categorical node attributes [10]. The SNAP algorithm produces a summary graph by grouping nodes based on user-selected node attributes and relationships. The k-snap algorithm also allows users to customize the summary graph [10] and extends the SNAP algorithm which allows users to control the resolution of the summary graph by drill-down or roll-up operations. Although the SNAP and k-snap algorithms can produce a summary graph based on user-selected node attributes and relationships, they still exist some drawbacks. In many real-life cases, users do not know what relationships are useful between their selected nodes and they do not know whether there exist other entities between their selected nodes. Thus, users might prefer to just specify their interested entity sets, and then expect an algorithm to return a summary graph which can discover other entities existing between their specified entity sets. 8

10 CHAPTER 2. RELATED WORK ESPRESSO Algorithm and REX Algorithm There are many previous works focusing on extracting relationships between input entities from large graphs. For example, [7] proposed a connection subgraph discovery algorithm over RDF graphs. The notion connection subgraph proposed by [2] is to extract a connected subgraph based on a pair of query vertices and the number of vertices in the subgraph can be controlled. [4] addressed the extraction of subgraphs based on a set of query vertices in entity-relationship graphs. [1] partitioned graphs with respect to the context of a vertex. ESPRESSO algorithm proposed by [8] and REX algorithm proposed by [3] are based on these previous works and focus on producing a concise summary graph to explain the relationship between two sets of entities in a knowledge graph. The limitation of these two algorithms is that users can only specify two entity sets of their interest and produce related explanations.

11 Chapter 3 Methodology In this chapter, I will present my algorithms of producing the summary graphs from a given data graph. For clarity, I will first define the notions of data graphs, summary graphs, path, path length, as well as the resolution of summary graphs. Then, I will introduce each algorithm in detail. Note that, each algorithm introduced in this report can be used in both directed and undirected graphs. For simplicity, I only utilize undirected graphs as examples to present both algorithms. 3.1 Problem Definition Figure 3.1: A summary graph G (b) is generated from a data graph D (a), which is taken from [9] A data graph, denoted by D = (V D, E D, T, A, τ, ψ), is a graph with a set V D of entities as vertices, a set E D of relationships as edges, a set A of attributes, a set 10

12 CHAPTER 3. METHODOLOGY 11 T of entity types, and two assignment functions: τ : V D T, ψ : V D A that assign to each vertex an entity type (from a set T of possible types) and a set of attributes (from a set A of possible attribute values). Figure 3.1(a) presents a data graph D. Each vertex in D has an entity type t T and is associated with a set of attributes in A. A summary graph G, denoted by G = (V G, E G, T, A, τ, ψ ) is a graph with a set V G of entity sets as vertices or supernodes, a set E G of relationships as edges or superedges, and two assignment functions: τ : V G T, ψ : V G A that assign to each vertex a entity type (from a set T of possible types) and a set of attribute values (from a set A of possible attributes). Here, each supernode in G is a subset of vertices of V D and each superedge in G is a subset of edges in E D. Figure 3.1(b) presents a summary graph G with 4 supernodes (S1, S2, S3 and S4) and 4 superedges (S1-S2, S2-S3, S2-S4, S3-S4). From the definition of path in Wikipedia a path in a graph is a finite or infinite sequence of edges which connect a sequence of vertices which, by most definitions, are all distinct from one another. I use P to refer to a path in this report. The path length, denoted as L, is the number of edges occuring in a path. For example, the paths from S1 to S3 and their lengths in Figure 3.1(b) can be: path (P ) length (L) P 1 = < S1, S2, S3 > 2 P 2 = < S1, S2, S4, S3 > 3 In a graph, the resolution of a graph means the size of the graph, i.e. how many vertices in a graph. In this report, I use r to refer to the user-specified resolution. In Figure 3.1(b), the resolution is Algorithm Design In this section, I will introduce two algorithms for graph summarization, i.e. Pathbased Algorithm and Ripple Algorithm proposed in this report. In Path-based Algorithm, users are allowed to take a pair of entity sets as well as a threshold of path lengths as input. Given an input, the algorithm finds out the intermediate entities appearing on related paths and groups them by entity type. After grouping, these entity sets are presented as supernodes in the summary graph G. If there is an edge between node A and node B in the data graph, and then there must exist a superedge between supernodes who contain node A and node B, respectively.

13 CHAPTER 3. METHODOLOGY 12 In Ripple Algorithm, users may take a number of interested entity sets as well as a threshold of path lengths as input. Intermediate entities are found by Breadth- First-Search (BFS). After grouping these intermediate entities, an initial summary graph is returned. This algorithm further allows users to specify the resolution r of the final summary graph G and hence the initial summary graph can be split until the resolution meets the user requirement. The major difference between these two algorithms is that Path-based Algorithm can only be used to find relationships between a pair of entity sets; whereas, the Ripple Algorithm can be used to find relationships among two or more entity sets. In principle, Ripple Algorithm can be used without the limitation on the number of entity sets as input Path-based Algorithm In this section I discuss the details of Path-based Algorithm. A high-level description is provided in Figure 3.2. Input: - A data graph: D - A pair of entity sets (V 1, V 2 ) - Threshold of path lengths: L Output: - A summary graph G Path-based Algorithm 1: S 1 = V 1, S 2 = V 2 \\Initialize supernode 1 (S 1 ) and supernode 2 (S 2 ) 2: find all paths whose lengths are not greater than L between S 1 and S 2 3: group vertices on related paths by entity types, each group is a supernode 4: if there exists an edge between node A and node B do: 5: add a superedge between S i and S j which contain node A and node B, respectively 6: return G Figure 3.2: Path-based Algorithm Let S i be the supernode in a summary graph G. There are 4 phases in Path-based Algorithm: 1. Initialization (line 1) Finding vertices who meet the input requirement for entity sets (V 1 and V 2 ), and save them in the corresponding supernode (S 1 and S 2 ). 2. Find paths (line 2) Finding paths between two supernodes S 1 and S 2 and the length of each path is not greater than L.

14 CHAPTER 3. METHODOLOGY Group vertices (line 3 ) Grouping vertices in the previous paths based on entity types. 4. Build superedges (line 4 to line 5) If there is an edge between node A and node B in the paths, and then adding a superedge between supernodes which contain node A and node B, respectively. Finally, Path-based Algorithm returns a summary graph. author 3 paper proceeding writes writes 1 5 writes 10 in_pr journal writes writes 11 4 Figure 3.3: A data graph example In the follows, I am going to illustrate how the Path-based Algorithm works with an example. Example 1 Assume Figure 3.3 is the input data graph D and users ask the question: Have people at ANU collaborated with people at Peking University? Phase 1: Initialization Assume that people at ANU are grouped into V 1, people at Peking University are grouped into V 2. That is, V 1 = {1, 2}, V 2 = {3, 4}. Hence, S 1 = V 1 = {1, 2}, S 2 = V 2 = {3, 4}. Phase 2: Find paths Suppose that the user-specified threshold L = 2; therefore, the algorithm finds out all paths whose length are not greater than 2. The related paths are: P 1 : < 1, 5, 3 >

15 CHAPTER 3. METHODOLOGY 14 P 2 : < 2, 5, 3 > P 3 : < 2, 6, 4 > Phase 3: Group vertices Grouping vertices on the above 3 paths based on entity types. The grouping result is: entity type vertices S 1 {1, 2} S 2 {3, 4} paper {5, 6} Phase 4: Build superedges Take P 1 as an example, there is an edge between node 1 and node 5. Because node 1 is in S 1, node 5 is in paper, so there should be a superedge between S 1 and paper. Iteratively repeating to add superedges based on each path. After building all superedges, the Algorithm summary graph Design of this example (Path-based is in Figure 3.4. Algorithm) So the answer for the previous question is people at ANU co-write 2 papers (5,6) with people at Peking University. S3 = {5,6} S1 writes S3 writes S2 summary graph G Figure 3.4: A summary graph G However, if users specify n supernodes (n > 2), the algorithm has to repeat phase 2 (Find paths) n(n 1) 2 times to find all paths between every pair of input supernodes, which has low efficiency. I thus developed Ripple Algorithm to address the limitation and to allow users to control the resolution Ripple Algorithm The Path-based Algorithm produces a summary graph based on a pair of input entity sets as well as the path length threshold. Unfortunately, only studying two entity sets is not sufficient to deal with real-life datasets or data graphs in many cases, as most real life data graphs are complex. Users might want to study the relationships between many entity sets to see how they influence each other. Ripple Algorithm is introduced to solve the limitation of Path-based Algorithm and further allows

16 CHAPTER 3. METHODOLOGY 15 users to specify the resolution r of the final summary graph G. Ripple Algorithm will iteratively split supernodes until the resolution of a summary graph meets a user-specified resolution r. Before introducing the Ripple Algorithm, I am going to define two ratios: splitting ratio and participant ratio. Splitting ratio Splitting ratio is to help Ripple Algorithm make decisions on which group to split and how to split it. Given a summary graph G with n supernodes, a splitting ratio will be calcuated for each attribute. Then we pick up the attribute who has the minimum splitting ratio as the splitting attribute, the supernode who has the splitting attribute is chosen as splitting supernode. The splitting ratio (s) is definded as follows: s = M(G i, A) (3.1) G i M(G i, A) = {u.a u G i } where u.a refers to the value of attribute A of vertex u. Participant Ratio To figure out whether the relationship between two supernodes is strong or not, I define a ratio p to statistically reflect it. Given the relationship E between two groups (G i,g j ), we use the following equation to calculate the strength of this group relationship E. p = N(G i, G j ) (3.2) G i where N(G i, G j ) = {u Gi v Gj (u, v) E} The larger the participant ratio is, the stronger the relationship is. A high-level description of Ripple Algorithm is provided in Figure 3.5.

17 CHAPTER 3. METHODOLOGY 16 Input: - A data graph: D - User-specified entity sets: (V 1, V 2,..., V n) - Threshold of path lengths: L - Resolution: r Output: - A summary graph G Ripple Algorithm 1: S 1 = V 1, S 2 = V 2,..., S n = V n\\initialize supernode 1 to supernode n 2: step = 1, SearchV ertices = φ, G i = φ\\initialize step, SearchV ertices and G i 3: for S i in (S 1, S 2,..., S n) do: 4: G i = {vertices in S i } 5: SearchVertices = vertices in S i 6: while step < L do: 7: SearchVertices iter = φ 8: for each supernode S in Gi do: 9: if S has vertices in SearchVertices do: 10: apply BFS on such vertices to find neighbors 11: if neighbor n has not been searched do: 12: SearchVertices iter.append(n) 13: group neighbors 14: add new groups to G i based on the result on line 13 15: SearchVertices = SearchVertices iter 16: merge groups in G i who are overlapping 17: merge same-type groups who have the same neighbors and relationships 18: step += 1 19: Merge all G i into one summary graph G 20: merge groups who are overlapping 21: merge same-type groups who have the same neighbors and relationships 22: while size(g) < r do: 23: find splitting attribute and splitting group 24: split G 25: return G Figure 3.5: Ripple Algorithm There are 4 phases in Ripple Algorithm: 1. Initialization (line 1 to line 2) Finding vertices who meet the input requirements for entity sets {V 1,..., V n }, and save them in the corresponding supernodes {S 1,..., S n }. SearchV ertices is a set to save all found vertices that have not been searched but they will be searched in the next step. The summary graph which is searched from S i saves in a set named G i. 2. Iterative search for each input supernode Si (line 3 to line 18) For each input supernode S i, doing L-step BFS search and group found neighboring vertices. The grouping result saves in G i. 3. Merge each Gi (line 19 to line 21) Merge all G i into one summary graph G.

18 CHAPTER 3. METHODOLOGY Resolution control (line 22 to line 25) Users identify the resolution r. The choice of the splitting supernode and splitting attribute is based on the splitting ratio. Note that the iterative group splitting will continue until the resolution of current summary graph is equal to or greater than r. As discussed in the above 4 phases, the threshold of path lengths between any pair of input entity sets is [1, 2 L]. Figure 3.6: High-level overview of Ripple Algorithm with 2 input entity sets In the follows, I am going to illustrate how the Ripple Algorithm works with an example. Example 2 Consider the data graph in Figure 3.3 and the Ripple Algorithm will answer the question: Have people at ANU collaborated with people at Peking University in 27th international ACM conference? Phase 1: Initialization S 1 = V 1 = {1, 2} contains people at ANU, S 2 = V 2 = {3, 4} contains people at Peking University and S 3 = V 3 = {10} is the conference proceeding. Assume that users specify the threshold of path lengths as 2. Phase 2: Iterative search for each input supernode S i For each input supernode S i, do 2-step BFS search and get the summary graph G i. Phase 3: Merge each G i After merging each G i, the summary graph of this example presents in Figure 3.7. So the answer of the question is that papers in

19 CHAPTER 3. METHODOLOGY 18 Algorithm Design (Ripple Algorithm) S 4 are collected in some journals (S 5 ) and conference proceeding (S 3 ). These papers (S 4 ) are written by people at ANU and people at Peking University. cites S1 in_jo S5 S4 S3 S2 S4 = {5,6,7,8} S5 = {11} summary graph G Figure 3.7: Summary graph G Phase 4: Resolution control Assume that users identify the r = 10. The algorithm will yield a summary graph whose resolution is equal to or greater than 10.

20 Chapter 4 Evaluation This chapter presents the experiments of my proposed algorithms. The experiments include the experiment design and experimental results. In experiment design section {4.1}, evaluation dataset and experimental environment will be introduced. In section {4.2}, the performance of these two algorithms will be discussed in terms of the running time. The summary graphs of Ripple Algorithm will also be discussed based on participant ratio. 4.1 Experiment Design The algorithms were implemented using Python All experiments were performed on a MacBook Pro with 2.2 GHz Intel Core i7 CPU and 16 Gbytes RAM. I used one dataset: ACM bibliographical network (ACM network) in my experiments which has 5 entity types and 6 relationship types. In the ACM network, each paper is written by at least one author and each paper is collected in a conference proceeding or a journal. One paper can cite other papers or be cited by other papers. Each conference proceeding or journal is published by a publisher. The files in the ACM network are in the CSV format, so I constructed the data graph of ACM network based on the given relation schemas. Detailed diagram to represent the relationship schemas of this dataset is shown in the Figure

21 CHAPTER 4. EVALUATION 20 Figure 4.1: The Relation Schemas of ACM Bibliographical Network Table 4.1: Size of ACM network Table Name # Records author 5500 paper 5579 proceeding 6421 journal 128 publisher Experimental Results In this section, Path-based Algorithm and Ripple Algorithm were tested individually. For Path-based Algorithm, I tested the running time under different conditions. The experimental results of Ripple Algorithm will be presented from 2 aspects. First, I tested the Ripple Algorithm without iterative splitting. The other evaluation was to test resolution control of Ripple Algorithm.

22 CHAPTER 4. EVALUATION Path-based Algorithm To test the running time, the Path-based Algorithm was applied to 3 test cases (shown in the following table). Each test case was run 3 times with different thresholds of path lengths (L = 3, 4, 5). The running time of 3 test cases with different thresholds are shown in Figure 4.2(a). We can draw the conclusion: For a given pair of entity sets, the running time of Path-based Algorithm increases with increasing threshold of path lengths. (a) Path-based Algorithm (b) Ripple Algorithm Figure 4.2: Running time of two algorithms Ripple Algorithm First, the Ripple Algorithm was applied to 3 test cases (shown in the following table). For each test case, the algorithm was run 3 times with different thresholds of path lengths (L = 2, 3, 4). The running time of the 3 test cases are in Figure 4.2(b). We can draw the conclusion: Given entity sets, the running time of Ripple Algorithm increases as the threshold increases. Figure 4.3(a) is the summary graph of TC1 (L = 2) with participant ratios and

23 CHAPTER 4. EVALUATION 22 Figure 4.3(b) is the summary graph of TC1 (L = 3) with participant ratio. Comparing these two summary graphs, we can know that given a number of entity sets, most participant ratios slightly reduce as the threshold of path lengths increases. (a) TC1 (L = 2) V1 size: paper size: 4802 V3 size: V2 size: 12 proceeding size: publisher size: 8 author size: 105 journal size: (b) TC1 (L = 3) (c) TC1 (L=2, r = 15) Figure 4.3: Summary graphs To test the resolution control of Ripple Algorithm, I used the summary graph of TC1 (L = 2) and assumed user-specified resolution r = 15. After calculating the splitting ratio for each attribute of the summary graph in Figure 4.3(a), the algorithm finds the splitting supernode is paper 1 and splits paper 1 based on joid attribute.

24 CHAPTER 4. EVALUATION 23 Figure 4.3(c) is the splitting result.

25 Chapter 5 Conclusion and Future Work In this project, I have developed 2 algorithms for grouping-based summarization: Path-based Algorithm and Ripple Algorithm. Path-based Algorithm can summarize a data graph based on a pair of input entity sets as well as the threshold of path lengths. However, Path-based Algorithm is inefficient if users identify more than 2 entity sets, as finding path can be only implemented between two vertices. That is, if users input n supernodes (n>2), the Path-based Algorithm has to iteratively find paths n(n 1) 2 times. Ripple Algorithm allows users to select two or more entity sets, finding out all intermediate vertices between input entity sets and discovering their relationships. Also, users can control the resolution of a summary graph. However, it is subjective to noise. If there are too much noise in the data graph, it is more likely to produce many supernodes in the summary graph, and some supernodes may just have one vertex. The major future work for Ripple Algorithm is to propose another splitting measure. The current splitting method of Ripple Algorithm is splitting ratio measure which choose the attribute who has the minimum splitting ratio as splitting attribute. However, it is not working well in some cases. Also, this measure is more likely to produce some supernodes which just have one vertex. 24

26 Appendices 25

27 Appendix A Independent Study Contract 26

28 APPENDIX A. INDEPENDENT STUDY CONTRACT 27

29 APPENDIX A. INDEPENDENT STUDY CONTRACT 28

30 Appendix B Project Description 29

31 APPENDIX B. PROJECT DESCRIPTION 30

32 APPENDIX B. PROJECT DESCRIPTION 31

33 Appendix C Software Description 32

34 Software Brief There are two directories in this software folder, i.e. "Path-based" directory and "Ripple" directory. Path-based directory This directory is used to implement the Path-based Algorithm. The folder named "files" saves the test dataset (ACM network). There is only one python file named "path-based.py" in this directory. Directly used modules Six modules are directly used from python (python ) library. They are: Functions Six functions are used in this program. One funciton named "RepresentsInt(s)" is obtained from Details about each function can be found in README. 1. RepresentsInt(s) 2. find_selectedsn() 3. find_allpaths(l) 4. cate_type(listofpaths)

35 5. build_adlist(paths) 6. build_sg(nodeset) Ripple directory This directory is used to implement the Ripple Algorithm. The folder named "files" saves the test dataset (ACM network). There are 3 python files in this directory: "Dataset_To_Graph_1.py", "Search_Each_Supernode_2.py" and "Resolution_Control_3.py". Directly used modules Seven modules are directly used from python (python ) library. They are: Functions Thirteen functions are used in this program. RepresentsInt(s) and to_graph(combinedgroup) + to_edge(l) functions are others' works, obtained from and respectively. Details about each function can be found in README. 1. RepresentsInt(s) 2. buildadlist(node1, node2) 3. find_selectedsn()

36 4. typeequalsupernode() + typenotequalsupernode() 5. find_out_neighs(node, grouptype) 6. find_in_neighs(node, grouptype) 7. to_graph(combinedgroup) + to_edge(l) 8. comb_groups(gg) 9. find_finaledges(nodeset) 10. NodesetMapAttr() 11. AttrMapNum() 12. FindSplitAttr() + FindMinNum() 13. find_relationship(groups) Test Environment All experiments were performed on a MacBook Pro with 2.2 GHz Intel Core i7 CPU and 16 Gbytes RAM. One test dataset named "ACM bibliographical network" is provided by my supervisor, Qing Wang. The implementation was written in Python Process To test the performance and correctness of my proposed algorithms, each algorithm was applied to 3 test cases. To compare the running time, each test case was run 3 times with different thresholds of path lengths. Details can be seen in Evaluation chapter in report. How to run my software is introduced in README with command examples.

37 Appendix D README 36

38 Iterative Graph Summarization based on Grouping This project proposes two algorithms (Path-based Algorithm and Ripple Algorithm). Both of them produce a summary graph based on user-identified supernodes, finding intermediate supernodes between them. Ripple Algorithm further allows users to control the resolution of a summary graph. Getting Started This project was written and tested in macos Sierra. So the commands shown in this Readme are in mac style. Prerequisites 1. Python 2.x version. The algorithms were implemented using Python There should be a "files" folder in "Path-based" and "Ripple" directory and test dataset (ACM network) in the folder. 3. networkx, csv, Counter, copy, groupby, timeit as well as itemgetter python modules are required. They can be installed by running "pip install module name". 4. There should be 1 python file in Path-based directory, 3 python files in Ripple directory. Running the tests This section will introduce how to run these two algorithms, respectively. Path-based Algorithm 1. Go into the "Path-based" directory in the terminal. 2. Run: python path-based.py 3. Give input as the program requires. For each attribute value, you can just give the keyword. The attribute you can choose is in the D.add_node() statement. Example

39 > cd ~/Path-based > python path-based.py > the number of attrs for supernode 1: 2 > the attribute 1 for supernode 1, use = to assign value: node_type=author > the attribute 2 for supernode 1, use = to assign value: affiliation=duke > the number of attrs for supernode 2: 2 > the attribute 1 for supernode 2, use = to assign value: node_type=author > the attribute 2 for supernode 2, use = to assign value: affiliation=ibm > threshold of path lengths: 3 The detail about the summary graph will be shown. Ripple Algorithm 1. Go into the "Ripple" directory in the terminal. 2. Run: python Resolution_Control_3.py 3. Give input as the program requires. For each attribute value, you can just give the keyword. The attribute you can choose is in the D.add_node() statement in "Dataset_To_Graph_1.py". Example > cd ~/Ripple > python Resolution_Control_3.py > the number of input supernodes: 3 > the number of attrs for supernode 1: 2 > the attribute 1 for supernode 1, use = to assign value: node_type=author > the attribute 2 for supernode 1, use = to assign value: affiliation=duke > the number of attrs for supernode 2: 2 > the attribute 1 for supernode 2, use = to assign value: node_type=author > the attribute 2 for supernode 2, use = to assign value: affiliation=ibm > the number of attrs for supernode 3: 2 > the attribute 1 for supernode 3, use = to assign value: node_type=paper > the attribute 2 for supernode 3, use = to assign value: title=algorithms > threshold of path lengths: 2 The detail about the summary graph will be shown. > specify the resolution:15 The detail about the split summary graph will be shown. Project Structure Functions in each algorithm program are introduced in this section.

40 Path-based Algorithm Codes in "Path-based" directory implemented the Path-based Algorithm. There is only one python file named path-based.py in this directory. Functions are: 1. RepresentsInt(s): check if a string "s" represents an integer, codes of this function obtained from here. 2. find_selectedsn(): find vertices which meet the input requirement for entity sets. 3. find_allpaths(l): find paths (length is not greater than L) between 2 input supernodes. 4. cate_type(listofpaths): group vertices on paths based on entity type. 5. build_adlist(paths): build adjacent list based on found paths. 6. build_sg(nodeset): build a summary graph. Ripple Algorithm Codes in "Ripple" directory implemented the Ripple Algorithm. There are three python files in this directory, i.e. "Dataset_To_Graph_1.py", "Search_Each_Supernode_2.py" and "Resolution_Control_3.py". Dataset_To_Graph_1.py This python file is used to build the dataset into a data graph D. Functions are: 1. RepresentsInt(s): check if a string "s" represents an integer, codes of this function obtained from here. 2. buildadlist(node1, node2): build adjacent lists if there is an edge between node1 and node2 in data graph D. 3. find_selectedsn(): find vertices which meet the input requirement for entity sets. 4. typeequalsupernode() + typenotequalsupernode(): two functions are used to group vertices in the data graph D based on entity types. Here, user-identified supernodes are regarded as new entity types (Si) Search_Each_Supernode_2.py This python file is used to do Breadth-First-Search from each input supernode and merge groups. Functions are: 1. find_out_neighs(node, grouptype): find all "out" neighboring vertices of a group and group them based on direction, entity type and edge type. 2. find_in_neighs(node, grouptype): find all "in" neighboring vertices of a group and group them based on direction, entity type and edge type. 3. to_graph(combinedgroup) + to_edge(l): construct groups in "CombinedGroup" into graphs. Using the connected components to find groups which are overlapping. Codes of this function obtained from here. 4. comb_groups(gg): (1) find groups in "GG" which are in the same type and have the same neighbors. (2) For each same neighbor, check whether they are connected by the same edge type. If groups meet (1) and (2) requirements, merging them into one group.

41 5. find_finaledges(nodeset): find the relationships between groups in "nodeset". Resolution_Control_3.py This python file is used to allow users to control the resolution of the summary graph which is built in "Search_Each_Supernode_2.py". Functions are: 1. NodesetMapAttr(): find distinct attribute values for each attribute. 2. AttrMapNum(): calculate the number of distinct values for each attribute. 3. FindSplitAttr() + FindMinNum(): find splitting attribute and splitting group. 4. find_relationship(groups): find the relationships between groups in "Groups". Known Issues 1. To find all paths between two input entity sets, Path-based Algorithm does not consider about the direction, so it may take a long time if users specify a large value for threshold. Solution is to restart the program with a smaller threshold value. 2. There is no space when you assign values to attributes. 3. Capitalization of attribute values does matter! Please make sure the attribute value you chosen follows the dataset in "files" folder. If the program says: There may exist typo in your input!! Check the input again and restart the program with correct input. 4. In the result of Ripple Algorithm, it may contain supernodes whose name start with "M". Supernodes start with "M" refer to nodes who have the same entity type (node_type) with input supernodes. For example, "S1" is a supernode which contains authors work for Duke University; therefore, "M1" is a supernode which contains authors who do not work for Duke University. 5. If there exist some supernodes which end with number, it means there exist supernodes who have the same entity type (node_type) with them, but they have different superedges. For example, there may exist "paper" and "paper1" in the result. It means these two supernodes are all papers but they are connected by different superedges, so they cannot be merged into one group. Author Codes and this README are written by Sirui Li (u ). Mail: u @anu.edu.au Acknowledgments Two functions in this project are others' works and they have been cited in this README. I really appreciate the guidance from my supervisor Qing Wang.

42 Bibliography [1] J. Cheng, Y. Ke, W. Ng, and J. X. Yu. Context-aware object connection discovery in large graphs. In Data Engineering, ICDE 09. IEEE 25th International Conference on, pages IEEE, [2] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [3] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex: explaining relationships between entity pairs. Proceedings of the VLDB Endowment, 5(3): , [4] G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: mining informative entity relationship subgraphs. In Proceedings of the 18th ACM conference on Information and knowledge management, pages ACM, [5] Y. Liu, A. Dighe, T. Safavi, and D. Koutra. A graph summarization: A survey. arxiv preprint arxiv: , [6] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages ACM, [7] C. Ramakrishnan, W. H. Milnor, M. Perry, and A. P. Sheth. Discovering informative connection subgraphs in multi-relational graphs. ACM SIGKDD Explorations Newsletter, 7(2):56 63, [8] S. Seufert, K. Berberich, S. J. Bedathur, S. K. Kondreddi, P. Ernst, and G. Weikum. Espresso: Explaining relationships between entity sets. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages ACM, [9] Y. Tian and J. M. Patel. Interactive graph summarization. In Link Mining: Models, Algorithms, and Applications, pages Springer,

43 BIBLIOGRAPHY 42 [10] N. Zhang, Y. Tian, and J. M. Patel. Discovery-driven graph summarization. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages IEEE, 2010.

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Efficient Aggregation for Graph Summarization

Efficient Aggregation for Graph Summarization Efficient Aggregation for Graph Summarization Yuanyuan Tian (University of Michigan) Richard A. Hankins (Nokia Research Center) Jignesh M. Patel (University of Michigan) Motivation Graphs are everywhere

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Lily: Ontology Alignment Results for OAEI 2009

Lily: Ontology Alignment Results for OAEI 2009 Lily: Ontology Alignment Results for OAEI 2009 Peng Wang 1, Baowen Xu 2,3 1 College of Software Engineering, Southeast University, China 2 State Key Laboratory for Novel Software Technology, Nanjing University,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

The clustering in general is the task of grouping a set of objects in such a way that objects

The clustering in general is the task of grouping a set of objects in such a way that objects Spectral Clustering: A Graph Partitioning Point of View Yangzihao Wang Computer Science Department, University of California, Davis yzhwang@ucdavis.edu Abstract This course project provide the basic theory

More information

A Schema Extraction Algorithm for External Memory Graphs Based on Novel Utility Function

A Schema Extraction Algorithm for External Memory Graphs Based on Novel Utility Function DEIM Forum 2018 I5-5 Abstract A Schema Extraction Algorithm for External Memory Graphs Based on Novel Utility Function Yoshiki SEKINE and Nobutaka SUZUKI Graduate School of Library, Information and Media

More information

On the packing chromatic number of some lattices

On the packing chromatic number of some lattices On the packing chromatic number of some lattices Arthur S. Finbow Department of Mathematics and Computing Science Saint Mary s University Halifax, Canada BH C art.finbow@stmarys.ca Douglas F. Rall Department

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks A. Papadopoulos, G. Pallis, M. D. Dikaiakos Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks IEEE/WIC/ACM International Conference on Web Intelligence Nov.

More information

Large Scale Graph Algorithms

Large Scale Graph Algorithms Large Scale Graph Algorithms A Guide to Web Research: Lecture 2 Yury Lifshits Steklov Institute of Mathematics at St.Petersburg Stuttgart, Spring 2007 1 / 34 Talk Objective To pose an abstract computational

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Distributed Data Anonymization with Hiding Sensitive Node Labels

Distributed Data Anonymization with Hiding Sensitive Node Labels Distributed Data Anonymization with Hiding Sensitive Node Labels C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan University,Trichy

More information

Determining Differences between Two Sets of Polygons

Determining Differences between Two Sets of Polygons Determining Differences between Two Sets of Polygons MATEJ GOMBOŠI, BORUT ŽALIK Institute for Computer Science Faculty of Electrical Engineering and Computer Science, University of Maribor Smetanova 7,

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

FPGP: Graph Processing Framework on FPGA

FPGP: Graph Processing Framework on FPGA FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E. Dept., TNLIST, Tsinghua University dgh14@mails.tsinghua.edu.cn 1 Big graph is widely used Big graph is widely

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park

Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph Nechama Florans Mentor: Dr. Boram Park G: V 5 Vertex Clique Covers and Edge Clique Covers: Suppose we have a graph

More information

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD Recap. Growth rates: Arrange the following functions in ascending order of growth rate: n 2 log n n log n 2 log n n/ log n n n Introduction Algorithm: A step-by-step way of solving a problem. Design of

More information

Probabilistic Graph Summarization

Probabilistic Graph Summarization Probabilistic Graph Summarization Nasrin Hassanlou, Maryam Shoaran, and Alex Thomo University of Victoria, Victoria, Canada {hassanlou,maryam,thomo}@cs.uvic.ca 1 Abstract We study group-summarization of

More information

CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS

CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS 1 JAMES SIMS, 2 NATARAJAN MEGHANATHAN 1 Undergrad Student, Department

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

Abstract. 1. Introduction

Abstract. 1. Introduction MATCHINGS IN 3-DOMINATION-CRITICAL GRAPHS: A SURVEY by Nawarat Ananchuen * Department of Mathematics Silpaorn University Naorn Pathom, Thailand email: nawarat@su.ac.th Abstract A subset of vertices D of

More information

Towards a hybrid approach to Netflix Challenge

Towards a hybrid approach to Netflix Challenge Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

AN IMPROVED DENSITY BASED k-means ALGORITHM

AN IMPROVED DENSITY BASED k-means ALGORITHM AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

WOOster: A Map-Reduce based Platform for Graph Mining

WOOster: A Map-Reduce based Platform for Graph Mining WOOster: A Map-Reduce based Platform for Graph Mining Aravindan Raghuveer Yahoo! Bangalore aravindr@yahoo-inc.com Abstract Large scale graphs containing O(billion) of vertices are becoming increasingly

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

HW Graph Theory SOLUTIONS (hbovik) - Q

HW Graph Theory SOLUTIONS (hbovik) - Q 1, Diestel 9.3: An arithmetic progression is an increasing sequence of numbers of the form a, a+d, a+ d, a + 3d.... Van der Waerden s theorem says that no matter how we partition the natural numbers into

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

1 Counting triangles and cliques

1 Counting triangles and cliques ITCSC-INC Winter School 2015 26 January 2014 notes by Andrej Bogdanov Today we will talk about randomness and some of the surprising roles it plays in the theory of computing and in coding theory. Let

More information

6c Lecture 3 & 4: April 8 & 10, 2014

6c Lecture 3 & 4: April 8 & 10, 2014 6c Lecture 3 & 4: April 8 & 10, 2014 3.1 Graphs and trees We begin by recalling some basic definitions from graph theory. Definition 3.1. A (undirected, simple) graph consists of a set of vertices V and

More information

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Gautam Bhat, Rajeev Kumar Singh Department of Computer Science and Engineering Shiv Nadar University Gautam Buddh Nagar,

More information

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling 2014/04/09 @ WWW 14 Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling Takuya Akiba (U Tokyo) Yoichi Iwata (U Tokyo) Yuichi Yoshida (NII & PFI)

More information

Rank Preserving Clustering Algorithms for Paths in Social Graphs

Rank Preserving Clustering Algorithms for Paths in Social Graphs University of Waterloo Faculty of Engineering Rank Preserving Clustering Algorithms for Paths in Social Graphs LinkedIn Corporation Mountain View, CA 94043 Prepared by Ziyad Mir ID 20333385 2B Department

More information

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process KITTISAK KERDPRASOP and NITTAYA KERDPRASOP Data Engineering Research Unit, School of Computer Engineering, Suranaree

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce CSE 701: LARGE-SCALE GRAPH MINING A. Erdem Sariyuce WHO AM I? My name is Erdem Office: 323 Davis Hall Office hours: Wednesday 2-4 pm Research on graph (network) mining & management Practical algorithms

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Efficient Construction of Safe Regions for Moving knn Queries Over Dynamic Datasets

Efficient Construction of Safe Regions for Moving knn Queries Over Dynamic Datasets Efficient Construction of Safe Regions for Moving knn Queries Over Dynamic Datasets Mahady Hasan, Muhammad Aamir Cheema, Xuemin Lin, Ying Zhang The University of New South Wales, Australia {mahadyh,macheema,lxue,yingz}@cse.unsw.edu.au

More information

Entity Resolution over Graphs

Entity Resolution over Graphs Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Analyzing a Greedy Approximation of an MDL Summarization

Analyzing a Greedy Approximation of an MDL Summarization Analyzing a Greedy Approximation of an MDL Summarization Peter Fontana fontanap@seas.upenn.edu Faculty Advisor: Dr. Sudipto Guha April 10, 2007 Abstract Many OLAP (On-line Analytical Processing) applications

More information

University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases

University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: 20523416 User ID: skoushan 4A

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Searching SNT in XML Documents Using Reduction Factor

Searching SNT in XML Documents Using Reduction Factor Searching SNT in XML Documents Using Reduction Factor Mary Posonia A Department of computer science, Sathyabama University, Tamilnadu, Chennai, India maryposonia@sathyabamauniversity.ac.in http://www.sathyabamauniversity.ac.in

More information

Institutionen för datavetenskap Department of Computer and Information Science

Institutionen för datavetenskap Department of Computer and Information Science Institutionen för datavetenskap Department of Computer and Information Science Final thesis K Shortest Path Implementation by RadhaKrishna Nagubadi LIU-IDA/LITH-EX-A--13/41--SE 213-6-27 Linköpings universitet

More information

CS 103 Six Degrees of Kevin Bacon

CS 103 Six Degrees of Kevin Bacon CS 103 Six Degrees of Kevin Bacon 1 Introduction This is the second half of the previous assignment, and acts as the culmination of your C/C++ programming experience in this course. You will use certain

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Symmetric Product Graphs

Symmetric Product Graphs Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-20-2015 Symmetric Product Graphs Evan Witz Follow this and additional works at: http://scholarworks.rit.edu/theses

More information

Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters

Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters 1 University of California, Santa Barbara, 2 Hewlett Packard Labs, and 3 Hewlett Packard Enterprise 1

More information

Database performance optimization

Database performance optimization Database performance optimization by DALIA MOTZKIN Western Michigan University Kalamazoo, Michigan ABSTRACT A generalized model for the optimization of relational databases has been developed and implemented.

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha

More information

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology

More information

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 258 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

The Structure and Properties of Clique Graphs of Regular Graphs

The Structure and Properties of Clique Graphs of Regular Graphs The University of Southern Mississippi The Aquila Digital Community Master's Theses 1-014 The Structure and Properties of Clique Graphs of Regular Graphs Jan Burmeister University of Southern Mississippi

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

princeton univ. F 15 cos 521: Advanced Algorithm Design Lecture 2: Karger s Min Cut Algorithm

princeton univ. F 15 cos 521: Advanced Algorithm Design Lecture 2: Karger s Min Cut Algorithm princeton univ. F 5 cos 5: Advanced Algorithm Design Lecture : Karger s Min Cut Algorithm Lecturer: Pravesh Kothari Scribe:Pravesh (These notes are a slightly modified version of notes from previous offerings

More information

Tie strength, social capital, betweenness and homophily. Rik Sarkar

Tie strength, social capital, betweenness and homophily. Rik Sarkar Tie strength, social capital, betweenness and homophily Rik Sarkar Networks Position of a node in a network determines its role/importance Structure of a network determines its properties 2 Today Notion

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

Recommendation with Differential Context Weighting

Recommendation with Differential Context Weighting Recommendation with Differential Context Weighting Yong Zheng Robin Burke Bamshad Mobasher Center for Web Intelligence DePaul University Chicago, IL USA Conference on UMAP June 12, 2013 Overview Introduction

More information

Graph Theory using Sage

Graph Theory using Sage Introduction Student Projects My Projects Seattle, August 2009 Introduction Student Projects My Projects 1 Introduction Background 2 Student Projects Conference Graphs The Matching Polynomial 3 My Projects

More information

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Ayaka ONISHI 1, and Chiemi WATANABE 2 1,2 Graduate School of Humanities and Sciences, Ochanomizu University,

More information

Online k-taxi Problem

Online k-taxi Problem Distributed Computing Online k-taxi Problem Theoretical Patrick Stäuble patricst@ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Georg Bachmeier,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

A two-stage strategy for solving the connection subgraph problem

A two-stage strategy for solving the connection subgraph problem Graduate Theses and Dissertations Graduate College 2012 A two-stage strategy for solving the connection subgraph problem Heyong Wang Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/etd

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

The Game Chromatic Number of Some Classes of Graphs

The Game Chromatic Number of Some Classes of Graphs The Game Chromatic Number of Some Classes of Graphs Casper Joseph Destacamento 1, Andre Dominic Rodriguez 1 and Leonor Aquino-Ruivivar 1,* 1 Mathematics Department, De La Salle University *leonorruivivar@dlsueduph

More information

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Online Graph Exploration

Online Graph Exploration Distributed Computing Online Graph Exploration Semester thesis Simon Hungerbühler simonhu@ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Sebastian

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

EULER S FORMULA AND THE FIVE COLOR THEOREM

EULER S FORMULA AND THE FIVE COLOR THEOREM EULER S FORMULA AND THE FIVE COLOR THEOREM MIN JAE SONG Abstract. In this paper, we will define the necessary concepts to formulate map coloring problems. Then, we will prove Euler s formula and apply

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Module 11. Directed Graphs. Contents

Module 11. Directed Graphs. Contents Module 11 Directed Graphs Contents 11.1 Basic concepts......................... 256 Underlying graph of a digraph................ 257 Out-degrees and in-degrees.................. 258 Isomorphism..........................

More information

Efficiently Mining Positive Correlation Rules

Efficiently Mining Positive Correlation Rules Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 39S-44S Efficiently Mining Positive Correlation Rules Zhongmei Zhou Department of Computer Science & Engineering,

More information

An Enhanced Algorithm to Find Dominating Set Nodes in Ad Hoc Wireless Networks

An Enhanced Algorithm to Find Dominating Set Nodes in Ad Hoc Wireless Networks Georgia State University ScholarWorks @ Georgia State University Computer Science Theses Department of Computer Science 12-4-2006 An Enhanced Algorithm to Find Dominating Set Nodes in Ad Hoc Wireless Networks

More information

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems KNOWLEDGE GRAPHS Lecture 1: Introduction and Motivation Markus Krötzsch Knowledge-Based Systems TU Dresden, 16th Oct 2018 Introduction and Organisation Markus Krötzsch, 16th Oct 2018 Knowledge Graphs slide

More information

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING 1 SONALI SONKUSARE, 2 JAYESH SURANA 1,2 Information Technology, R.G.P.V., Bhopal Shri Vaishnav Institute

More information

How do people tag pictures? A study with Facebook application

How do people tag pictures? A study with Facebook application COMP3750 Final Report How do people tag pictures? A study with Facebook application Author: Victor Hartanto Wibisono, U4644427 Supervisor: Dr. Lexing Xie 1 June 2012 Acknowledgements This research would

More information