Iterative Graph Summarization based on Grouping
|
|
- Luke Short
- 5 years ago
- Views:
Transcription
1 Iterative Graph Summarization based on Grouping Sirui Li Supervisor: Dr. Qing Wang COMP4560: Advanced Computing Project Australian National University Semester 1, 2017 May 26, 2017
2 Acknowledgements First and foremost, I would like to express my thanks of gratitude to my supervisor, Dr. Qing Wang, who gave me the golden opportunity to do this project which I am really interested in. Without her assistance and guidence, the project and the report cannot be accomplished. Secondly I would also like to thank Dr. Weifa Liang who is the course convener, for providing helpful assistance on academic writing and presentation. Most importantly, my thanks go to my family and friends, without the understanding and support of them, the project cannot be finalized. 1
3 Abstract Analysing large datasets to discover useful insights is an important task in many application domains. In order to find the information, graphs are widely used to model entities (as nodes) and their relationships (as edges) in the datasets. These graphs may contain millions of nodes and edges. To reduce the complexity and understand the underlying information of large graphs, graph summarization is a critical technique. There are two major methods in graph summarization: statistical methods and grouping-based methods. Statistical methods, such as degree distributions, are useful, but it is hard to control the resolution [9]. Grouping-based methods, or aggregation-based methods, produce summaries from large graphs by grouping nodes based on attributes and relationships. Existing grouping-based methods, such as SNAP and k-snap, allow users to select node attributes and relationships of the summaries, but they require high user interaction, which means users need to know the detail about the nodes and edges of the summaries before the analysis. This project is to propose an algorithm to summarize graphs iteratively until reaching a desired resolution specified by users. The algorithm just requires users to specify some entity sets as well as the resolution, and then the algorithm finds the intermediate nodes appear during the search and groups these nodes by some method. Keywords: Graph Summarization, Statistical Methods, Grouping-based Methods, Aggregation-based Methods 2
4 Contents Acknowledgements 1 Abstract 2 1 Introduction Objectives Contributions Outline Related Work SNAP and k-snap Algorithms ESPRESSO Algorithm and REX Algorithm Methodology Problem Definition Algorithm Design Path-based Algorithm Ripple Algorithm Evaluation Experiment Design Experimental Results Path-based Algorithm Ripple Algorithm Conclusion and Future Work 24 Appendices 25 A Independent Study Contract 26 3
5 CONTENTS 4 B Project Description 29 C Software Description 32 D README 36
6 Chapter 1 Introduction Modelling real-life entities and relationships between them into graphs is a common method to extract underlying information in large datasets. One typical application is the social networking analysis. Mining social networkings can help analysers understand how people interact with each other. However, decoding useful information in large graphs is too difficult. Hence, graph summarization is needed to summarize large graphs into small graphs which can be easily understood. Graph summarization is a process of transferring large graphs into concise forms [9]. Generally, two kinds of methods are used in graph summarization: statistical methods and grouping-based methods. Statistical methods [9] include degree distributions, hop-plots and so on. The major downside of these methods is that the resolution of summaries cannot be controlled by users. Grouping-based methods, also known as aggregation-based methods [9], are used to produce summary graphs by grouping nodes based on selected attributes of nodes and edges. Figure 1.1 which is taken from [9] presents an overall picture of grouping-based graph summarization process, the graph at left has 7445 vertices and 19,971 edges. Given some conditions, a summary graph is shown on the right. In the summary graph, each vertex is a set of vertices from the original graph, called supernode or group; each edge represents the connection between two sets of vertices, called superedge or group relationship. The summary graph on the right can help understand the relationships among different sets of vertices in the original graph. 5
7 CHAPTER 1. INTRODUCTION 6 Figure 1.1: A summary graph (right) is generated for the original graph (left), which is taken from [9] Two operations are proposed in the grouping-based methods: SNAP and k-snap [9]. These two operations produce a summary graph based on user-selected vertex attributes and relationships. Both of them require attributes homogeneity, that is, vertices in the same group must have the same value for each user-selected attribute [9]. The SNAP operation also requires relationship homogeneity, that is, if there exists a relationship between two groups, then every node in one group has to connect to at least one node in another group [9]. Nonetheless, homogeneity is hard to achieve in real-life applications because data may involve too much noise. Thus, k-snap operation relaxes this requirement and allows users to control the size of the summary graph. Even if SNAP and k-snap operations can produce summary graphs with high quality, there still exist some drawbacks. For example, they require high user interaction, which means users are expected to know the detail about summary graphs before building up the summary graphs. Unfortunately, it is highly impracticable and almost impossible in real life applications, as most users may lack knowledge about useful relationships in most cases. 1.1 Objectives The goal of my project is to propose an iterative algorithm for grouping-based methods based on (1) user interested entity sets (2) user-identified resolution. It can answer questions of the form Have people at ANU collaborated with people at Peking University? or Have people at ANU collaborated with people at Peking University in 27th international ACM conference?. The specific objectives are: 1. Develop an iterative algorithm for grouping-based graph summarization. 2. Rank the summary graph.
8 CHAPTER 1. INTRODUCTION 7 3. Evaluate effectiveness and efficiency of the algorithm. Note that there are actually two algorithms (Path-based Algorithm and Ripple Algorithm) introduced in this report. Path-based Algorithm can only be applied when there are 2 user-specified entity sets. Ripple Algorithm improves the Pathbased Algorithm such that users can specify 2 or more entity sets to study. 1.2 Contributions My work has addressed the limitations of previous works and can provide the following noval features: Users can specify an arbitrary number of their interested entity sets. The threshold of path lengths between user-specified entity sets can be identified by users. Entities that appear in one or more paths between user-specified entity sets can be grouped in the summary graph. How to group these intermediate entities will be further discussed in chapter 3. Users can control the resolution of the summary graph, i.e. users can specify how many supernodes in the summary graph. 1.3 Outline This project report is structured as follows: Chapter 2 presents the related works for graph summarization and discusses the limitations. Chapter 3 presents the notions I define for presenting the algorithms and how the proposed algorithms (Path-based Algorithm and Ripple Algorithm) work. Chapter 4 shows the evaluation of the performance of these two algorithms in terms of the participant ratio and running time. Chapter 5 concludes this report and discusses the future work.
9 Chapter 2 Related Work Graph summarization is widely used in many application domains. Various graph summarization techniques have been proposed to help users understand the characteristics of large graphs [6], [10]. A recent survey by Yike Liu [5] summarizes a comprehensive overview of the state-of-the-art methods for graph summarization. 2.1 SNAP and k-snap Algorithms Grouping-based methods are one of the most popular techniques for graph summarization. These methods aggregate vertices of an original graph into supernodes and connect them with superedges, producing a summary graph [5]. Several recent studies have investigated to the problem of how to build up a summary graph by applying grouping-based methods. Summarization by Grouping Nodes on Attributes and Pairwise Relationships (SNAP) and k-snap are two database-stype algorithms to summarize graphs [9]. They can only deal with categorical node attributes [10]. The SNAP algorithm produces a summary graph by grouping nodes based on user-selected node attributes and relationships. The k-snap algorithm also allows users to customize the summary graph [10] and extends the SNAP algorithm which allows users to control the resolution of the summary graph by drill-down or roll-up operations. Although the SNAP and k-snap algorithms can produce a summary graph based on user-selected node attributes and relationships, they still exist some drawbacks. In many real-life cases, users do not know what relationships are useful between their selected nodes and they do not know whether there exist other entities between their selected nodes. Thus, users might prefer to just specify their interested entity sets, and then expect an algorithm to return a summary graph which can discover other entities existing between their specified entity sets. 8
10 CHAPTER 2. RELATED WORK ESPRESSO Algorithm and REX Algorithm There are many previous works focusing on extracting relationships between input entities from large graphs. For example, [7] proposed a connection subgraph discovery algorithm over RDF graphs. The notion connection subgraph proposed by [2] is to extract a connected subgraph based on a pair of query vertices and the number of vertices in the subgraph can be controlled. [4] addressed the extraction of subgraphs based on a set of query vertices in entity-relationship graphs. [1] partitioned graphs with respect to the context of a vertex. ESPRESSO algorithm proposed by [8] and REX algorithm proposed by [3] are based on these previous works and focus on producing a concise summary graph to explain the relationship between two sets of entities in a knowledge graph. The limitation of these two algorithms is that users can only specify two entity sets of their interest and produce related explanations.
11 Chapter 3 Methodology In this chapter, I will present my algorithms of producing the summary graphs from a given data graph. For clarity, I will first define the notions of data graphs, summary graphs, path, path length, as well as the resolution of summary graphs. Then, I will introduce each algorithm in detail. Note that, each algorithm introduced in this report can be used in both directed and undirected graphs. For simplicity, I only utilize undirected graphs as examples to present both algorithms. 3.1 Problem Definition Figure 3.1: A summary graph G (b) is generated from a data graph D (a), which is taken from [9] A data graph, denoted by D = (V D, E D, T, A, τ, ψ), is a graph with a set V D of entities as vertices, a set E D of relationships as edges, a set A of attributes, a set 10
12 CHAPTER 3. METHODOLOGY 11 T of entity types, and two assignment functions: τ : V D T, ψ : V D A that assign to each vertex an entity type (from a set T of possible types) and a set of attributes (from a set A of possible attribute values). Figure 3.1(a) presents a data graph D. Each vertex in D has an entity type t T and is associated with a set of attributes in A. A summary graph G, denoted by G = (V G, E G, T, A, τ, ψ ) is a graph with a set V G of entity sets as vertices or supernodes, a set E G of relationships as edges or superedges, and two assignment functions: τ : V G T, ψ : V G A that assign to each vertex a entity type (from a set T of possible types) and a set of attribute values (from a set A of possible attributes). Here, each supernode in G is a subset of vertices of V D and each superedge in G is a subset of edges in E D. Figure 3.1(b) presents a summary graph G with 4 supernodes (S1, S2, S3 and S4) and 4 superedges (S1-S2, S2-S3, S2-S4, S3-S4). From the definition of path in Wikipedia a path in a graph is a finite or infinite sequence of edges which connect a sequence of vertices which, by most definitions, are all distinct from one another. I use P to refer to a path in this report. The path length, denoted as L, is the number of edges occuring in a path. For example, the paths from S1 to S3 and their lengths in Figure 3.1(b) can be: path (P ) length (L) P 1 = < S1, S2, S3 > 2 P 2 = < S1, S2, S4, S3 > 3 In a graph, the resolution of a graph means the size of the graph, i.e. how many vertices in a graph. In this report, I use r to refer to the user-specified resolution. In Figure 3.1(b), the resolution is Algorithm Design In this section, I will introduce two algorithms for graph summarization, i.e. Pathbased Algorithm and Ripple Algorithm proposed in this report. In Path-based Algorithm, users are allowed to take a pair of entity sets as well as a threshold of path lengths as input. Given an input, the algorithm finds out the intermediate entities appearing on related paths and groups them by entity type. After grouping, these entity sets are presented as supernodes in the summary graph G. If there is an edge between node A and node B in the data graph, and then there must exist a superedge between supernodes who contain node A and node B, respectively.
13 CHAPTER 3. METHODOLOGY 12 In Ripple Algorithm, users may take a number of interested entity sets as well as a threshold of path lengths as input. Intermediate entities are found by Breadth- First-Search (BFS). After grouping these intermediate entities, an initial summary graph is returned. This algorithm further allows users to specify the resolution r of the final summary graph G and hence the initial summary graph can be split until the resolution meets the user requirement. The major difference between these two algorithms is that Path-based Algorithm can only be used to find relationships between a pair of entity sets; whereas, the Ripple Algorithm can be used to find relationships among two or more entity sets. In principle, Ripple Algorithm can be used without the limitation on the number of entity sets as input Path-based Algorithm In this section I discuss the details of Path-based Algorithm. A high-level description is provided in Figure 3.2. Input: - A data graph: D - A pair of entity sets (V 1, V 2 ) - Threshold of path lengths: L Output: - A summary graph G Path-based Algorithm 1: S 1 = V 1, S 2 = V 2 \\Initialize supernode 1 (S 1 ) and supernode 2 (S 2 ) 2: find all paths whose lengths are not greater than L between S 1 and S 2 3: group vertices on related paths by entity types, each group is a supernode 4: if there exists an edge between node A and node B do: 5: add a superedge between S i and S j which contain node A and node B, respectively 6: return G Figure 3.2: Path-based Algorithm Let S i be the supernode in a summary graph G. There are 4 phases in Path-based Algorithm: 1. Initialization (line 1) Finding vertices who meet the input requirement for entity sets (V 1 and V 2 ), and save them in the corresponding supernode (S 1 and S 2 ). 2. Find paths (line 2) Finding paths between two supernodes S 1 and S 2 and the length of each path is not greater than L.
14 CHAPTER 3. METHODOLOGY Group vertices (line 3 ) Grouping vertices in the previous paths based on entity types. 4. Build superedges (line 4 to line 5) If there is an edge between node A and node B in the paths, and then adding a superedge between supernodes which contain node A and node B, respectively. Finally, Path-based Algorithm returns a summary graph. author 3 paper proceeding writes writes 1 5 writes 10 in_pr journal writes writes 11 4 Figure 3.3: A data graph example In the follows, I am going to illustrate how the Path-based Algorithm works with an example. Example 1 Assume Figure 3.3 is the input data graph D and users ask the question: Have people at ANU collaborated with people at Peking University? Phase 1: Initialization Assume that people at ANU are grouped into V 1, people at Peking University are grouped into V 2. That is, V 1 = {1, 2}, V 2 = {3, 4}. Hence, S 1 = V 1 = {1, 2}, S 2 = V 2 = {3, 4}. Phase 2: Find paths Suppose that the user-specified threshold L = 2; therefore, the algorithm finds out all paths whose length are not greater than 2. The related paths are: P 1 : < 1, 5, 3 >
15 CHAPTER 3. METHODOLOGY 14 P 2 : < 2, 5, 3 > P 3 : < 2, 6, 4 > Phase 3: Group vertices Grouping vertices on the above 3 paths based on entity types. The grouping result is: entity type vertices S 1 {1, 2} S 2 {3, 4} paper {5, 6} Phase 4: Build superedges Take P 1 as an example, there is an edge between node 1 and node 5. Because node 1 is in S 1, node 5 is in paper, so there should be a superedge between S 1 and paper. Iteratively repeating to add superedges based on each path. After building all superedges, the Algorithm summary graph Design of this example (Path-based is in Figure 3.4. Algorithm) So the answer for the previous question is people at ANU co-write 2 papers (5,6) with people at Peking University. S3 = {5,6} S1 writes S3 writes S2 summary graph G Figure 3.4: A summary graph G However, if users specify n supernodes (n > 2), the algorithm has to repeat phase 2 (Find paths) n(n 1) 2 times to find all paths between every pair of input supernodes, which has low efficiency. I thus developed Ripple Algorithm to address the limitation and to allow users to control the resolution Ripple Algorithm The Path-based Algorithm produces a summary graph based on a pair of input entity sets as well as the path length threshold. Unfortunately, only studying two entity sets is not sufficient to deal with real-life datasets or data graphs in many cases, as most real life data graphs are complex. Users might want to study the relationships between many entity sets to see how they influence each other. Ripple Algorithm is introduced to solve the limitation of Path-based Algorithm and further allows
16 CHAPTER 3. METHODOLOGY 15 users to specify the resolution r of the final summary graph G. Ripple Algorithm will iteratively split supernodes until the resolution of a summary graph meets a user-specified resolution r. Before introducing the Ripple Algorithm, I am going to define two ratios: splitting ratio and participant ratio. Splitting ratio Splitting ratio is to help Ripple Algorithm make decisions on which group to split and how to split it. Given a summary graph G with n supernodes, a splitting ratio will be calcuated for each attribute. Then we pick up the attribute who has the minimum splitting ratio as the splitting attribute, the supernode who has the splitting attribute is chosen as splitting supernode. The splitting ratio (s) is definded as follows: s = M(G i, A) (3.1) G i M(G i, A) = {u.a u G i } where u.a refers to the value of attribute A of vertex u. Participant Ratio To figure out whether the relationship between two supernodes is strong or not, I define a ratio p to statistically reflect it. Given the relationship E between two groups (G i,g j ), we use the following equation to calculate the strength of this group relationship E. p = N(G i, G j ) (3.2) G i where N(G i, G j ) = {u Gi v Gj (u, v) E} The larger the participant ratio is, the stronger the relationship is. A high-level description of Ripple Algorithm is provided in Figure 3.5.
17 CHAPTER 3. METHODOLOGY 16 Input: - A data graph: D - User-specified entity sets: (V 1, V 2,..., V n) - Threshold of path lengths: L - Resolution: r Output: - A summary graph G Ripple Algorithm 1: S 1 = V 1, S 2 = V 2,..., S n = V n\\initialize supernode 1 to supernode n 2: step = 1, SearchV ertices = φ, G i = φ\\initialize step, SearchV ertices and G i 3: for S i in (S 1, S 2,..., S n) do: 4: G i = {vertices in S i } 5: SearchVertices = vertices in S i 6: while step < L do: 7: SearchVertices iter = φ 8: for each supernode S in Gi do: 9: if S has vertices in SearchVertices do: 10: apply BFS on such vertices to find neighbors 11: if neighbor n has not been searched do: 12: SearchVertices iter.append(n) 13: group neighbors 14: add new groups to G i based on the result on line 13 15: SearchVertices = SearchVertices iter 16: merge groups in G i who are overlapping 17: merge same-type groups who have the same neighbors and relationships 18: step += 1 19: Merge all G i into one summary graph G 20: merge groups who are overlapping 21: merge same-type groups who have the same neighbors and relationships 22: while size(g) < r do: 23: find splitting attribute and splitting group 24: split G 25: return G Figure 3.5: Ripple Algorithm There are 4 phases in Ripple Algorithm: 1. Initialization (line 1 to line 2) Finding vertices who meet the input requirements for entity sets {V 1,..., V n }, and save them in the corresponding supernodes {S 1,..., S n }. SearchV ertices is a set to save all found vertices that have not been searched but they will be searched in the next step. The summary graph which is searched from S i saves in a set named G i. 2. Iterative search for each input supernode Si (line 3 to line 18) For each input supernode S i, doing L-step BFS search and group found neighboring vertices. The grouping result saves in G i. 3. Merge each Gi (line 19 to line 21) Merge all G i into one summary graph G.
18 CHAPTER 3. METHODOLOGY Resolution control (line 22 to line 25) Users identify the resolution r. The choice of the splitting supernode and splitting attribute is based on the splitting ratio. Note that the iterative group splitting will continue until the resolution of current summary graph is equal to or greater than r. As discussed in the above 4 phases, the threshold of path lengths between any pair of input entity sets is [1, 2 L]. Figure 3.6: High-level overview of Ripple Algorithm with 2 input entity sets In the follows, I am going to illustrate how the Ripple Algorithm works with an example. Example 2 Consider the data graph in Figure 3.3 and the Ripple Algorithm will answer the question: Have people at ANU collaborated with people at Peking University in 27th international ACM conference? Phase 1: Initialization S 1 = V 1 = {1, 2} contains people at ANU, S 2 = V 2 = {3, 4} contains people at Peking University and S 3 = V 3 = {10} is the conference proceeding. Assume that users specify the threshold of path lengths as 2. Phase 2: Iterative search for each input supernode S i For each input supernode S i, do 2-step BFS search and get the summary graph G i. Phase 3: Merge each G i After merging each G i, the summary graph of this example presents in Figure 3.7. So the answer of the question is that papers in
19 CHAPTER 3. METHODOLOGY 18 Algorithm Design (Ripple Algorithm) S 4 are collected in some journals (S 5 ) and conference proceeding (S 3 ). These papers (S 4 ) are written by people at ANU and people at Peking University. cites S1 in_jo S5 S4 S3 S2 S4 = {5,6,7,8} S5 = {11} summary graph G Figure 3.7: Summary graph G Phase 4: Resolution control Assume that users identify the r = 10. The algorithm will yield a summary graph whose resolution is equal to or greater than 10.
20 Chapter 4 Evaluation This chapter presents the experiments of my proposed algorithms. The experiments include the experiment design and experimental results. In experiment design section {4.1}, evaluation dataset and experimental environment will be introduced. In section {4.2}, the performance of these two algorithms will be discussed in terms of the running time. The summary graphs of Ripple Algorithm will also be discussed based on participant ratio. 4.1 Experiment Design The algorithms were implemented using Python All experiments were performed on a MacBook Pro with 2.2 GHz Intel Core i7 CPU and 16 Gbytes RAM. I used one dataset: ACM bibliographical network (ACM network) in my experiments which has 5 entity types and 6 relationship types. In the ACM network, each paper is written by at least one author and each paper is collected in a conference proceeding or a journal. One paper can cite other papers or be cited by other papers. Each conference proceeding or journal is published by a publisher. The files in the ACM network are in the CSV format, so I constructed the data graph of ACM network based on the given relation schemas. Detailed diagram to represent the relationship schemas of this dataset is shown in the Figure
21 CHAPTER 4. EVALUATION 20 Figure 4.1: The Relation Schemas of ACM Bibliographical Network Table 4.1: Size of ACM network Table Name # Records author 5500 paper 5579 proceeding 6421 journal 128 publisher Experimental Results In this section, Path-based Algorithm and Ripple Algorithm were tested individually. For Path-based Algorithm, I tested the running time under different conditions. The experimental results of Ripple Algorithm will be presented from 2 aspects. First, I tested the Ripple Algorithm without iterative splitting. The other evaluation was to test resolution control of Ripple Algorithm.
22 CHAPTER 4. EVALUATION Path-based Algorithm To test the running time, the Path-based Algorithm was applied to 3 test cases (shown in the following table). Each test case was run 3 times with different thresholds of path lengths (L = 3, 4, 5). The running time of 3 test cases with different thresholds are shown in Figure 4.2(a). We can draw the conclusion: For a given pair of entity sets, the running time of Path-based Algorithm increases with increasing threshold of path lengths. (a) Path-based Algorithm (b) Ripple Algorithm Figure 4.2: Running time of two algorithms Ripple Algorithm First, the Ripple Algorithm was applied to 3 test cases (shown in the following table). For each test case, the algorithm was run 3 times with different thresholds of path lengths (L = 2, 3, 4). The running time of the 3 test cases are in Figure 4.2(b). We can draw the conclusion: Given entity sets, the running time of Ripple Algorithm increases as the threshold increases. Figure 4.3(a) is the summary graph of TC1 (L = 2) with participant ratios and
23 CHAPTER 4. EVALUATION 22 Figure 4.3(b) is the summary graph of TC1 (L = 3) with participant ratio. Comparing these two summary graphs, we can know that given a number of entity sets, most participant ratios slightly reduce as the threshold of path lengths increases. (a) TC1 (L = 2) V1 size: paper size: 4802 V3 size: V2 size: 12 proceeding size: publisher size: 8 author size: 105 journal size: (b) TC1 (L = 3) (c) TC1 (L=2, r = 15) Figure 4.3: Summary graphs To test the resolution control of Ripple Algorithm, I used the summary graph of TC1 (L = 2) and assumed user-specified resolution r = 15. After calculating the splitting ratio for each attribute of the summary graph in Figure 4.3(a), the algorithm finds the splitting supernode is paper 1 and splits paper 1 based on joid attribute.
24 CHAPTER 4. EVALUATION 23 Figure 4.3(c) is the splitting result.
25 Chapter 5 Conclusion and Future Work In this project, I have developed 2 algorithms for grouping-based summarization: Path-based Algorithm and Ripple Algorithm. Path-based Algorithm can summarize a data graph based on a pair of input entity sets as well as the threshold of path lengths. However, Path-based Algorithm is inefficient if users identify more than 2 entity sets, as finding path can be only implemented between two vertices. That is, if users input n supernodes (n>2), the Path-based Algorithm has to iteratively find paths n(n 1) 2 times. Ripple Algorithm allows users to select two or more entity sets, finding out all intermediate vertices between input entity sets and discovering their relationships. Also, users can control the resolution of a summary graph. However, it is subjective to noise. If there are too much noise in the data graph, it is more likely to produce many supernodes in the summary graph, and some supernodes may just have one vertex. The major future work for Ripple Algorithm is to propose another splitting measure. The current splitting method of Ripple Algorithm is splitting ratio measure which choose the attribute who has the minimum splitting ratio as splitting attribute. However, it is not working well in some cases. Also, this measure is more likely to produce some supernodes which just have one vertex. 24
26 Appendices 25
27 Appendix A Independent Study Contract 26
28 APPENDIX A. INDEPENDENT STUDY CONTRACT 27
29 APPENDIX A. INDEPENDENT STUDY CONTRACT 28
30 Appendix B Project Description 29
31 APPENDIX B. PROJECT DESCRIPTION 30
32 APPENDIX B. PROJECT DESCRIPTION 31
33 Appendix C Software Description 32
34 Software Brief There are two directories in this software folder, i.e. "Path-based" directory and "Ripple" directory. Path-based directory This directory is used to implement the Path-based Algorithm. The folder named "files" saves the test dataset (ACM network). There is only one python file named "path-based.py" in this directory. Directly used modules Six modules are directly used from python (python ) library. They are: Functions Six functions are used in this program. One funciton named "RepresentsInt(s)" is obtained from Details about each function can be found in README. 1. RepresentsInt(s) 2. find_selectedsn() 3. find_allpaths(l) 4. cate_type(listofpaths)
35 5. build_adlist(paths) 6. build_sg(nodeset) Ripple directory This directory is used to implement the Ripple Algorithm. The folder named "files" saves the test dataset (ACM network). There are 3 python files in this directory: "Dataset_To_Graph_1.py", "Search_Each_Supernode_2.py" and "Resolution_Control_3.py". Directly used modules Seven modules are directly used from python (python ) library. They are: Functions Thirteen functions are used in this program. RepresentsInt(s) and to_graph(combinedgroup) + to_edge(l) functions are others' works, obtained from and respectively. Details about each function can be found in README. 1. RepresentsInt(s) 2. buildadlist(node1, node2) 3. find_selectedsn()
36 4. typeequalsupernode() + typenotequalsupernode() 5. find_out_neighs(node, grouptype) 6. find_in_neighs(node, grouptype) 7. to_graph(combinedgroup) + to_edge(l) 8. comb_groups(gg) 9. find_finaledges(nodeset) 10. NodesetMapAttr() 11. AttrMapNum() 12. FindSplitAttr() + FindMinNum() 13. find_relationship(groups) Test Environment All experiments were performed on a MacBook Pro with 2.2 GHz Intel Core i7 CPU and 16 Gbytes RAM. One test dataset named "ACM bibliographical network" is provided by my supervisor, Qing Wang. The implementation was written in Python Process To test the performance and correctness of my proposed algorithms, each algorithm was applied to 3 test cases. To compare the running time, each test case was run 3 times with different thresholds of path lengths. Details can be seen in Evaluation chapter in report. How to run my software is introduced in README with command examples.
37 Appendix D README 36
38 Iterative Graph Summarization based on Grouping This project proposes two algorithms (Path-based Algorithm and Ripple Algorithm). Both of them produce a summary graph based on user-identified supernodes, finding intermediate supernodes between them. Ripple Algorithm further allows users to control the resolution of a summary graph. Getting Started This project was written and tested in macos Sierra. So the commands shown in this Readme are in mac style. Prerequisites 1. Python 2.x version. The algorithms were implemented using Python There should be a "files" folder in "Path-based" and "Ripple" directory and test dataset (ACM network) in the folder. 3. networkx, csv, Counter, copy, groupby, timeit as well as itemgetter python modules are required. They can be installed by running "pip install module name". 4. There should be 1 python file in Path-based directory, 3 python files in Ripple directory. Running the tests This section will introduce how to run these two algorithms, respectively. Path-based Algorithm 1. Go into the "Path-based" directory in the terminal. 2. Run: python path-based.py 3. Give input as the program requires. For each attribute value, you can just give the keyword. The attribute you can choose is in the D.add_node() statement. Example
39 > cd ~/Path-based > python path-based.py > the number of attrs for supernode 1: 2 > the attribute 1 for supernode 1, use = to assign value: node_type=author > the attribute 2 for supernode 1, use = to assign value: affiliation=duke > the number of attrs for supernode 2: 2 > the attribute 1 for supernode 2, use = to assign value: node_type=author > the attribute 2 for supernode 2, use = to assign value: affiliation=ibm > threshold of path lengths: 3 The detail about the summary graph will be shown. Ripple Algorithm 1. Go into the "Ripple" directory in the terminal. 2. Run: python Resolution_Control_3.py 3. Give input as the program requires. For each attribute value, you can just give the keyword. The attribute you can choose is in the D.add_node() statement in "Dataset_To_Graph_1.py". Example > cd ~/Ripple > python Resolution_Control_3.py > the number of input supernodes: 3 > the number of attrs for supernode 1: 2 > the attribute 1 for supernode 1, use = to assign value: node_type=author > the attribute 2 for supernode 1, use = to assign value: affiliation=duke > the number of attrs for supernode 2: 2 > the attribute 1 for supernode 2, use = to assign value: node_type=author > the attribute 2 for supernode 2, use = to assign value: affiliation=ibm > the number of attrs for supernode 3: 2 > the attribute 1 for supernode 3, use = to assign value: node_type=paper > the attribute 2 for supernode 3, use = to assign value: title=algorithms > threshold of path lengths: 2 The detail about the summary graph will be shown. > specify the resolution:15 The detail about the split summary graph will be shown. Project Structure Functions in each algorithm program are introduced in this section.
40 Path-based Algorithm Codes in "Path-based" directory implemented the Path-based Algorithm. There is only one python file named path-based.py in this directory. Functions are: 1. RepresentsInt(s): check if a string "s" represents an integer, codes of this function obtained from here. 2. find_selectedsn(): find vertices which meet the input requirement for entity sets. 3. find_allpaths(l): find paths (length is not greater than L) between 2 input supernodes. 4. cate_type(listofpaths): group vertices on paths based on entity type. 5. build_adlist(paths): build adjacent list based on found paths. 6. build_sg(nodeset): build a summary graph. Ripple Algorithm Codes in "Ripple" directory implemented the Ripple Algorithm. There are three python files in this directory, i.e. "Dataset_To_Graph_1.py", "Search_Each_Supernode_2.py" and "Resolution_Control_3.py". Dataset_To_Graph_1.py This python file is used to build the dataset into a data graph D. Functions are: 1. RepresentsInt(s): check if a string "s" represents an integer, codes of this function obtained from here. 2. buildadlist(node1, node2): build adjacent lists if there is an edge between node1 and node2 in data graph D. 3. find_selectedsn(): find vertices which meet the input requirement for entity sets. 4. typeequalsupernode() + typenotequalsupernode(): two functions are used to group vertices in the data graph D based on entity types. Here, user-identified supernodes are regarded as new entity types (Si) Search_Each_Supernode_2.py This python file is used to do Breadth-First-Search from each input supernode and merge groups. Functions are: 1. find_out_neighs(node, grouptype): find all "out" neighboring vertices of a group and group them based on direction, entity type and edge type. 2. find_in_neighs(node, grouptype): find all "in" neighboring vertices of a group and group them based on direction, entity type and edge type. 3. to_graph(combinedgroup) + to_edge(l): construct groups in "CombinedGroup" into graphs. Using the connected components to find groups which are overlapping. Codes of this function obtained from here. 4. comb_groups(gg): (1) find groups in "GG" which are in the same type and have the same neighbors. (2) For each same neighbor, check whether they are connected by the same edge type. If groups meet (1) and (2) requirements, merging them into one group.
41 5. find_finaledges(nodeset): find the relationships between groups in "nodeset". Resolution_Control_3.py This python file is used to allow users to control the resolution of the summary graph which is built in "Search_Each_Supernode_2.py". Functions are: 1. NodesetMapAttr(): find distinct attribute values for each attribute. 2. AttrMapNum(): calculate the number of distinct values for each attribute. 3. FindSplitAttr() + FindMinNum(): find splitting attribute and splitting group. 4. find_relationship(groups): find the relationships between groups in "Groups". Known Issues 1. To find all paths between two input entity sets, Path-based Algorithm does not consider about the direction, so it may take a long time if users specify a large value for threshold. Solution is to restart the program with a smaller threshold value. 2. There is no space when you assign values to attributes. 3. Capitalization of attribute values does matter! Please make sure the attribute value you chosen follows the dataset in "files" folder. If the program says: There may exist typo in your input!! Check the input again and restart the program with correct input. 4. In the result of Ripple Algorithm, it may contain supernodes whose name start with "M". Supernodes start with "M" refer to nodes who have the same entity type (node_type) with input supernodes. For example, "S1" is a supernode which contains authors work for Duke University; therefore, "M1" is a supernode which contains authors who do not work for Duke University. 5. If there exist some supernodes which end with number, it means there exist supernodes who have the same entity type (node_type) with them, but they have different superedges. For example, there may exist "paper" and "paper1" in the result. It means these two supernodes are all papers but they are connected by different superedges, so they cannot be merged into one group. Author Codes and this README are written by Sirui Li (u ). Mail: u @anu.edu.au Acknowledgments Two functions in this project are others' works and they have been cited in this README. I really appreciate the guidance from my supervisor Qing Wang.
42 Bibliography [1] J. Cheng, Y. Ke, W. Ng, and J. X. Yu. Context-aware object connection discovery in large graphs. In Data Engineering, ICDE 09. IEEE 25th International Conference on, pages IEEE, [2] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [3] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex: explaining relationships between entity pairs. Proceedings of the VLDB Endowment, 5(3): , [4] G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: mining informative entity relationship subgraphs. In Proceedings of the 18th ACM conference on Information and knowledge management, pages ACM, [5] Y. Liu, A. Dighe, T. Safavi, and D. Koutra. A graph summarization: A survey. arxiv preprint arxiv: , [6] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages ACM, [7] C. Ramakrishnan, W. H. Milnor, M. Perry, and A. P. Sheth. Discovering informative connection subgraphs in multi-relational graphs. ACM SIGKDD Explorations Newsletter, 7(2):56 63, [8] S. Seufert, K. Berberich, S. J. Bedathur, S. K. Kondreddi, P. Ernst, and G. Weikum. Espresso: Explaining relationships between entity sets. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages ACM, [9] Y. Tian and J. M. Patel. Interactive graph summarization. In Link Mining: Models, Algorithms, and Applications, pages Springer,
43 BIBLIOGRAPHY 42 [10] N. Zhang, Y. Tian, and J. M. Patel. Discovery-driven graph summarization. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages IEEE, 2010.
Top-k Keyword Search Over Graphs Based On Backward Search
Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer
More informationEfficient Aggregation for Graph Summarization
Efficient Aggregation for Graph Summarization Yuanyuan Tian (University of Michigan) Richard A. Hankins (Nokia Research Center) Jignesh M. Patel (University of Michigan) Motivation Graphs are everywhere
More informationAn Empirical Analysis of Communities in Real-World Networks
An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationDENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE
DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering
More informationQuadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase
Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,
More informationLily: Ontology Alignment Results for OAEI 2009
Lily: Ontology Alignment Results for OAEI 2009 Peng Wang 1, Baowen Xu 2,3 1 College of Software Engineering, Southeast University, China 2 State Key Laboratory for Novel Software Technology, Nanjing University,
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationA Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods
A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering
More informationRecommendation System for Location-based Social Network CS224W Project Report
Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless
More informationThe clustering in general is the task of grouping a set of objects in such a way that objects
Spectral Clustering: A Graph Partitioning Point of View Yangzihao Wang Computer Science Department, University of California, Davis yzhwang@ucdavis.edu Abstract This course project provide the basic theory
More informationA Schema Extraction Algorithm for External Memory Graphs Based on Novel Utility Function
DEIM Forum 2018 I5-5 Abstract A Schema Extraction Algorithm for External Memory Graphs Based on Novel Utility Function Yoshiki SEKINE and Nobutaka SUZUKI Graduate School of Library, Information and Media
More informationOn the packing chromatic number of some lattices
On the packing chromatic number of some lattices Arthur S. Finbow Department of Mathematics and Computing Science Saint Mary s University Halifax, Canada BH C art.finbow@stmarys.ca Douglas F. Rall Department
More informationOn Multiple Query Optimization in Data Mining
On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl
More informationActive Blocking Scheme Learning for Entity Resolution
Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking
More informationInternational Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14
International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationSOMSN: An Effective Self Organizing Map for Clustering of Social Networks
SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,
More informationAn Efficient Clustering Method for k-anonymization
An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationA. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks
A. Papadopoulos, G. Pallis, M. D. Dikaiakos Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks IEEE/WIC/ACM International Conference on Web Intelligence Nov.
More informationLarge Scale Graph Algorithms
Large Scale Graph Algorithms A Guide to Web Research: Lecture 2 Yury Lifshits Steklov Institute of Mathematics at St.Petersburg Stuttgart, Spring 2007 1 / 34 Talk Objective To pose an abstract computational
More informationDocument Retrieval using Predication Similarity
Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research
More informationLinking Entities in Chinese Queries to Knowledge Graph
Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn
More informationDistributed Data Anonymization with Hiding Sensitive Node Labels
Distributed Data Anonymization with Hiding Sensitive Node Labels C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan University,Trichy
More informationDetermining Differences between Two Sets of Polygons
Determining Differences between Two Sets of Polygons MATEJ GOMBOŠI, BORUT ŽALIK Institute for Computer Science Faculty of Electrical Engineering and Computer Science, University of Maribor Smetanova 7,
More informationCLASSIFICATION FOR SCALING METHODS IN DATA MINING
CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department
More informationFPGP: Graph Processing Framework on FPGA
FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E. Dept., TNLIST, Tsinghua University dgh14@mails.tsinghua.edu.cn 1 Big graph is widely used Big graph is widely
More informationE-Companion: On Styles in Product Design: An Analysis of US. Design Patents
E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing
More informationResearch Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park
Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph Nechama Florans Mentor: Dr. Boram Park G: V 5 Vertex Clique Covers and Edge Clique Covers: Suppose we have a graph
More informationCSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD
Recap. Growth rates: Arrange the following functions in ascending order of growth rate: n 2 log n n log n 2 log n n/ log n n n Introduction Algorithm: A step-by-step way of solving a problem. Design of
More informationProbabilistic Graph Summarization
Probabilistic Graph Summarization Nasrin Hassanlou, Maryam Shoaran, and Alex Thomo University of Victoria, Victoria, Canada {hassanlou,maryam,thomo}@cs.uvic.ca 1 Abstract We study group-summarization of
More informationCONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS
CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS 1 JAMES SIMS, 2 NATARAJAN MEGHANATHAN 1 Undergrad Student, Department
More informationSQL-to-MapReduce Translation for Efficient OLAP Query Processing
, pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,
More informationAbstract. 1. Introduction
MATCHINGS IN 3-DOMINATION-CRITICAL GRAPHS: A SURVEY by Nawarat Ananchuen * Department of Mathematics Silpaorn University Naorn Pathom, Thailand email: nawarat@su.ac.th Abstract A subset of vertices D of
More informationTowards a hybrid approach to Netflix Challenge
Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the
More informationComparison of FP tree and Apriori Algorithm
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationAN IMPROVED DENSITY BASED k-means ALGORITHM
AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining Aravindan Raghuveer Yahoo! Bangalore aravindr@yahoo-inc.com Abstract Large scale graphs containing O(billion) of vertices are becoming increasingly
More informationMonotone Constraints in Frequent Tree Mining
Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance
More informationHW Graph Theory SOLUTIONS (hbovik) - Q
1, Diestel 9.3: An arithmetic progression is an increasing sequence of numbers of the form a, a+d, a+ d, a + 3d.... Van der Waerden s theorem says that no matter how we partition the natural numbers into
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More information1 Counting triangles and cliques
ITCSC-INC Winter School 2015 26 January 2014 notes by Andrej Bogdanov Today we will talk about randomness and some of the surprising roles it plays in the theory of computing and in coding theory. Let
More information6c Lecture 3 & 4: April 8 & 10, 2014
6c Lecture 3 & 4: April 8 & 10, 2014 3.1 Graphs and trees We begin by recalling some basic definitions from graph theory. Definition 3.1. A (undirected, simple) graph consists of a set of vertices V and
More informationDetecting and Analyzing Communities in Social Network Graphs for Targeted Marketing
Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Gautam Bhat, Rajeev Kumar Singh Department of Computer Science and Engineering Shiv Nadar University Gautam Buddh Nagar,
More informationDynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling
2014/04/09 @ WWW 14 Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling Takuya Akiba (U Tokyo) Yoichi Iwata (U Tokyo) Yuichi Yoshida (NII & PFI)
More informationRank Preserving Clustering Algorithms for Paths in Social Graphs
University of Waterloo Faculty of Engineering Rank Preserving Clustering Algorithms for Paths in Social Graphs LinkedIn Corporation Mountain View, CA 94043 Prepared by Ziyad Mir ID 20333385 2B Department
More informationFeature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process
Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process KITTISAK KERDPRASOP and NITTAYA KERDPRASOP Data Engineering Research Unit, School of Computer Engineering, Suranaree
More informationA Parallel Community Detection Algorithm for Big Social Networks
A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic
More informationCSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce
CSE 701: LARGE-SCALE GRAPH MINING A. Erdem Sariyuce WHO AM I? My name is Erdem Office: 323 Davis Hall Office hours: Wednesday 2-4 pm Research on graph (network) mining & management Practical algorithms
More informationLink Mining & Entity Resolution. Lise Getoor University of Maryland, College Park
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous
More informationEfficient Construction of Safe Regions for Moving knn Queries Over Dynamic Datasets
Efficient Construction of Safe Regions for Moving knn Queries Over Dynamic Datasets Mahady Hasan, Muhammad Aamir Cheema, Xuemin Lin, Ying Zhang The University of New South Wales, Australia {mahadyh,macheema,lxue,yingz}@cse.unsw.edu.au
More informationEntity Resolution over Graphs
Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,
More informationAnalyzing Dshield Logs Using Fully Automatic Cross-Associations
Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu
More informationAnalyzing a Greedy Approximation of an MDL Summarization
Analyzing a Greedy Approximation of an MDL Summarization Peter Fontana fontanap@seas.upenn.edu Faculty Advisor: Dr. Sudipto Guha April 10, 2007 Abstract Many OLAP (On-line Analytical Processing) applications
More informationUniversity of Waterloo. Storing Directed Acyclic Graphs in Relational Databases
University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: 20523416 User ID: skoushan 4A
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationSearching SNT in XML Documents Using Reduction Factor
Searching SNT in XML Documents Using Reduction Factor Mary Posonia A Department of computer science, Sathyabama University, Tamilnadu, Chennai, India maryposonia@sathyabamauniversity.ac.in http://www.sathyabamauniversity.ac.in
More informationInstitutionen för datavetenskap Department of Computer and Information Science
Institutionen för datavetenskap Department of Computer and Information Science Final thesis K Shortest Path Implementation by RadhaKrishna Nagubadi LIU-IDA/LITH-EX-A--13/41--SE 213-6-27 Linköpings universitet
More informationCS 103 Six Degrees of Kevin Bacon
CS 103 Six Degrees of Kevin Bacon 1 Introduction This is the second half of the previous assignment, and acts as the culmination of your C/C++ programming experience in this course. You will use certain
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationMining High Order Decision Rules
Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high
More informationSupporting Fuzzy Keyword Search in Databases
I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as
More informationSymmetric Product Graphs
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-20-2015 Symmetric Product Graphs Evan Witz Follow this and additional works at: http://scholarworks.rit.edu/theses
More informationTowards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters
Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters 1 University of California, Santa Barbara, 2 Hewlett Packard Labs, and 3 Hewlett Packard Enterprise 1
More informationDatabase performance optimization
Database performance optimization by DALIA MOTZKIN Western Michigan University Kalamazoo, Michigan ABSTRACT A generalized model for the optimization of relational databases has been developed and implemented.
More informationUAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA
UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University
More informationFinding Neighbor Communities in the Web using Inter-Site Graph
Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University
More informationResults and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets
Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha
More informationA Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *
A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology
More informationCSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection
CSE 258 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationThe Structure and Properties of Clique Graphs of Regular Graphs
The University of Southern Mississippi The Aquila Digital Community Master's Theses 1-014 The Structure and Properties of Clique Graphs of Regular Graphs Jan Burmeister University of Southern Mississippi
More informationCollaborative Filtering using a Spreading Activation Approach
Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,
More informationprinceton univ. F 15 cos 521: Advanced Algorithm Design Lecture 2: Karger s Min Cut Algorithm
princeton univ. F 5 cos 5: Advanced Algorithm Design Lecture : Karger s Min Cut Algorithm Lecturer: Pravesh Kothari Scribe:Pravesh (These notes are a slightly modified version of notes from previous offerings
More informationTie strength, social capital, betweenness and homophily. Rik Sarkar
Tie strength, social capital, betweenness and homophily Rik Sarkar Networks Position of a node in a network determines its role/importance Structure of a network determines its properties 2 Today Notion
More informationClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:
More informationAnalysis and Extensions of Popular Clustering Algorithms
Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University
More informationRecommendation with Differential Context Weighting
Recommendation with Differential Context Weighting Yong Zheng Robin Burke Bamshad Mobasher Center for Web Intelligence DePaul University Chicago, IL USA Conference on UMAP June 12, 2013 Overview Introduction
More informationGraph Theory using Sage
Introduction Student Projects My Projects Seattle, August 2009 Introduction Student Projects My Projects 1 Introduction Background 2 Student Projects Conference Graphs The Matching Polynomial 3 My Projects
More informationEvent Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation
Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Ayaka ONISHI 1, and Chiemi WATANABE 2 1,2 Graduate School of Humanities and Sciences, Ochanomizu University,
More informationOnline k-taxi Problem
Distributed Computing Online k-taxi Problem Theoretical Patrick Stäuble patricst@ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Georg Bachmeier,
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationA two-stage strategy for solving the connection subgraph problem
Graduate Theses and Dissertations Graduate College 2012 A two-stage strategy for solving the connection subgraph problem Heyong Wang Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/etd
More informationClustering Billions of Images with Large Scale Nearest Neighbor Search
Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton
More informationThe Game Chromatic Number of Some Classes of Graphs
The Game Chromatic Number of Some Classes of Graphs Casper Joseph Destacamento 1, Andre Dominic Rodriguez 1 and Leonor Aquino-Ruivivar 1,* 1 Mathematics Department, De La Salle University *leonorruivivar@dlsueduph
More informationCSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection
CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationOnline Graph Exploration
Distributed Computing Online Graph Exploration Semester thesis Simon Hungerbühler simonhu@ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Sebastian
More informationLecture 17 November 7
CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has
More informationEULER S FORMULA AND THE FIVE COLOR THEOREM
EULER S FORMULA AND THE FIVE COLOR THEOREM MIN JAE SONG Abstract. In this paper, we will define the necessary concepts to formulate map coloring problems. Then, we will prove Euler s formula and apply
More informationCOMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION
More informationModule 11. Directed Graphs. Contents
Module 11 Directed Graphs Contents 11.1 Basic concepts......................... 256 Underlying graph of a digraph................ 257 Out-degrees and in-degrees.................. 258 Isomorphism..........................
More informationEfficiently Mining Positive Correlation Rules
Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 39S-44S Efficiently Mining Positive Correlation Rules Zhongmei Zhou Department of Computer Science & Engineering,
More informationAn Enhanced Algorithm to Find Dominating Set Nodes in Ad Hoc Wireless Networks
Georgia State University ScholarWorks @ Georgia State University Computer Science Theses Department of Computer Science 12-4-2006 An Enhanced Algorithm to Find Dominating Set Nodes in Ad Hoc Wireless Networks
More informationKNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems
KNOWLEDGE GRAPHS Lecture 1: Introduction and Motivation Markus Krötzsch Knowledge-Based Systems TU Dresden, 16th Oct 2018 Introduction and Organisation Markus Krötzsch, 16th Oct 2018 Knowledge Graphs slide
More informationIMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING
IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING 1 SONALI SONKUSARE, 2 JAYESH SURANA 1,2 Information Technology, R.G.P.V., Bhopal Shri Vaishnav Institute
More informationHow do people tag pictures? A study with Facebook application
COMP3750 Final Report How do people tag pictures? A study with Facebook application Author: Victor Hartanto Wibisono, U4644427 Supervisor: Dr. Lexing Xie 1 June 2012 Acknowledgements This research would
More information