Distributed Data Structures and Algorithms for Disjoint Sets in Computing Connected Components of Huge Network

Similar documents
Data Structures for Disjoint Sets

CS F-14 Disjoint Sets 1

Disjoint set (Union-Find)

CS 161: Design and Analysis of Algorithms

Solutions to Midterm 2 - Monday, July 11th, 2009

Lecture 13. Reading: Weiss, Ch. 9, Ch 8 CSE 100, UCSD: LEC 13. Page 1 of 29

CS170{Spring, 1999 Notes on Union-Find March 16, This data structure maintains a collection of disjoint sets and supports the following three

CSC 505, Fall 2000: Week 7. In which our discussion of Minimum Spanning Tree Algorithms and their data Structures is brought to a happy Conclusion.

Decreasing a key FIB-HEAP-DECREASE-KEY(,, ) 3.. NIL. 2. error new key is greater than current key 6. CASCADING-CUT(, )

Lecture Overview Register Allocation

CS S-14 Disjoint Sets 1

22 Elementary Graph Algorithms. There are two standard ways to represent a

CMSC 341 Lecture 20 Disjointed Sets

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Disjoint Sets 3/22/2010

Advanced algorithms. topological ordering, minimum spanning tree, Union-Find problem. Jiří Vyskočil, Radek Mařík 2012

22 Elementary Graph Algorithms. There are two standard ways to represent a

Dynamic Data Structures. John Ross Wallrabenstein

Minimum-Spanning-Tree problem. Minimum Spanning Trees (Forests) Minimum-Spanning-Tree problem

Randomized Graph Algorithms

COMP Analysis of Algorithms & Data Structures

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

A Framework for Space and Time Efficient Scheduling of Parallelism

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND

Introduction to Algorithms Third Edition

Electrical & Computer Engineering University of Waterloo Canada March 8, 2007

CPS222 Lecture: Sets. 1. Projectable of random maze creation example 2. Handout of union/find code from program that does this

Minimum Spanning Trees My T. UF

Slides for Faculty Oxford University Press All rights reserved.

Graph Theory: Introduction

( ) n 3. n 2 ( ) D. Ο

Binary Relations McGraw-Hill Education

Evaluating find a path reachability queries

Union Find 11/2/2009. Union Find. Union Find via Linked Lists. Union Find via Linked Lists (cntd) Weighted-Union Heuristic. Weighted-Union Heuristic

CS301 - Data Structures Glossary By

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

CMSC 341 Lecture 20 Disjointed Sets

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

Data Structure. IBPS SO (IT- Officer) Exam 2017

COP 4531 Complexity & Analysis of Data Structures & Algorithms

Graph algorithms based on infinite automata: logical descriptions and usable constructions

CSE 100: GRAPH ALGORITHMS

DS UNIT 4. Matoshri College of Engineering and Research Center Nasik Department of Computer Engineering Discrete Structutre UNIT - IV

Minimum Spanning Trees and Union Find. CSE 101: Design and Analysis of Algorithms Lecture 7

CMSC 341 Lecture 20 Disjointed Sets

Provably Efficient Non-Preemptive Task Scheduling with Cilk

UML CS Algorithms Qualifying Exam Spring, 2004 ALGORITHMS QUALIFYING EXAM

( D. Θ n. ( ) f n ( ) D. Ο%

looking ahead to see the optimum

Minimal Equation Sets for Output Computation in Object-Oriented Models

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Chapter 4. Relations & Graphs. 4.1 Relations. Exercises For each of the relations specified below:

Final Examination CSE 100 UCSD (Practice)

Sparse Hypercube 3-Spanners

Search trees, tree, B+tree Marko Berezovský Radek Mařík PAL 2012

END-TERM EXAMINATION

Number Theory and Graph Theory

Disjoint Sets and the Union/Find Problem

Algorithms and Theory of Computation. Lecture 5: Minimum Spanning Tree

2.1 Greedy Algorithms. 2.2 Minimum Spanning Trees. CS125 Lecture 2 Fall 2016

The Geometry of Carpentry and Joinery

Algorithms and Theory of Computation. Lecture 5: Minimum Spanning Tree

A graph is finite if its vertex set and edge set are finite. We call a graph with just one vertex trivial and all other graphs nontrivial.

Cpt S 223 Fall Cpt S 223. School of EECS, WSU

Bipartite Roots of Graphs

A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers

Solutions to Exam Data structures (X and NV)

Byzantine Consensus in Directed Graphs

DESIGN AND ANALYSIS OF ALGORITHMS

On Algebraic Expressions of Generalized Fibonacci Graphs

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting

Distributed minimum spanning tree problem

Lecture 4: Graph Algorithms

CSci 231 Final Review

Course Review for Finals. Cpt S 223 Fall 2008

Minimum Spanning Trees

Joint Entity Resolution

Building a network. Properties of the optimal solutions. Trees. A greedy approach. Lemma (1) Lemma (2) Lemma (3) Lemma (4)

Chapter 5. Greedy algorithms

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

Theory of 3-4 Heap. Examining Committee Prof. Tadao Takaoka Supervisor

Homework Assignment #3 Graph

Theorem 2.9: nearest addition algorithm

Indexing and Hashing

The Cheapest Way to Obtain Solution by Graph-Search Algorithms

Disjoint Sets. Based on slides from previous iterations of this course.

) $ f ( n) " %( g( n)

CS 310 Advanced Data Structures and Algorithms

Department of Information Technology

arxiv: v3 [cs.ds] 18 Apr 2011

The convex hull of a set Q of points is the smallest convex polygon P for which each point in Q is either on the boundary of P or in its interior.

DOWNLOAD PDF LINKED LIST PROGRAMS IN DATA STRUCTURE

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Disjoint-set data structure: Union-Find. Lecture 20

Fundamental Properties of Graphs

MAT 7003 : Mathematical Foundations. (for Software Engineering) J Paul Gibson, A207.

CISC 621 Algorithms, Midterm exam March 24, 2016

Minimum spanning trees

Transcription:

Distributed Data Structures and Algorithms for Disjoint Sets in Computing Connected Components of Huge Network Wing Ning Li, CSCE Dept. University of Arkansas, Fayetteville, AR 72701 wingning@uark.edu Thomas A. J. Schweiger, Acxiom Corporation, Faytevill, AR 72701 Tom.Schweiger@acxiom.com Abstract A network or graph consists of a set of nodes joined by edges. A set of nodes are form a connected component when every vertex has a path to every other node in the set. The connected components of a network can be computed efficiently using depth-first search or disjoint-sets data structures. The disjoint sets approach has the advantages of easily answering queries whether two nodes are in the same connected component and efficiently maintaining the connected components as new edges are added. For huge network with hundreds of millions or billions of nodes, the disjoint set data structures may be too large to be held in memory by a single processor. To solve this problem, we proposes a scheme to partition, distribute, and manipulate the disjoint set data structures among multiple processors in a computing grid. 1. Introduction A set of nodes are form a connected component when every vertex has a path to every other node in the set. The connected components of a network can be computed efficiently using depth-first search or disjoint-sets data structures [1]. The disjoint sets approach has the advantages of easily answering queries whether two nodes are in the same connected component and efficiently maintaining the connected components as new edges are added. This latter feature allows the network view to be dynamic, and simplifies maintenance of the disjoint set as the network evolves over time. For huge network with hundreds of millions or billions of nodes, the disjoint set data structures may be too large to be held in memory by a single processor. For instance, many 32-bit based processor systems have the maximum size of virtual memory of four gigabytes (4GB) and user applications may use up to two gigabytes of memory space for all heap, stack, and global stores. Programs that require more than 4GB memory space cannot be compiled and run. To find the connected components for a network with two billions nodes, storing the disjoint set as an array of four-byte integers, requires eight gigabytes of memory. To address this problem, we propose a scheme to partition and distribute the disjoint set in a computing grid so that each processor holds a manageable subset of the data structures. Reconsidering the above example, one could partition the 8 GB integer array into eight subsets. Each array subset is allocated to a processor in a computing grid with adequate pooled resources to manage the problem in memory, say eight processors. In this case, each processor requires 1GB of memory, well below architecture limits. In general, the fewer the sub arrays the better our scheme performs. A sophisticated partitioning algorithm should take into account the configuration of the available resources. A processor with larger main memory should get a larger piece of the array than a processor with fewer resources. Three basic operations are required to determine the connected components of a network: MAKE-SET(x), UNION(x,y), and FIND-SET(x) [1], where x and y denote set elements or objects (these operations will be described in detail later). If, by chance, the nodes in all the connected components happen to be allocated to the same partition (that is, no edge connects nodes allocated to different partitions), then a trivial solution is to let each processor construct the disjoint set for its sub array, and only these three operations are needed. In practice, however, many edges will connect nodes allocated to different partitions. Therefore, in addition to the MAKE-SET(x), UNION(x,y), and FIND- SET(x), additional operations are needed which we will introduce. The remainder of our paper is organized as follows. In section two we review disjoint sets data structures and algorithms and we introduce basic definitions and terminology. In section three we provide examples to illustrate the key ideas and principle operations of our distributed algorithm. In section four we present our

algorithm. Finally we offer concluding remarks and suggest future work. 2. Preliminaries A network is an undirected graph of nodes which for our purposes are numbered sequentially from 1 to n. Once n is given, the set of nodes is implicitly understood. An edge is a connection between a pair of nodes, written as (x, y), where x and y are integers between 1 and n that denote the connected nodes. Thus a network is uniquely specified by n and its set of edges. Two nodes may not be directly connected by an edge, but may be connected by an indirect path. In graph theoretic term, the path from node u to node v is the sequence of nodes starting at u and ending at v for which each pair of consecutive nodes in the sequence is connected by an edge. A node u is said to be reachable from v if the there exists a path between them. For undirected graphs this property is reflexive; if v is reachable from u, u is also reachable from v. A node is always reachable from itself. An graph is said to be fully connected if every pair of nodes is reachable. If a graph is not fully connected, it is said to be comprised of connected components. Connected components are fully-connected, non-overlapping subgraphs of the larger network. We find connected components using a disjoint-set data strucure -- a collection of disjoint dynamic sets. Each disjoint set in the collection represents a member of the graph. Disjoint-set data structures support the operations MAKE-SET(x), UNION(x,y), and FIND-SET(x). MAKE- SET(x) creates a new set whose only member is x. UNION(x,y) unites the dynamic sets that contain x and y respectively into a new set that is their union (a standard set operation). FIND-SET(x) returns a pointer to the representative of the unique set that contains x. In a most efficient implementation of disjoint sets, rooted trees represent sets, with each tree node containing one member and each tree representing one set. By using two heuristics- union by rank and path compression -we can achieve the asymptotically fastest disjoint-set data structure known [1]. In [1], the following pseudocode is presented to implements disjoint-set data structure. The parent of x is designated as p[x] and the rank of x as rank[x] in the code. MAKE-SET(x) 1. p[x] =x 2. rank[x] =0 UNION (x,y) 1. LINK(FIND-SET(x), FIND-SET(y)) LINK(x,y) 1. if rank[x] > rank[y] then 2. p[y] =x 3. else 4. p[x]=y 5. if rank[x] = rank[y] then 6. rank[y]=rank[y]+1 FIND-SET(x) 1. if x!= p[x] then 2. p[x]=find-set(p[x]) 3. return p[x] Note that indentation is used to indicate block structure in the pseudo-code. The reader is referred to [1] for in depth discussion and amortized analysis of the algorithms. For a disjoint set whose elements are integers from 1 to n, rooted trees may be implemented by an integer array p[1..n], where p[k] contains the index of the array at which the parent of k is located. By using cleaver programming and the observation that rank values are needed only by the roots of the trees, we can store both the parent and rank information in the p[] array and eliminate the rank[] fields or array completely. To use a disjoint-set data structure to compute the connected components of a given network sequentially, we initially place each node in its own set using the MAKE- SET operation. Thus each x is initially a connected component. Conceptually, this is a graph with nodes but no edged. Next we begin to add edges to the graph. For each edge (x,y), we test if x and y belong to different sets using the return values of FIND-SET(x) and FIND-SET(y). If so, the two sets are united to form a larger connected component using the UNION(x,y). Once all the edges are processed we construct the final connected components by determining which nodes are in the same set. The reader is referred to [1] for a thorough discussion of the algorithm. As we explained in the introduction section, for huge network with billions of nodes this sequential approach becomes intractable due to the memory limitation imposed by processor architecture. To address this issue we use distributed multiple processors in a computing grid to gain access to a larger amount of memory. This requires a different view of the disjoint set data structure array. It must be partitioned and distributed among multiple processors in a computing grid in such a way that each processor can hold a subset of the data structure internally. To begin, we partition the array p[1..n] into k sub arrays p[n 0..n 1 ], p[n 1 +1..n 2 ], p[n 2 +1..n 3 ],, p[n k-1 +1..n k ], where n 0 =1 and n k =n. Within this scheme we recognize

two kinds of edges: local edges and global edges. An edge (x,y) is a local edge if x and y are in the same partition; that is, for some j, x and y are in the range n j-1 +1..n j. If the edge spans two partitions it is a global edge. As we mentioned earlier, If all edges are local edges, then no connected components span partitions. In this case the solution is a trivial parallel and distributed algorithm and no additional operations are required. In practice, however, the network will partition such that there are many global edges. One solution for finding connected components is to implement UNION and FIND-SET operations over distributed sub array with parent indices pointing back and forth between different processors. This solution is sequential in nature and very inefficient. Instead, we propose first processing just the local edges. This gives the connected components that are found in the range n j-1 +1..n j for each partition j. We call these connected components the local components of partition j. Local components are computed using the sequential algorithm on local edges. Two local components on different partitions will belong to the same connected component if and only if a global edge connects a node in one local component to a node in the other local component. The next step of our algorithm examines the global edges so that we can merge local components on different partitions Note that our initial view of local edges may be insufficient to determine all the local components. For example, two local components on the same partition may belong to the same connected component, but insufficient information is available at this time to join them. circles are edges. Lines connecting small circles within a larger circle are local edges. By inspection one can see that the network is fully connected -- there is only one connected component in the network. The eight nodes in each partition should form a local component. However, if we consider only local edges we derive one local component for the lower-left partition and four local components, each of which has two nodes, for the remaining partitions. To see how local components on the same partition may be merged by considering global edges, consider the global edges of lower-left partition. It has four global edges, two of which connect to nodes in the upper-left partition and two to nodes in the lower-right partition. The global edges to the upper-left partition generate a path joining two local components in that partition the second and fourth local components are joined via the path through global edges to the single local component in the lower left partition. Similarly, the global edges to the lower-right partition generate a path joining two local edges in that partition. Such a pathway has the same effect as adding a new local edge in the partition. In general, k global edges from a current local component to another partition generate paths equivalent to k-1 new local edges in the second partition. Global edges not only generate paths that join local edges, they also generate new paths between local partitions on different partitions, and these in turn may generate additional paths between local edges in the next iteration. As we saw, the global edges from the local component in lower-left partition connect to the upper-left and lower-right partitions. Note how this generates a new pathway between the upper-left and lower-right partition that has the same effect as adding a new global edge in the network. In general, for k partitions reached by global edges leaving a current local component, k(k-1)/2 new global edges are generated for each pair of partitions. Figure 1: A network example and it 4-partition Consider the example of in Figure 1. The larger circles represent partitions or processors. Small circles represent nodes in the network. Lines connecting small Figure 2: New local and global edges generated from the local component of lower-left partition.

Figure 2 shows symbolically the new local and global edges generated by the pathways created by the global edges the lower left partition. The blue colored edges symbolize new local edges and the red colored edge symbolizes a new global edge. Recognizing that adding such new edges creates path shortcuts is an important step that we use extensively. Our distributed algorithm has three steps that all run in parallel: 1. Each processor computes its current local components using disjoint-set operations. 2. Each processor computes new local edges and new global edges for other partitions based on its current local components and existing global edges 3. Each processor transmits information about new local and global edges to the interested processors. generates new local edges for the top partitions: (A, B) via F, (B, C) via G, and (C, D) and (D, E) via I. For step 3 the partitions transmit the new local edges. Returing to step 1, the partitions process the new local edges l using the disjoint set algorithm. The effect of the new local edges is to join the vertices of each partition into a single local component. A global edge is stored from each local component to any other local component, indicating they belong to the same connected component though residing in different partitions. Now consider again the network shown in Figure 1. The results of completing the first iteration of the three steps are shown in Figure 4, with new local edges drawn in blue and new global edges drawn in red. Note that the topmost local component in the upper-right partition generates a new local edge for the upper-left partition and a new global edge between the first local component of the upper-left partition and the third local component of the lower-right partition. The three steps are repeated until no new local or global edges are created. The maximum number of repetitions is determined by the number of partitions. 3. Illustrative Examples. Let us first consider the example network shown in Figure 3. The larger circles represent partitions, each a process on a grid host, and the smaller circles represent nodes in the network. Lines connecting small circles are edges. In this example all the edges are initially global edges; each vertex forms a locally connected component by itself. This completes step 1. A B C D E Figure 4: New edges generated after iteration one. F G H I Figure 3: An example network and its 2-partition. For step 2, the global edges radiating from the top partition generates new local edges for the bottom partition: (F, G) via B, and (G, H) and (H, I) via C. Simultaneously, the global edges radiating from the bottom partition We indicate by the bold line those global edges that need to be stored by the partitions to record which local components are connected to make connected components. Figure 5 shows the local components after processing the new local edges generated in first iteration. Nodes having the same color belong to the same local component. Figure 5 also shows the new global edges generated by the first iteration. The heavy black and red lines are stored global edges. The heavy blue lines are new local edges.

are needed; and so on. The final set of global edges link the connected components that are partitioned among the grid processors. A MPI implementation of the proposed algorithm is reported in [2]. The problems for which the proposed distributed algorithm solves are considered in [3,4,5]. Figure 5: Local components and global edges after iteration 2. The new edges after iteration two are shown as light blue colored lines in Figure 6. At this point, as expected, each processor has a single local component after processing the new local edges. 5. Conclusions A fast distributed algorithm was developed for finding the connected components of massive networks. The distributed disjoint set allows us to maintain the connected components information in memory. This property will be investigated and explored further to implement a connected components service and a transactional connected components computation where connected components information is updated as new edge pairs are introduced. Acknowledgments This research was supported in part by Acxiom Corporation through the Acxiom Laboratory for Applied Research. References Figure 6: New local edges after iteration two. 3. The Basic Algorithm. Without loss of generality, we subdivide an array of size n into 2 k consecutive partitions of which each is assigned to a grid resource. Based on this partitioning, edges are classified into two types: the local edges and global edges. The disjoint-set FIND-SET and UNION algorithms are used as before to process local edges. Global edges are used to generate new local edges for other partitions. Global edges are also used to generate new global edges for the next iteration of the same process. We have shown that after at most k iterations all local components are found. Hence, if the array is partitioned in two, one iteration is needed. If the array is partitioned in four, two iterations are needed; if the array is partitioned in eight, three iterations [1] Cormen, T., Leiserson, C., Rivest, R., Stein, C. Introduction To Algorithms Second Edition, McGraw-Hill Higher Education. [2] R. Bheemavaram, Parallel and Distributed Grouping Algorithms for Finding Related Records of Huge Data Sets on Cluster Grid, MS thesis, University of Akansas, 2006 [3] R. Bheemavaram, J. Zhang, and W. Li A Parallel and distributed approach for finding transitive closures of data records: A proposal: Proc. Acxiom Laboratory for Applied Research (ALAR) 2005 conference on Applied Research in Information Technology, pp. 61-70 March. 2006. [4] W. Li, J. Zhang, and R. Bheemavaram Efficient Algorithms for Grouping Data to Improve Data Quality Proc. 2006 International Conference on information and knowledge engineering, pp. 149-154, June 2006. [5] J. Zhang, R. Bheemavaram, and W. Li Transitive Closure of Data Records: Application and Computation Proc. Acxiom Laboratory for Applied Research (ALAR) 2006 conference on Applied Research in Information Technology, pp. 71-81 March. 2006.