Keyword Search in External Memory Graph Representations of Data

Size: px

Start display at page:

Download "Keyword Search in External Memory Graph Representations of Data"

Oliver Todd
5 years ago
Views:

1 Keyword Search in External Memory Graph Representations of Data B. Tech. Seminar Report Submitted in partial fulfillment of the requirements for the degree of Bachelor of Technology by Avin Mittal Roll No: under the guidance of Prof. S. Sudarshan a Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai

2 Acknowledgements I would like to thank my guide, Prof. S. Sudarshan for the constant feedback and encouragement he has given to my work, and without whose motivation this work would not have been possible. Avin Mittal i

3 Abstract Keyword search over relational and XML data has grown in popularity since the advent of Web search engines. Keyword search over relational data is significantly different from web search as the required information is often split across multiple tables as a result of normalization. The algorithms and techniques that are applied to databases, thus produce answer trees from the data graph as opposed to answer nodes produced by Web search engines. BANKS and some other systems enable keyword-based search over relational databases. Though the algorithms and heuristics used in these systems are highly efficient and have been tuned to give the best results, most of these systems assume the presence of the entire data in memory which may not be the case for standard databases. Also, the search algorithms used by such systems may explore a large portion of the graph before finding an answer. Through this report, we aim to study some of the external memory techniques that can be applied to such systems and thus, improve the time and space complexity of the algorithms, keeping in mind and trying to restrict the number of I/O operations taking place. ii

4 Contents 1 Introduction 1 2 Keyword Search Systems BANKS Graph Model for Representing Data Query and Answer Model Backward Expanding Search Algorithm Bidirectional Search Algorithm DBXplorer Publish Search ObjectRank Authority Transfer Graph Score of a Node w.r.t a Query External Memory Graph Algorithms Blocking Model Paging Strategies Graph Partitioning Mathematical model Multilevel k-way Partitioning S-Node representation Advantages of S-Node Representation Structure of an S-Node Representation Partitioning Desiderata Iterative Partitioning Algorithm Compression Huffman Encoding Reference Encoding Additional Improvements Indexing Types of Indexes HOPI - A Connection Index Hop Cover Incremental Algorithm for Computation of 2-Hop Cover Divide and Conquer Algorithm for Computation of 2-Hop Cover Distance Aware 2-Hop Cover Conclusions and Future Work 21 iii

5 1 Introduction Search engines on the web have popularized the keyword search paradigm among the users for fetching information. With the huge amounts of online data being stored in the form of relational and XML data, keyword searching over such forms of data has grown in importance, particularly for casual users who have no knowledge regarding the underlying schema. The standard query languages such as SQL and XPath are too complex to be used by casual users, and thus systems that support keyword querying over databases are the only way to get information from relational data for such users. The key difference here as opposed to keyword search on the web is that the data may be split across various tables due to normalization, and thus any system should take this into consideration, hence producing answers constructed from on-the-fly joins of tables of the databases. Many systems such as BANKS, DBXplorer, DISCOVER, ObjectRank, XRank etc. have been proposed and developed that support such keyword queries over databases. We discuss some of the systems and their key features in Section 2. In BANKS and some of the above mentioned systems, the graph data model is used to represent the data from the database. In the graph model, a node is created corresponding to each tuple in the database, and a directed edge is created for each foreign key link or other kinds of links (between tuples) in the database. The kind of searches that keyword queries entail and the problem of normalization make the graph model the most natural way to represent the data such that efficient keyword search is possible. The answer to a query is given as a rooted directed tree, with the keyword nodes present at the leaves of the tree. In BANKS, two algorithms have been proposed to compute the answer trees given a set of keyword nodes - (i) Backward Search, which starts from the keyword nodes and searches for a common root in the backward direction, and (ii) Bidirectional Search, which allows forward searches from potential roots towards leaves. Though the keyword search systems such as BANKS work well in practice, they suffer from the potential problem of memory explosion. Thus, the run-time memory requirements of running the algorithm can be unreasonably large for many large-scale databases, since the search algorithms may explore a large number of nodes before producing an answer. Also, the algorithms used in the above mentioned systems assume the presence of required data in memory. In practice, this may not be the case for huge graphs and thus, the run time of the algorithm may be dominated by the number of I/O operations that take place during the execution of the algorithm. The purpose of this report is to study some of the external memory techniques that have been described over the years to enable efficient storage and querying of disk resident data, and to see how these techniques can be applied to some of the systems mentioned above. In Section 3, we look at some of the measures that have been suggested over the years for efficient traversal and compression of the disk-based data graphs. Two level representation of graphs, clustering to produce partitions of graph and various compression techniques such as Huffman Encoding and Reference Encoding for graphs are some of the techniques that are studied. Many of these techniques have been described and studied in detail for Web graphs but can be applied to other graphs as well. Also, we study some of the basic properties of disk based algorithms and the paging strategies used by a system to minimize the I/O operations while executing an algorithm. Indexing is an important aspect of efficient storage and traversal of data graphs. Indexing enables us to locate and query the relevant sections of the graph quickly. There are many types of indexes based on the type of queries it is intended to be used for. Connection index is a type of 1

6 index that is used to store only the connectivity information regarding a graph, i.e. a connection index can be used to answer ancestor or descendant queries. One such index is HOPI - which is based on computing the 2-Hop cover of a graph. We study the basic properties, computation and other features of the HOPI index in Section 4. 2

2 Keyword Search Systems Since the growth of world wide web, there has been a rapid growth in the number of users who wish to browse and search the contents of online databases without necessarily

7 2 Keyword Search Systems Since the growth of world wide web, there has been a rapid growth in the number of users who wish to browse and search the contents of online databases without necessarily caring about the underlying schema or the formal query language such as SQL. Many systems have been proposed over the past few years that solve the problem of keyword based searching over relational and XML data. This section describes some of the techniques and features of these keyword-based search systems. 2.1 BANKS BANKS ([BHN + 02])- acronym for Browsing ANd Keyword Searching - is a system that enables data and schema browsing along with keyword search over relational databases. Keyword search in BANKS is done using proximity based ranking based on foreign key and other types of links. The database is modelled as a directed graph, with nodes representing tuples and edges representing cross-references. BANKS allows query keywords to match data (information present in the tuples) as well as meta-data (column/table names). BANKS also enables near zero-effort Web publishing of relational data which would otherwise remain invisible to the Web Graph Model for Representing Data The database is modelled as a directed weighted graph with each tuple in the database corresponding to a node in the graph. Each foreign key link is a directed edge from the referencing node to the referenced node. An answer is a rooted directed tree having a path from the root to each of the keyword nodes. The answer trees are scored based on a combination of edge weights and node weights, and the obtained scores are used to rank the answer trees. The edge weights are assigned in inverse proportion to their importance. For e.g. in the bibliographic database, an edge from Paper to Writes would have a lower weight than that from Paper to cites. The weight of a tree is proportional to the sum of the edge weights and the relevance of a tree is inversely proportional to its weight. In some cases, we may require to traverse the edges in backward direction. For e.g. finding a path from Paper to Author would require traversing the foreign key edge from Writes to Paper in the backward direction. Thus, for each edge (u, v), a backward edge (v, u) is created with a different edge weight. The weight of (v, u) is directly proportional to the number of links to v from the nodes of same type as u. Figure 1: Backward Edges 3

8 Node Weights, inspired by the prestige rankings such as in Google ([BP98]) are also incorporated in the model. Nodes having multiple pointers to them have a higher weight. Higher node weight implies higher prestige. Node weights and tree weights are combined to get an overall relevance score. Similarity measure, s(r 1, R 2 ) between relations R 1 and R 2, where R 1 is the referencing relation and R 2 is the referenced relation, is generally asymmetric and depends upon the type of link from R 1 to R 2. s(r 1, R 2 ) is set to infinity if R 1 doesn t refer to R Query and Answer Model A query consists of n terms, say t 1, t 2,, t n. The first step is to locate the nodes matching the query terms. A node is said to be relevant for a query term if it contains the terms as a part of an attribute value or metadata. For each term t i, a set of nodes S i relevant to t i is thus obtained. The answer to a query is a rooted directed tree containing at least one node from each S i. Each answer tree is assigned a relevance score based on the edge weights and the node weights, and the answers are presented in decreasing order of the scores Backward Expanding Search Algorithm The answer tree as described above may also contain nodes which are in neither of the S i, and thus is a Steiner tree. Steiner tree computation is NP-hard and is further complicated by node weight considerations. Any search algorithm should not only produce the tree with the highest relevance score, but also other trees with high relevance scores. The backward expanding search algorithm ([BHN + 02]) offers a heuristic solution for incrementally computing the query results. Given a set of keywords, the algorithm first finds the set of relevant nodes S i for each keyword t i using disk resident indices on keywords. Let S = S i. The backward search algorithm runs S concurrent copies of the Dijkstra s single source shortest path algorithm, one for each keyword node in S. Each copy of the Dijkstra s algorithm traverses the graph edges in reverse direction. The goal is to find a common vertex from which a forward path exists to at least one node in each set S i. The paths thus obtained will define a rooted directed tree with the common vertex at root and the keyword nodes at the leaves. In particular, the connection trees generated by the algorithm may not be in exact decreasing order of relevance scores as the node weights are not being considered to obtain the tree. Thus, a heap of top-k answers seen so far is maintained (for some k, where k is determined by experimentation), and whenever the heap becomes full the top answer is printed Bidirectional Search Algorithm The backward search algorithm presented above explores unnecessary large number of nodes in the following cases: The query contains a frequently occurring term. In the backward search algorithm one iterator is associated with every matching node, thus the algorithm generates a large number of iterators if a keyword matches a large number of nodes. This could happen in case of frequently occurring terms or if a keyword matches a relation name (which matches all tuples belonging to that relation) 4

9 One of the iterators reaches a node having a large in-degree. In this case, the iterator will need to explore a large number of nodes. In the above scenarios, the backward search algorithm explores a large portion of the graph before finding relevant answers and this may result in a large search time. The authors in [KPC + 05] propose a new algorithm where in they create iterators only for the nodes matching less frequent keywords and explore paths from potential roots to the more frequent keywords. Every node reached by an iterator in the backward search algorithm is a potential root. If a forward path is followed from these nodes, the frequent keywords may be reached faster and thus the algorithm terminates faster. Figure 2: Motivation for Bidirectional Search The two main features of the bidirectional search algorithm are: Starting forward searches from the potential roots A spreading activation model to prioritize nodes on the iterator fringe, thus ensuring that the iterator with a small fringe would get a higher priority, and among nodes within a single iterator, those in a less bushy subtree would get a higher priority. Avoiding backward search from large fringes avoids wasteful expansion, yet the corresponding keyword nodes can be connected by searching forward from nodes high activation values. The key differences of the bidirectional search algorithm from the backward search algorithm are: All the single source shortest path iterators from Backward Search algorithm are merged into a single incoming iterator. Spreading activation is used to prioritize the search. For the incoming iterator, the node to be expanded next is the one with the highest activation. Activation spreads from keyword nodes and edge weights are taken into account when spreading the activation, so the activation reflects the edge weights as well as the branching of the search fringe. Another iterator called the outgoing iterator is concurrently run which follows outward edges from the nodes explored by the incoming iterator. 2.2 DBXplorer DBXplorer ([ACD02]) is another system for keyword search over relational databases. The DBXplorer system has been implemented using a commercial relational database and a web 5

10 server and allows the users to interact via a browser front-end. The distinguishing feature of DBXplorer with BANKS is that DBXplorer uses SQL and the query optimizations of SQL over join trees to output a single row that can be obtained by joining several tables. Given a set of query words, DBXplorer returns all rows (either from single tables, or by joining tables connected by foreign key links) such that each row contains all keywords. The two key features of the system are a) a pre-processing step called Publish that builds the structures associated with the database and prepares the database for keyword searching, and b) a Search step that obtains the matching rows from the published database. Figure 3: Architecture of DBXplorer Publish The given database (or some specific part of it) is prepared for keyword searching through the following steps: A database is identified along with the set of rows and columns within the database to be published Auxiliary tables are created for supporting keyword queries over the database. The most important of these is the symbol table that is used for identifying the location of each keyword in the database (i.e. rows, columns, tables they occur in) Search When a query (consisting of a set of keywords) is given to the system, it is answered as follows: The symbol table is looked up to find the tables, rows and columns of the database that contain the query keywords. All potential subsets of tables in the database that if joined (i.e. have foreign key connections), may contain the rows having all keywords, are identified and enumerated. A subset of tables can be joined only if they are connected in the schema through foreign key links. 6

11 Figure 4: Join trees For each enumerated join tree, an SQL statement is created (and executed) that joins the tables in the tree and selects the rows having all keywords. The rows thus obtained are output to the user. DISCOVER: Though DBXplorer is free from the limitations of universal relations, it does not consider solutions that may have more than one tuple from a single relation. DBXplorer produces join trees from the schema graph, thereby restricting itself to at most one tuple from each relation (see Fig. 4). An improvement over DBXplorer is DISCOVER ([HP02]), which produces candidate networks of tuples, i.e. sets of tuples related due to primary key-foreign key relationships and together containing all the keywords of the query. DISCOVER operates in two steps: the Candidate Network Generator enumerates all possible candidate networks of tuples and the Plan Generator which plans for the efficient evaluation of the set of candidate networks, exploiting the opportunities to reuse the common subexpressions. Both DISCOVER and DBXplorer rank their results on the basis of the number of joins involved in producing the results. Thus, the more complicated notions of node weights and the similarity measures between relations is ignored, resulting in a naive ranking scheme. Another difference between these systems and BANKS is that they are capable of being applied only to relational databases, whereas BANKS applies equally well to XML data as well. The work done in DBXplorer, DISCOVER uses the implementation of the underlying DBMS for efficient storage and relies on the query optimizations of SQL for efficient querying. 2.3 ObjectRank ObjectRank([BHP04]) is a system that applies the authority-based ranking to keyword search in relational databases modelled as directed graphs. Authority is a measure of importance of a node in the graph. Intuitively, each keyword node has some authority and this authority is transferred along the links in the data graph. Pagerank([BP98]) is an excellent tool to measure the global importance of nodes, however in case of keyword search over databases, we need a query specific relevance score for each of the nodes. For this purpose, a global authority calculation is done which is very similar to the PageRank computation of Google, along with a keyword specific authority calculation for each node which assigns an authority value corresponding to each keyword to the node. At run time, the ObjectRank system computes the query specific authorities for the nodes by combining the individual keyword based authorities computed earlier. The final score of a node for a given query is obtained by combining the global authority 7

and the query specific authority scores for the node. The link structure of the database graph is quite different from the link structure of the web.

12 and the query specific authority scores for the node. The link structure of the database graph is quite different from the link structure of the web. Unlike the Web where all edges are hyperlinks and each link is rather indistinguishable from other, in relational databases there are various types of edges. The edges between different types of nodes can carry different semantics and thus, the amount of authority flowing to each out-neighbour of a node may not be identical Authority Transfer Graph The database is modelled as a directed graph where each tuple of the database is represented as a node in the graph. Each node represents an object of the database and may have some attributes. Also, associated with the database is a schema graph, in which each relation in the database is modelled as a node and those connected by foreign key links are joined by edges. From the schema graph, an authority transfer graph is created to reflect the authority flow through the edges of the graph. For each edge (u, v) in the schema graph, we create one forward and one backward edge in the authority transfer graph. The idea of a backward edge is that authority can flow in backward direction as well. Both forward and backward edges are annotated with authority transfer rates, which reflects the fraction of authority of that node flowing through the edge. However, the authority transfer rates for both directions need not be same. Figure 5: Authority Transfer Graph Score of a Node w.r.t a Query The score of a node v w.r.t. a particular keyword query w is given by combining the global ObjectRank r G (v) of the node and the keyword specific ObjectRank r w (v) of the node. The combination function used in ObjectRank is given by r w,g (v) = r w (v).(r G (v)) g, where g is the global ObjectRank weight and can be tuned for varying degree of generality in the results. Keyword Specific ObjectRank. Given a single keyword query, ObjectRank finds the set of nodes containing the keyword, S(w) (called the base set for that keyword) and assigns an ObjectRank r w (v i ) to each node v i of the graph by resolving the equation 8

13 r w = dar w (1 d) + S(w) s where A is the authority transfer matrix, d controls the base set importance and s is the base set vector for S(w). The damping factor d determines the portion of ObjectRank that an object transfers to its neighbours. By decreasing d, the nodes containing the keywords are favoured. Global ObjectRank. The global ObjectRank is calculated in an identical fashion to the PageRank computation of Google. The system applies the random surfer model, and the base set includes all nodes in the graph. Note that, this method assumes all nodes have the same initial authority value (as in Pagerank). However, there are many applications in which the domain expert may not desire so, for example in a complaints database we may wish to assign complaints from regular customers a higher initial authority. Multiple Keyword Queries. The ObjectRank for a multiple keyword query w 1, w 2,, w m is given by extending the random surfer model to introduce m surfers, where i th surfer starts from the base set S(w i ). For AND semantics, the ObjectRank of v is the probability that at any given time, all the random surfers are at v. Hence the ObjectRank of v for AND semantics is given by r w 1,,w m AND (v) = r w i (v) i=1,,m For OR semantics, the ObjectRank of v is given by the probability that at any given time any one surfer is at v. Hence the ObjectRank of v for OR semantics is given by r w 1,,w m OR (v) = 1 (1 r w i (v)) i=1,,m The main advantages of ObjectRank lie in the flexibility of the system. The system can be tuned and the various parameters can be varied, thus making the system more customized to application domain. Some of the parameters that can be tuned and the effects they have in the ObjectRank system are: The base authorities of different nodes need not be same for all domains. The domain expert may decide to differentiate between the nodes of the same base set based on some attribute values for the objects in the set. The damping factor in the calculation of keyword specific ObjectRank can be decreased to favour the base set. The authority transfer rates are set by a domain expert and they reflect the amount of authority flow across various types of links. The global ObjectRank weight can be varied to assign different degrees of weightage to global importance of nodes as opposed to their relevance to the current query. 9

14 3 External Memory Graph Algorithms Typical graphs constructed from databases as described in Section 2 are huge in size, and consequently most parts of the graph reside on the disk. Thus, the in-memory algorithms, such as Dijkstra s single source shortest path algorithm, which assume the presence of all the graph data in memory don t scale well for such graphs as the I/O operations dominate the running time of the algorithm. In this section we look at some of the external memory storage techniques and algorithms that can be used to overcome these difficulties. 3.1 Blocking The authors in [NGV93] discuss the problem of using disk blocking efficiently for searching graphs that are too large to fit in memory. A key feature of their model is that they allow a vertex to be stored multiple times in order to take advantage of redundancy Model The key assumptions of the model for a graph G=(V,E) are: 1. The data is associated with only the vertices of the graph each of which can be represented in a fixed amount of space. 2. Data for vertices is stored in blocks, each of which can hold data for at most B vertices. 3. Data for a single vertex may be present in more than one block. Only one such block needs to be present in memory to satisfy the request. 4. The internal memory can hold data for at most M vertices, after which B vertices have to be flushed out to make room for a new block. 5. The total number of vertices in the graph is n = ρm, where ρ >> 1 Assumption 3 is reasonable if the data is not updated too often, which is indeed the case for most databases. If the total amount of storage required to store the graph is S, the storage blow-up is defined as s = S/(n/B). Intuitively, s is the average number of blocks a vertex is present in. If a vertex is present in the memory at any time, it is said to be covered. The term page fault is used to refer to an event when the path being traced extends to an uncovered vertex, and thus at least one block must be read from the secondary memory. Thus, algorithms in general have two parts: An assignment of vertices of graphs to blocks, known as blocking. A paging algorithm to specify what vertices/blocks are in memory at any time Paging Strategies When a page fault causes a block to be brought into memory, it may be the case that it is necessary for it to overwrite data that are already in memory so as not to exceed the memory size. If the data are always flushed out by blocks, we say that we are using the weak paging strategy. A strong paging strategy would allow any B copies of vertices to be flushed, independent of whether they were originally in the same block. 10

15 We can also distinguish between paging algorithms based on whether they are on-line or off-line. An off-line paging strategy may consider the entire path before deciding what blocks need to be brought in. An on-line paging strategy must base its decision on what block to bring in at any point only upon the previous history of the path. We can term a paging algorithm as lazy if it only brings a block into memory that services an immediately preceding page fault. Blocking using BALL-COVER: A ball of radius r around a vertex v is defined as the set of vertices {v V d(v, v ) < r}, where d represents the minimum distance between v and v. The problem BALL-COVER is to pack a minimum number of balls of radius r into the graph such that each vertex belongs to at least one ball. The output of BALL-COVER is a set of vertices V such that for every vertex v V, there exists a vertex v V such that d(v, v ) < r. Also, a k-compact neighbourhood N v (k) of v is defined as a k-element subset of V, containing v and which is such that the minimum distance from v to a vertex not in N v (k) (radius of the neighbourhood) is maximum over all such subsets. Let r (B) = minimum (over all vertices) radius of the compact neighbourhoods of size B, where B is the blocking size. The BALL-COVER is computed for the graph with radius r = r (B)/2. Figure 6: Distance from v to w is less than r/2 The blocking for the given graph consists of the set {N v (B) v V }, where B is the block size. Now, when a page fault occurs at vertex v, we know (by definition of BALL-COVER) that there is at least one vertex v V such that d(v, v ) < r. The representative block corresponding to the vertex v is then brought to the memory. Since, the distance from v to v is less than r = r (B)/2 and the distance from v to any vertex not in N v (B) is at least r (B) (by definition of radius of compact neighbourhood), it can be shown that the distance from v to any vertex not in N v (B) is at least r (B)/2. The above analysis shows that if redundancy is permitted in the blocking, we can actually upper bound the number of I/O operations that take place during the execution of an algorithm. Thus, redundancy provides excellent performance of algorithms. However, the trade-off in this case is between the storage blow-up and efficiency. So, if we allow the storage requirements to go up, we can produce highly I/O efficient algorithms. On the other hand, graph partitioning techniques (described in the following sections) doesn t incur storage overheads, while the efficiency of graph traversal may reduce. 11

16 3.2 Graph Partitioning Graph partitioning (or clustering) is important in external memory graph algorithms as it leads to a smaller, more manageable way of storing the graph on the disk and also, reduces the time required for graph traversal. Informally speaking, clustering is a process of grouping nodes of a graph such that intra-cluster similarity is maximized and inter-cluster similarity is minimized Mathematical model The k-way graph partitioning problem is defined as follows: Given a directed, weighted graph G = (V, E) with V = n, partition V into k subsets, V 1, V 2,, V k such that V i V j = φ for i j, V i = n/k and i V i = V, and the sum of the edge weights whose incident vertices belong to different subsets is minimized. A k-way partitioning of V is commonly represented by a partitioning vector P of length n, such that for every vertex v ɛ V, P[v] is an integer between 1 and k, indicating the partition to which vertex v belongs. Given a partitioning P, the sum of edge weights whose incident vertices belong to different partitions is called the edge-cut of the partitioning Multilevel k-way Partitioning In [KK98], the authors propose a multilevel approach to graph partitioning and show that it is much more efficient than the recursive bisection algorithm. They also present a high quality and efficient refinement algorithm that can improve upon the initial k-way partitioning. The basic structure of multilevel k-way partitioning algorithm is: The graph G = (V, E) is Figure 7: Multilevel k-way partitioning (from [KK98]) first coarsened down to a small number of vertices, a k-way partitioning of this small graph is computed and then the partitioning is projected back towards the original graph, by successively refining the partitioning at each intermediate level. 12

17 The phases of multilevel partitioning are described in more detail below: Coarsening Phase. During the coarsening phase, a sequence of smaller graphs G i = (V i, E i ) is constructed from the original graph G = (V 0, E 0 ) such that V i < V i 1. In coarsening, a set of vertices (found by computing a maximal matching) of G i, say Vi v is combined to form a single vertex, say v of the next level coarser graph G i+1. The weight of v is the sum of weights of vertices in Vi v. Also, the edges of v are the union of edges in Vi v. In case more than one edge of V v i is incident on a vertex u, the weight of the new edge (from v to u) is equal to the sum of the weights of all such edges. The above coarsening method ensures that the following properties are satisfied :- (i) The edge-cut of partitioning in the coarse graph is equal to the edge cut of the same partitioning in the finer graph. (ii) a balanced partitioning of the coarser graph leads to a balanced partitioning of the finer graph. The coarsening phase ends when the number of vertices falls below a certain level or the reduction in the size of successive coarse graphs becomes too small. Initial Partitioning Phase. The second phase of multilevel partitioning involves computing a k-way partition of the coarsened graph obtained the previous phase, such that each partition contains roughly V 0 /k vertex weight of the original graph. Uncoarsening Phase. In the uncoarsening phase, the partition of the coarse graph obtained in the previous phase is projected back successively to the next finer graphs with some modifications at each stage to further reduce the edge-cut or to improve the balance of the partitioning. This process is repeated till the original graph is obtained. The first step in uncoarsening is to project the partition of G i to G i 1. Thus, in level i if a vertex v belongs to cluster C k, then all vertices of G i 1 that were combined to form v are also assigned to cluster C k. After the projection step, the clustering obtained may be refined by moving vertices from one cluster to another (provided the balance of the clustering is maintained), thus further reducing edge-cut of the partitioning. This refinement is done using the Kernighan-Lin (KL) partitioning algorithm and its variants. Refinement algorithm. The vertices of the graph are visited in random order. If the vertex can be moved to a different cluster (i.e. at least one of its neighbours belongs to a different cluster) the gain associated with moving the vertex to its neighbouring partitions is computed. The vertex is moved to the partition with the maximum possible gain if the balancing condition (defined below) is not violated, or if a decrease in edge-cut is not possible for the vertex, the balance of the partitioning can be improved. Balancing Condition. Let W i be a vector of k elements, such that W i [a] is the weight of partition a of graph G i, and let W min and W max be the minimum and maximum permissible cluster weights respectively. A vertex v, whose weight is w(v) can be moved from partition a to partition b only if W i [b] + w(v) W max W i [a] w(v) W min 13

18 3.3 S-Node representation The authors in [RGM03] propose a technique called S-Node representation for efficient querying and storage of large data graphs. Though their work is particular to web repository graphs, some of the ideas presented there in are quite general and can be applied to any class of graphs. S-Node representation is a two level representation in which the smaller graphs encode the interconnections within a small subset of pages while the top-level directed graph, which consists of supernodes and superedges, contains pointers to these smaller graphs. Such a representation is highly space efficient and enables in-memory querying of very large graphs Figure 8: S-Node representation Advantages of S-Node Representation 1. It compresses the graph so that large portions of the graph can fit into reasonable amounts of memory and thus, in-memory algorithms can be used instead of the I/O extensive disk based algorithms. 2. It provides a natural way of exploring the graph by exploring local areas that are relevant to the query. The top level graph can also serve as an index to the lower level graphs so that the relevant lower level graphs can be quickly located Structure of an S-Node Representation Let the directed graph be represented by G. The sets V (G) and E(G) refer to the vertex set and edge set respectively. Let P = N 1, N 2,..., N 3 represent a partition of the vertex set of G. The following types of directed graphs are then defined:- 1. Supernode graph A supernode graph contains n vertices called supernodes, one for each element of the partition. The supernodes are connected to each other using directed edges called superedges. A superedge is created from N i to N j iff there is at least one edge from a vertex in N i to a vertex in N j. 2. Intranode graph Each element N i of the partition is associated with an intranode graph, which represents interconnections between the pages that belong to that element. 14

19 3. Positive superedge graph A positive superedge graph SEdgeP os i,j is a directed bipartite graph that represents all the links that point from N i to N j. 4. Negative superedge graph A negative superedge graph SEdgeNeg i,j is a directed bipartite graph that represents, among all possible links that point from pages in N i to pages in N j, those that do not exist in the actual graph. Figure 9: Partitioning the graph Given a partition P on the vertex set of G, we can construct an S-Node representation of G, denoted as SNode(G, P ), by using a supernode graph that points to set of intranode graphs and a set of positive or negative superedge graphs. Each superedge E i,j points to either the corresponding positive superedge graph SEdgeP os i,j or the corresponding negative superedge graph SEdgeNeg i,j, depending on which of the two superedge graphs have the smaller number of edges. This choice between positive and negative superedge graphs allows us to compactly encode both dense and sparse interconnections between pages belonging to two different supernodes. The S-Node representation as described above preserves all the linkage information of the original graph except that the adjacency lists are partitioned across multiple smaller graphs. For right choice of partition this representation is highly compact and well-suited for local as well as global access tasks Partitioning Desiderata To build an S-Node representation that efficiently supports global and local access to graphs, the following requirements must be met:- Pages with similar adjacency lists are grouped together, as much as possible, so that a compression technique called reference encoding can be used to achieve significant compression of intranode and superedge graphs. In addition, this kind of grouping will have the additional benefit of assigning related pages to the same partition. Nodes assigned to a given cluster belong to same domain or have some lexicographic similarity. These nodes would tend to have a significant percentage of links, and thus might be traversed in a short span of time. 15

20 3.3.4 Iterative Partitioning Algorithm Beginning with an initial coarse-grained partition P 0 = {N 01, N 02,..., N 0n }, we can continuously refine it during subsequent iterations, generating a sequence of partitions P 1, P 2,..., P f. Refinement is the process of taking an element, say N ij of the partition P i = {N i1, N i2,..., N ik } and partitioning it further into smaller sets, say {A 1, A 2,..., A m } to obtain the next level partition P i+1 = {N i1, N i2,..., N i,j 1, N i,j+1,..., N ik } {A 1, A 2,..., A m }. The initial partition P 0 groups pages belonging to the domain to which they belong (or some other measure of lexicographic similarity), and all pages belonging to the same domain are mapped to the same element of the partition. Since the final partition P f is a refinement, the second property in is satisfied. At each iteration, the element N ij can be split using two methods, URL split (computationally inexpensive, used in earlier iterations) or Clustered split (computationally expensive, so used in later iterations when size of individual partitions is small). URL split. This method partitions the elements in N ij based on their URL patterns. Pages with similar URL prefixes are grouped together and kept separate from pages with different URL prefixes. Every application of URL split on a partition uses a URL prefix that is one level longer than the prefix that was used to generate the partition. URL split attempts to exploit the inherent directory structure encoded in the URLs to group related pages together. Clustered Split. This technique splits the pages in N ij by using a clustering algorithm, such as k-means, to identify pages with similar adjacency lists. Figure 10: Clustered Split To apply clustered split, the supernode graph for the current partition, say N ij is constructed (see Fig. 10). A bit vector adj(p) is associated with each page p of N ij. The size of the bit vector is equal to the outdegree of the supernode associated with the partition element. The bits of adj(p) are set depending on the supernodes to which p points. After constructing such a bit vector for each page in N ij, k-means clustering is applied on these vectors. The clusters produced by the k-means clustering are then used to partition N ij. 3.4 Compression Compression is important for large sized graphs as it allows for more efficient storage and transfer, and may improve the performance of many algorithms by allowing computations to be performed in faster levels of computer memory hierarchies. The authors in [AM01] describe how the two compression techniques, (i) Huffman Encoding and (ii) Reference Encoding can 16

21 be applied for efficient compression of the web graph and thus achieve the aforementioned advantages. Though their discussion is limited to the web graph and the copying model, the assumptions that they make can be applied to any graph in general and thus, the techniques described can be used for compressing database graphs to good effect as well Huffman Encoding Experimentally it is determined that the degrees in typical graphs follow the Zipfian distribution, i.e. the number of nodes with degree j is proportional to 1/j α, where α is a fixed constant. Given this variation in the degrees of the nodes of the graph, Huffman encoding can be used to compress the graph, where the Huffman codeword of a node is assigned based on its degree. A special stop symbol is used to separate the outedges of each node. The encoding scheme can use either indegree or outdegree, whichever is better. Huffman encoding provides a natural and simple way of efficiently compressing any given graph and can be used in any system that wants efficient computation on the compressed form of the graph. However, it ignores the natural clustering structure induced in many graphs Reference Encoding Reference encoding works particularly well for graphs generated using the copying model, since it represents the adjacency list of a node in terms of some other node s adjacency list, if the two nodes share many outlinks. When node i is compressed in this way using node j, node j is said to be a reference for node i. If node j is labelled as reference of node i, a 0/1 bit vector indicating which outedge of j is also an outedge of i. Other outedges of i can then be separately defined, say using log n in an n node graph. Let N(i) and N(j) represent the set of outedges for node i and node j respectively. The cost of compressing node i using node j as a reference with this scheme is then cost(i, j) = outdeg(j) + log n ( N(i) N(j) + 1) Figure 11: Reference Encoding Given a graph in this compressed format, consider the problem of reconstructing the adjacency list of i. This could require us to traverse the adjacency list of the reference node of i, say j, which in turn might be further encoded in terms of some other reference node. This chain will go on till we completely determine the adjacency list of i. This can lead to large reconstruction times and is a potential drawback of reference encoding scheme. Also, cycles among references must be avoided. Thus, if i is encoded in terms of j and j is encoded in terms of k, then care must be taken to ensure that k is not encoded in terms of i. Affinity Graph. The affinity graph G S is used to determine the reference nodes for each node in the graph G W, avoiding the cycles problem described above. Specifically, the nodes of G S are the same as nodes of G W. The weight of the edge from i to j in G S is set to the cost 17

22 of encoding i using j as reference. A root node r is added to the affinity graph to which every other node has a directed edge and from which there are no outgoing edges. The weight of the edge from i to r, is the cost of storing i, without using any reference node. Node i has a directed edge to node j if and only if w(i, j) < w(i, r). FIND-REFERENCE algorithm. The algorithm first computes the affinity graph for the given graph and then finds an optimal set of references such that, (i) Each node has at most one reference and (ii) There are no cycles among references. The problem of finding the optimal reference assignment to each node subject to the restrictions mentioned above is equivalent to finding a minimum weighted directed spanning tree with root r on the affinity graph Additional Improvements Various improvements can be implemented after the references have been found via the above algorithm. For example, additional references can be found for a node. For this, we can remove the edges covered by references from the original graph and rerun the algorithm. Though the algorithm is not optimal, since better compression could be obtained if the first run was made keeping in mind further stages were coming, it nevertheless gives an efficient heuristic for further compressing the graph using multiple references, which is NP-hard in general. Also, we can use Huffman encoding to compress the edges not covered by references. Again the set of references obtained may not be ideal, since we are invalidating the cost function that was used to compute the references. However, until we choose the references, we cannot determine the cost of edges not covered by references, so it is difficult to take this into account properly in the cost function. Other possible improvements include using different compressed representations. For example, the bit vector, used to store which links a node has in common with its reference, can be Huffman or run-length encoded. 18

23 4 Indexing Indexing is required for keyword search in databases to facilitate efficient storage as well as to locate and query the relevant portions of information in a timely and organized manner. In the case of graph representations of data, keyword searching involves navigating paths through the graph structure, and thus obtaining the answer to the query. XML data along with relational databases is the primary data storage technique that is used in most applications. Keyword search over XML data is similar to keyword search over relational databases, once the graph model is created for the XML document. 4.1 Types of Indexes Three types of indexes are defined in [STW04] for XML data, classified according to the XPath navigational axes they support: Structure Indexes: This kind of index is limited to trees only and can t be applied to any general class of graphs. Structure indexes consider the XML data as a rooted tree and and encode the tree using a pre- and post-order numbering scheme. Path Indexes: Path indexing is based on structural summaries of XML graphs. Path indexes represent all paths starting from the document root or some pre-defined subset of all such paths. Most path indexes are not limited to trees only and can be applied to arbitrary graphs or can be extended to handle arbitrary graphs. Connection Indexes: Connection indexes are labelling schemes that support efficient ancestor and descendant queries over the XML data, i.e. connection indexes are used to answer queries such as Are u and v connected in the graph? 4.2 HOPI - A Connection Index HOPI ([STW04]) is a connection index for querying XML data, which is constructed from the 2-HOP cover of a graph. HOPI is a compact representation for reachability and distance information in a graph. HOPI can handle path expressions efficiently and supports efficient evaluation of queries with path wild-cards Hop Cover A 2-hop cover of a graph is a compact representation of the connections of a graph. The 2-hop cover for a graph is computed by choosing some node w on every path (u, v) in the graph, and adding w to a set L out (u) of descendants of u and to a set L in (v) of ancestors of v. Now, if it is required to check the existence of a path from u to v in the graph, it can be done efficiently by checking if L out (u) L in (v) φ. 2-Hop Label. Let G = (V, E) be a directed graph. Each vertex v of the graph G is assigned a 2-hop label L(v) = (L in (v), L out (v)) where L in (v), L out (v) V such that for every node x in L in (v), there exists a path from x to v in G and for every node y in L out (v), there is a path from v to y. For a directed graph G = (V, E), let u and v be two nodes with 2-hop labels L(u) and L(v) respectively. Then, there exists a path from u to v if there is a node w V such that 19

24 w L in (u) L out (v). A 2-hop labelling of a graph assigns to each node in G a 2-hop label. 2-Hop cover. A 2-hop cover of a graph G = (V, E) is 2-hop labelling of G such that if there is a path from u to v in G, then L out (u) L in (v) φ. The size of a 2-hop cover is defined as the sum of the sizes of all node labels: v V ( L in(v) + L out (v) ) Incremental Algorithm for Computation of 2-Hop Cover The set cover problem can be reduced to the problem of finding the minimum 2-hop cover, and thus computing the minimum 2-hop cover for a graph is an NP-hard problem. The algorithm proposed to compute the 2-hop cover, computes a center node w for each path (u, v) in G, and adds w to L in (v) and to L out (u). The algorithm maintains a set of not yet covered connections and at each step picks a center node so as to cover as many of the uncovered connections as possible. Such a set of connections is obtained by computing the densest subgraph of the center graph of w. This algorithm runs in polynomial time and computes a 2-hop cover that is within O(log V ) larger than the optimal size Divide and Conquer Algorithm for Computation of 2-Hop Cover Computing the transitive closure as required by the incremental algorithm is memory intensive, so a divide and conquer technique is proposed for computing the 2-hop cover: The graph is partitioned such that the transitive closures of the partitions fit in memory and so the 2-hop cover computation for each partition can be carried with memory based structures. Compute the transitive closure for each partition and thus compute the 2-hop cover for each partition. Merge the 2-hop covers for partitions having at least one cross partition edge and thus, obtain a 2-hop cover for the entire graph Distance Aware 2-Hop Cover The above algorithm for building the 2-hop cover can be modified to include the distance information in the HOPI index ([STW05]). Each entry in the label of a node is augmented with the distance information, e.g. the entries in L in (v) are now pairs (u, d(u, v)), where d(u, v) is the distance between u and v. The main modification that needs to be made is that a node w can be center node for a path from u to v only if it is on a shortest path, since otherwise it can t reflect the correct distance information. This additional restriction is added to the construction of the center graph for w, where we add the edge (u, v) only if the distance from u to v is the same as the sum of distances from u to w and from w to v. 20

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of