Keyword Search in External Memory Graph Representations of Data

Size: px
Start display at page:

Download "Keyword Search in External Memory Graph Representations of Data"

Transcription

1 Keyword Search in External Memory Graph Representations of Data B. Tech. Seminar Report Submitted in partial fulfillment of the requirements for the degree of Bachelor of Technology by Avin Mittal Roll No: under the guidance of Prof. S. Sudarshan a Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai

2 Acknowledgements I would like to thank my guide, Prof. S. Sudarshan for the constant feedback and encouragement he has given to my work, and without whose motivation this work would not have been possible. Avin Mittal i

3 Abstract Keyword search over relational and XML data has grown in popularity since the advent of Web search engines. Keyword search over relational data is significantly different from web search as the required information is often split across multiple tables as a result of normalization. The algorithms and techniques that are applied to databases, thus produce answer trees from the data graph as opposed to answer nodes produced by Web search engines. BANKS and some other systems enable keyword-based search over relational databases. Though the algorithms and heuristics used in these systems are highly efficient and have been tuned to give the best results, most of these systems assume the presence of the entire data in memory which may not be the case for standard databases. Also, the search algorithms used by such systems may explore a large portion of the graph before finding an answer. Through this report, we aim to study some of the external memory techniques that can be applied to such systems and thus, improve the time and space complexity of the algorithms, keeping in mind and trying to restrict the number of I/O operations taking place. ii

4 Contents 1 Introduction 1 2 Keyword Search Systems BANKS Graph Model for Representing Data Query and Answer Model Backward Expanding Search Algorithm Bidirectional Search Algorithm DBXplorer Publish Search ObjectRank Authority Transfer Graph Score of a Node w.r.t a Query External Memory Graph Algorithms Blocking Model Paging Strategies Graph Partitioning Mathematical model Multilevel k-way Partitioning S-Node representation Advantages of S-Node Representation Structure of an S-Node Representation Partitioning Desiderata Iterative Partitioning Algorithm Compression Huffman Encoding Reference Encoding Additional Improvements Indexing Types of Indexes HOPI - A Connection Index Hop Cover Incremental Algorithm for Computation of 2-Hop Cover Divide and Conquer Algorithm for Computation of 2-Hop Cover Distance Aware 2-Hop Cover Conclusions and Future Work 21 iii

5 1 Introduction Search engines on the web have popularized the keyword search paradigm among the users for fetching information. With the huge amounts of online data being stored in the form of relational and XML data, keyword searching over such forms of data has grown in importance, particularly for casual users who have no knowledge regarding the underlying schema. The standard query languages such as SQL and XPath are too complex to be used by casual users, and thus systems that support keyword querying over databases are the only way to get information from relational data for such users. The key difference here as opposed to keyword search on the web is that the data may be split across various tables due to normalization, and thus any system should take this into consideration, hence producing answers constructed from on-the-fly joins of tables of the databases. Many systems such as BANKS, DBXplorer, DISCOVER, ObjectRank, XRank etc. have been proposed and developed that support such keyword queries over databases. We discuss some of the systems and their key features in Section 2. In BANKS and some of the above mentioned systems, the graph data model is used to represent the data from the database. In the graph model, a node is created corresponding to each tuple in the database, and a directed edge is created for each foreign key link or other kinds of links (between tuples) in the database. The kind of searches that keyword queries entail and the problem of normalization make the graph model the most natural way to represent the data such that efficient keyword search is possible. The answer to a query is given as a rooted directed tree, with the keyword nodes present at the leaves of the tree. In BANKS, two algorithms have been proposed to compute the answer trees given a set of keyword nodes - (i) Backward Search, which starts from the keyword nodes and searches for a common root in the backward direction, and (ii) Bidirectional Search, which allows forward searches from potential roots towards leaves. Though the keyword search systems such as BANKS work well in practice, they suffer from the potential problem of memory explosion. Thus, the run-time memory requirements of running the algorithm can be unreasonably large for many large-scale databases, since the search algorithms may explore a large number of nodes before producing an answer. Also, the algorithms used in the above mentioned systems assume the presence of required data in memory. In practice, this may not be the case for huge graphs and thus, the run time of the algorithm may be dominated by the number of I/O operations that take place during the execution of the algorithm. The purpose of this report is to study some of the external memory techniques that have been described over the years to enable efficient storage and querying of disk resident data, and to see how these techniques can be applied to some of the systems mentioned above. In Section 3, we look at some of the measures that have been suggested over the years for efficient traversal and compression of the disk-based data graphs. Two level representation of graphs, clustering to produce partitions of graph and various compression techniques such as Huffman Encoding and Reference Encoding for graphs are some of the techniques that are studied. Many of these techniques have been described and studied in detail for Web graphs but can be applied to other graphs as well. Also, we study some of the basic properties of disk based algorithms and the paging strategies used by a system to minimize the I/O operations while executing an algorithm. Indexing is an important aspect of efficient storage and traversal of data graphs. Indexing enables us to locate and query the relevant sections of the graph quickly. There are many types of indexes based on the type of queries it is intended to be used for. Connection index is a type of 1

6 index that is used to store only the connectivity information regarding a graph, i.e. a connection index can be used to answer ancestor or descendant queries. One such index is HOPI - which is based on computing the 2-Hop cover of a graph. We study the basic properties, computation and other features of the HOPI index in Section 4. 2

7 2 Keyword Search Systems Since the growth of world wide web, there has been a rapid growth in the number of users who wish to browse and search the contents of online databases without necessarily caring about the underlying schema or the formal query language such as SQL. Many systems have been proposed over the past few years that solve the problem of keyword based searching over relational and XML data. This section describes some of the techniques and features of these keyword-based search systems. 2.1 BANKS BANKS ([BHN + 02])- acronym for Browsing ANd Keyword Searching - is a system that enables data and schema browsing along with keyword search over relational databases. Keyword search in BANKS is done using proximity based ranking based on foreign key and other types of links. The database is modelled as a directed graph, with nodes representing tuples and edges representing cross-references. BANKS allows query keywords to match data (information present in the tuples) as well as meta-data (column/table names). BANKS also enables near zero-effort Web publishing of relational data which would otherwise remain invisible to the Web Graph Model for Representing Data The database is modelled as a directed weighted graph with each tuple in the database corresponding to a node in the graph. Each foreign key link is a directed edge from the referencing node to the referenced node. An answer is a rooted directed tree having a path from the root to each of the keyword nodes. The answer trees are scored based on a combination of edge weights and node weights, and the obtained scores are used to rank the answer trees. The edge weights are assigned in inverse proportion to their importance. For e.g. in the bibliographic database, an edge from Paper to Writes would have a lower weight than that from Paper to cites. The weight of a tree is proportional to the sum of the edge weights and the relevance of a tree is inversely proportional to its weight. In some cases, we may require to traverse the edges in backward direction. For e.g. finding a path from Paper to Author would require traversing the foreign key edge from Writes to Paper in the backward direction. Thus, for each edge (u, v), a backward edge (v, u) is created with a different edge weight. The weight of (v, u) is directly proportional to the number of links to v from the nodes of same type as u. Figure 1: Backward Edges 3

8 Node Weights, inspired by the prestige rankings such as in Google ([BP98]) are also incorporated in the model. Nodes having multiple pointers to them have a higher weight. Higher node weight implies higher prestige. Node weights and tree weights are combined to get an overall relevance score. Similarity measure, s(r 1, R 2 ) between relations R 1 and R 2, where R 1 is the referencing relation and R 2 is the referenced relation, is generally asymmetric and depends upon the type of link from R 1 to R 2. s(r 1, R 2 ) is set to infinity if R 1 doesn t refer to R Query and Answer Model A query consists of n terms, say t 1, t 2,, t n. The first step is to locate the nodes matching the query terms. A node is said to be relevant for a query term if it contains the terms as a part of an attribute value or metadata. For each term t i, a set of nodes S i relevant to t i is thus obtained. The answer to a query is a rooted directed tree containing at least one node from each S i. Each answer tree is assigned a relevance score based on the edge weights and the node weights, and the answers are presented in decreasing order of the scores Backward Expanding Search Algorithm The answer tree as described above may also contain nodes which are in neither of the S i, and thus is a Steiner tree. Steiner tree computation is NP-hard and is further complicated by node weight considerations. Any search algorithm should not only produce the tree with the highest relevance score, but also other trees with high relevance scores. The backward expanding search algorithm ([BHN + 02]) offers a heuristic solution for incrementally computing the query results. Given a set of keywords, the algorithm first finds the set of relevant nodes S i for each keyword t i using disk resident indices on keywords. Let S = S i. The backward search algorithm runs S concurrent copies of the Dijkstra s single source shortest path algorithm, one for each keyword node in S. Each copy of the Dijkstra s algorithm traverses the graph edges in reverse direction. The goal is to find a common vertex from which a forward path exists to at least one node in each set S i. The paths thus obtained will define a rooted directed tree with the common vertex at root and the keyword nodes at the leaves. In particular, the connection trees generated by the algorithm may not be in exact decreasing order of relevance scores as the node weights are not being considered to obtain the tree. Thus, a heap of top-k answers seen so far is maintained (for some k, where k is determined by experimentation), and whenever the heap becomes full the top answer is printed Bidirectional Search Algorithm The backward search algorithm presented above explores unnecessary large number of nodes in the following cases: The query contains a frequently occurring term. In the backward search algorithm one iterator is associated with every matching node, thus the algorithm generates a large number of iterators if a keyword matches a large number of nodes. This could happen in case of frequently occurring terms or if a keyword matches a relation name (which matches all tuples belonging to that relation) 4

9 One of the iterators reaches a node having a large in-degree. In this case, the iterator will need to explore a large number of nodes. In the above scenarios, the backward search algorithm explores a large portion of the graph before finding relevant answers and this may result in a large search time. The authors in [KPC + 05] propose a new algorithm where in they create iterators only for the nodes matching less frequent keywords and explore paths from potential roots to the more frequent keywords. Every node reached by an iterator in the backward search algorithm is a potential root. If a forward path is followed from these nodes, the frequent keywords may be reached faster and thus the algorithm terminates faster. Figure 2: Motivation for Bidirectional Search The two main features of the bidirectional search algorithm are: Starting forward searches from the potential roots A spreading activation model to prioritize nodes on the iterator fringe, thus ensuring that the iterator with a small fringe would get a higher priority, and among nodes within a single iterator, those in a less bushy subtree would get a higher priority. Avoiding backward search from large fringes avoids wasteful expansion, yet the corresponding keyword nodes can be connected by searching forward from nodes high activation values. The key differences of the bidirectional search algorithm from the backward search algorithm are: All the single source shortest path iterators from Backward Search algorithm are merged into a single incoming iterator. Spreading activation is used to prioritize the search. For the incoming iterator, the node to be expanded next is the one with the highest activation. Activation spreads from keyword nodes and edge weights are taken into account when spreading the activation, so the activation reflects the edge weights as well as the branching of the search fringe. Another iterator called the outgoing iterator is concurrently run which follows outward edges from the nodes explored by the incoming iterator. 2.2 DBXplorer DBXplorer ([ACD02]) is another system for keyword search over relational databases. The DBXplorer system has been implemented using a commercial relational database and a web 5

10 server and allows the users to interact via a browser front-end. The distinguishing feature of DBXplorer with BANKS is that DBXplorer uses SQL and the query optimizations of SQL over join trees to output a single row that can be obtained by joining several tables. Given a set of query words, DBXplorer returns all rows (either from single tables, or by joining tables connected by foreign key links) such that each row contains all keywords. The two key features of the system are a) a pre-processing step called Publish that builds the structures associated with the database and prepares the database for keyword searching, and b) a Search step that obtains the matching rows from the published database. Figure 3: Architecture of DBXplorer Publish The given database (or some specific part of it) is prepared for keyword searching through the following steps: A database is identified along with the set of rows and columns within the database to be published Auxiliary tables are created for supporting keyword queries over the database. The most important of these is the symbol table that is used for identifying the location of each keyword in the database (i.e. rows, columns, tables they occur in) Search When a query (consisting of a set of keywords) is given to the system, it is answered as follows: The symbol table is looked up to find the tables, rows and columns of the database that contain the query keywords. All potential subsets of tables in the database that if joined (i.e. have foreign key connections), may contain the rows having all keywords, are identified and enumerated. A subset of tables can be joined only if they are connected in the schema through foreign key links. 6

11 Figure 4: Join trees For each enumerated join tree, an SQL statement is created (and executed) that joins the tables in the tree and selects the rows having all keywords. The rows thus obtained are output to the user. DISCOVER: Though DBXplorer is free from the limitations of universal relations, it does not consider solutions that may have more than one tuple from a single relation. DBXplorer produces join trees from the schema graph, thereby restricting itself to at most one tuple from each relation (see Fig. 4). An improvement over DBXplorer is DISCOVER ([HP02]), which produces candidate networks of tuples, i.e. sets of tuples related due to primary key-foreign key relationships and together containing all the keywords of the query. DISCOVER operates in two steps: the Candidate Network Generator enumerates all possible candidate networks of tuples and the Plan Generator which plans for the efficient evaluation of the set of candidate networks, exploiting the opportunities to reuse the common subexpressions. Both DISCOVER and DBXplorer rank their results on the basis of the number of joins involved in producing the results. Thus, the more complicated notions of node weights and the similarity measures between relations is ignored, resulting in a naive ranking scheme. Another difference between these systems and BANKS is that they are capable of being applied only to relational databases, whereas BANKS applies equally well to XML data as well. The work done in DBXplorer, DISCOVER uses the implementation of the underlying DBMS for efficient storage and relies on the query optimizations of SQL for efficient querying. 2.3 ObjectRank ObjectRank([BHP04]) is a system that applies the authority-based ranking to keyword search in relational databases modelled as directed graphs. Authority is a measure of importance of a node in the graph. Intuitively, each keyword node has some authority and this authority is transferred along the links in the data graph. Pagerank([BP98]) is an excellent tool to measure the global importance of nodes, however in case of keyword search over databases, we need a query specific relevance score for each of the nodes. For this purpose, a global authority calculation is done which is very similar to the PageRank computation of Google, along with a keyword specific authority calculation for each node which assigns an authority value corresponding to each keyword to the node. At run time, the ObjectRank system computes the query specific authorities for the nodes by combining the individual keyword based authorities computed earlier. The final score of a node for a given query is obtained by combining the global authority 7

12 and the query specific authority scores for the node. The link structure of the database graph is quite different from the link structure of the web. Unlike the Web where all edges are hyperlinks and each link is rather indistinguishable from other, in relational databases there are various types of edges. The edges between different types of nodes can carry different semantics and thus, the amount of authority flowing to each out-neighbour of a node may not be identical Authority Transfer Graph The database is modelled as a directed graph where each tuple of the database is represented as a node in the graph. Each node represents an object of the database and may have some attributes. Also, associated with the database is a schema graph, in which each relation in the database is modelled as a node and those connected by foreign key links are joined by edges. From the schema graph, an authority transfer graph is created to reflect the authority flow through the edges of the graph. For each edge (u, v) in the schema graph, we create one forward and one backward edge in the authority transfer graph. The idea of a backward edge is that authority can flow in backward direction as well. Both forward and backward edges are annotated with authority transfer rates, which reflects the fraction of authority of that node flowing through the edge. However, the authority transfer rates for both directions need not be same. Figure 5: Authority Transfer Graph Score of a Node w.r.t a Query The score of a node v w.r.t. a particular keyword query w is given by combining the global ObjectRank r G (v) of the node and the keyword specific ObjectRank r w (v) of the node. The combination function used in ObjectRank is given by r w,g (v) = r w (v).(r G (v)) g, where g is the global ObjectRank weight and can be tuned for varying degree of generality in the results. Keyword Specific ObjectRank. Given a single keyword query, ObjectRank finds the set of nodes containing the keyword, S(w) (called the base set for that keyword) and assigns an ObjectRank r w (v i ) to each node v i of the graph by resolving the equation 8

13 r w = dar w (1 d) + S(w) s where A is the authority transfer matrix, d controls the base set importance and s is the base set vector for S(w). The damping factor d determines the portion of ObjectRank that an object transfers to its neighbours. By decreasing d, the nodes containing the keywords are favoured. Global ObjectRank. The global ObjectRank is calculated in an identical fashion to the PageRank computation of Google. The system applies the random surfer model, and the base set includes all nodes in the graph. Note that, this method assumes all nodes have the same initial authority value (as in Pagerank). However, there are many applications in which the domain expert may not desire so, for example in a complaints database we may wish to assign complaints from regular customers a higher initial authority. Multiple Keyword Queries. The ObjectRank for a multiple keyword query w 1, w 2,, w m is given by extending the random surfer model to introduce m surfers, where i th surfer starts from the base set S(w i ). For AND semantics, the ObjectRank of v is the probability that at any given time, all the random surfers are at v. Hence the ObjectRank of v for AND semantics is given by r w 1,,w m AND (v) = r w i (v) i=1,,m For OR semantics, the ObjectRank of v is given by the probability that at any given time any one surfer is at v. Hence the ObjectRank of v for OR semantics is given by r w 1,,w m OR (v) = 1 (1 r w i (v)) i=1,,m The main advantages of ObjectRank lie in the flexibility of the system. The system can be tuned and the various parameters can be varied, thus making the system more customized to application domain. Some of the parameters that can be tuned and the effects they have in the ObjectRank system are: The base authorities of different nodes need not be same for all domains. The domain expert may decide to differentiate between the nodes of the same base set based on some attribute values for the objects in the set. The damping factor in the calculation of keyword specific ObjectRank can be decreased to favour the base set. The authority transfer rates are set by a domain expert and they reflect the amount of authority flow across various types of links. The global ObjectRank weight can be varied to assign different degrees of weightage to global importance of nodes as opposed to their relevance to the current query. 9

14 3 External Memory Graph Algorithms Typical graphs constructed from databases as described in Section 2 are huge in size, and consequently most parts of the graph reside on the disk. Thus, the in-memory algorithms, such as Dijkstra s single source shortest path algorithm, which assume the presence of all the graph data in memory don t scale well for such graphs as the I/O operations dominate the running time of the algorithm. In this section we look at some of the external memory storage techniques and algorithms that can be used to overcome these difficulties. 3.1 Blocking The authors in [NGV93] discuss the problem of using disk blocking efficiently for searching graphs that are too large to fit in memory. A key feature of their model is that they allow a vertex to be stored multiple times in order to take advantage of redundancy Model The key assumptions of the model for a graph G=(V,E) are: 1. The data is associated with only the vertices of the graph each of which can be represented in a fixed amount of space. 2. Data for vertices is stored in blocks, each of which can hold data for at most B vertices. 3. Data for a single vertex may be present in more than one block. Only one such block needs to be present in memory to satisfy the request. 4. The internal memory can hold data for at most M vertices, after which B vertices have to be flushed out to make room for a new block. 5. The total number of vertices in the graph is n = ρm, where ρ >> 1 Assumption 3 is reasonable if the data is not updated too often, which is indeed the case for most databases. If the total amount of storage required to store the graph is S, the storage blow-up is defined as s = S/(n/B). Intuitively, s is the average number of blocks a vertex is present in. If a vertex is present in the memory at any time, it is said to be covered. The term page fault is used to refer to an event when the path being traced extends to an uncovered vertex, and thus at least one block must be read from the secondary memory. Thus, algorithms in general have two parts: An assignment of vertices of graphs to blocks, known as blocking. A paging algorithm to specify what vertices/blocks are in memory at any time Paging Strategies When a page fault causes a block to be brought into memory, it may be the case that it is necessary for it to overwrite data that are already in memory so as not to exceed the memory size. If the data are always flushed out by blocks, we say that we are using the weak paging strategy. A strong paging strategy would allow any B copies of vertices to be flushed, independent of whether they were originally in the same block. 10

15 We can also distinguish between paging algorithms based on whether they are on-line or off-line. An off-line paging strategy may consider the entire path before deciding what blocks need to be brought in. An on-line paging strategy must base its decision on what block to bring in at any point only upon the previous history of the path. We can term a paging algorithm as lazy if it only brings a block into memory that services an immediately preceding page fault. Blocking using BALL-COVER: A ball of radius r around a vertex v is defined as the set of vertices {v V d(v, v ) < r}, where d represents the minimum distance between v and v. The problem BALL-COVER is to pack a minimum number of balls of radius r into the graph such that each vertex belongs to at least one ball. The output of BALL-COVER is a set of vertices V such that for every vertex v V, there exists a vertex v V such that d(v, v ) < r. Also, a k-compact neighbourhood N v (k) of v is defined as a k-element subset of V, containing v and which is such that the minimum distance from v to a vertex not in N v (k) (radius of the neighbourhood) is maximum over all such subsets. Let r (B) = minimum (over all vertices) radius of the compact neighbourhoods of size B, where B is the blocking size. The BALL-COVER is computed for the graph with radius r = r (B)/2. Figure 6: Distance from v to w is less than r/2 The blocking for the given graph consists of the set {N v (B) v V }, where B is the block size. Now, when a page fault occurs at vertex v, we know (by definition of BALL-COVER) that there is at least one vertex v V such that d(v, v ) < r. The representative block corresponding to the vertex v is then brought to the memory. Since, the distance from v to v is less than r = r (B)/2 and the distance from v to any vertex not in N v (B) is at least r (B) (by definition of radius of compact neighbourhood), it can be shown that the distance from v to any vertex not in N v (B) is at least r (B)/2. The above analysis shows that if redundancy is permitted in the blocking, we can actually upper bound the number of I/O operations that take place during the execution of an algorithm. Thus, redundancy provides excellent performance of algorithms. However, the trade-off in this case is between the storage blow-up and efficiency. So, if we allow the storage requirements to go up, we can produce highly I/O efficient algorithms. On the other hand, graph partitioning techniques (described in the following sections) doesn t incur storage overheads, while the efficiency of graph traversal may reduce. 11

16 3.2 Graph Partitioning Graph partitioning (or clustering) is important in external memory graph algorithms as it leads to a smaller, more manageable way of storing the graph on the disk and also, reduces the time required for graph traversal. Informally speaking, clustering is a process of grouping nodes of a graph such that intra-cluster similarity is maximized and inter-cluster similarity is minimized Mathematical model The k-way graph partitioning problem is defined as follows: Given a directed, weighted graph G = (V, E) with V = n, partition V into k subsets, V 1, V 2,, V k such that V i V j = φ for i j, V i = n/k and i V i = V, and the sum of the edge weights whose incident vertices belong to different subsets is minimized. A k-way partitioning of V is commonly represented by a partitioning vector P of length n, such that for every vertex v ɛ V, P[v] is an integer between 1 and k, indicating the partition to which vertex v belongs. Given a partitioning P, the sum of edge weights whose incident vertices belong to different partitions is called the edge-cut of the partitioning Multilevel k-way Partitioning In [KK98], the authors propose a multilevel approach to graph partitioning and show that it is much more efficient than the recursive bisection algorithm. They also present a high quality and efficient refinement algorithm that can improve upon the initial k-way partitioning. The basic structure of multilevel k-way partitioning algorithm is: The graph G = (V, E) is Figure 7: Multilevel k-way partitioning (from [KK98]) first coarsened down to a small number of vertices, a k-way partitioning of this small graph is computed and then the partitioning is projected back towards the original graph, by successively refining the partitioning at each intermediate level. 12

17 The phases of multilevel partitioning are described in more detail below: Coarsening Phase. During the coarsening phase, a sequence of smaller graphs G i = (V i, E i ) is constructed from the original graph G = (V 0, E 0 ) such that V i < V i 1. In coarsening, a set of vertices (found by computing a maximal matching) of G i, say Vi v is combined to form a single vertex, say v of the next level coarser graph G i+1. The weight of v is the sum of weights of vertices in Vi v. Also, the edges of v are the union of edges in Vi v. In case more than one edge of V v i is incident on a vertex u, the weight of the new edge (from v to u) is equal to the sum of the weights of all such edges. The above coarsening method ensures that the following properties are satisfied :- (i) The edge-cut of partitioning in the coarse graph is equal to the edge cut of the same partitioning in the finer graph. (ii) a balanced partitioning of the coarser graph leads to a balanced partitioning of the finer graph. The coarsening phase ends when the number of vertices falls below a certain level or the reduction in the size of successive coarse graphs becomes too small. Initial Partitioning Phase. The second phase of multilevel partitioning involves computing a k-way partition of the coarsened graph obtained the previous phase, such that each partition contains roughly V 0 /k vertex weight of the original graph. Uncoarsening Phase. In the uncoarsening phase, the partition of the coarse graph obtained in the previous phase is projected back successively to the next finer graphs with some modifications at each stage to further reduce the edge-cut or to improve the balance of the partitioning. This process is repeated till the original graph is obtained. The first step in uncoarsening is to project the partition of G i to G i 1. Thus, in level i if a vertex v belongs to cluster C k, then all vertices of G i 1 that were combined to form v are also assigned to cluster C k. After the projection step, the clustering obtained may be refined by moving vertices from one cluster to another (provided the balance of the clustering is maintained), thus further reducing edge-cut of the partitioning. This refinement is done using the Kernighan-Lin (KL) partitioning algorithm and its variants. Refinement algorithm. The vertices of the graph are visited in random order. If the vertex can be moved to a different cluster (i.e. at least one of its neighbours belongs to a different cluster) the gain associated with moving the vertex to its neighbouring partitions is computed. The vertex is moved to the partition with the maximum possible gain if the balancing condition (defined below) is not violated, or if a decrease in edge-cut is not possible for the vertex, the balance of the partitioning can be improved. Balancing Condition. Let W i be a vector of k elements, such that W i [a] is the weight of partition a of graph G i, and let W min and W max be the minimum and maximum permissible cluster weights respectively. A vertex v, whose weight is w(v) can be moved from partition a to partition b only if W i [b] + w(v) W max W i [a] w(v) W min 13

18 3.3 S-Node representation The authors in [RGM03] propose a technique called S-Node representation for efficient querying and storage of large data graphs. Though their work is particular to web repository graphs, some of the ideas presented there in are quite general and can be applied to any class of graphs. S-Node representation is a two level representation in which the smaller graphs encode the interconnections within a small subset of pages while the top-level directed graph, which consists of supernodes and superedges, contains pointers to these smaller graphs. Such a representation is highly space efficient and enables in-memory querying of very large graphs Figure 8: S-Node representation Advantages of S-Node Representation 1. It compresses the graph so that large portions of the graph can fit into reasonable amounts of memory and thus, in-memory algorithms can be used instead of the I/O extensive disk based algorithms. 2. It provides a natural way of exploring the graph by exploring local areas that are relevant to the query. The top level graph can also serve as an index to the lower level graphs so that the relevant lower level graphs can be quickly located Structure of an S-Node Representation Let the directed graph be represented by G. The sets V (G) and E(G) refer to the vertex set and edge set respectively. Let P = N 1, N 2,..., N 3 represent a partition of the vertex set of G. The following types of directed graphs are then defined:- 1. Supernode graph A supernode graph contains n vertices called supernodes, one for each element of the partition. The supernodes are connected to each other using directed edges called superedges. A superedge is created from N i to N j iff there is at least one edge from a vertex in N i to a vertex in N j. 2. Intranode graph Each element N i of the partition is associated with an intranode graph, which represents interconnections between the pages that belong to that element. 14

19 3. Positive superedge graph A positive superedge graph SEdgeP os i,j is a directed bipartite graph that represents all the links that point from N i to N j. 4. Negative superedge graph A negative superedge graph SEdgeNeg i,j is a directed bipartite graph that represents, among all possible links that point from pages in N i to pages in N j, those that do not exist in the actual graph. Figure 9: Partitioning the graph Given a partition P on the vertex set of G, we can construct an S-Node representation of G, denoted as SNode(G, P ), by using a supernode graph that points to set of intranode graphs and a set of positive or negative superedge graphs. Each superedge E i,j points to either the corresponding positive superedge graph SEdgeP os i,j or the corresponding negative superedge graph SEdgeNeg i,j, depending on which of the two superedge graphs have the smaller number of edges. This choice between positive and negative superedge graphs allows us to compactly encode both dense and sparse interconnections between pages belonging to two different supernodes. The S-Node representation as described above preserves all the linkage information of the original graph except that the adjacency lists are partitioned across multiple smaller graphs. For right choice of partition this representation is highly compact and well-suited for local as well as global access tasks Partitioning Desiderata To build an S-Node representation that efficiently supports global and local access to graphs, the following requirements must be met:- Pages with similar adjacency lists are grouped together, as much as possible, so that a compression technique called reference encoding can be used to achieve significant compression of intranode and superedge graphs. In addition, this kind of grouping will have the additional benefit of assigning related pages to the same partition. Nodes assigned to a given cluster belong to same domain or have some lexicographic similarity. These nodes would tend to have a significant percentage of links, and thus might be traversed in a short span of time. 15

20 3.3.4 Iterative Partitioning Algorithm Beginning with an initial coarse-grained partition P 0 = {N 01, N 02,..., N 0n }, we can continuously refine it during subsequent iterations, generating a sequence of partitions P 1, P 2,..., P f. Refinement is the process of taking an element, say N ij of the partition P i = {N i1, N i2,..., N ik } and partitioning it further into smaller sets, say {A 1, A 2,..., A m } to obtain the next level partition P i+1 = {N i1, N i2,..., N i,j 1, N i,j+1,..., N ik } {A 1, A 2,..., A m }. The initial partition P 0 groups pages belonging to the domain to which they belong (or some other measure of lexicographic similarity), and all pages belonging to the same domain are mapped to the same element of the partition. Since the final partition P f is a refinement, the second property in is satisfied. At each iteration, the element N ij can be split using two methods, URL split (computationally inexpensive, used in earlier iterations) or Clustered split (computationally expensive, so used in later iterations when size of individual partitions is small). URL split. This method partitions the elements in N ij based on their URL patterns. Pages with similar URL prefixes are grouped together and kept separate from pages with different URL prefixes. Every application of URL split on a partition uses a URL prefix that is one level longer than the prefix that was used to generate the partition. URL split attempts to exploit the inherent directory structure encoded in the URLs to group related pages together. Clustered Split. This technique splits the pages in N ij by using a clustering algorithm, such as k-means, to identify pages with similar adjacency lists. Figure 10: Clustered Split To apply clustered split, the supernode graph for the current partition, say N ij is constructed (see Fig. 10). A bit vector adj(p) is associated with each page p of N ij. The size of the bit vector is equal to the outdegree of the supernode associated with the partition element. The bits of adj(p) are set depending on the supernodes to which p points. After constructing such a bit vector for each page in N ij, k-means clustering is applied on these vectors. The clusters produced by the k-means clustering are then used to partition N ij. 3.4 Compression Compression is important for large sized graphs as it allows for more efficient storage and transfer, and may improve the performance of many algorithms by allowing computations to be performed in faster levels of computer memory hierarchies. The authors in [AM01] describe how the two compression techniques, (i) Huffman Encoding and (ii) Reference Encoding can 16

21 be applied for efficient compression of the web graph and thus achieve the aforementioned advantages. Though their discussion is limited to the web graph and the copying model, the assumptions that they make can be applied to any graph in general and thus, the techniques described can be used for compressing database graphs to good effect as well Huffman Encoding Experimentally it is determined that the degrees in typical graphs follow the Zipfian distribution, i.e. the number of nodes with degree j is proportional to 1/j α, where α is a fixed constant. Given this variation in the degrees of the nodes of the graph, Huffman encoding can be used to compress the graph, where the Huffman codeword of a node is assigned based on its degree. A special stop symbol is used to separate the outedges of each node. The encoding scheme can use either indegree or outdegree, whichever is better. Huffman encoding provides a natural and simple way of efficiently compressing any given graph and can be used in any system that wants efficient computation on the compressed form of the graph. However, it ignores the natural clustering structure induced in many graphs Reference Encoding Reference encoding works particularly well for graphs generated using the copying model, since it represents the adjacency list of a node in terms of some other node s adjacency list, if the two nodes share many outlinks. When node i is compressed in this way using node j, node j is said to be a reference for node i. If node j is labelled as reference of node i, a 0/1 bit vector indicating which outedge of j is also an outedge of i. Other outedges of i can then be separately defined, say using log n in an n node graph. Let N(i) and N(j) represent the set of outedges for node i and node j respectively. The cost of compressing node i using node j as a reference with this scheme is then cost(i, j) = outdeg(j) + log n ( N(i) N(j) + 1) Figure 11: Reference Encoding Given a graph in this compressed format, consider the problem of reconstructing the adjacency list of i. This could require us to traverse the adjacency list of the reference node of i, say j, which in turn might be further encoded in terms of some other reference node. This chain will go on till we completely determine the adjacency list of i. This can lead to large reconstruction times and is a potential drawback of reference encoding scheme. Also, cycles among references must be avoided. Thus, if i is encoded in terms of j and j is encoded in terms of k, then care must be taken to ensure that k is not encoded in terms of i. Affinity Graph. The affinity graph G S is used to determine the reference nodes for each node in the graph G W, avoiding the cycles problem described above. Specifically, the nodes of G S are the same as nodes of G W. The weight of the edge from i to j in G S is set to the cost 17

22 of encoding i using j as reference. A root node r is added to the affinity graph to which every other node has a directed edge and from which there are no outgoing edges. The weight of the edge from i to r, is the cost of storing i, without using any reference node. Node i has a directed edge to node j if and only if w(i, j) < w(i, r). FIND-REFERENCE algorithm. The algorithm first computes the affinity graph for the given graph and then finds an optimal set of references such that, (i) Each node has at most one reference and (ii) There are no cycles among references. The problem of finding the optimal reference assignment to each node subject to the restrictions mentioned above is equivalent to finding a minimum weighted directed spanning tree with root r on the affinity graph Additional Improvements Various improvements can be implemented after the references have been found via the above algorithm. For example, additional references can be found for a node. For this, we can remove the edges covered by references from the original graph and rerun the algorithm. Though the algorithm is not optimal, since better compression could be obtained if the first run was made keeping in mind further stages were coming, it nevertheless gives an efficient heuristic for further compressing the graph using multiple references, which is NP-hard in general. Also, we can use Huffman encoding to compress the edges not covered by references. Again the set of references obtained may not be ideal, since we are invalidating the cost function that was used to compute the references. However, until we choose the references, we cannot determine the cost of edges not covered by references, so it is difficult to take this into account properly in the cost function. Other possible improvements include using different compressed representations. For example, the bit vector, used to store which links a node has in common with its reference, can be Huffman or run-length encoded. 18

23 4 Indexing Indexing is required for keyword search in databases to facilitate efficient storage as well as to locate and query the relevant portions of information in a timely and organized manner. In the case of graph representations of data, keyword searching involves navigating paths through the graph structure, and thus obtaining the answer to the query. XML data along with relational databases is the primary data storage technique that is used in most applications. Keyword search over XML data is similar to keyword search over relational databases, once the graph model is created for the XML document. 4.1 Types of Indexes Three types of indexes are defined in [STW04] for XML data, classified according to the XPath navigational axes they support: Structure Indexes: This kind of index is limited to trees only and can t be applied to any general class of graphs. Structure indexes consider the XML data as a rooted tree and and encode the tree using a pre- and post-order numbering scheme. Path Indexes: Path indexing is based on structural summaries of XML graphs. Path indexes represent all paths starting from the document root or some pre-defined subset of all such paths. Most path indexes are not limited to trees only and can be applied to arbitrary graphs or can be extended to handle arbitrary graphs. Connection Indexes: Connection indexes are labelling schemes that support efficient ancestor and descendant queries over the XML data, i.e. connection indexes are used to answer queries such as Are u and v connected in the graph? 4.2 HOPI - A Connection Index HOPI ([STW04]) is a connection index for querying XML data, which is constructed from the 2-HOP cover of a graph. HOPI is a compact representation for reachability and distance information in a graph. HOPI can handle path expressions efficiently and supports efficient evaluation of queries with path wild-cards Hop Cover A 2-hop cover of a graph is a compact representation of the connections of a graph. The 2-hop cover for a graph is computed by choosing some node w on every path (u, v) in the graph, and adding w to a set L out (u) of descendants of u and to a set L in (v) of ancestors of v. Now, if it is required to check the existence of a path from u to v in the graph, it can be done efficiently by checking if L out (u) L in (v) φ. 2-Hop Label. Let G = (V, E) be a directed graph. Each vertex v of the graph G is assigned a 2-hop label L(v) = (L in (v), L out (v)) where L in (v), L out (v) V such that for every node x in L in (v), there exists a path from x to v in G and for every node y in L out (v), there is a path from v to y. For a directed graph G = (V, E), let u and v be two nodes with 2-hop labels L(u) and L(v) respectively. Then, there exists a path from u to v if there is a node w V such that 19

24 w L in (u) L out (v). A 2-hop labelling of a graph assigns to each node in G a 2-hop label. 2-Hop cover. A 2-hop cover of a graph G = (V, E) is 2-hop labelling of G such that if there is a path from u to v in G, then L out (u) L in (v) φ. The size of a 2-hop cover is defined as the sum of the sizes of all node labels: v V ( L in(v) + L out (v) ) Incremental Algorithm for Computation of 2-Hop Cover The set cover problem can be reduced to the problem of finding the minimum 2-hop cover, and thus computing the minimum 2-hop cover for a graph is an NP-hard problem. The algorithm proposed to compute the 2-hop cover, computes a center node w for each path (u, v) in G, and adds w to L in (v) and to L out (u). The algorithm maintains a set of not yet covered connections and at each step picks a center node so as to cover as many of the uncovered connections as possible. Such a set of connections is obtained by computing the densest subgraph of the center graph of w. This algorithm runs in polynomial time and computes a 2-hop cover that is within O(log V ) larger than the optimal size Divide and Conquer Algorithm for Computation of 2-Hop Cover Computing the transitive closure as required by the incremental algorithm is memory intensive, so a divide and conquer technique is proposed for computing the 2-hop cover: The graph is partitioned such that the transitive closures of the partitions fit in memory and so the 2-hop cover computation for each partition can be carried with memory based structures. Compute the transitive closure for each partition and thus compute the 2-hop cover for each partition. Merge the 2-hop covers for partitions having at least one cross partition edge and thus, obtain a 2-hop cover for the entire graph Distance Aware 2-Hop Cover The above algorithm for building the 2-hop cover can be modified to include the distance information in the HOPI index ([STW05]). Each entry in the label of a node is augmented with the distance information, e.g. the entries in L in (v) are now pairs (u, d(u, v)), where d(u, v) is the distance between u and v. The main modification that needs to be made is that a node w can be center node for a path from u to v only if it is on a shortest path, since otherwise it can t reflect the correct distance information. This additional restriction is added to the construction of the center graph for w, where we add the edge (u, v) only if the distance from u to v is the same as the sum of distances from u to w and from w to v. 20

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

arxiv: v1 [cs.db] 22 Apr 2011

arxiv: v1 [cs.db] 22 Apr 2011 EMBANKS: Towards Disk Based Algorithms For Keyword-Search In Structured Databases Submitted in partial fulfillment of the requirements for the degree of arxiv:1104.4384v1 [cs.db] 22 Apr 2011 Bachelor of

More information

Algorithm Design (8) Graph Algorithms 1/2

Algorithm Design (8) Graph Algorithms 1/2 Graph Algorithm Design (8) Graph Algorithms / Graph:, : A finite set of vertices (or nodes) : A finite set of edges (or arcs or branches) each of which connect two vertices Takashi Chikayama School of

More information

Lecture 4: Graph Algorithms

Lecture 4: Graph Algorithms Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Representing Web Graphs

Representing Web Graphs Representing Web Graphs Sriram Raghavan, Hector Garcia-Molina Computer Science Department Stanford University Stanford, CA 94305, USA {rsram, hector}@cs.stanford.edu Abstract A Web repository is a large

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel

More information

Problem Definition. Clustering nonlinearly separable data:

Problem Definition. Clustering nonlinearly separable data: Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided Large Attributed Graph Clustering with Multiple Sparse Annotations (PAKDD 2016) Problem Definition Clustering

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Lecture Summary CSC 263H. August 5, 2016

Lecture Summary CSC 263H. August 5, 2016 Lecture Summary CSC 263H August 5, 2016 This document is a very brief overview of what we did in each lecture, it is by no means a replacement for attending lecture or doing the readings. 1. Week 1 2.

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

Parallel Graph Algorithms

Parallel Graph Algorithms Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050/VT3 Part I Introduction Overview Graphs definitions & representations Minimal Spanning Tree (MST) Prim s algorithm Single Source

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Advanced Data Management

Advanced Data Management Advanced Data Management Medha Atre Office: KD-219 atrem@cse.iitk.ac.in Sept 26, 2016 defined Given a graph G(V, E) with V as the set of nodes and E as the set of edges, a reachability query asks does

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

looking ahead to see the optimum

looking ahead to see the optimum ! Make choice based on immediate rewards rather than looking ahead to see the optimum! In many cases this is effective as the look ahead variation can require exponential time as the number of possible

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

Graphs: Introduction. Ali Shokoufandeh, Department of Computer Science, Drexel University

Graphs: Introduction. Ali Shokoufandeh, Department of Computer Science, Drexel University Graphs: Introduction Ali Shokoufandeh, Department of Computer Science, Drexel University Overview of this talk Introduction: Notations and Definitions Graphs and Modeling Algorithmic Graph Theory and Combinatorial

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Dynamic Metadata Management for Petabyte-scale File Systems

Dynamic Metadata Management for Petabyte-scale File Systems Dynamic Metadata Management for Petabyte-scale File Systems Sage Weil Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller UC Santa Cruz November 1, 2006 Presented by Jae Geuk, Kim System Overview Petabytes

More information

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455 Technical Report

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Parallel Graph Algorithms

Parallel Graph Algorithms Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050 Spring 202 Part I Introduction Overview Graphsdenitions, properties, representation Minimal spanning tree Prim's algorithm Shortest

More information

CS521 \ Notes for the Final Exam

CS521 \ Notes for the Final Exam CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )

More information

Lesson 2 7 Graph Partitioning

Lesson 2 7 Graph Partitioning Lesson 2 7 Graph Partitioning The Graph Partitioning Problem Look at the problem from a different angle: Let s multiply a sparse matrix A by a vector X. Recall the duality between matrices and graphs:

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms Introduction graph theory useful in practice represent many real-life problems can be if not careful with data structures Chapter 9 Graph s 2 Definitions Definitions an undirected graph is a finite set

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms Lecturer: Shi Li Department of Computer Science and Engineering University at Buffalo Main Goal of Algorithm Design Design fast

More information

Reference Sheet for CO142.2 Discrete Mathematics II

Reference Sheet for CO142.2 Discrete Mathematics II Reference Sheet for CO14. Discrete Mathematics II Spring 017 1 Graphs Defintions 1. Graph: set of N nodes and A arcs such that each a A is associated with an unordered pair of nodes.. Simple graph: no

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Graphs: Introduction Ulf Leser This Course Introduction 2 Abstract Data Types 1 Complexity analysis 1 Styles of algorithms 1 Lists, stacks, queues 2 Sorting (lists) 3 Searching

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Thursday, October 1, 2015 Outline 1 Recap 2 Shortest paths in graphs with non-negative edge weights (Dijkstra

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms Chapter 9 Graph Algorithms 2 Introduction graph theory useful in practice represent many real-life problems can be if not careful with data structures 3 Definitions an undirected graph G = (V, E) is a

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

Adaptive-Mesh-Refinement Pattern

Adaptive-Mesh-Refinement Pattern Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Graphs and Network Flows ISE 411. Lecture 7. Dr. Ted Ralphs

Graphs and Network Flows ISE 411. Lecture 7. Dr. Ted Ralphs Graphs and Network Flows ISE 411 Lecture 7 Dr. Ted Ralphs ISE 411 Lecture 7 1 References for Today s Lecture Required reading Chapter 20 References AMO Chapter 13 CLRS Chapter 23 ISE 411 Lecture 7 2 Minimum

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019 CS 341: Algorithms Douglas R. Stinson David R. Cheriton School of Computer Science University of Waterloo February 26, 2019 D.R. Stinson (SCS) CS 341 February 26, 2019 1 / 296 1 Course Information 2 Introduction

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

22 Elementary Graph Algorithms. There are two standard ways to represent a

22 Elementary Graph Algorithms. There are two standard ways to represent a VI Graph Algorithms Elementary Graph Algorithms Minimum Spanning Trees Single-Source Shortest Paths All-Pairs Shortest Paths 22 Elementary Graph Algorithms There are two standard ways to represent a graph

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Algorithms for Integer Programming

Algorithms for Integer Programming Algorithms for Integer Programming Laura Galli November 9, 2016 Unlike linear programming problems, integer programming problems are very difficult to solve. In fact, no efficient general algorithm is

More information

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved. Chapter 9 Greedy Technique Copyright 2007 Pearson Addison-Wesley. All rights reserved. Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that

More information

Minimum-Spanning-Tree problem. Minimum Spanning Trees (Forests) Minimum-Spanning-Tree problem

Minimum-Spanning-Tree problem. Minimum Spanning Trees (Forests) Minimum-Spanning-Tree problem Minimum Spanning Trees (Forests) Given an undirected graph G=(V,E) with each edge e having a weight w(e) : Find a subgraph T of G of minimum total weight s.t. every pair of vertices connected in G are

More information

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!

More information

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan Keyword search in relational databases By SO Tsz Yan Amanda & HON Ka Lam Ethan 1 Introduction Ubiquitous relational databases Need to know SQL and database structure Hard to define an object 2 Query representation

More information

Highway Dimension and Provably Efficient Shortest Paths Algorithms

Highway Dimension and Provably Efficient Shortest Paths Algorithms Highway Dimension and Provably Efficient Shortest Paths Algorithms Andrew V. Goldberg Microsoft Research Silicon Valley www.research.microsoft.com/ goldberg/ Joint with Ittai Abraham, Amos Fiat, and Renato

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Lecture and notes by: Nate Chenette, Brent Myers, Hari Prasad November 8, Property Testing

Lecture and notes by: Nate Chenette, Brent Myers, Hari Prasad November 8, Property Testing Property Testing 1 Introduction Broadly, property testing is the study of the following class of problems: Given the ability to perform (local) queries concerning a particular object (e.g., a function,

More information

Seminar on Algorithms and Data Structures: Multiple-Edge-Fault-Tolerant Approximate Shortest-Path Trees [1]

Seminar on Algorithms and Data Structures: Multiple-Edge-Fault-Tolerant Approximate Shortest-Path Trees [1] Seminar on Algorithms and Data Structures: Multiple-Edge-Fault-Tolerant Approximate Shortest-Path Trees [1] Philipp Rimle April 14, 2018 1 Introduction Given a graph with positively real-weighted undirected

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Introduction III. Graphs. Motivations I. Introduction IV

Introduction III. Graphs. Motivations I. Introduction IV Introduction I Graphs Computer Science & Engineering 235: Discrete Mathematics Christopher M. Bourke cbourke@cse.unl.edu Graph theory was introduced in the 18th century by Leonhard Euler via the Königsberg

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms Chapter 9 Graph Algorithms 2 Introduction graph theory useful in practice represent many real-life problems can be slow if not careful with data structures 3 Definitions an undirected graph G = (V, E)

More information

Lecture 6 Basic Graph Algorithms

Lecture 6 Basic Graph Algorithms CS 491 CAP Intro to Competitive Algorithmic Programming Lecture 6 Basic Graph Algorithms Uttam Thakore University of Illinois at Urbana-Champaign September 30, 2015 Updates ICPC Regionals teams will be

More information

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Introduction to Mathematical Programming IE406. Lecture 16. Dr. Ted Ralphs

Introduction to Mathematical Programming IE406. Lecture 16. Dr. Ted Ralphs Introduction to Mathematical Programming IE406 Lecture 16 Dr. Ted Ralphs IE406 Lecture 16 1 Reading for This Lecture Bertsimas 7.1-7.3 IE406 Lecture 16 2 Network Flow Problems Networks are used to model

More information

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL 5.1 INTRODUCTION The survey presented in Chapter 1 has shown that Model based testing approach for automatic generation of test

More information

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible

More information

Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization

Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL XX, NO. XX, 2005 1 Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization Navaratnasothie Selvakkumaran and

More information

11/22/2016. Chapter 9 Graph Algorithms. Introduction. Definitions. Definitions. Definitions. Definitions

11/22/2016. Chapter 9 Graph Algorithms. Introduction. Definitions. Definitions. Definitions. Definitions Introduction Chapter 9 Graph Algorithms graph theory useful in practice represent many real-life problems can be slow if not careful with data structures 2 Definitions an undirected graph G = (V, E) is

More information

On the Minimum k-connectivity Repair in Wireless Sensor Networks

On the Minimum k-connectivity Repair in Wireless Sensor Networks On the Minimum k-connectivity epair in Wireless Sensor Networks Hisham M. Almasaeid and Ahmed E. Kamal Dept. of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011 Email:{hisham,kamal}@iastate.edu

More information

Let the dynamic table support the operations TABLE-INSERT and TABLE-DELETE It is convenient to use the load factor ( )

Let the dynamic table support the operations TABLE-INSERT and TABLE-DELETE It is convenient to use the load factor ( ) 17.4 Dynamic tables Let us now study the problem of dynamically expanding and contracting a table We show that the amortized cost of insertion/ deletion is only (1) Though the actual cost of an operation

More information

A project report submitted to Indiana University

A project report submitted to Indiana University Page Rank Algorithm Using MPI Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

Parallel Query Optimisation

Parallel Query Optimisation Parallel Query Optimisation Contents Objectives of parallel query optimisation Parallel query optimisation Two-Phase optimisation One-Phase optimisation Inter-operator parallelism oriented optimisation

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

Chapter 17 Indexing Structures for Files and Physical Database Design

Chapter 17 Indexing Structures for Files and Physical Database Design Chapter 17 Indexing Structures for Files and Physical Database Design We assume that a file already exists with some primary organization unordered, ordered or hash. The index provides alternate ways to

More information

22 Elementary Graph Algorithms. There are two standard ways to represent a

22 Elementary Graph Algorithms. There are two standard ways to represent a VI Graph Algorithms Elementary Graph Algorithms Minimum Spanning Trees Single-Source Shortest Paths All-Pairs Shortest Paths 22 Elementary Graph Algorithms There are two standard ways to represent a graph

More information

CS200: Graphs. Prichard Ch. 14 Rosen Ch. 10. CS200 - Graphs 1

CS200: Graphs. Prichard Ch. 14 Rosen Ch. 10. CS200 - Graphs 1 CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of nodes and edges What can this represent? n A computer network n Abstraction of a map n Social network CS200 - Graphs 2

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem CS61: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem Tim Roughgarden February 5, 016 1 The Traveling Salesman Problem (TSP) In this lecture we study a famous computational problem,

More information

Greedy Approach: Intro

Greedy Approach: Intro Greedy Approach: Intro Applies to optimization problems only Problem solving consists of a series of actions/steps Each action must be 1. Feasible 2. Locally optimal 3. Irrevocable Motivation: If always

More information

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

Implementation of Near Optimal Algorithm for Integrated Cellular and Ad-Hoc Multicast (ICAM)

Implementation of Near Optimal Algorithm for Integrated Cellular and Ad-Hoc Multicast (ICAM) CS230: DISTRIBUTED SYSTEMS Project Report on Implementation of Near Optimal Algorithm for Integrated Cellular and Ad-Hoc Multicast (ICAM) Prof. Nalini Venkatasubramanian Project Champion: Ngoc Do Vimal

More information

PERFECT MATCHING THE CENTRALIZED DEPLOYMENT MOBILE SENSORS THE PROBLEM SECOND PART: WIRELESS NETWORKS 2.B. SENSOR NETWORKS OF MOBILE SENSORS

PERFECT MATCHING THE CENTRALIZED DEPLOYMENT MOBILE SENSORS THE PROBLEM SECOND PART: WIRELESS NETWORKS 2.B. SENSOR NETWORKS OF MOBILE SENSORS SECOND PART: WIRELESS NETWORKS 2.B. SENSOR NETWORKS THE CENTRALIZED DEPLOYMENT OF MOBILE SENSORS I.E. THE MINIMUM WEIGHT PERFECT MATCHING 1 2 ON BIPARTITE GRAPHS Prof. Tiziana Calamoneri Network Algorithms

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PAUL BALISTER Abstract It has been shown [Balister, 2001] that if n is odd and m 1,, m t are integers with m i 3 and t i=1 m i = E(K n) then K n can be decomposed

More information