and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa

Size: px

Start display at page:

Download "and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa"

Berniece Miles
5 years ago
Views:

1 Vertical Fragmentation and Allocation in Distributed Deductive Database Systems Seung-Jin Lim Yiu-Kai Ng Department of Computer Science Brigham Young University Provo, Utah 80, U.S.A. Abstract Although approaches for vertical fragmentation and data allocation have been proposed [1, 1], algorithms for vertical fragmentation and allocation of data and rules in distributed deductive database systems (DDDBSs) are lacking. In this paper, we present dierent approaches for vertical fragmentation of relations that are referenced by rules and an allocation strategy for rules and fragments in a DDDBS. The potential advantages of the proposed fragmentation and allocation scheme include maximal locality of query evaluation and minimization of communication cost in a distributed system, in addition to the desirable properties of (vertical) fragmentation and rule allocation as discussed in the literature [11, 1]. We also formulate the mathematical interpretation of the proposed vertical fragmentation and allocation algorithms. Keywords: rules, fragmentation, allocation, replication, deductive databases, distributed systems 1 Introduction Deductive database systems enhance the expressive power of conventional relational database systems by adopting logic programming as a query language which allows recursion, while distributed database systems oer many advantages over centralized database systems which include the enhancement of reliability and availability of the involved databases, improvement of overall system performance by executing transactions in parallel, and minimization of contention for system resources [1]. The integration of these two database systems appears to provide a promising, potentially more powerful and reliable database system for information processing. Two of the main design activities in distributed systems are fragmentation and data allocation. Fragmentation allows parallel execution of a single query, reduces the amount of irrelevant data access and unnecessary data transfer, increases the level of concurrency 1

2 and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transactions by closely matching fragments for the requirements of transactions [1]. Our design goals of fragmentation and allocation of rule and data aim to maximize concurrent rule execution, reduce replication of rules and data, minimize communication cost during query evaluation, and decrease the query response time. Dierent approaches for (vertical) fragmentation and rule allocation in distributed (deductive) database systems have been proposed. [1] introduces a vertical partitioning algorithm using a graphical technique that provides an improvement over the previous work on vertical partitioning [1]. [10] presents a theory of fragmentation and studies the completeness and update problems of overlapping fragments. [] develops a fragmentation technique for class objects in a distributed object based system. [11] discusses the rule allocation problem in a distributed database system and proposes a rule partitioning method. [1, 1] construct algorithms for dynamic data allocation in distributed systems. [1] proposes a distributed data allocation algorithm that utilizes actual query processing schedules. The proposed method, which integrates the problems of distributed query optimization and optimal data allocation statically by sequentially optimizing query strategies and then data allocation, determines the (horizontal) fragments of relations to be allocated so that the total transmission cost of processing user queries and updates is minimized. The same problem has been addressed by [] which includes an iterative method for integrating query optimization and data allocation methods in distributed database design. An optimization heuristic is adopted in [] which iteratively determines the minimum cost query strategies and minimum cost data allocation until a local minimum for the combined problem is found. All of these approaches, however, either address the problem of data allocation [1, 1] and optimal query processing [1, ], focus on the partitioning problem of rules [11], or deal with the fragmentation of relations or class objects in a distributed system [, 10, 1, 1]. Algorithms for vertical fragmentation of relations referenced by rules and allocation of rules and corresponding fragments in a distributed deductive database system (DDDBS) are lacking. In this paper, we present dierent approaches for vertical fragmentation of relations and allocation of rules and fragments. Data and rules are distributed across dierent sites in a network to meet the operational needs and to handle future information processing at each site. The proposed fragmentation and allocation strategy maximizes locality of query evaluation while minimizes communication cost and execution time during query processing. We proceed to present our results as follows. In Section we provide our basic denitions for dependencies among rule expressions and base relations. In Section we propose four dierent algorithms: RCA for rule clustering, OVF for computing overlapping vertical fragmentation, DVF for generating disjoint vertical fragmentation, and CAA for allocating rules and corresponding fragments. In Section we include the mathematical interpretation of the proposed algorithms and formulate the communication costs of distributing rules and fragments and query evaluation in Sections.1 and., respectively. In Section we present the proofs of correctness and complexity analysis of the proposed algorithms and give the concluding remarks in Section.

3 Basic Denitions We consider each Datalog rule r in a DDDBS of the form p(x 1 ; ; X n ) :- q 1 (Y 1 ; ; Y m ); ; q t (Z 1 ; ; Z s ) where p is the head (predicate) of r and is either a derived (intensional) or mixed predicate (relation) 1. q i (1 i t) is either a derived, mixed, or base (extensional) predicate (relation), and q 1 ; : : : ; q t form the body of r. An argument of a predicate is either a variable or a constant. A rule with an empty body is a base relation, i.e., extensional predicate, which contains a set of facts, and a rule without the head predicate is a query. r is recursive if at least one of the predicates in the body of r is p []. Two predicates p and q are mutually recursive if p and q are (in)directly dependent on each other, i.e., in order to compute p, we need to compute q, and vice versa, and we call any two rules with p and q as head predicates respectively mutually recursive rules. r can be extended to handle complex data structures as in higher-order logic database languages [9], and our proposed solutions for fragmentation and allocation problems are independent of the constructs of r, i.e., whether r is a Datalog rule or is extended as a rule in a higher-order logic database language. We apply some basic principles of graph theory to the proposed fragmentation and rule and data allocation algorithms, and use matrices to capture the dependency relationships among rules and base relations. Denition 1 A directed graph (digraph for short) G(N; E) consists of two sets, the nonempty set N (or N(G)) of nodes and the set E (or E(G)) of edges. Each node in N represents either a rule or a base relation, whereas each edge (n 1 ; n ) in E denotes that n 1 and n are rules, and the head predicate of n (which can be a base relation) appears in the body of n 1. Denition In a digraph G, node n j is reachable from node n i if there exists a path from n i to n j. Node n j is directly reachable from node n i if there exists a path of length 1 from n i to n j. Denition Given a digraph G, let A be the Boolean adjacency matrix of G. Then, A (m) [i; j] = ( 1 if there exists a path of length m from node ni to node n j 0 otherwise In particular, A (1) [i; j] is called a direct dependency matrix of G. Denition In a digraph G, a reachability matrix R is dened as R = A (1) _ A () A (n) where an entry of R is computed by applying the Boolean addition (_) to the corresponding entries in A (1) ; ; A (n). 1 A predicate p is mixed if there is a set of ground facts for p and p appears as the head predicate of some rules [].

4 There exist direct dependency matrices that capture the rule-to-rule and rule-torelation relationships, respectively in DDDBSs. Denition In a digraph G with N r distinct rules, an N r N r direct rule-to-rule dependency matrix, rr, is dened as rr[i; j] = 8 >< >: 1 if rule r j is directly reachable from rule r i ; i.e., the head predicate of r j appears in the body of r i 0 otherwise Denition In a digraph G with N r distinct rules and N R base relations, an N r N R direct rule-to-relation dependency matrix, rr, is dened as rr[i; j] = 8 >< >: 1 if base relation R j is directly reachable from rule r i ; i.e., R j appears in the body of r i 0 otherwise Denition An N R 1 table-size matrix T S is dened as T S[i] = n, where i denotes base relation R i, and n denotes the size (in bytes) of R i. Denition 8 A network topology matrix T, which is symmetric relative to the principal diagonal, is dened as T [i; j] = w, where w denotes the total weight of the shortest path (measured by the physical distance) from site n i to site n j in a network. w = 0 if i = j. The connection weight of a site S i in a network with p other sites is dened accordingly as P p T [i; k], which is the sum of the total weight of the shortest path from S k=1 i to each of the other sites in the network. Example 1 Consider a distributed deductive database (DDDB) D consisting of nine rules and four base relations whose relationships are captured by a direct rule-to-rule dependency matrix rr and a direct rule-to-relation dependency matrix rr. Further assume that a tablesize matrix T S for the base relations in D and a network topology matrix T are given along with rr and rr as follows: rr = T S = ; T = ; rr = ;

5 (a) A given DDDB and its digraph G (b) A network and its topology matrix Figure 1: The dependency graph of a DDDB and a network Figure 1(a) depicts D, in which rules are without arguments for simplicity of presentation, and relationships among rules and base relations are captured by a digraph G. Figure 1(b) shows a network with labeled edges, which denote weights of the edges, and its network topology matrix T. Fragmentation and Allocation Algorithms In this section, we present a fragmentation and allocation algorithm (FAA). FAA, as de- ned, consists of three subalgorithms: a rule clustering algorithm (RCA), a data clustering algorithm (DCA), and a rule and fragment allocation algorithm (CAA). FAA aims for maximizing the locality of query evaluation and thus minimizing communication cost and search space during query processing. In order to minimize the communication cost of transmitting data or partial answers to a query, rules with the same head predicate and, if possible, fragments of the base relations on which they depend (either directly or indirectly) are allocated to the same site. For if not, the computation of partial answers to a query Q involving rules with the same head predicate are performed at dierent sites since none of these sites has all of these rules locally. As a result, these sites must communicate with one another in order to generate all the answers to Q, and hence adding to the communication cost. Furthermore, our clustering and allocation algorithms assign mutually recursive rules to the same site. For if not, participating sites, where these rules are stored, have to communicate at each intermediate step of executing a query Q, and thus increasing the communication cost and execution time of Q since processors at dierent sites may spend most of their time waiting for one another or transmitting data across sites in the network [11]. RCA, which allows replicated rules in a DDDBS, is properly designed so that most, if not all, of the rules used by a query in a DDDBS are executed locally and hence reduces communication overheads. Since knowledge and integrity constraints represented in a de- When a query Q of the form?- q1 (V1); : : : ; q n (V n ), where V i (1 i n) denotes a vector of arguments of q i, is submitted to a site S where (a subset of) the rules for computing the answers to Q do not reside, (subqueries in) Q must be remotely executed and its answers are transmitted to S.

6 ductive database, which are captured by rule expressions, are much less time-variant than data [11], the eect on updates of replicated rules is reduced. DCA, on the other hand, provides two alternatives: either replication or partition of fragments of base relations referenced by rules. Replication is a desirable feature in a static DDDBS since it increases the locality of query processing and the availability and reliability of a DDDBS []. It is assumed that two direct dependency matrices, rr and rr, a table-size matrix T S, and a network topology matrix T are given as inputs (i.e., they are predetermined) to FAA. It is further assumed that each site has its local distributed data directory (which is called knowledge directory in [8]) that contains the information of \which site has which rules and fragments of base relations," and all the rules and fragments of base relations to be allocated are originally stored at a particular site, called primary site, in the network..1 RCA We consider two kinds of rules in a DDDB: (i) a directly dependent rule r 1 on another rule r, i.e., the head predicate of r appears in the body of r 1, and (ii) an indirectly dependent rule r 1 on another rule r, i.e., the head predicate of r appears in the body of r 1 through a number of intermediate rules. For example, in Figure 1(a), rules r and r are directly dependent on rule r, but rules r 1 and r are indirectly dependent on rule r through r and on base relation R through rules r and r. RCA rst constructs a digraph (DG) G using rr (G represents both direct and indirect dependency relationships among rules), and then computes each distinct subgraph (subdg) of G. Distinct subgraphs of G are used by DCA to compute fragments of base relations on which rules in each distinct subgraph depend (either directly or indirectly). Sections.1.1 and.1. include the steps of RCA..1.1 Computing Prospective Distinct Subgraphs We construct each prospective distinct subgraph of rules, that are not base relations, in G such that either consists of a single rule that does not reach other rules (that are not base relations) in G, or one (and only one) of the rules r in can directly reach other rules in, i.e., the head of every rule (except r) in appears in the body of r. Example Given the DDDB and its dependency graph G in Figure 1(a), the sets of rules fr 1 ; r g, fr ; r 1 ; r g, fr g, fr ; r g, fr ; r g, fr ; r ; r g, fr ; r g, fr 8 g, and fr 9 g with their A subgraph SG in G is called a distinct subgraph if SG has no outgoing edges to other nodes, which denote rules, not base relations, in G. All the rules in SG are eventually distributed to a particular site in the network. A prospective distinct subgraph = fr1 ; rg of a DG G is a subgraph of G such that r is directly reachable from r1. Each prospective distinct subgraph eventually becomes (a portion of) a distinct subgraph of G. If there exists no node which is directly reachable from r1, then there is no r in.

7 (a) The dependency graph G of a given DDDB (b) Prospective distinct subgraphs (c) Attach fr1; rg; frg, and fr; rg to fr; r1; rg (d) Attach frg, fr,rg and fr,rg to fr,r,rg, and discard embedded subgraphs Figure : (Prospective) Distinct subgraphs of DDDB corresponding edges as shown in Figure (b) form dierent prospective distinct subgraphs of G in Figure 1(a). To simplify the discussion, from now on we denote a subdg of a DG by a set of nodes, assuming that the edges connecting nodes in are implicitly represented..1. Generating Distinct Subgraphs The next step of RCA expands, if possible, a prospective distinct subgraph iteratively to generate a distinct subgraph by adding indirectly dependent rules to as follows: If every node in a prospective distinct subgraph j is directly or indirectly reachable from a node in another prospective distinct subgraph i, expand i by attaching j to i and retain j. Repeat this process until no further changes can be made. Each subgraph of G which is embedded within another subgraph of G is discarded. This yields a set of distinct subgraphs of G. Example Consider the set of prospective distinct subgraphs generated in Example. By applying the steps in this subsection, we obtain ve distinct subgraphs: subdg 1 =fr 1,r,r,r g, subdg = fr,r g, subdg = fr,r,r,r g, subdg = fr 8 g and subdg =fr 9 g. Figures (c) - (d) illustrate the process of merging the prospective distinct subgraphs shown in Figure (b).

8 . DCA In this section, we propose a data clustering algorithm (DCA) which generates vertical fragments of base relations that are referenced by rules in a distinct subgraph computed by RCA. These fragments are attached to the distinct subgraphs and are allocated along with rules, which (in)directly depend on the fragments, at chosen sites in a DDDBS according to the cluster allocation algorithm to be introduced in Section.. Conventional techniques for developing fragments have tailored on the needs of user applications (queries). Our vertical fragmentation technique, however, is strictly based on the given set of rule expressions represented in a DG as well as the access frequency of queries in one of the fragmentation algorithms. The uniqueness of our approach is two-fold. First, no information of access pattern on base relations is needed by our fragmentation approach and hence our approach eliminates extra inputs as required by conventional fragmentation methods [1, 1, 1]. Second, attributes clustered in a vertical fragment are often determined by using an attribute anity matrix (constructed by using an attribute usage matrix and transactions) in the conventional approaches [1, 1]; however, fragments generated by using our vertical fragmentation approaches are strictly based on the rule-toattribute dependency matrices. We are motivated to investigate the vertical fragmentation problem in DDDBSs since it is inherently more complicated than horizontal fragmentation due to the total number of alternatives that are available in the vertical case [1]. More importantly, our vertical fragmentation methods generate an \optimal" clustering scheme of database relations that are referenced by rules in a DDDB. The resultant fragmentation scheme is optimal since only relevant rules and essential data that are needed for processing a particular query in a DDDB are clustered together to enhance the eciency and minimize overheads (in terms of communication costs) during query processing. Since disjoint fragmentation can be more easily handled by a distributed system than more sophisticated overlapping fragmentation [1], in this paper we present a disjoint vertical fragmentation algorithm, called DVF, as one of the two subalgorithms of DCA. DVF disallows distinct fragments of a base relation R to contain common attributes of R, and these fragments are not replicated over the network. On the other hand, since disjoint fragmentation is impractical in some real-world applications due to the constraints that it imposes to database design [10], we also consider an alternative fragmentation scheme, the overlapping vertical fragmentation, called OVF, the other subalgorithm of DCA. OVF allows overlapped fragments of a base relation to be replicated and distributed over the network. These two vertical fragmentation approaches are based on the notion of direct and indirect rule-to-attribute dependencies. In a DDDBS where response time and communication cost are the primary design issues and the involved databases are static, i.e., most of the data processing activities are retrievals (i.e., read), rather than modication (i.e., update), OVF is preferable than DVF. On the other hand, if communication cost is not a major concern, such as DDDBSs built based on local area networks, and database updates occur frequently, then DVF is preferable than OVF. It is assumed that there exists a tuple identier attribute T ID [10] for each fragmented relation R such that T ID is allocated with each fragment of R to the site where the fragment resides. (Tuple identier attributes ensure the lossless-join decomposition of various vertical 8

9 fragments of a base relation, and this concept is well understood in the literature [10, 1].) Prior to the introduction of the two fragmentation approaches, we give a few denitions that are used in the two proposed vertical fragmentation algorithms. Denition 9 An N r NA rule-to-attribute dependency R k matrix AD k of base relation R k, where N r is the number of distinct rules and NA is the number of attributes in R R k k, in a DDDBS is dened as AD k [i; j] = ( 1 if the jth attribute of Rk is used by rule r i 0 otherwise It is assumed that given a rule r that references a base relation p, all the attributes of p that are not used by r are replaced by the \don't care" symbol [], i.e.,, in p. For example, given the rule r: q(v ) :- : : : ; p(a 1 ; ; A ); : : :., where p is a base relation with attributes A 1 ; A, and A, r uses only attributes A 1 and A of p since A of p is replaced by ' ' in r. Denition 10 Given a base relation R k and a distinct subgraph subdg i, the minimal set of attributes i;k is the subset of attributes SB of R k such that each attribute in SB is referred by at least one rule in subdg i. Denition 11 Let A k;i denote the ith attribute of a base relation R k. A vertical fragment of a base relation R k, denoted F i;k = fa k1 ; ; A kj g, is a subset of attributes of R k that are referenced by a rule in subdg i...1 Overlapping Vertical Fragmentation Recall that in Section.1. subsets of rules are clustered into distinct subgraphs to be allocated to dierent sites in a DDDBS. In this subsection, we propose a strategy for clustering vertical fragments of base relations that are referenced by a set of rules S r in a distinct subgraph. These fragments are allocated along with S r to a chosen site in a DDDBS. Algorithm OVF: Overlapping Vertical Fragmentation Algorithm INPUT: Distinct subgraphs subdgs and the set of all rule-to-attribute dependency matrices AD 1,, AD n, where n denotes the number of based relations in a DDDB. OUTPUT: A set of overlapping vertical fragments F S. F i;k in F S is a fragment of base relation R k that is referenced by a rule in subdg i. For each distinct subgraph subdg i, determine the minimal set of attributes i;k of each base relation R k that are referenced by some rules in subdg i. A rule-to-attribute dependency matrix is similar to the attribute usage matrix as dened in [1, 1], whereas the former is based on rule expressions while the latter is based on past query history. Note that i;k = F i;k in OVF, but i;k F i;k in DVF. 9

10 In OVF, two attributes A and B in a base relation are assigned to the same fragment if A and B are referenced by a rule r to be stored at the same site regardless of their degree of anity [1, 1]. Hence, a query that uses r can process r at a single site. The vertical fragmentation algorithms in [1, 1], on the other hand, allocate A and B to dierent fragments if the anity A and B is negligible. As a result, maximal locality of query evaluation is guaranteed in a DDDBS using our fragmentation and allocation approaches; however, this clustering approach is not adopted by the conventional vertical fragmentation approaches. Example Consider Example again and let AD 1, AD, AD, and AD be the given rule-to-attribute dependency matrices of the base relations R 1, R, R, and R in D of Example 1, respectively as shown below. AD1 = ;AD = ;AD = ;AD = OVF yields the minimal sets of attributes for subdg 1,, subdg as follows: 1;1 = fa 1;1 ; A 1; ; A 1; g; 1; = 1; = fg; 1; = fa ;1 ; A ; ; A ; ; A ; g; ;1 = ; = ; = fg; ; = fa ;1 ; A ; ; A ; ; A ; g; ;1 = fg; ; = fa ; ; A ; g; ; = fa ;1 ; A ; ; A ; ; A ; g; ; = fa ;1 ; A ; g; ;1 = ; = ; = fg; ; = fa ; ; A ; ; A ; g; ;1 = ; = ; = fg; ; = fa ;1 ; A ; ; A ; ; A ; g Hence, we obtain F i;k for each distinct subgraph subdg i in Example. Appending each F i;k to the corresponding subdg yields subdg 1 = fr 1 ; r ; r ; r, F 1;1 = fa 1;1 ; A 1; ; A 1; g; F 1; = fa ;1 ; A ; ; A ; ; A ; gg; subdg = fr ; r, F ; = fa ;1 ; A ; ; A ; ; A ; gg; subdg = fr ; r ; r ; r, F ; = fa ; ; A ; g; F ; = fa ;1 ; A ; ; A ; ; A ; g; F ; = fa ;1 ; A ; gg; subdg = fr 8, F ; = fa ; ; A ; ; A ; gg; subdg = fr 9, F ; = fa ;1 ; A ; ; A ; ; A ; gg 10

11 .. Disjoint Vertical Fragmentation In this subsection, we present the approach of DVF that is used as an alternative of OVF in conjunction with RCA. It is common in deductive databases that dierent rules reference the (same set of attributes of the) same base relation. It is also likely that dierent distinct subgraphs, as discussed in Section.1, include the same subset of rules that are to be distributed over dierent sites in a DDDBS. Hence, dierent distinct subgraphs may be extended to include the (same set of attributes of the) same base relation. If we do not allow replication of a subset of attributes A of a base relation for various reasons, we must decide which distinct subgraph should include A. Therefore, one of the design issues of our disjoint vertical fragmentation approach is to determine the distinct subgraph to which A should be assigned when dierent distinct subgraphs, including, depend on A. Our primary goal in determining the allocation of A is to minimize data transfer across the network during the query evaluation process that involves A. Given a set of attributes A in base relation R and a distinct subgraph, there are two cases to be considered for the allocation of A to : (i) depends on A, or (ii) does not depend on A. Obviously, we prefer to assign A to if depends on A to minimize data transfer during the query evaluation process involving rules in that depend on A. We also need to consider the situation when two or more distinct subgraphs depend on A. Our clustering strategy is based on the query access frequency of rules such that a more frequently referenced rule should be given priority on clustering with the data on which it depends. Before we discuss the approach of our DVF for assigning A to one of these distinct subgraphs, we give the following denitions. Denition 1 Given a DDDBS D with N Q vector of D, denoted freq Q, is dened as distinct queries, the query-access-frequency freq Q [i] = n; where i denotes query Q i (1 i N Q ) and n denotes the access frequency of query Q i during a given period of time in D [1]. Rule r j is said to be referenced by query Q i, or Q i depends on r j, if the head predicate of r j appears in Q i. Furthermore, if distinct subgraph k includes r j, which is referenced by Q i, then we say that k is referenced by Q i or Q i depends on k. Denition 1 Given a DDDBS D with N Q distinct queries and N r distinct rules, the N Q N r query-access-rule matrix Qr of D is dened as where 1 i N Q and 1 j N r. Qr[i; j] = ( 1 if rule rj is referenced by query Q i 0 otherwise A rule r depends on (a subset of attributes A of) base relation R, or (A in) R is referenced by r, if (attributes in A that appear as arguments of) R is in the body of r. Any distinct subgraph that includes r is said to depend on (A in) R, or (A in) R is said to be referenced by. 11

12 Denition 1 Given a DDDBS D, its query-access-frequency vector freq Q and queryaccess-rule matrix Qr, the rule-access-frequency vector freq r of D is dened as freq r = freq Q Qr; where freq r [i] denotes the access frequency of rule r i according to the given set of queries during a given period of time in D. Denition 1 Given a DDDBS D and its rule-access-frequency vector freq r, the subgraphaccess-frequency vector of a set of distinct subgraphs of D, denoted freq, for the given set of queries during a given period of time in D is dened as freq [i] = X k freq r [k]; where rule r k is included in distinct subgraph i. Given the denitions above, we now dene the most frequently referenced rules/distinct subgraphs as follows: Denition 1 Given a rule-access-frequency vector freq r, rule r i is the most frequently referenced rule among the set of N r rules denoted in freq r if max(freq r [1]; : : : ; freq r [N r ]) = freq r [i]. With respect to the reference frequency of rules according to a given set of queries, we can dene the most frequently referenced distinct subgraph among the given set of distinct subgraphs. This can be done by replacing r i, N r, and freq r in Denition 1 by distinct subgraph i, a number of the distinct subgraphs N, and subgraph-access-frequency vector freq, respectively. Example Consider the query-access-frequency vector freq Q and the query-access-rule matrix Qr given below. h freq Q = i 9 ; Qr = The rule-access-frequency vector is calculated as freq r = freq Q Qr = h i which indicates that r is the most frequently referenced rule by the given set of seven queries. Subsequently, using the distinct subgraphs generated in Example, we compute the subgraph-access-frequency vector as follows: freq [1] = freq r [1] + freq r [] + freq r [] + freq r [] = = 8 freq [] = freq r [] + freq r [] = + 10 = freq [] = freq r [] + freq r [] + freq r [] + freq r [] = = freq [] = freq r [8] = 8 freq [] = freq r [9] = 19 1

13 Hence, subdg is the most frequently referenced distinct subgraph among all the distinct subgraphs computed in Example, which is followed by subdg 1, subdg, subdg, and subdg. With the above denitions, we now present our approach for assigning a subset of attributes A of base relation R to one of the distinct subgraphs that depends on A. Given a set of queries, if a set of two or more distinct subgraphs S depend on A, we assign A to the distinct subgraph which is the most frequently referenced distinct subgraph in S so that data transfer during query processing involving A can be minimized. Hence, the following criteria are used in DVF: [Assignment Criteria] Distinct subgraph subdg i in S is assigned a subset of attributes A of base relation R if Criterion 1. subdg i depends on A, whereas no other distinct subgraph does, or Criterion. subdg i is the most frequently referenced distinct subgraph in S. If there exist more than one most frequently referenced distinct subgraph in S, then A is assigned to subdg i if subdg i is the rst to be considered for A. Let's consider Criteria for the set of attributes A which is competed by more than one distinct subgraph. Suppose that distinct subgraph subdg 1 depends on A, subdg depends on another subset of attributes B of R, subdg depends on subset C of R, and so forth. Furthermore, assume that A \ B \ C \ = (= ;). If subdg 1 is the most frequently referenced distinct subgraph among all the distinct subgraphs that depends on, then is assigned to subdg 1. We apply the same criteria for determining the assignment of A?, B?, and so forth. Algorithm DVF: Disjoint Vertical Fragmentation Algorithm INPUT: Distinct subgraphs subdg 1, : : :, subdg N, direct rule-to-rule dependency matrix rr, rule-to-attribute dependency matrix AD k for each base relation R k, queryaccess-frequency vector freq Q, and query-access-rule matrix Qr. OUTPUT: A set of disjoint vertical fragments F S. F i;k in F S is a fragment of base relation R k that is referenced by subdg i, 1 i N. Step 1. For each distinct subgraph subdg i, identify all the attributes of each base relation R k that are referenced by (a rule in) subdg i using AD k. Step. For each subset of attributes A of base relation R k that is referenced by only subdg i, apply the rst criterion of the Assignment Criteria and let F i;k = A. Step. For each subset of attributes A of base relation R k that is referenced by more than one distinct subgraph, apply the second criterion of the Assignment Criteria to determine the assignment of A and let F i;k = A. 1

14 Example Consider base relation R of the DDDB in Example 1 and the subgraph-accessfrequency vector computed in Example. Suppose that the rule-to-attribute dependency matrices are as given in Example. AD in Example indicates that rule r depends on attributes A ;1, A ;, A ; and A ;, whereas rule r 8 depends on attributes A ;, A ; and A ;. As a result, subdg 1, subdg, and subdg, which include r, are competing for A ;1, A ;, A ; and A ;. SubDG, which includes r 8, is competing with subdg 1, subdg, and subdg for A ; and A ;. Using the subdgs and matrices in Examples and, respectively, step 1 of DVF identies the sets of dependent attributes as shown in Table 1. subdg rules dependent attributes subdg 1 r 1 ; r ; r ; r fa 1;1 ; A 1; ; A 1; g fa ;1 ; A ; ; A ; ; A ; g subdg r ; r fa ;1 ; A ; ; A ; ; A ; g subdg r ; r ; r ; r fa ; ; A ; g, fa ;1 ; A ; g fa ;1 ; A ; ; A ; ; A ; g subdg r 8 fa ; ; A ; ; A ; g subdg r 9 fa ;1 ; A ; ; A ; ; A ; g Table 1: Dependencies between subgraphs and attributes Note that subdg 1, subdg, and subdg are competing for the same subset of attributes in R, i.e., fa ;1 ; A ; ; A ; ; A ; g, whereas subdg is competing for a dierent subset of attributes of R, i.e., fa ; ; A ; ; A ; g. Since no other distinct subgraph competes for attribute A ; with subdg, by step of DVF, A ; is assigned to subdg. We now consider the assignment of A ;1 ; A ; ; A ;, and A ; to either subdg 1, subdg, subdg, or partly to subdg. SubDG 1, subdg, and subdg compete for the common attributes A ;1 and A ;, whereas subdg 1, subdg, subdg and subdg compete for A ; and A ;. According to the subgraph-access-frequency vector computed in Example and the Assignment Criteria, we assign A ;1 and A ; to subdg since subdg is the most frequently referenced subgraph among the three distinct subgraphs. Also, we assign A ; and A ; to subdg for the same reason among the four distinct subgraphs. Hence, appending the fragmentation of R to the subdgs in Example yields subdg 1 fr 1 ; r ; r ; r ; F 1; = g = fr 1 ; r ; r ; r g; subdg fr ; r ; F ; = g = fr ; r g; subdg fr ; r ; r ; r ; F ; = fa ;1 ; A ; ; A ; ; A ; gg; subdg fr 8 ; F ; = fa ; gg; subdg fr 9 g Note that we have yet to assign attribute A ; to any of the distinct subgraphs. This happens because no distinct subgraph depends on A ;. We discuss the strategy to allocate A ; in the Cluster Allocation Algorithm in the next section. 1

15 . Cluster Allocation Having included in each distinct subgraph a set of rules S r with the corresponding set of vertical fragments of base relations { which are necessary to evaluate the rules in S r { using RCA and either OVF or DVF, FAA proceeds to choose network sites for the allocation of the clusters, each of which contains the rules and all the corresponding vertical fragments of base relations in a particular distinct subgraph, by using the Cluster Allocation Algorithm (CAA). This section describes the steps in CAA. Our strategy for the allocation of clusters of rules and data is consistent with the strategy that we use for clustering rules and data discussed in the preceding sections, i.e., our primary concern in the allocation of the clusters over the network is to minimize data transfer across the network during the query evaluation process. For this purpose, we consider the access frequency of a query at a particular network site which is shown in the query-access-frequency-at-site matrix. Denition 1 Given a DDDBS with N S network sites and N Q queries, an N S N Q queryaccess-frequency-at-site matrix freq SQ is dened as follows: freq SQ [i; j] = n; where 1 i N S ; 1 j N Q, and n denotes the number of jth query initiated at site i during a given period of time. Furthermore, the sum of the jth column of freq SQ denotes the access frequency of the jth query at dierent sites. Thus, the sum of the jth column in freq SQ has to be the same as freq Q [j] as given in Denition 1. Subsequently, we can derive freq Q using freq SQ as follows: freq Q [j] = XNS i=1 freq SQ [i; j]; where 1 j N Q. Example Suppose that we are given the following query-access-frequency-at-site matrix for the four network sites in Example 1 and the seven queries mentioned in Example : freq SQ = The summation of each column of freq SQ yields X h i freq SQ [i; j] = i=1 which is freq Q in Example. Using the query-access-frequency-at-site matrix and the query-access-rule matrix of a DDDBS, we determine the access frequency of a rule at a network site, represented by the rule-access-frequency-at-site matrix. 1

16 Denition 18 Given a query-access-frequency-at-site matrix freq SQ and a query-accessrule matrix Qr, an N S N r rule-access-frequency-at-site matrix freq Sr can be computed as freq Sr = freq SQ Qr; where freq Sr [i; j] = n denotes that site i accesses the jth rule n times during a given period of time. Furthermore, the sum of the jth column of freq Sr denotes the access frequency of rule j at dierent sites. Thus, the sum has to be the same as freq r [j] in Denition 1. Subsequently, we can derive freq r using freq Sr as follows: freq r [j] = XNS i=1 freq Sr [i; j]; where 1 j N r. Example 8 Consider freq SQ in Example and Qr in Example again. We obtain the access frequency of a rule at a network site by computing the rule-access-frequency-at-site matrix freq Sr as follows: freq Sr = freq SQ Qr = = The summation of each column of freq Sr yields X h i freq Sr [i; j] = i=1 which is freq r in Example. Based on the notion of the access frequency of a rule at dierent network sites, we now dene the access frequency of a distinct subgraph at a network site, represented by the subgraph-access-frequency-at-site matrix using the rule-access-frequency-at-site matrix. Denition 19 Given a rule-access-frequency-at-site matrix freq Sr, an N S N subgraphaccess-frequency-at-site matrix of the given set of sites and distinct subgraphs, denoted freq S, is dened as follows: freq S [i; j] = X k freq Sr [i; k]; where rule r k is included in distinct subgraph j and j is referenced by a query at site i. 1

17 We now dene the site which accesses distinct subgraph i most frequently among the given N S sites in a network as follows: Denition 0 Given the subgraph-access-frequency-at-site matrix freq S of a DDDBS with N S sites, site S i is the site which accesses distinct subgraph j (1 j N ) most frequently if max(freq S [1; j]; : : : ; freq S [N S ; j]) = freq S [i; j]. Example 9 Consider freq Sr in Example 8 and the ve distinct subgraphs generated in Example again. We obtain the subgraph-access-frequency-at-site matrix as follows: freq S = = Using freq S, we can determine which site accesses subgraph i ; 1 i, most frequently among all the network sites. As computed, site is the one accessing 1 and most frequently; sites and are the most frequently sites accessing, sites for, and site for. We now propose a strategy for allocating each cluster of rules and vertical fragments computed by either OVF or DVF of base relations to a network site. Algorithm CAA: Cluster Allocation Algorithm INPUT: A set of clusters C 1, : : :, C N, a set of network sites S 1, : : :, S N S with the associated subgraph-access-frequency-at-site matrix freq S, the network topology matrix T, and E which is a set of attributes not referenced by any distinct subgraph. OUTPUT: Allocation of each cluster C i ; 1 i N, and E to a network site. Step 1. For each cluster C i, identify the site S M i which accesses C i most frequently among all of the sites in the network using freq S. Step. If there exists only one site S M i allocate C i to S M i. which is the most frequently access site of C i, then Step. If a number of sites S 1 ; : : : ; S n are the most frequently access sites of C i, choose one of these sites whose connection weight is the smallest using T. (If more than one site has the smallest connection weight, arbitrarily choose one of these sites.) Allocate C i to the chosen site. 1

18 Step. Allocate E to the site in the network whose connection weight is the smallest. (If there exists more than one such site, arbitrarily choose one of them.) Note that we consider the connection weight at step above. The connection weight CW i of site S i indicates the data transfer cost 8 between S i and all other sites in the network. When two or more sites access a cluster C at the same frequency, it is reasonable to allocate C to the site whose connection weight is less than the others in order to minimize data transfer cost during the query evaluation process in general. Example 10 Consider the subgraph-access-frequency-at-site matrix freq S in Example 9 and the network topology matrix T given in Example 1 again. We identify the site S M i (1 i ) that accesses cluster C i most frequently as follows using freq S : S M 1 = S S M = S S M = S ; S S M = S S M = S Hence, we allocate C 1 and C to site, C to site, and C to site. In case of C, site and site access it at the same frequency, thus we consider the connection weight of these sites. The connection weight of each site CW i, 1 i, is computed as follows using T : CW 1 = = 8 CW = = CW = = 1 CW = = 10 Hence, C is allocated to site since CW < CW. Furthermore, by step of Algorithm CAA, A ; in base relation R is allocated to site since A ; is not referenced by any rule, and hence any distinct subgraph. Mathematical Interpretation of FAA In this section, we present the mathematical implication of the proposed algorithms..1 Mathematical Implication of RCA We construct the direct and indirect rule-to-rule dependency matrix R rr and the segmentto-rule dependency matrix R 0 rr which correspond to the steps of constructing (prospective) distinct subgraphs as discussed in Section.1. This can be done by rst computing the reachability matrix R rule, which captures all direct and indirect rule-to-rule dependencies among dierent rules, from the given direct rule-to-rule dependency matrix rr, and then cost. 8 It is assumed that the physical distance between two network sites is proportional to the data transfer 18

19 performing a Boolean addition on R rule and an N r N r identity matrix I, where N r is the number of distinct rules in a given database. (R rule _ I retains all the rules that are not reachable from other rules in the database.) Hence, R rule = rr (1) _ rr () rr (Nr) (1) R rr = R rule _ I () Hereafter, we proceed to generate distinct subgraphs as computed in Section.1. using R rr by extracting each row of R rr that is not included in any other row of R rr9. The resultant matrix is R 0 rr, the segment-to-rule dependency matrix, where each row is called a segment which is a vector representation of a distinct subgraph computed in Section.1.. Example 11 Consider the direct rule-to-rule dependency matrix rr in Example 1. Hence, R rule = rr 1 _ rr rr 9 = = _ _ 9 Row i in Rrr includes row j of R rr if 8 Nr k=1 (R rr[j; k] = 1 implies R rr [i; k] = 1), where 1 i; j N r and N r denotes the number of distinct rules in a database. 19

20 R rr = R rule _ I = ; R 0 rr = The ith row of R 0 rr, 1 i, corresponds to the distinct subgraph subdg i in Example.. Mathematical Implication of OVF Each row of the segment-to-rule matrix R 0 rr is a segment which includes rules that are referenced by the corresponding distinct subgraph. Since the given rule-to-attribute dependency matrix AD k of base relation R k captures the information of which rules reference which attributes in base relation R k, using R 0 rr and AD k, we can determine which segment references which attributes in R k. Hence, the set of minimal overlapping vertical fragments F k of base relation R k can be computed as follows: F k = R 0 rr AD k Note that the `' operation in the above formula is not a boolean multiplication. Instead, it is a normal matrix multiplication. The ith row of F k (1 i N ) with at least one non-zero entry yields a minimal vertical fragment 10 of base relation R k for the ith segment which represents the ith distinct subgraph, i.e., subdg i, computed by RCA. Example 1 Consider the segments captured in Rrr, 0 where R 0 rr is computed in Example 11, and the rule-to-attribute dependency matrices AD 1, AD, AD, and AD of base relations R 1, R, R, and R, respectively, given in Example. Then, F 1 = R 0 rr AD 1 = F = R 0 rr AD = ; F = R 0 rr AD = ; F = R 0 rr AD = We see that the ith row of F k (1 i, 1 k ) corresponds to i;k in Example. Using the ith row of each F k and the ith row of R 0 rr, we can construct the corresponding distinct subdg i in Example. 10 The ith row in Fk is a minimal overlapping vertical fragment of R k since the ith row includes only the attributes of R k that are referenced by subdg i. 0

21 . Mathematical Implication of DVF As discussed in Section.., the major task of DVF is to determine the distinct subgraph to which a subset of attributes A of a base relation should be assigned when more than one distinct subgraph references A. This task is accomplished by Algorithm DVF using the Assignment Criteria as discussed in Section... We now show that the task performed by Algorithm DVF can be accomplished by manipulating the segment-to-rule matrix Rrr, 0 the rule-to-attribute dependency matrix AD k for each base relation R k, the query-accessfrequency vector freq Q which is derived from the query-access-frequency-at-site matrix freq SQ, and the query-access-rule matrix Qr as follows, given N, the number of distinct subgraphs, and N, the number of attributes in R A k k: [Step 1 of DVF] For each distinct subgraph subdg i, 1 i N, the subset of attributes A of base relation R k and AD k as follows: that is referenced by subdg i can be identied by using R 0 rr F k = R 0 rr AD k where `' is the matrix multiplication operation, and for any j (1 j N A k ), F k [i; j] = 1 denotes that the jth attribute of R k is referenced by subdg i, and A = [ 1jNA k attr(f k [i; j]) where attr(f k [i; j]) = ( fa k;j g if F k [i; j] = 1 ; otherwise. [Step of DVF] The subset of attributes B i of base relation R k that is referenced by only one distinct subgraph subdg i can be determined by using F k computed above. If there exists only one i (1 i N ) such that F k [i; j] = 1, for any j (1 j N A k ), then the jth attribute of R k is referenced only by subdg i. Hence, B i = [ 1jNA k attr(f k [i; j]) where attr(f k [i; j]) = 8 >< >: fa k;j g if F k [i; j] = 1, F k [i 0 ; j] = 0, and i = i 0, for all i 0, 1 i 0 N ; otherwise. which indicates that B i should be assigned to subdg i according to the rst criterion of the Assignment Criteria. [Step of DVF] The jth attribute C k;j of base relation R k that is referenced by more than one distinct subgraph can also be determined by using F k. If there exist more 1

22 than one i (1 i N ) such that F k [i; j] = 1, for any j (1 j N ), A k C k;j is referenced by more than one distinct subgraph. Hence, where attr(f k [i; j]) = 8 >< >: C k;j = attr(f k [i; j]) fa k;j g if F k [i; j] = 1, F k [i 0 ; j] = 1, and i = i 0 for some i 0 (1 i 0 N ) ; otherwise. To simulate the second criterion of the Assignment Criteria, we rst compute the rule-access-frequency vector freq r = freq Q Q r and the subgraph-access-frequency vector freq [i] = P k freq r [k], for any rule r k that is included in subgraph i. (Note that freq can be computed by the matrix multiplication of R 0 rr and freq 0 r, i.e., freq = R 0 rr freq0 r, where freq 0 r is the column vector representation of freq r.) Using freq, we determine the most frequently referenced distinct subgraph M from the set of distinct subgraphs S g in which each distinct subgraph depends on C k;j. M is the chosen distinct subgraph in S g whose corresponding entry in freq is larger than (or equal to) each of the corresponding entry in freq which denotes the access frequency of one of the distinct subgraphs in S g, i.e., freq M = max(freq [i 1 ], freq [i ],, freq [ M ],, freq [i n ]), where the corresponding distinct subgraphs of freq [i 1 ],, freq [i n ] are the distinct subgraphs in S g 11. Then, we assign C k;j to M. Example 1 Consider the resultant matrices F 1, F, F, and F in Example 1. The value 1 in each matrix indicates that the corresponding attribute is referenced by a distinct subgraph. It is not dicult to see that the third column of the ith row of Table 1 includes all the attributes whose corresponding entries are set to 1 in the ith row of F 1, F, F, and F, respectively which is step 1 of DVF. Let us consider F. Note that F [; ] denotes the th attribute of R, A ;, which is referenced only by subdg. Hence, we assign A ; to subdg, which is step of DVF. On the other hand, A ;1 and A ; are referenced by subdg 1, subdg and subdg, whereas A ; and A ; are referenced by subdg 1, subdg, subdg, and subdg. Therefore, we apply the second criterion of the Assignment Criteria to determine a distinct subgraph to which the set of attributes is to be assigned by the following matrix manipulation, where freq Q and Qr are given in Example, and R 0 rr is given in Example 11: freq r = freq Q Qr = h i Sg can be determined by extracting all the corresponding entries in the jth column of F k. If F k [i; j] = 1 (1 i N ), then subdg i depends on the jth attribute of base relation R k, as discussed in step 1 of DVF.

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,