Attribute-Pair Range Rules

Size: px

Start display at page:

Download "Attribute-Pair Range Rules"

Zoe Holt
5 years ago
Views:

1 Lecture Notes in Computer Science 1 Attribute-Pair Range Rules Jerome Robinson Barry G. T. Lowden Department of Computer Science, University of Essex Colchester, Essex, CO4 3SQ, U.K. {robij, lowdb}@essex.ac.uk Abstract. This paper examines the properties of metadata in the form of IF THEN rules which contain two predicates on attributes of a relational database table. For example: a( ) d( ), which means "if the value of attribute 'a' in a tuple is in the range 15 to 30 then the value of attribute 'd' will be in the range 243 to 271." Metadata of this kind is useful in Semantic Query Optimisation and Remote Cache Management. The two predicates (antecedent and consequent) in each rule are Selection Conditions or constraints of the type found in database queries. Each condition therefore denotes a subset of a database table. Rules can be cascaded, using subrange containment as the link between successive rules. The set of rules can therefore be regarded as a set of edges in a Condition Dependency Graph, and using the rule-set is path discovery in the graph. The purpose of the current paper is to introduce some of the properties of attribute-pair range rules. 1 Introduction If data servers took more interest in their data they could be more helpful in their response to client queries. Metadata, in the form of pairs of range conditions from different attributes can be easily derived from the data, either by induction triggered by queries [4] or by systematic analysis [6]. The resulting attribute-pair rules can be used for Semantic Query Optimisation and cache management in remote clients. Each Attribute Pair (AP) Rule is a subset descriptor for a database table, comprising a set selector (the antecedent) and a set descriptor (the consequent). Methods for choosing appropriate subsets include deriving rules by systematic analysis and set reduction as in [6], or by associating a rule with each bar of a histogram describing the table as in [7], or by recognizing subsets that are frequently used during database access. Some properties of AP Rules are now specified. Each rule has the simple form: A B, where antecedent A and consequent B are range constraints or conditions applied to attributes a and b respectively. Range conditions include equality conditions, such as (a = n). Range Conditions have two forms: i) single-test conditions: (a θ n) where θ { <,, =,, > }, 'a' is an attribute name, and n is a value of the same type as the attribute,

2 Lecture Notes in Computer Science 2 and ii) double-test conditions: (n a m) where n and m are values, of attribute a s type. a(n.. m) is an abbreviation for (n a m). We assume a(n.. m) denotes a closed interval (i.e. the values n and m are included in the range). Rules are used by matching (comparing) condition A or B with another range condition such as a query condition or a condition in another rule. Two range conditions on the same database table are comparable if they have the same attribute name. Eg: b(5.. 81), (b > 63), (b = 549) can all be compared because they all denote ranges for attribute b. Comparison is a test for subrange containment. Set (A) is a subset of set (B) in the rule A B The set selected by applying selection condition A is a subset of the tuples obtained by condition B. This is a necessary consequence of the assertion A B, which can be read as "all As are Bs". Furthermore, a rule partitions its database table into three disjoint subsets, corresponding to selection conditions: A B, A B, A B. The significance of this table partition is discussed in [7]. Attribute Pair Rules are used as subset selector/descriptor pairs, and they are an exact description of the data rather than the probabilistic knowledge usually produced in KDD. A large amount of such metadata is desirable, for a full description of the data, but the rule set must be structured for fast access. Examination of the Hasse Diagram for the ordered set of attribute ranges (e.g. Figs. 2 and 3) identifies an efficient data structure, access strategy, and systematic rule discovery algorithm for arbitrary sets of range rules. The structure of the rest of the paper is as follows. Section 2 introduces Remote Cache Management as an example of the use of attribute-pair rules. Section 3 describes the rule set as a Condition Dependency Graph (CD Graph) and mentions Semantic Query Optimisation [ 4, 8, 9, 10, 11 ] as a second application requiring these simple subset descriptors. Section 4 provides a data structure for range rule sets, which allows rapid range matching, since fast access to the rule base is important to its applications. 2 Remote Cache Management A query-processing issue of current research interest [eg 1,2,3,5] concerns remote access, via wide area network, to the data server. Mediators, for example, are a local interface for clients to multiple remote data servers. They cache queries in order to reuse cached values to answer later queries. This provides faster query answering, achieves some degree of independence from network delays and breakdowns and helps reduce internet traffic congestion. Their problem is to identify queries that can be answered with locally-cached data. Query result containment by previous result sets, and overlap (partial containment) must be recognized. Existing attempts to solve this problem by query analysis suffer the same drawback as Conventional Query Optimisers: they are syntactic rather than semantic operations. Knowledge about the data can help.

3 Lecture Notes in Computer Science 3 For example: A query, Q1, selects all tuples in a table, where attribute b is greater than 27. The result set is cached. A new query whose selection condition is ( 40 b 53 ) can obviously be answered from set Q1. But query Q3, whose selection condition is ( 21 e 33 ) has no apparent connection with the cached set, since it refers to a different attribute. No amount of syntactic analysis will reveal any connection. But the data server may know that ( 21 e 33 ) ( b > 27 ) i.e.: all tuples in that table with e column values in the range [21..33] have a value of the b attribute that is > 27. Therefore the result set for Q3 is contained in the cached set from Q1. Two ways to use this knowledge of the data to assist remote cache managers are: 1. If the Mediator sent query Q3 to the data server, the server would recognize the connection with previous query Q1 and reply with a selection condition to apply to the specified cache set to obtain the new results. This short reply is a small singlepacket message able to pass rapidly through the store-and-forward network, faster than the result set. 2. In the case of a static data repository (such as a Data Warehouse, Data Archive, or just stable data) the server could send relevant rules to the remote cache manager, so that it can supplement its decisions with knowledge of the data. Each query result set dispatched to a registered cache manager can be supplemented by a set of rules. N-1 rules for an N-ary database relation, one for each attribute other than the one in the current query. For example: A cached result is for the range condition a( ) on attribute a. For each other attribute x i in the table, produce a rule: range(x i ) a( ). This means "all tuples with x i value within the specified range(x i ) will have values of a in the range [15..30] and are therefore in the local cache". Such rules can be obtained from the existing rule set, as follows. Merging Rules Each rule to be dispatched to the remote cache manager is obtained by merging cache-relevant rules from the server s rule set. There are 2 steps: Step 1. Rules, in general, are classified according to the pair of attributes they contain. From the rules whose consequent condition refers to attribute a, and whose antecedent attribute is d, say, extract those with consequent range within [ ], i.e. [ ]. E.g. : R1 d( ) a( ) R2 d( ) a( ) R3 d( ) a( ) R1, R2 are nested rules, so R2 can be deleted from the set. It defines characteristics of a subset of the tuples covered by R1. R1 and R3 are overlap rules. These are the sort to merge, in step 2: Step 2. From the selected set of d(n i.. m i ) a(p i.. q i ) rules, obtain s = min(n i ), t = max(m i ), v = min(p i ), w = max(q i ). The merged rule is: d(s.. t) a(v.. w) i.e., the union of antecedent conditions implies the union of consequents. E.g. from the rules above, d( ) a( ). This means all tuples selected by condition d( ) will have attribute a values in the range [ ].

4 Lecture Notes in Computer Science 4 Explanation: From R1, all tuples with d( ) have a( ); and from R3, all tuples with d( ) have a( ). Unioning the set selected by d( ) with that selected by d( ) produces the set selected by d( ). This set inherits the consequents from the rules describing the smaller sets. Since no new tuples were added, all tuples selected by constraint d( ) have ( a(16..25) OR a(21..30) ), i.e. all have a( ). Consequent ranges must overlap if antecedents overlap [ 7 ]. Note that the consequent does not quite fill the cache range a( ). If more rules had been available for merging, the consequent range might have been widened, in which case the antecedent range would necessarily increase and would then be able to capture a greater number of future queries. 2.1 Partial Containment by Cache Current research on remote cache management [e.g. 1, 5] is interested in whether some of the answers to a current query are contained in the local cache. The user may not require more than the sample of results contained in the cache. But if the full set is needed, a request for a smaller set is made to the remote server, thereby reducing network traffic and also providing immediate access to some of the data. The smaller set complements the cached set to complete the query result set. The server can support partial containment by the cache as well as full containment. Full containment is expressed by a rule A B, where A is a new query and B the old, cached, query condition. Partial containment can be indicated by rules in a number of ways. For example, a rule B A means cached set B is a subset of new query result A. Partial overlap in condition ranges also represents partial containment in the cache. Eg, new query: a( ), rule: a( ) B, where B is the cached set. 2.2 Conjunctive Conditions The cached query may be the result of a conjunction of selection conditions on the table. E.g. (A & C), where A, C are constraints on two different database attributes. Two rules, B A and B C mean B (A & C). In practice, two rules with exactly the same antecedent, B, may not exist, so the intersection of antecedent ranges B and B" from two rules B A and B" C" are used, where range(a) contains range(a ) and range(c) contains range(c"). The intersection of B and B provides the value of B in the rule B (A & C). For example: cached query: ( c( ) AND a( ) ) rules: b(3.. 8) c( ) b( 4..9) a( ) The first rule states: if attribute b has a value in the range [3.. 8] then attribute c will have a value in [ ], which is therefore in the wider range [ ] used in the cache descriptor. Similarly, b(4..9) implies a is in [ ] and therefore in [12..20]. The rules therefore become: b(3.. 8) c( ) and b( 4..9) a( ).

5 Lecture Notes in Computer Science 5 (A rule means that all tuples matching the antecedent constraint will also obey the consequent constraint. It does not imply that values exist at all points on the number line interval denoted by the consequent range). From these rules it follows that b(4..8) ( c(15..30) AND a(12..20) ) because if both rule antecedents are true (i.e. the intersection of their ranges) then both consequents are true. This rule does not claim to provide a complete description of the cached set, just some information rapidly available from currently held rules. The antecedent range [4..8] could no doubt be increased if the data itself, in the base relation, was examined. 2.3 Query Simplification The remote cache manager labels each query result set with the query expression which produced the set [e.g. 2, 3]. The data server can use AP rules for semantic query reformulation, e.g. to eliminate redundant terms. For example, a query expression A & B is reduced to B by rule B A. (A & B B if B A, because set(b) set(a) set(b) when B A). Eliminating terms from cache descriptor expressions allows them to subsume more queries in syntactic analysis, so cache management benefits from knowledge-based simplification of query expressions, using AP rules. 3 The Condition Dependency Graph (C D Graph) It is useful to regard the set of rules as edges in a Condition Dependency Graph. Each rule is an edge, and rule composition produces paths of two or more edges which denote transitive rules. The first and last nodes in any path are antecedent and consequent in a rule, and can be linked directly by a single arc. These transitive rules identified by paths are no different in character from rules produced directly from data, e.g. by induction [4] or systematic analysis [6]. They denote a relationship between data values in two attributes. So although a path is a chain of deduction, it reveals a rule which is then independent of the inference path. Intermediate rules in the path could be deleted from the rule set without affecting the validity of the transitive rule. Paths branch because a consequent range can imply many antecedent ranges in other rules; and because one antecedent condition can appear in several rules with consequents on different attributes. Branching is useful because transitive paths from a common antecedent can be intersected to produce a new, more specific rule. Fig. 1. Part of a Condition Dependency Graph including query conditions A, F and E.

6 Lecture Notes in Computer Science 6 Fig. 1 shows a second practical application of Attribute Pair rules, namely Semantic Query Optimisation (SQO), which involves path discovery in the Condition Dependency Graph. A query is a collection of conditions which query-result tuples must satisfy. The collection of query conditions is structured as a sum of products expression (i.e. a disjunction of conjunctive sub-queries). SQO rewrites the sub-queries in order to produce a query that can be processed faster. Fig. 1 shows three conditions, A, F and E in a conjunctive query have been matched with conditions in rules. The path from A to E means that condition E can be deleted from the query, because A E so any tuples which satisfy condition A will also satisfy E, without being tested. The cycle containing condition F means F can be replaced in the query expression by K, if equivalent condition K is a faster test to apply to tuples. Equivalent conditions select the same set of tuples. If any of the conditions implied by A or E contradict condition F then the query will produce no results and can be answered immediately with the empty set without consulting any data (or if the conjunct A E F is a subquery it can be deleted from the larger query). For example, if F is the condition (25 d 34) and a rule (i.e. path) A (45 d 63) exists, then no tuple can satisfy both conditions A and F as required by the query, because all tuples meeting condition A have attribute d values outside the range required by condition F. 3.1 Condition Matching Matching conditions in the CD graph (e.g. query conditions to rule conditions, or consequent to antecedent when cascading rules) need not involve exact match if range conditions are involved. In fact, matching is an inference process using subrange containment as the rule of inference. A range implies its super-ranges: [n.. m] [ n.. m]. e.g. b(15..30) b(12..40), meaning if b is in the interval [15..30] it is also in [12..40]. Conditions denote sets of tuples, so set( b(15..30) ) is part of the set of tuples with values in the range [12..40] Query Condition/Rule Antecedent Matching For comparable conditions, a query condition matches a rule antecedent if: range(rule antecedent condition) contains range(query condition) Reason: Query condition must imply rule antecedent. Implication is by subrange containment. Set(query condition) set(rule antecedent condition) Rule Consequent/Query Condition Matching For comparable conditions, rule consequent matches query condition if: range(query condition) contains range(rule consequent condition) Reason: Consequent condition must imply query condition is true. (This ends a chain of inference from one query condition to another, through one or more rules). Semantic Query Optimisation rewrites a query into a form which a Data Server can answer more quickly. SQO is a pre-processing stage between user and server, which intercepts and rewrites queries. It must therefore be fast to avoid delaying the query and so counteracting the benefits of the faster query it produces. Therefore it is im-

7 Lecture Notes in Computer Science 7 portant to pre-process the CD Graph, before a query arrives, so that transitive rules are ready to apply without having to build paths at query time. It is therefore useful to know (as discussed in the next section) whether there is a limit to the amount of pre-processing to be done, since full transitive closure is a significant workload. 3.2 Maximum Pathlength in the CD Graph The maximum pathlength to be examined in the CD graph when deriving transitive rules is N edges, where N is the number of columns in the database table. Paths in the graph can be longer than N, but rule derivation need only examine paths to depth N. A path is a sequence of pairs of attribute conditions, formed by cascading attributepair rules. The longest useful path is one in which every attribute appears exactly once. When a path reaches a condition on the attribute from which it started, the node at the end of the path must denote a superset of the tuples denoted by the start of path condition. Eg: a(15..20) a(12..93). Paths such as a(15..20) a(23..32) or a(15..20) a(18..19) cannot be derived, since these rules are self-contradictory. A range can only imply a super-range of itself. Consider the path: A 1 B 1 C 1 A 2 C 2 E 1 where A i denotes condition i on attribute a, B i condition i on attribute b, etc. Case 1: Nested paths with Shared Consequent Three rules linking attributes a and c are shown: A 1 C 1 A 2 C 2 A 1 C 2 But the second subsumes the third, since A 1 is a subrange of A 2 (because A 1 A 2 ). For example, if A 1 = a( ) and A 2 = a( ) the information in the rule a( ) C 2 is only part of the information in the rule a( ) C 2. Case 2: Absence of Consequent Attribute from the First Cycle When a path reaches an attribute for the second time it starts a second cycle. Eg A 1 and A 2 in the path above start different cycles through the database table s attribute set. Two rules link attributes a and e in the path above: A 1 E 1 and A 2 E 1 Since condition node A 1 does not imply any other condition on attribute e in the path, the rule A 1 E 1 is the best rule in which condition A 1 implies a value range for attribute e. But it is a redundant rule, since it is subsumed by A 2 E 1 because range(a 2 ) contains range(a 1 ), so set(a 1 ) set(a 2 ). Therefore, although paths longer than N can be generated they are not useful. A second cycle can be treated as a SEPARATE path, providing information about a different subset of the database table. This is a superset of the tuples described by the first cycle, since the start node of the second path is a super-range of the first path s start node. (The start node is the antecedent of the rules in that section of the path, and selects the set of tuples described by the rules). 3.3 Merging Antecedent Ranges The CD graph can therefore be seen as a collection of separate, but maybe concatenated, paths. So the maximum number of incoming edges to any node is N-1, representing each of the other columns of the database table. More than N-1 would denote redundant rules, since two rules have the same consequent, the same antecedent col-

8 Lecture Notes in Computer Science 8 umn, but different antecedent range. If those ranges are nested then the subrange rules are discarded. But if ranges overlap, they are merged during graph processing to produce a wider (more useful) range antecedent, as follows. Consider two incoming edges: a(15..20) b(12..23) and a(17..23) b(16..29) to the antecedent node of rule: b(10..31) d(15..44). The two original rules are separate subset descriptors, with overlapping antecedent ranges. They can be combined, by unioning, to a single rule: a(15..23) b(12..29) whose consequent assertion satisfies the antecedent constraint of b(10..31) d(15..44) and therefore produces a new, transitive, rule: a(15..23) d(15..44). A new antecedent node is thus added to the CD graph, for the condition a( ), as the start of transitive rules. 3.4 Maximum Out-degree of Graph Vertices When producing new edges to represent transitive paths, the number of outgoing edges from any node must be limited to N-1 (for an N-ary database relation) so that a node is only connected to the narrowest available range for each consequent attribute. Sequential and parallel paths can provide consequents. Nested consequent ranges are produced by successive cycles in sequential paths, producing rules such as: a(15..23) e(20..27), a(15..23) e(10..49), a(15..23) e(3..182). The first rule subsumes the others. The redundant rules are not produced if transitive pathlength is limited as discussed in section 3.2. Parallel paths produce overlapping consequent ranges, such as: a(15..23) e(20..27), a(15..23) e(24..31). Rule intersection provides an improved transitive rule, a(15..23) e(24..27). Both original rules are true, but less informative than their intersection rule. Each consequent assertion is improved (made more specific) by information in another consequent, since only values common to all the overlapping consequent ranges can actually exist in the subset data they describe. 4 Rule or Condition Match Algorithms The condition match problem is one that must be solved in any rule system which has a large number of rules. In a system that automatically increases the size of the rule set by discovery, there is a tendency for the rule system s performance to deteriorate as the set grows. The match phase in production systems (such as Expert Database Systems or C-A Rule support in Active Databases) is a very time-consuming component of rule use, in the continuous match-select-act cycle. It requires the use of discrimination networks such as RETE, TREAT or GATOR which store partial match results, in order to improve performance. However, attribute-pair range-condition rules are very different. Their structure and semantics allow rapid matching, by simple lookup algorithms. The algorithms, and the storage structure, are derived from the Hasse Diagram shown in Fig. 2, which denotes the ordered set of possible range conditions on an integer attribute whose extreme values are 10 and 20. But the diagram reveals the structure of any set of range conditions on any orderable attribute type. It is the position of certain zones on the diagram (those shown in Fig. 3) which suggests a suitable data structure and search algorithm for attribute-pair rules.

9 Lecture Notes in Computer Science 9 Fig. 2. Hasse Diagram for Range Conditions Fig. 3. Significant areas for range node 12-17

10 Lecture Notes in Computer Science 10 Fig. 4a. A Data Structure Fig. 4b. Possible Version in Practice Fig.4a shows a tree data structure derived from the Hasse diagram: a sequence of lists of nodes (range conditions). All nodes in a diagonal list have the same lower limit for their ranges (as shown in Fig. 2). This common value for a list is the listvalue. The set of lists is sorted by listvalue, so the sequence of diagonals in Fig. 4a is in ascending order of lower range limits. The nodes within each list are sorted in descending order of upper range-limit. This structure, corresponds to the diagonal rows in the Hasse diagram, and so allows early-terminating search algorithms to be easily specified for range match in any of the 'significant areas' in Fig. 3. In practice, nodes may be missing from the structure shown in Fig. 2, as indicated in Fig. 4b, but since the order of nodes is maintained the same search algorithm still applies, and works equally well with non-integer numeric types. To support antecedent lookup in the rule set, as required for SQO, a Hasse Diagram whose nodes correspond to antecedent ranges is used. The structure represents the AP set associated with one pair of attributes from a specific table. The rule set can be stored as a sorted set of lists of rules, a shown in Fig. 4b. Each rule has the form a(m.. n) b(p.. q). All rules in a list have the same m value, the listvalue for that list, which is used to sort the lists into ascending order. Within each list each node contains three values: <n, p, q> where n is called the nodevalue, and is used to sort nodes in the list into descending order. The attribute names, a and b, are implicit since the data structure contains only rules for one specific attribute pair. 5 Conclusions This paper introduced some of the properties of Attribute-Pair range-condition rules, which can be derived automatically from data and constitute a description of certain features of the data. Two practical uses for these rules were identified, namely Remote Cache Management and Semantic Query Optimisation. These simple rules avoid many of the practical difficulties associated with more elaborate rule bases, which render them impractical for strictly time-constrained applications. More elaborate rule structures are more expressive, but lack benefits such as (i) fast access

11 Lecture Notes in Computer Science 11 through a simple regular data structure, (ii) graph representation providing a map to guide rule application, and allowing full pre-computation of transitive inferences, (iii) modular rulebase structure, from rule classification by attribute pair and database relation, allowing efficient rule set management (such as access only to currently relevant subsets), and (iv) easily derived rules which are therefore easily discarded, rather than accumulating continuously. The CD graph represents dependencies between subsets of data in a database table. Each path represents a sequence of monotonically increasing nested sets of tuples. SQO uses forward paths. Cache management requires backward paths from a specified node. Graph pre-processing builds transitive paths before they are urgently needed. Parallel paths represent sets of transitive rules which can be merged to a single rule by intersection, or to another rule by unioning corresponding ranges in rules. Unioning is useful in cache expression implication, and in broadening the antecedent range to enclose a given assertion range. Intersection narrows ranges to produce a more specific subset descriptor. Previous work in SQO has used rule sets which are an arbitrary collection of rule structures, not amenable to the graph representation for inference closure, and with inherently lower utility per rule in the applications we consider. References 1. Adali, S., Candan, K. S., Papakonstantinou, Y., Subrahmanian, V. S.: Query Caching and Optimization in Distributed Mediator Systems. ACM SIGMOD Conf. (1996) Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., Tan, M.: Semantic Data Caching and Replacement, Proc. 22nd VLDB Conference (1996) Keller, A. M., Basu, J.: A Predicate-based Caching Scheme for Client-Server Database Architectures. VLDB Journal 5(1) 1996, Lowden, B.G.T., Robinson, J., Lim, K.Y.: A Semantic Query Optimiser using Automatic Rule Derivation. WITS 95, 5th Intl. Workshop on Information Technologies and Systems (1995) Qian, X.: Query Folding. 12th IEEE Intl. Conf. on Data Engineering (1996) Robinson, J., Lowden, B. G. T.: Data Analysis for Query Processing. 2nd Intl. Symposium on Intelligent Data Analysis (1997) (LNCS 1280) 7. Robinson, J., Lowden, B. G. T.: Semantic Query Optimisation and Rule Graphs. KRDB'98, 5th International Workshop on Knowledge Representation meets Data Bases (1998). 8. Shekhar, S., et al.: A Formal Trade-off between Optimization and Execution Costs in Semantic Query Optimization. Proc. 14th VLDB Conference (1988) Shenoy, S. T., Ozsoyoglu, Z. M.: Design and Implementation of a Semantic Query Optimizer. IEEE Trans. Knowledge and Data Engineering, 1(3) 1989, Siegel, M.,et al.: A Method for Automatic Rule Derivation to Support Semantic Query Optimization. ACM Trans. Database Systems, 17(4) 1992, Yu, C., Sun, W.: Automatic Knowledge Acquisition and Maintenance for Semantic Query Optimization. IEEE Trans. Knowledge and Data Engineering, 1(3) 1989,

Distributing the Derivation and Maintenance of Subset Descriptor Rules

Distributing the Derivation and Maintenance of Subset Descriptor Rules Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester, Essex, CO4