Distributing the Derivation and Maintenance of Subset Descriptor Rules

Size: px

Start display at page:

Download "Distributing the Derivation and Maintenance of Subset Descriptor Rules"

Nelson Burke
5 years ago
Views:

1 Distributing the Derivation and Maintenance of Subset Descriptor Rules Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester, Essex, CO4 3SQ, U.K. Abstract This draft paper describes a solution to the rule maintenance problem for data descriptor rules derived from data that may subsequently change. The method utilises any available computers in the local area network, to derive and maintain rule sets. Introduction Database query processing involves the selection and manipulation of data subsets specified by the query or by the query processor. Descriptors for data subsets are useful in optimising the query processing task. For example, histograms are simple subset descriptors. Each bar in the histogram describes a data subset by specifying the number of data items in that subset. This descriptor information is used in conventional query optimisation to schedule the order of operations on intermediate data sets in the query execution plan. Attribute pair rules [14] are subset descriptors which state dependencies between data values within subsets. Rules of this kind are the basis of semantic query optimisation [13, 17, 18] and can also be used to support data caching in remote clients to a database management system [20]. The problem for applications using subset descriptors is that any change to data may require a corresponding change to the description of one or more of the data subsets. This implies the need for fast derivation and maintenance of subset descriptors, in a way that does not add workload to the database server. We investigate the use of multiple workstations in the same local area network as the data server to handle the work of descriptor derivation and maintenance. Tasks are distributed to these computers by a particular workstation (the master in a master-slave configuration). The tasks run on each slave workstation as background programs. 2. Background Information A subset descriptor is a selector, descriptor pair. The selector is a data-value constraint (a selection condition) which identifies a subset of data items in the database. The descriptor provides information about data items in that subset. The selector uses a collection of values and ranges for specified attributes as a selection constraint to include tuples in the subset. It may be the Boolean expression in the WHERE clause of an SQL query, for example. Attribute Pair (AP) rules [14] have been identified as a particularly useful form of subset descriptor for semantic query optimisation (SQO) and remote cache management. An AP rule has the form A B, where A and B both have of the structure of query selection conditions, but consequent B also describes tuples selected by condition A. For example, c(10..20) d(27..36) means in the set of tuples which satisfy the condition (10 c < 20) for attribute c, all have attribute d values in the range This apparently simple rule structure hides detail, since an AP rule set refers to a particular database table which may be a virtual relation, containing the pair of attributes. For example, the rule: ship_class(class, type, draught, _,_) ship(_, class, type, status, _,_) (draught < 50) (status = Active ) 1

2 is the AP rule: (draught < 50) (status = Active ) on a virtual relation which is the natural join of base relations ship_class and ship. Since a whole set of such AP rules is associated with the table it is inappropriate to repeat information in every rule. Sets of AP rule subset descriptors are derived from the data in preparation for query processing. Each AP rule is an ordered pair of conditions, which allows rules to be used as directed edges in a graph [13] whose paths provide transitive inferences which are further descriptions of the subset selected by the start of sub-path condition. AP rules can be extended to multi-consequent rules [13] without losing this graph edge semantics. Eg, the descriptor: c(10..20) d(27..36) (f {'OMG', 'TPC'}) h( ) represents three AP rules, all with the same antecedent condition, c(10..20). A single antecedent look-up in the rule set thus provides multiple assertions about the selected subset. Reducing look-up time for descriptors is important in query optimisation, since the goal is reduced query execution time. The consequent is a vector of assertions, so that descriptors for different subsets are easily compared or combined by specific vector element. For example, a database query selects data items with a(75..90) AND c(10..15), and two relevant descriptors exist which describe subsets containing the required data items: a(70..95) d(18..43) (f { ODMG, 'OMG'}) h(13..71) c(10..20) d(27..36) (f {'OMG', 'TPC'}) h( ) The query conditions are sub-ranges of the antecedent conditions. Each antecedent therefore selects a superset of that selected by the corresponding query condition. Furthermore, the conjunction in the query specifies the intersection of the two sets described by the two rules. Pairwise comparison of vector elements in the two rule consequents show incompatible values for attribute h. This indicates that the two query conditions select disjoint subsets of data items. No tuples can satisfy both query conditions so the result set will be empty, and the empty answer can be returned immediately, without consulting the database. In 'Associative Caching' [4, 5, 8] each client computer keeps a copy of each of its query result sets in its own local database. The purpose is to reduce the size and frequency of queries to the remote data server accessed by wide-area network. This can reduce query cost factors based on access restrictions imposed by the server such as authorisation delays, payment charges for data, and server breakdown or workload delay as well as internet delays. For each new query the client tries to find some or all of the required data in its local collection of result sets. Usually this is done by syntactic comparison of the new query with each previous query [e.g. 2, 11] to detect overlapping data sets. Attribute-Pair Range Rules which the server derives from its data for its own use in query optimisation, can be further utilised to provide descriptors for each query result set. This new information adds to the limited description currently available to a client in the form of the previous query expression. It enables clients to recognise data overlap for new queries which refer to attributes not mentioned in the previous query, so that local data can now be exploited for syntactically unrelated queries [15]. Subset descriptors are a form of knowledge about the data, which is derived directly from the data. But unlike many forms of KDD it must be exact [10] rather than probability-based. This means it cannot use only samples of data sets. It must process all tuples in the subset it describes. Therefore the use of subset descriptors can introduce a significant processing workload. But the data server should not be required to do extra work of this kind, since it may delay current queries. Creating metadata to make future queries faster would make current queries slower. Furthermore, data may change (in environments other than the static data warehouse or data archive) and this requires a corresponding change to descriptors. So it 2

3 would be useful to find an existing hardware resource that can be used to do this work instead of the data server. 3. Creating a Parallel Virtual Machine from Networked Computers PVM [6] is a well-established software system which enables a group of workstations linked by local-area network to work together as a Virtual Machine. Modern workstations have more computing power than they use. Successive generations of workstations increase the computing power and capacity of previous generations. Therefore the amount of spare computing capability in a network of desktop machines is steadily increasing. This is a resource that can be used to analyse and summarize data sets. PVM allows networked workstations to be used by spawning new background programs on the machines. The programs can accept messages from a main process/program on a particular computer telling them what to do and can return the specified results. Data sets can be transferred directly between machines, or via the Network File System. This allows the data server to be treated as one component in a multi-workstation machine. The processing and memory resources of the machine can expand and contract dynamically by varying the number of computers being used. The task of deriving subset descriptors from data can be distributed to multiple workstations in the local area network, as follows. Main Process Identify table and relevant attributes Retrieve the database table data Establish the PVM machine Sub-processes on different computers Receive the workload from the main process Send the same amount of data to each slave Sort the records according to a specific attribute Derive the rules Wait for the rules Receive and merge the rule subsets Send the derived rules to the main process Fig. 1 Using Multiple workstations to Create a set of Subset Descriptor Rules 3

4 4. Rule Derivation Algorithm Used 1. The Main Process chooses an attribute to be antecedent for the current set of rules, and identifies the MIN and MAX values if it is numeric. It broadcasts those values to all computers, and the message also specifies the number of rules required in the set. 2. Each computer then receives from the Main Process a subset of the database table to be described, and sorts it on the attribute specified as antecedent. 3. After sorting, each computer divides the MIN..MAX range into the specified number of sub-ranges. This is the number of rules required, since each sub-range produces a rule. 4. Each computer divides its sorted table (part of the original database table) into disjoint subsets, using the sub-ranges to select tuples. The ordered sequence of tuples is scanned, building each rule incrementally. For example, if the next sub-range is 10 a < 25 for the antecedent attribute named a then all tuples in the relevant sub-sequence of tuples will contribute to the rule. If the first tuple in the sub-set has 26 as the value of attribute c, then the rule so far is (10 a < 25) (c = 26). Descriptors for other consequent attributes are added in the same way. The next tuple in the ordered sequence has c = 31, so the rule describing all tuples encountered so far becomes (10 a < 25) (26 c 31). If the next tuple has c = 29 then the rule remains unchanged because it correctly describes the set of three tuples which includes this new tuple. Thus each new tuple encountered during the scan through the ordered table will either extend the consequent range or leave it unchanged, so that when no more tuples satisfy the selection condition (10 a < 25) the rule describes all tuples in that sub-set. The next tuple in the sorted data sequence starts a new descriptor for the next sub-set, with antecedent (25 a < 40), for example. When the end of the sorted table is reached, the computer has produced the specified number of sub-range descriptor rules. 5. Each computer returns its set of rules to the Main Process, which merges corresponding rules from all the separate computers to create a single rule set with the specified number of rules. This rule set describes the whole database table. Corresponding rules are rules with the same antecedent condition, produced in separate computers. Rule merging is just another stage of incremental rule generation. For example, rules (40 a < 55) (61 c 83) and (40 a < 55) (68 c 74) are provided by two computers. The combined rule is (40 a < 55) (61 c 83), since this describes both sets. If a further computer provides the rule (40 a < 55) (75 c 85), the descriptor for the Union of the three tuple sub-sets is (40 a < 55) (61 c 85). Another computer returns (40 a < 55) no tuples, so the rule remains (40 a < 55) (61 c 85). 5. Performance of the Multi-computer Rule Derivation Algorithm The elapsed time for multi-computer rule derivation has been measured in experiments. The following graph shows a typical example of the experimental results obtained. It shows the measured times to derive rules from a table with rows, of 112 bytes per row, distributed to varying numbers of networked workstations. The attribute used as antecedent for the derived rules was of Character String type, which is much slower to sort than numeric attributes. Although measured times for numeric antecedent attributes are much shorter, the shape of the graph is very similar, indicating a rapid reduction in time as the number of computers used increases. This is the time needed to derive a set of rules from a database table. The rule set is like a histogram with an Attribute Pair rule or multi-consequent rule describing the subset represented by each bar of the histogram. 4

5 Measured Time to Derive Rules from tuples whose antecedent attribute is of String type 700 Elapsed Time (seconds) Measured Time Expected time 625/H : Number of Computers used in the local network Fig. 2 Observed Performance of Multi-Computer Rule Derivation Total time is significantly reduced by working with multiple computers. But the time reduction is also remarkable in being better than one might predict. Dividing work between three workers can divide the total time by three; although additional work to distribute data and synchronize the workers may prevent the theoretical speedup of T/H, where T is the time for a single worker and H is the number of workers. The graph plots values of T/H for comparison with the measured times. T was 625 seconds. For two or more computers the elapsed time was found to be shorter than T/H. Values plotted in the graph are as follows. No. of Hosts, H : Measured Time : Expected, 625/H : The tuple (14 Mbyte) example is typical of results from experiments on data sets of various sizes and data types. Better than T/H performance was observed for all. Several factors contribute to this speedup. The NlogN complexity of the Quicksort algorithm, which consumes most of the elapsed time in the rule derivation process, is one factor. If the elapsed time, T, to sort a set of data is proportional to N.logN, then T = (1/k).N.logN, where 1/k is the constant of proportionality. But T = 625 seconds when N = , so k = Then values for T can be predicted as (1/1066)N.logN, where N is /H, and H is the number of workstations. But observed times are still significantly faster than these predicted times. The following graph indicates the connection between T/H and NlogN as the size of the data sub-set in each machine decreases as the data set is partitioned between more computers. 5

6 N N*logN /H Number of Computers, H ( N is the number of Data Items per computer, i.e /H ) Fig. 3 Comparison of NlogN values with T/H A second factor, which contributes to the large speedup when distributing the sort algorithm, is the amount of paging required as the size of data set to be sorted increases. The proportion of pages which cause page faults, requiring swapping from disk, increases with the amount by which the data set exceeds the available main memory space. Each disk access is a severe time penalty. So the smaller data sets provided by division to more machines reduce the number of these delays. A third factor, related to available main memory space and paging, is the data transfer time when sending large data subsets to computers to sort. Message passing is used between computers. The receive buffer in PVM message passing is limited by the amount of main memory available to dynamically utilise as buffer space. Blocking send is used to reliably transfer data, so that delays can occur as the amount of data exceeds the amount of physical memory space. Paging to virtual memory must occur before physical memory frames are available as buffer space to accept more data. This delay does not occur when the number of computers used is great enough (depending on the size of the whole data set). 6. Rule Maintenance If the data changes, rules describing the data may need to change. Insert, Delete and Update are the database operations that can change the data. Tuple INSERT has the same effect on descriptors as a new tuple encountered during the table scan described in section 4. The numeric or string value of the antecedent attribute in the new tuple maps to the relevant rule. Assertions in that rule describing other attributes may need to be extended by values in this new tuple. If several sets of rules exist, each with a different antecedent attribute, then the new tuple maps to one rule in each set. Deleting a tuple does not require any change to range assertion rules, since deletion does not falsify rules. Any remaining data values are still within consequent-specified range limits. However, choosing to create new descriptors for any rule whose antecedent includes attribute values in the deleted tuple may provide narrower ranges as consequent assertions. This is beneficial because narrow consequent ranges can match more query conditions. 6

7 Updating a single tuple changes the value of one or more fields in an existing tuple and is equivalent to reading and Deleting the tuple before Inserting the new version. Rule maintenance actions are therefore the same as Delete followed by Insert. However, if an Update changes a field in all tuples in the table, the server will disable all assertions about that column of the table until they can all be revised. This makes one of the elements in some rule consequent vectors temporarily unavailable. When a new tuple is Inserted into the database table it is also sent to one of the computers to add to its data subset. As a result of this new data the computer may notify the Main Process that one of its subset descriptor rules has changed. For example, a(15..20) c(63..91) is a revised rule produced by the computer. To merge this with the existing rule set, the current rule: a(15..20) c(29..71) which was previously produced by merging results from all computers, becomes a(15..20) c(29..91). If a tuple is Updated, in the central database table, the old version of the tuple is broadcast to all computers, so that the machine with a matching tuple can delete it, before Inserting the new version. After a delete, n of the rules can be revised in the affected computer, where n attributes were updated. It then notifies the Main Process that an improved version of that particular subset descriptor is available, and the Main Process examines the corresponding rule from all other computers in order to create a new merged descriptor for that subset. The master computer retains all the rule sub-sets created in all the slaves, to use in this incremental rule maintenance process. 7. Conclusions Converting a Database table to a set of subset descriptors rules is a data reduction process, because the rule set is much smaller than its data set. (The descriptors provide a summary of the data). Partitioning a data set and then merging rule sets derived from the partitions is found to be an effective way to speed up the creation of rule sets. A sorting algorithm was used to get the data subsets into a structure (a set of sorted sequences) which can be used as a look-up table to rapidly derive rules and to update those rules when the data changes. Merging rule sets from a collection of workstations is very fast. Much faster than merging sorted data subsets. The configuration of 'master' workstation with a set of 'slave' workstations in a local area network provides an effective way to solve the problem of maintaining derived descriptors rules as the data changes. The master workstation is (also) the user interface to the database, accepting queries and data updates from networked users. It sends all data changes to the slave workstations as well as to the data server, and the slaves respond with any changes caused to their rule subsets. The workload of rule derivation and maintenance does not affect the data server, because it is done on different computers. Workstations in a local network are commonly underutilised. Their computing capacity is rarely used to its full extent because modern desktop computers are powerful machines. But typical application programs have a use profile that makes the machines virtually idle for most of the time, with occasional bursts of activity. We utilise such networked workstations as a distributed computing resource, to derive and maintain data descriptor rules by means of background programs on the workstations. The master workstation uses derived rules for semantic query optimisation [13,16,18] and for remote client cache management [20], but it can also answer queries from the sorted data in slave workstations as well as from the data server. This method of query optimisation by generating query execution plans that use workstation data sets, as well as the database data server, is the subject of our current research. 7

8 References 1. S. Abiteboul, R. Hull and V. Viannu, Foundations of Databases, Addison-Wesley, Adali, S., Candan, K. S., Papakonstantinou, Y., Subrahmanian, V. S.: Query Caching and Optimization in Distributed Mediator Systems. Proc ACM SIGMOD Conf. (1996) Julie Basu, Meikel Poess, and Arthur M. Keller, Performance Analysis of an Associative Caching Scheme for Client-Server Databases, Technical Note STAN-CS-TN-97-61, Stanford University, Computer Science Dept., September Julie Basu, Meikel Poess, and Arthur M. Keller, High Performance and Scalability Through Associative Client-Side Caching, Seventh International Workshop on High Performance Transaction Systems, Pacific Grove, CA, September Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., Tan, M.: Semantic Data Caching and Replacement, Proc. 22nd VLDB Conference (1996) A Geist, et al, "PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Networked Parallel Computing", MIT Press, Godfrey, P., and Gryz, J., Semantic Query Caching for Heterogeneous Databases, KRDB'97, 4th International Workshop on Knowledge Representation meets Data Bases, , Keller, A. M., Basu, J.: A Predicate-based Caching Scheme for Client-Server Database Architectures. VLDB Journal 5(1) 1996, G. Piatetsky-Shapiro, Discovery, Analysis and Presentation of Strong Rules, Knowledge Discovery in Databases, Eds. G. Piatetsky-Shapiro and W. J. Frawley, MIT Press (1991) Qian, X.: Query Folding. 12th IEEE Intl. Conference on Data Engineering (1996) Robinson, J., Lowden, B. G. T.: Data Analysis for Query Processing. 2nd Intl. Symposium on Intelligent Data Analysis (1997) (LNCS 1280) 13. Robinson, J., Lowden, B. G. T.: Semantic Query Optimisation and Rule Graphs. KRDB'98, 5th International Workshop on Knowledge Representation meets Data Bases, , J. Robinson and B. G. T. Lowden, Attribute-Pair Range Rules. Proc. DEXA'98, 9th Intl. Conference on Database and Expert Systems Applications (1998) (LNCS 1460) 16. S. Shekhar, B. Hamidzadeh, A. Kohli, and M. Coyle. Learning transformation rules for semantic query optimization: A data-driven approach, IEEE Transactions on Knowledge and Data Engineering, 5(6), , S.T. Shenoy, Z.M. Ozsoyoglu, A System for Semantic Query Optimization, Proc ACM SIGMOD Conference, 1987, pp M. Siegel, E. Sciore, S. Salveter, A Method for Automatic Rule Derivation to Support Semantic Query Optimization, ACM TODS 17(4) , Divesh Srivastava, Shaul Dar, H. V. Jagadish, Alon Y. Levy, Answering Queries with Aggregation Using Views, Proc. 22 nd VLDB Conference (1996) J. Robinson and B. G. T. Lowden, Extending the Re-use of Query Results at Remote Client Sites, Proc. DEXA 00, 11th Intl. Conf. on Database and Expert Systems Applications, 2000, pages Springer (LNCS 1873). 8

Utilizing Multiple Computers in Database Query Processing and Descriptor Rule Management

Utilizing Multiple Computers in Database Query Processing and Descriptor Rule Management Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester,