A MODEL FOR ADVANCED QUERY CAPABILITY DESCRIPTION IN MEDIATOR SYSTEMS

Size: px

Start display at page:

Download "A MODEL FOR ADVANCED QUERY CAPABILITY DESCRIPTION IN MEDIATOR SYSTEMS"

Beverly Anthony Norris
5 years ago
Views:

1 A MODEL FOR ADVANCED QUERY CAPABILITY DESCRIPTION IN MEDIATOR SYSTEMS Alberto Pan, Paula Montoto and Anastasio Molano Denodo Technologies, Almirante Fco. Moreno 5 B, Madrid, Spain apan@denodo.com, pmontoto@denodo.com, amolano@denodo.com Manuel Álvarez, Juan Raposo and Ángel Viña Information and Communication Technologies Department, University of A Coruña, A Coruña, Spain mad@udc.es, jrs@udc.es, avc@udc.es Keywords: Abstract: Mediator systems, sources query capability, semi-structured data integration. Mediator systems aim to provide an unified global data schema over distributed heterogeneous structured and semi-structured data sources. These systems must deal with limitations on the query capabilities of the sources. This paper introduces a new framework for representing source query capability along with the algorithms needed to compute the query capabilities of the global schema from sources. Our approach for computing query capabilities is able to support a richer capabilities representation framework than the ones previously presented in the literature. We show that those approaches are insufficient to properly represent many real sources, and how our approach can solve those limitations. 1. INTRODUCTION The world today is characterised by the abundance of information sources available through media such as the WWW, databases, semi-structured files, etc. Nevertheless, this information is usually scattered, heterogeneous and weakly structured (semi-structured data). Mediator systems (Wiederhold, 1992), provide users with an unified global schema over distributed heterogeneous structured and semi-structured data sources. In the context in which the mediators operate, sources generally have limitations in the way in which they can be queried. For example, in Web sources (one of the most significant types of mediator sources) query possibilities are often limited to those provided by some type of HTML form. The mediator systems presented up to date, such as (García-Molina, 1995)(Halevy, 1996)(Tomasic, 1998), used simple query capability description frameworks in order to detect when a given query over the mediated global schema cannot be answered due to source limitations. But most of these frameworks fail to properly represent the query capabilities of many real sources. Besides, these systems do not compute the query capabilities of global schema relations from source capabilities. This has the following important disadvantages: Users are forced to write queries by a trial & error process, trying to figure out which queries can in fact be processed by the mediator. It is not possible to use mediators as source for new ones (achieving this can be very convenient in large data integration projects). Nevertheless its importance the issue of computing global schema query capabilities, as far as we are aware, has only been dealt with in (Yerneni, 1999). But this work was based on a simple query capability description framework unable to deal with the features encountered in many real sources.

2 In section 2 of this paper we present a query capability description framework able to properly model a much wider range of real sources. We also present an algorithm for propagating query capability from sources to global relations supported by our framework. In section 3, we provide an overview of a method for executing queries in a mediator using the former techniques. Those ideas have been implemented in a commercial mediator system which has been used for the construction of various industrial data integration applications. In section 4, we include some quantitative data from them, which is relevant for our discussion. Finally, Section 5 discusses related and future work. 2. QUERY CAPABILITY FRAMEWORK 2.1 Relation Model In our model each source exports a combination of relations, called base relations. Each base relation is composed of a combination of attributes, each one belonging to a particular data type. Besides of the usual atomic data-types (strings of characters, integers, money, dates, etc.) structs and arrays are also supported. Each relation will represent its query capability through what we term as search-methods. Each search-method is comprised of three elements: : indicate the restrictions which must be satisfied by the queries sent to the relation through this particular searchmethod. : Indicate the query capabilities offered by the relation, once the negative capabilities has been satisfied. Output of attributes: Set of attributes which appear in the response of the queries sent against the relation. This set is needed since some sources allow being queried by attributes which are not included in query answers. For instance, some web bookshops can be queried by the book subject, but such an attribute does not appear in the output. Both negative and positive capabilities are represented through a set of 4-uples with the format: (attribute, operator, multiplicity, possible_values) where: attribute is an attribute of the relation. operator is an operator valid for the attribute data type. multiplicity indicates how many query conditions for the given pair attribute/operator can execute the relation in the same query. + indicates a number of values above 0 but with no upper limit. Possible_values is the list of values by which the attribute can be consulted. If it contains the value Any it means that the search range is not limited (in the range associated to the attribute data type). EXAMPLE 1: We consider the example of an on-line bookshop whose search form is shown in figure 1. The form forces the user to specify a value for the TITLE attribute (i.e. it is a mandatory field) and optionally the user can specify a value for the AUTHOR attribute, for the FORMAT attribute (restricted to a list of possible values) and a range of values for the PRICE attribute. Queries by title and author are searches by keyword (operator Contains). An arbitrary number of words can be used to fill in the boxes and the source will perform the logical AND operation between them. For instance, filling the title box with the words java and xml would return all the books containing both words in its title. In addition to the former attributes, we assumed that the shop also returns a PUBLISHER attribute. Title Mandatory Author Format (select one) Price All formats Hardcover ebooks Paperback min max Figure 1. Bookshop search form We model this source as the following relation: R={TITLE:String, AUTHOR:String, FORMAT:Enumerated(Hardcover,

3 Paperback, ebooks), PUBLISHER: String, PRICE:Money} Which capabilities can be modelled with the following search-method: { NEGATIVE { (TITLE,Contains,1,ANY) } POSITIVE{ (TITLE,Contains,+,ANY) (AUTHOR,Contains,+,ANY) (FORMAT,=,1,{ Hardcover, Paperback, ebooks }) (PRICE,<=,1,ANY) (PRICE,>=,1,ANY) } OUTPUT {TITLE,AUTHOR,FORMAT,PUBLISHER,PRICE} } Not all the relations can be modelled with only one search-method. When a relation presents mutually excluding options, more than a searchmethod is needed. For instance, it could happen that the source, instead of forcing the user to specify a value for the TITLE attribute, only forced to specify a value either for the TITLE attribute or for the AUTHOR attribute. This could be represented by adding a second search-method identical to the former one but substituting its negative 4-uple into: (AUTHOR,CONTAINS,1,ANY). 2.2 Query Model We define a base query as a query which involves only one relation. A base query in our model is of the form (using SQL syntax): Select * from R where (QueryCondition1) and (QueryCondition2)and and (QueryConditionN) where a query condition is a Boolean expression combining 3-uples of the form (Attribute Operator Value). For instance, a valid base query over R would be: EXAMPLE 2: A base query over R. select * from R where (title contains java )and(title contains xml ) and (price <= 5000) and (publisher contains Addison-Wesley ) It is interesting to point out the differences between our base query model and the one supported in systems like (Yerneni, 1999) (which is the only previous system which computed mediator query capabilities). The main differences are that in those simpler models the only operator permitted in query conditions is the = operator; and only one query condition per attribute is permitted. Those models cannot properly represent the query capabilities of many sources. Let us consider Example 1: It is not possible to represent the query capabilities for the PRICE attribute, since the source allows to be queried using <= and >= operators, instead of =. It is not possible to properly represent the source query capabilities for TITLE and AUTHOR attributes, since they use the CONTAINS operator (which only could be poorly "emulated" by the = operator), and also the source allows to be queried with more than one query condition for those attributes (as in the previous example query, where there was two query conditions for the TITLE attribute). Of course, the system is not restricted to execute only base queries. As will become more evident below, queries can also combine different relations by using joins, unions, aggregation operations, etc. 2.3 Matching queries with searchmethods In order to execute a base query against a relation through a certain search-method, the following conditions are required: 1) All the negative 4-uples must be satisfied by the query conditions of the query. A negative 4-uple (A,OP,MUL,VALUES) is satisfied if the query contains at least a number of MUL query conditions of the form (A OP a1)where a1 VALUES. For instance the first base query in Example 2 would satisfy restrictions on the search-method from Example 1, since there is

4 one query condition involving the TITLE attribute and the CONTAINS operator. 2) The query conditions of the query are permitted by the positive 4-uples. For verifying that the conditions of a given query are permitted by a search-method, we iterate over all those conditions. Each condition of the form (A OP a1)is permitted if exists one positive 4-uple (A,OP2,MUL,VALUES) verifying: 1) OP=OP2, 2) MUL>=1, and 3) a1 VALUES. Each time a positive 4-uple having a multiplicity value different from '+' is used for permitting a query condition, we decrease its multiplicity by 1. When it reaches 0, the 4-uple cannot be used to permit more query conditions during the process. For instance, the first base query in Example 2 would not be directly answerable by the source since there is not 4-uple which permits querying the relation by the PUBLISHER attribute 2.4 Post-processing techniques Query conditions involving those attributes present in the output set can always be performed by the mediator, by applying post-processing techniques. EXAMPLE 3: The first base query showed in the Example 2 can, in fact, be executed over the relation R from the Example 1, by executing the query: select * from R where (title contains java ) and (price<=5000) on the source and then (since the PUBLISHER attribute is present in the output), post-processing in the view the obtained results by filtering out those tuples not verifying the additional condition: (publisher contains Addison-Wesley ) As a consequence of these techniques, we can add a 4-uple of the form (A,ANY,+,ANY)for each attribute A in the output set (and there will be no need of other 4-uples for these attributes, since this 4-uple represents the capability for performing any query condition). Any other boolean condition involving A could also be applied (even using other boolean operators like or, not, etc.) Another post-processing technique can be applied for further improving source query capability. Let us define the containment relation between operators. We will say that an operator OP2 is contained in an operator OP1, if OP1 can emulate OP2 by post-processing techniques. For instance, a query condition using = can be emulated by using the same query condition with the Contains operator, and then filtering out the non exact-match results. So we will say that = operator is contained in the Contains operator. This only can be done if the attribute of the condition is included in the output set. That means that every negative 4-uple (A,OP,MUL,VALUES), where A is included in the output set, can also be satisfied by conditions of the form (A OP2 a), where a VALUES and OP2 is contained in OP. In the same way, a positive 4-uple (A,OP,MUL,VALUES) where A is included in the output set, also permits a query condition (A OP2 a), where a VALUES and OP2 is contained in OP. Since the queries executed against a base relation by post-processing techniques, can be significantly less efficient (because more data must be transmitted over the network), it is useful to differentiate between the native and the extended capabilities of the source. In query execution this should be taken into account in order to use post-processing only when needed. 2.5 Global schema relations Once the base relations have been created, each relation of the global schema is defined by a query involving the base relations, in a similar way to the definition of views in a conventional database. This approach is known in literature as Global As View (Florescu, 1998). A view is defined through a relational algebra expression involving relations. The algebraic operators permitted are: union, join, selection and projection. Orderby and aggregation operations are also supported (though not described here for the sake of shorteness).

5 EXAMPLE 4: Consider the three base relations: A ={TITLE:String,AUTHOR:String, PRICE:Money} B={TITLE:String,AUTHOR:String, PRICE:Money} C={TITLE:String,AUTHOR:String, RELEVANCE:Integer} A and B represent two Internet book merchant sites. C represents a web source in which its users assess the books relevance, and the system enables searches for the average relevance of a certain book. Consider that we want to obtain a global schema relation: R={TITLE,AUTHOR,PRICE,RELEVANCE} which contains all the books from A and B (only one tuple for each book), together with their average relevance according to C, and the minimum price for each book found within both e-shops. The definition of R would be: SELECT TITLE,AUTHOR,RELEVANCE, MIN(PRICE) FROM (I SELECT TITLE,AUTHOR,PRICE FROM A UNION SELECT TITLE,AUTHOR,PRICE FROM B),C WHERE I.TITLE=C.TITLE AND I.AUTHOR=C.AUTHOR GROUPBY TITLE,AUTHOR,RELEVANCE Computing query capability for global schema relations When the global schema relations are defined, the mediator is capable of computing its query capabilities from the sources. The query capability of a relation depends on the definition of the view through which such relation was defined. Given the tree of the relational algebra expression defining the view, we will provide a set of rules for calculating the search-methods of a given node in function of its node-type (that is, union, join, selection or projection), and the searchmethods of its children. Given these rules, the search-methods for the new relation can be easily obtained by traversing the tree from leaf nodes to root node, and calculating the search-methods for each node. The search-methods for the root node will be the ones representing the query capability of the relation. The way these rules should be applied differs depending on the number of child nodes. When the node has only one child (this will happen for projection and selection nodes), for each search-method in the relation represented by the child node, we will create a new search-method in the parent node. So, we will provide the rules for obtaining a new search-method from a given searchmethod from the child node (we will term it as a base search-method) When the node has many children (union and join operations), we only need to consider the case when there are only two children. If there were n, it would suffice to apply (n-1) times the process for two nodes. For instance, if we have an union node with four children (F1,F2,F3,F4), we could compute: R1=(F1 F2) R2=(F3 R1) R3=(F4 R2) and R3 would be the desired result. With two children, a new search-method in the parent node will be generated for each pair in the cross product of the two sets of search-methods from the children. The rules for obtaining a new searchmethod in function of each pair (we will term them as base search-methods) will be given for each operation (join and union). Once the way rules should be used has been introduced, the following sections provide them for each operation Union nodes Output Set An attribute will be in the output set of the new search-method if it is in the output set of both base search-methods. In order to correctly understand the following steps, it is convenient to have in mind the process

6 followed by a mediator for executing a query over an union node: it will simply execute the query over both children; and then will perform the union of both results. A first consequence of this is that the negative capabilities of one base search-method must be permitted by the positive capabilities of the other one. In other case, no query could be executed by the mediator over that node since: both search-methods must execute the same queries and one of the base search-methods forces to be queried by conditions that the other base searchmethod specifically excludes. Given that the former condition is satisfied, then the new search-method can be created by including in it the negative capabilities of both base searchmethods (since queries will have to satisfy the query restrictions of both children). This process can lead to redundant restrictions. We will say that a 4-uple X=(A,OP1,MUL1,VALUES1) is less restrictive than another 4-uple Y=(A,OP2,MUL2,VALUES2) (and thus, X can be removed ) if: OP1=OP2 MUL1>=MUL2. ( '+' is considered as the higher value). VALUES2 VALUES1. ('ANY' is considered as a set containing all possible values). Due to post-processing techniques, we can add a 4-uple (A,ANY,+,ANY)for each attribute A in the output set. On the other side, query conditions by those attributes not present in the output set are only possible if they are permitted by both base searchmethods. Thus, in order to calculate the new 4-uples for those attributes, the following process is needed: For each attribute, we compute the cross product of the 4-uples for the attribute in both base searchmethods. For each pair (X,Y) in the calculated cross product where: X=(A,OP,MUL1,VALUES1) Y=(A,OP,MUL2,VALUES2) (note that in order to generate a new 4-uple, both base 4-uples must share the same operator) we will create a 4-uple in the new search-method with the following form: (A,OP,MIN(MUL1,MUL2),VALUES1 VALUES2) Join Nodes While for executing queries over union nodes the mediator had only one relevant strategy, for join nodes, several ones are possible. As a consequence, the query capabilities computed will differ depending on the execution method used. So, we must compute the query capabilities for each possible case. In execution time, the mediator will be responsible of consider both the query capabilities and the cost of each execution method in order to choose the better one for each particular query. Here we will consider two different join execution strategies: conventional-join and nestedjoin. Intuitively, a query over a join node using conventional strategy, is executed by sending a subquery over both child relations (each sub-query will contain the query conditions concerning its attributes) and then, joining the sub-results. A query using nested strategy is executed by first sending a sub-query against the first child relation, and then for each unique value for the join attribute(s) returned in the sub-result, sending a query against the second relation. EXAMPLE 5: Nested join Let us consider two relations R1(A,B,C) and R2(A,D,E). Let us also suppose a relation R3(A,B,C,D,E)defined by R1 A R2. If we receive the query: Select * from R3 where (A op1 value1) and (B op2 value2) and (D op3 value3) Then the nested join strategy is implemented in the following way: 1) The following query is sent against R1, obtaining the result Q1 :

7 Q1 = Select * from R1 where (A op1 value1) and (B op2 value2) 2) Then, for each unique value a i for the attribute A obtained in Q1, we send against R2 the query: Q2 i = Select * from R2 where (A = a i ) and (D op3 value3) 3) Now we can compute the union of all the partial results obtained from R2: Q2 = Q2 1 Q2 2 Q2 n 4) Now we can compute Q1 A Q2 for obtaining the final result Conventional join strategy Output Set In order to allow this strategy to be executed, it is necessary that the join attribute be in the output set of both base search-methods. In other case, the new search-method could not be created. Given this, the output set is obtained by simply making the union of the two base sets. The new 4-uples for the join attributes are obtained in the same manner which has been already described for the attributes of union nodes. The new 4-uples for non-join attributes are obtained by simply copying the ones from the base search-methods. As in the case of union nodes, we can add a 4- uple of the form (A,ANY,+,ANY)for each attribute A in the output set. About the attributes not present in the output set, the 4-uples for non-join attributes can be obtained by simply copying the ones from the base searchmethods. As we have already seen, join attributes must be present in the output set, so we do not need to consider this case Nested join strategy As we have seen in Example 5, the order of the child nodes is relevant. So, we need to repeat the process twice, considering the two cases where each child node is chosen as first. Output Set In order to allow this strategy to be executed, it is necessary that the join attribute(s) be in the output set of the first relation. In other case, the new searchmethod could not be created. Note that in a nestedjoin, it is not required that the join attributes be in the output of the second relation, since its value is always known in the tuples retrieved from that relation (recall step two from Example 5). Then, the output set is obtained by including the join attributes and the non-join attributes which are in the output set of its base search-method. The difference with the conventional join strategy is that, for join attributes, the 4-uples of the second relation having the = operator, should be not copied into the new search-method, because they will always be satisfied. This is due to the fact that nested strategy always uses the values returned by the first relation to query the second for the join attributes and with the = operator (again, recall step two from Example 5). A first consideration is that this join strategy is only possible if the second base search-method allows querying the relation by the join attributes with the operator =. If this is satisfied, we can add a 4-uple of the form (A,ANY,+,ANY)for each attribute A in the output set. About the attributes not present in the output set, the 4-uples for non-join attributes can be obtained by simply copying the ones from the base searchmethods. As we have already seen, join attributes must be present in the output set, so we do not need to consider this case Projection nodes As is well known, a projection operation over a relation consists in removing some of their attributes (which are called hidden attributes). Output Set

8 The attributes in the output set of the new search-method will be the same as the ones present in the base search-method, excluding the hidden attributes. If there exists in the base search-method a 4-uple for a hidden attribute, then the new search-method could not be created. In other case, the 4-uples of the new searchmethod are obtained by copying the 4-uples of the base search-method. Again, we can add a 4-uple of the form (A,ANY, +,ANY)for each attribute A present in the output set. For those attributes not present in the output set, the 4-uples of the base search-method are copied in the new one. The 4-uples of the hidden attributes are removed Selection Nodes Output attributes The attributes in the output set of the new search-method will be the same as the ones present in the base one. We can also add those attributes not appearing in the output set of the base searchmethod, but which verifies at least one of the two following conditions: to appear in a selection condition with the '=' operator. Let us suppose a relation R=(A,B,C) where the attribute C does not appear in the output. Suppose we define a selection view R' over R where the selection condition is C=c1. We can consider that C is present in the output of R', since it is guaranteed that its value will be always known (it will be c1). to appear with the '=' operator in a 4-uple from the negative capabilities. Suppose a relation R=(A,B,C) having a negative capability of the form (C,=,MUL,VALUES). Since this guarantees that all the valid queries over R will include a query condition of the form (C=c1), we can assume that attribute C is in the output set, having the value c1 for all answer tuples. The 4-uples of the new search-method are obtained by copying the 4-uples of the base searchmethod, and removing those ones which are satisfied by the selection conditions (since these restrictions are already satisfied by the selection view operation, they need not to be satisfied by the query conditions). In order to be able of creating a new search method, the positive capabilities of the base searchmethod must allow the execution of the selection view operation. The algorithm showed in section 2.3 is used to determine it. For those attributes not present in the output set, the new 4-uples will be the ones obtained after applying the mentioned algorithm (multiplicity of 4- uples can be decreased as a result of this process, since the query conditions of the selection operation spend some of the available query capability). For the output attributes, we just add a 4-uple (A,ANY,+,ANY). 3. QUERY EXECUTION Once the global relations have been created and their capabilities generated, the mediator is able to accept queries over them. A query can be considered as a definition of a view over the global schema relations. Therefore, that view can be built by following an analogous process to that used to define the global schema relations, so a set of search-methods are obtained for the query. Those search methods including at least one negative 4-uple will be discarded, since that implies that query do not satisfy all the sources query restrictions (or negative capabilities). At this point, each search-method for the query can be seen as an relational algebra operation tree whose leaves are source relations and the "trace" of the search-method says how to execute it. So, each remaining search-method can, in fact, be considered as a feasible query execution plan. The optimiser module of the mediator should now obtain the cost of each plan in order to select the best one. Then the execution engine travels the chosen plan tree from the roots up to the leaves, obtaining a set of sub-queries over sources. Each one of these sub-queries is sent to the source wrapper and

9 executed by it. The execution engine then combines the returned results in accordance with what is specified by the plan tree, thus obtaining the final response. 4. QUANTITATIVE EXPERIENCE The techniques presented have been implemented in a commercial mediator system, and has been successfully used in several industrial applications for modelling more than 700 sources such as Web sites (aprox. 75% of total), databases, spreadsheets and XML documents. Some relevant for our discussion quantitative data referred to source query capabilities follows: 1) The percentage of sources which use some operator different from = exceeds 80%. 2) The most frequent operators are (in descending order): CONTAINS, =, <= and >=. 3) The percentage of sources which can execute multiple query conditions for some attributes exceeds 62%. 5. RELATED WORK AND CONCLUSIONS In recent years, a significant amount of academic mediator systems has been developed, such as Tsimmis (Garcia-Molina,1995), Information Manifold (Halevy,1996), or Disco (Tomasic, 1998). Some specific aspects in the construction of mediators, which are relevant to our discussion, have also been studied, such as query rewriting (Halevy, 2000), source query capabilities description and capabilities-based query rewriting (Papakonstantinou, 1996), and computation of mediator capabilities (Yerneni,1999). Nevertheless, as has already been discussed, most of these works failed to properly represent some advanced query features which are present in many real sources. (Papakonstantinou, 1996) is an exception for this, since it presents a more powerful capability description framework than the other systems. Nevertheless, it still lacks of some desirable features such as explicit representation of mandatory and optional attributes, which leads to less concise representations. Besides, this work does not address the important issue of computing mediator query capabilities. We claim that computing the query capabilities of the global schema relations is a must-have feature for mediators aiming to gain industrial support: 1) It is needed in order to let users know the queries supported. In our experience, industrial users do not find trial&error query processes acceptable. 2) It makes possible to use mediators as sources for other mediators, easily enabling incremental data integration processes. This feature is very desirable for large data integration projects. This issue had been dealt with in (Yerneni, 1999) but the algorithms presented in that work assumed a very limited source capability representation framework. In turn, the algorithm we present here is able of managing the richer capabilities representation models we have described. 6. REFERENCES Florescu, D., Levy, A., Mendelzon, A., Database Techniques for the World-Wide Web: A Survey. In SIGMOD RECORD, vol. 27 nº3. García-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman A., Sagiv, Y., Ullman, J., Widow, J., The TSIMMIS approach to mediation: Data models and languages. En Proceedings of NGITS (Next Generation Information Technologies and Systems). Halevy, A., Theory of Answering Queries Using Views. In ACM SIGMOD Record vol. 29, nº 4. Halevy, A., Rajaraman, A., Ordille, J., Query answering algorithms for information agents. In Proceedings of the Thirteenth National Conference on Artificial Intelligence Papakonstantinou, Y., Gupta, A., Hass, L., Capabilities-Based Query Rewriting in Mediator Systems. En Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems Tomasic, A., Raschid, L., Valduriez, P., Scaling access to distributed heterogeneous data sources with Disco. En IEEE Transactions On Knowledge and Data Engineering. Wiederhold, G., Mediators in the architecture of future information systems. Computer, 25(3) Yerneni, R., Li, C., García-Molina, H., and Ullman, J., Computing Capabilities of Mediators. In Proccedings of the ACM SIGMOD Conference.

Integrated Usage of Heterogeneous Databases for Novice Users

International Journal of Networked and Distributed Computing, Vol. 3, No. 2 (April 2015), 109-118 Integrated Usage of Heterogeneous Databases for Novice Users Ayano Terakawa Dept. of Information Science,