Element Algebra M. G. Manukyan Yerevan State University Yerevan, 0025 mgm@ysu.am Abstract. An element algebra supporting the element calculus is proposed. The input and output of our algebra are xdm-elements. Formal definition of element algebra is offered. We consider algebraic expressions as mappings. A reduction of the element calculus to the element algebra is suggested. 1 Introduction The XML databases currently attract definite interest among researchers of databases for the following reasons: - DTD is a compromise between the strict-schema models such as the relational or object models and the completely schemaless world of semi-structured data; - in contrast to semi-structured data model, the concept of database schema in the sense of conventional data models is supported; - in contrast to conventional data models strict-schemas, there is possibility to define more flexible database schemas (DTDs often allow optional fields or missing fields, for instance) [1]. A big disadvantage of DTD is that it does not contain tools to include information of types and integrity constraints. An important step in this direction is the XML Schema [2, 3] which is a formalism to restrict the structure of XML documents and also to extend XML with data types. An XML query data model [4] is developed which is based on the XML Schema type system. Notice that the XML query model is the foundation of the XML query algebra [5]. In the context of XML query data model and XML query algebra an XML query language [6] is suggested. Notice that XML Schema = XML + data types. Here data type has a non-classical interpretation: The value set of data type is defined without corresponding operations [7]. Therefore on the level of XML Schema we can not define the dynamics of application domain objects. XQuery [6] is a query language for XML which allows to give queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware. XQuery is not a declarative query language (detailed see in [7]). In distinct to declarative languages it is impossible to create an effective optimizer for XQuery due to its procedural character. In [8] we suggested an extensible data model (xdm) to: - extend semantics of the XML data model for supporting database concept; - create a declarative query language.
It is common in database theory to translate a query language into an algebra since algebra is a language of execution level. Thus the algebra is used to: -give a semantics for the query language; -support query optimization. The requirements above presume formal definition of the algebra. We have developed an element algebra to support the element calculus (a declarative query language for xdm) [8]. The input and output of our algebra are xdm-elements. The considered algebra supports standard algebraic operations. In the case of standard algebra the operands of algebraic operations are relations. In our case the operands of algebraic operations are the xdm-elements. Thus to directly apply the standard algebraic operations to xdm-elements we needs: - formalization the xdm-element in compliance with the theory of relational databases; - defining the inference rules of the resulting schemas of algebraic expressions; - proving element calculus and element algebra equivalence. 2 Related Work Many XML algebras are considered in literature, for example [9 17]. Some XML algebras considered in [9, 12 14] have been developed to support XQuery. In fact, the XML data model could be either a tree or a graph. A forest could be transformed to a single tree by simply adding a root node as a common parent for all trees. The basic unit of information is an individual member of a collection feeding operators. Notice that operator takes relations as input and produces a relation as output in the relational algebra. The relation composes of tuples which are basic units of information in the relational algebra. In [17] an XML algebra (called XAL) for data mining has been offered. In XAL, each XML document is represented as a rooted directed graph with a partial order relation defined on its edges. The basic unit is a vertex representing either element or attribute. An operator receives set of vertices as input and produces set of vertices as output. XAL provides a set of equivalence rules. Based on these rules a heuristic algorithm to transform a query tree into on optimized tree has been suggested. In Niagara [16] the XML document is also represented as rooted directed graph with elements and attributes as vertrices. The basic unit is a bag of vertices. Thus the operators operate with collections of bags of vertices. This approach to XML algebra assumes an implementation independent optimization by rewriting using equivalence. TAX [13, 14] treates an XML document as a forest of labeled rooted trees. TAX takes a labeled rooted tree as a basic unit by introducing the notation of pattern tree and witness tree. A pattern tree ia a pair of P = (T, E), where T is a node-labeled and edge-labeled tree, E is a formula with value-based predicates applicable to tree nodes. Each node in T is labeled by a unique integer whereas each edge T is labeled by either pc (parent-child) or ad (ancestor-descendant). A witness tree is an instance of the data trees matching the pattern tree. All operations of this algebra take collections of trees as input and produce a collection of trees as output. In [9]
an XML algebra (called IBM) to support XQuery is considered. The basic unit is a vertex that represents either element, attribute, or reference. An operator receives a collection of vertices as input and produces a collection of vertices as output. This model is a logical model and nothing is specified about the underlying storage representation or physical operators. In addition to standard operations a new reshaping operation to create a new XML document from fragments of selected XML documents is offered. A YATL [11] algebra has been developed for an XML-based integration system which integrates data from different sources. Only two new operations the bind and the Tree are suggested. The bind operation is used to extract relevant information from different sources and produce a structure called Tab, which practically is a 1NF relation [18]. The Tree is the inverse operation to bind and generates a new XML document. All others are standard operations of relational algebra. An algebra is considered in [15] for a DBMS designed specifically for managing semi-structured data. The distinguishing feature of this approach is on cost-based query optimization and manipulating dynamic data structures. Each query is transformed into a logical query plan using logical operations such as select, project, name, etc. which can be considered algebra operations. A cost-based approach is used to select the best physical plan from generated physical plans. Another XML algebra, called AT&T, is considered in [12]. The AT&T algebra is powerful enough to capture the semantics of many XML query languages and several optimization rules have been specified. In this algebra most of the operations are based on the iteration operation. AT&T has distinctive ability in detecting errors at query compile time with its well-defined list of the operations. A tree based algebra (called TA) is considered in [10]. The basic unit is a tree that is used to model an XML data. In this algebra operations take trees as input and produce tree as output. While the IBM, Niagara, TAX, XAL, AT&T, TA algebras were proposed as standalone XML algebras, the Lore [15] and YATL were developed for the semi-structured database system and integration system, respectively. The Niagara, TAX, XAL, AT&T, TA algebras support standard algebraic operations. 3 Formal Definition of Element Algebra Definition 1 We say that S is an xdm-element schema, if 1. S=<name, atomictype, f>, where f {?,, +, 1 }, or 2. S=<name, typeop(s 1, S 2,..., S n ), f>, typeop {sequence, choice, all}, and S i is an xdm-element schema 2, 1 i n. Definition 2 The xdm-element s of schema S is a finite collection of mappings S domain(f irstcomp(s)) domain(secondcomp(s)); if secondcomp(s)= typeop(s 1, S 2,..., S n ) then the following constraint should be hold for all e s: e[s i ] domain(s i ), 1 i n. 1 A following an xdm-element means that the xdm-element may occur exactly one time. 2 The xdm-attributes are not considered for simplicity.
The f irstcomp, secondcomp, domain functions have an obvious semantics in the previous definition. Notice that valset(atomictype), if secondcomp(s) = domain(secondcomp(s)) = atomictype n i=1 domain(s i), if not Definition 3 Let R and Q be xdm-elements schemas. We say that R and Q are similar, if 1. secondcomp(r)=atomictype1, secondcomp(q)=atomictype2, and atomictype1=atomictype2, or 2. secondcomp(r)=typeop(r 1, R 2,..., R n ), secondcomp(q)=typeop(q 1, Q 2,..., Q n ), and R i, Q i are similar, 1 i n. Definition 4 Let R and Q be xdm-elements schemas. We say that R is subschema of Q (R Q), if 1. firstcomp(r)=name1, secondcomp(r)=atomictype1, firstcomp(q)=name2, secondcomp(q)=atomictype2, and name1=name2, atomictype1=atomictype2, or 2. firstcomp(r)=name1, secondcomp(r)=typeop(r 1, R 2,..., R k ), firstcomp(q)=name2, secondcomp(q)=typeop(q 1, Q 2,..., Q m ), and name1=name2, and i [1, k] j [1, m] that R i Q j. Definition 5 Let r and q be xdm-elements with R and Q similar schemas correspondingly 3. Let us say that r and q are equal, if 1. secondcomp(r)=secondcomp(q)=atomictype, and content(r)=content(q), or 2. secondcomp(r)=typeop(r 1, R 2,..., R n ), secondcomp(q)=typeop(q 1, Q 2,..., Q n ): a) typeop=sequence, i [1, n] firstcomp(r i )=firstcomp(q i ), and r i and q i are equal xdm-elments with similar schemas R i and Q i correspondingly; b) typeop=all, i [1, n] j [1, n] firstcomp(r i )=firstcomp(q j ), and r i and q j are equal xdm-elments with similar schemas R i and Q j correspondingly; c) typeop=choice, there is a unique i [1, n] such that the following holds for some unique j [1, n]: firstcomp(r i )=firstcomp(q j ), and r i and q j are equal xdm-elments with similar schemas R i and Q j correspondingly. Concantenation. The concantenation of xdm-elements r =< namer, r 1, r 2,..., r k > and q =< nameq, q 1, q 2,..., q m > is an xdm-element defined as follows: rq =< rq, r 1, r 2,..., r k, q 1, q 2,..., q m > Set-theoretic operations. In definition of set-theoretic operations union, intersection and difference it is assumed that schemas of operands are similar. Let r and q be xdm-elements with R and Q similar schemas correspondingly 4. The 3 Without loss of generality it is assumed that an xdm-element schema is a pair of the following type <name, type>. 4 We will use to signify a multiset, {} to denote a set, while [] symbolizes a list.
union, intersection and difference of r and q xdm-elements are the xdm-elements defined as follows: r q =< r q, t t r t q > r q =< r q, t t r t q > r q =< r q, t t r t / q > Notice that union and intersection are commutative and associative operations. Cartesian Product. Let r and q be the xdm-elements with R and Q schemas correspondingly. The Cartesian product of r and q xdm-elements is an xdmelement defined as follows: r q =< r q, ts t r s q >. Selection. Let r be an xdm-element with schema R, and P be a predicate. The result of operation of selection from r by P is an xdm-element defined as follows: σ P (r) =< σ P (r), t t r P (t) > Projection. Let r be an xdm-element with schema R and π L (r) be a projection operation, where L is a list of elements. For simplicity let us assume L = [A, E Z, X Y ] (A, X R), then the result of the projection operation is an xdmelement defined as follows: π L (r) =< π L (r), < name, t[a]zt[y ] > t r Y = X Z := E > Natural Joins. Let r and q be xdm-elements with R and Q schemas correspondingly, such that R Q and Q R and R Q. The natural join of r and q xdm-elements is an xdm-element defined as follows: r q =< r q, ts[ L] t r s q t[l] = s[l] >, where L = R Q, L = Q L Grouping. Let r be an xdm-element with schema R and γ L (r) be a grouping operation, where L is a list of elements. For simplicity let us assume L = [A, f(b) C] (A, B R, f {min, max, sum, count, average}), then the result of the grouping operation is an xdm-element defined as follows: γ L (r) =< γ L (r), { t[a]s t r s =< C, f(π B (σ A=t[A] (r))) >} > Notice that our algebra also includes the conventional theta joins, duplicate elimination, division, renaming, sorting operations and aggregate functions. 4 Algebraic Expressions as Mappings We will use Exp and schema(exp) to signify an algebraic expression and its schema correspondingly. Let us define the following operations, and :?? =??? =??? =? + = = =?? =? =?? =? =? + + = + + =? + =? + =?? = + = + + =? =?? =?? =?? + =? + = + =? =? = =? +? = = = = + = + + =?? =? + + = + = =
The following recursive rules are used to define schema(exp): r1. If Exp = r, where r is an xdm-element with schema R, then schema(exp) =< Exp, secondcomp(r), thirdcomp(r) >; r2. If Exp = Exp 1 Exp 2 or Exp = Exp 1 Exp 2, or Exp = Exp 1 Exp 2, then if a) secondcomp(schema(exp 1 )) = secondcomp(schema(exp 2 )) = atomictype, and schema(exp) =< Exp, secondcomp(schema(exp 1 )), thirdcomp(schema(exp 1 )) Op thirdcomp(schema(exp 2 )) >, where, if Exp = Exp 1 Exp 2 Op =, if Exp = Exp 1 Exp 2, if Exp = Exp 1 Exp 2 b) secondcomp(exp 1 ) = typeop(schema(exp 1 1), schema(exp 2 1),..., schema(exp n 1)), secondcomp(schema(exp 2 )) = typeop(schema(exp 1 2), schema(exp 2 2),..., schema(exp n 2)), and schema(exp) =< Exp, typeop(schema(exp 1 3), schema(exp 2 3),..., schema(exp n 3)), thirdcomp(schema(exp 1 )) Op thirdcomp(schema(exp 2 )) >, where i [1, n] schema(exp i 3) =< Exp i 3, secondcomp(schema(exp i 1)), thirdcomp(schema(exp i 1)) Op thirdcomp(schema(exp i 2)); r3. If Exp = Exp 1 Exp 2, then schema(exp) =< Exp, sequence(secondcomp(schema(exp 1 )), secondcomp(schema(exp 2 ))), thirdcomp(schema(exp 1 )) thirdcomp(schema(exp 2 )) >; r4. If Exp = σ P (Exp 1 ), then schema(exp) =< Exp, secondcomp(schema(exp 1 )), thirdcomp(schema(exp 1 )) >; r5. If Exp = π L (Exp 1 ) (in general case L = sequence(l 1, L 2, L 3 ), where L 1 secondcomp(schema(exp 1 )), L 2 is a list of renamed xdm-elements, and L 3 is a list of derived xdm-elements), then schema(exp) =< Exp, L, thirdcomp(schema(exp 1 )) >; r6. If Exp = Exp 1 Exp 2, then schema(exp) =< Exp, secondcomp(schema(exp 1 )) secondcomp(schema(exp 2 )), (thirdcomp(schema(exp 1 )) thirdcomp(schema(exp 2 ))) >; If in Exp the r 1, r 2,..., r n xdm-elements with R 1, R 2,..., R n schemas are used, then Exp is defined by the following mapping: Exp : Coll(R 1 ) Coll(R 2 )... Coll(R n ) Coll(schema(Exp)), where Coll(R) is a collection of all xdm-elements with schema R. 5 Element Calculus Reduction to Element Algebra An expression in the element calculus has the following type 5 : x 1 x 2...x k ψ(x 1, x 2,..., x k ), where ψ is a formula with x 1, x 2,..., x k free variables. Let s be an xdm-element of schema S and E(S) is defined as follows: E(S) = π 1 (s) π 2 (s)... π n (s). If s 1, s 2,..., s n occur in ψ, then DOM(ψ) = E(S 1 ) E(S 2 )... E(S n ) {α 1, α 2,..., α n }, where i [1, n] α i 5 This section is based on the similar facts and techniques of the theory of relational databases [19, 20].
is a constant of ψ. Let E be an expression of element algebra defined as follows: E : DOM(ψ) DOM(ψ). Let us proof that for each safety expression of element calculus an equivalent expression of element algebra exists. For that, for each subformula ω of ψ we recursively define the expression of element algebra equivalent to y 1 y 2...y m ω(y 1, y 2,..., y m ). Notice that from safety of expression of element calculus, does not foolow safety of subformulas of the formula. Therefore we search for the equivalent expression of element algebra for (DOM(ψ)) m y 1 y 2...y m ω(y 1, y 2,..., y m ), where ω is a subformula of ψ and D m means D D... D (m times). Case 1. If subformula ω is an atom of type xθy, xθx, xθc, cθx (where x and y are element calculus variables, c is a constant, and θ {=,, >,, <, }), then σ xθy (E E), σ xθx (E), σ xθc (E), and σ cθx (E) are equivalent expressions of element algebra correspondingly. Case 2. The subformula ω is an atom of type (pe)(x 1, x 2,..., x l ), where i [1, l] x i is element calculus variable, and (pe) is the result of path expression [21] pe converted to multiset. Notice that the path expression is a sequence of steps defined as follows: [/ //]Step 1 [/ //]Step 2 [/ //]...[/ //]Step n. The equivalent expression of element algebra for the path expression is created by the following recurrent relation: π Li (Step i ), if condition is not given Step i+1 = π Li (σ P (Step i )), if P is a predicate π Li (σ P (γ L1 (Step i ))), if condition is given by a aggregate function here i = 0, 1,..., n 1, Step 0 = initial xdm-element, L i is the resulting list both of the xdm-elements and attributes in the step i+1, L 1 = grouping xdmelement/attribute + aggregate function xdm-element. If ae is the equivalent expression of our algebra for the path expression, then π L (ae) will be an equivalent expression of element algebra for (pe)(x 1, x 2,..., x l ), where L = [x 1, x 2,..., x l ]. Case 3. ω(y 1, y 2,..., y m ) = ω 1 (y 1, y 2,..., y m ). If E 1 is equivalent expression of element algebra for (DOM(ψ)) m y 1 y 2...y m ω 1 (y 1, y 2,..., y m ), then E m E 1 is an equivalent expression of element algebra for (DOM(ψ)) m y 1 y 2...y m ω 1 (y 1, y 2,..., y m ), which is equivalent to (DOM(ψ)) m y 1 y 2...y m ω 1 (y 1, y 2,..., y m ). Case 4. ω(y 1, y 2,..., y m ) = ω 1 (u 1, u 2,..., u n ) ω 2 (v 1, v 2,..., v l ). Let E ω be equivalent expression element algebra for (DOM(ψ)) m y 1 y 2...y m ω(y 1, y 2,..., y m ). If E ω1 and E ω2 are equivalent expressions element algebra for (DOM(ψ)) n u 1 u 2...u n ω 1 (u 1, u 2,..., u n ) and (DOM(ψ)) l v 1 v 2...v l ω 2 (v 1, v 2,..., v l ) correspondingly, then E ω = π y1,y 2,...,y m (E ω1 E m n ) π y1,y 2,...,y m (E ω2 E m l ). Case 5. ω(y 1, y 2,..., y m ) = ( y m+1 )ω 1 (y 1, y 2,..., y m+1 ). Let E 1 be equivalent expression of element algebra for (DOM(ψ)) m+1 y 1 y 2...y m+1 ω 1 (y 1, y 2,..., y m+1 ). It is obvious that y m+1 DOM(ψ) as ψ is a safety formula. Thus π y1,y 2,...,y m (E 1 ) is equivalent expression of element algebra for (DOM(ψ)) m y 1 y 2...y m ( y m+1 )ω 1 (y 1, y 2,..., y m+1 ). It is easy to see that we can analogously create the equivalent algebraic expressions for have not considered formulas.
6 Conclusion An element algebra supporting the element calculus is proposed. The xdmelements are inputs and outputs for the suggested algebra. Formal definitions of xdm-element schema, xdm-element with given schema, similar schemas, subschemas and equal xdm-elements are given. Based on these definitions an element algebra is formally defined. The equivalence rules for algebraic expressions are presented. The algebraic expressions are considered as mappings. Inference rules of resulting schemas of algebraic expressions are offered. The equivalence of element calculus and element algebra is proved. Finally, our approach to XML algebra allows to apply relational optimization techniques. References 1. Garcia-Molina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book. Prentice Hall (2002) 2. Biron, P., Malhotra, A.: XML Schema Part2:Datatypes, http://www.w3.org. (2001) 3. Thompson, H., Beech, D., et al: XML Schema Part1: Structures, http://www.w3.org. (2001) 4. Fernandez, M., Robie, J.: XML Query Data Model, http://www.w3.org. (2001) 5. Fankhauser, P., et al: The XML Query Algebra, http://www.w3.org. (2001) 6. Chamberlin, D., Florescu, D., Robie, J., Simeon, J., Stefanescu, M.: XQuery: A Query Language for XML, http://www.w3.org. (2001) 7. Date, C.: An Introduction to Database Systems. Addison-Wesley (2004) 8. Manukyan, M.: Extensible data model. In: ADBIS. (2008) 9. Beech, D., Malhotra, A., Rys, M.: A formal data model and algebra for XML, Communication W3C. (1999) 10. Bekai, A.E., Rossiter, N.: A tree based algebra framework for xml data systems. In: ICEIS. (2005) 11. Christophides, V., Cluet, S., Simeon, J.: On wrapping, query languages and efficient xml integration. In: ACM SIGMOD Conference on Management of Data. (2000) 12. Fernandez, M., Simeon, J., Walder, P.: A semi-monad for semi-structured data. In: ICDT. (2001) 13. Jagadish, H., et. al: Tax: A tree algebra for xml. In: DBLP Conference. (2001) 14. Jagadish, H., et. al: Timbler: A native xml database. VLDB (2002) 15. McHugh, J., et. al: Lore: A database management system for semi-structured data. SIGMOD (1997) 16. Viglas, S., et. al: Putting XML Query Algebras into Context, http://www.cs.wisc.edu/niagara/publications.html. (2002) 17. Zhang, M., J.Yao: Xml algebra for data mining. In: SPIE. (2004) 18. Codd, E.: A relational model for large shared data banks. Commuincations of the ACM (1970) 19. Maier, D.: The Theory of Relational Databases. Computer Science Press (1983) 20. Ullman, J.: Principles of Database Systems. Computer Science Press (1980) 21. Clark, J., DeRose, S.: XML Path Language, http://www.w3.org. (1999)