Some aspects of references behaviour when querying XML with XQuery

Some aspects of references behaviour when querying XML with XQuery c B.Khvostichenko boris.khv@pobox.spbu.ru B.Novikov borisnov@acm.org Abstract During the XQuery query evaluation, the query output is built as a completey new structure, containing copies of all elements that satisfy the query. At the same time, references contained in these data point to the original elements, hence the crossreferences between extracted data are not represented in the query output. In this paper, we propose a method for preserving cross-references, thus enriching the tree produced in the query evaluation with a graph structure. 1 Introduction XML is considered the most relevant standard for data representation and exchange among various applications on the Internet. The popularity of XML is in large part due to its flexibility for representing many kinds of information. As the importance of XML has increased, a series of standards has grown up around it, defined by the World Wide Web Consortium (W3C): XML Schema, XPath, XSLT etc. XML data are often quite heterogeneous, and distribute their meta-data throughout the document. XML documents may contain many levels of nested elements and can represent missing information simply by the absence of an element. For these reasons several special languages were designed for querying XML data. One of these languages is XQuery [10], being developed by XML Query Working Group. One can think of XML document as of tree, but XML Standard allows to define references between elements using ID-IDREF couples. Thus, XML documente turns into directed graph where directed edges represent parent-child relationship or IDREF- ID reference. This paper describes the problem arised with XML references when querying XML data with XQuery. It is organized in following way: Section 2 describes the probem in details, Section 3 introduces other approaches to dealing with references. Sec- This works was supported by the Russian Foundation for Basic Research under grant 04-01-00173. Proceedings of the Spring Young Researcher s Colloquium on Database and Information Systems SYR- CoDIS, St.-Petersburg, Russia, 2004 tion 4 expounds suggested solution and Section 5 summarizes the topic. 2 Problem Statement During query evaluation, XQuery builds new structure containing copies of all elements that satisfy the query. While XQuery uses XPath data model, described in [9], it copies IDs and IDREFs as attributes from original into newly created structure. It does not treat references in any special way, considering them as common element attributes. This could lead into some tricky situations, described below (we assume that if original document contains element, referencing with IDREF, then it also contains element with corresponing ID): 1. ID attribute is copied into new structure, IDREF is not copied. The problem is that the same ID already exists in original document. 2. IDREF attribute is copied into new structure, ID is not copied (hanging IDREF). As XML is semistructured data model, we do not pay attention to lack of some elements or attributes. 3. Both ID and corresponding IDREF are copied into new structure. This combines first case with need to distinguish original reference edge from new one. Problem: Define and implement a modified query evaluation technique for XQuery which is able to handle cross-references between extracted XML elements. 2.1 Example As an example we used fragment of musical encyclopaedia, that can be found in Figure 1. It stores bands and musicians in one catalog, containing casts for each band and instruments that person can play. Each cast contains references to musicians that played at this cast. Thus, different casts might probably reference the same musician (e.g. Robert Fripp is in every King Crimson s cast). Moreover, different bands can reference same musician (e.g. Greg Lake appeared in Emerson, Lake and Palmer and King Crimson both). Queries to this encyclopaedia, corresponding to problems described above, are as follows:

<catalog> <band name= Emerson, Lake and Palmer style= rock ID= elp > <artist IDREF= emerson >K. Emerson <band name= King Crimson style= rock ID= kingcrimson > <cast number= 1 years= 1969-1970 > <artist IDREF= giles >M. Giles <cast number= 2 years= 1970-1972 > <artist IDREF= collins >M. Collins <artist IDREF= haskell >G. Haskell <musician name= Keith Emerson ID= emerson > <instrument>piano</instrument> <instrument>organ</instrument> <musician name= Greg Lake ID= lake > <musician name= Carl Palmer ID= palmer > <instrument>percussion</instrument> <musician name= Robert Fripp ID= fripp > </catalog> Figure 1: Music encyclopaedia (fragment) for $i in document( catalog.xml )/catalog/musician where $i/instrument = Guitar return <guitar-player> $i/@id <name>data($i/@name)</name> for $j in $i/instrument return $j Figure 2: Query A. Select all guitar players A. Select all guitar players Figure 2, results are in Figure 5. B. Choose first cast of King Crimson Figure 3, results are in Figure 6. C. Find performers, that contain Lake in their name Figure 4, results are in Figure 7. for $i in document( catalog.xml ) /catalog/band[@name= King Crimson ] return for $j in $i/cast[@number= 1 ] return $j Figure 3: Query B. Choose first cast of King Crimson 3 Related Work 3.1 XQuery As mentioned above, this problem arose with XQuery query language [10] and XPath data model [9]. XQuery does not offer any special tool to treat IDREFs. Moreover, in XPath data model every element should have intrinsic identifier, but this internal identifier exists independently from ID that was given by user.

for $i in document( catalog.xml )// where contains(data($i/name), Lake ) return $i Figure 4: Query C. Find performers, that contain Lake in their name <guitar-player ID= lake > <name>greg Lake</name> <guitar-player ID= fripp > <name>robert Fripp</name> 3.2 Lorel Figure 5: Query A evaluation result Lorel [1] is a query language for Lore [6] database system for semistructured data. Originally, Lore was developed over own data model called OEM. Substantial difference between OEM and XML is that OEM represents graph with directed edges, while XML represents tree. Any object in OEM has its unique identifier and can be referenced with this identifier. Thus, this probem doesn t exist for OEM data model. After migrating Lore from OEM to XML [5], developers paid attention to this difference and introduced new Lore XML-based data model where user can choose, how to interpret IDREFs (as graph edges or as text attributes). 3.3 ODMG ODMG proposed Object Data Model and Object Query Language in its standard ODMG 3.0 [3]. They allow objects to reference each other, but ODMG s data model is strictly typed and has integrity constraints. They treat internal references as graph edges and do not have problems described in Section 2 It was suggested to use XML as an exchange format between applications that comply with ODMG standard [2]. But this suggestion does not use IDREF-ID option of referencing within XML document, it uses special structure to encode relationships. XML is used as media-language to transfer objects from one system to another, so there is no need to query this data. 4 Proposed Solution Proposed solution for problems, described in Section 2, consists of treating IDREF-ID references as graph edges with respect to XML data model and validity constraints. <cast number= 1 years= 1969-1970 > <artist IDREF= giles >M. Giles Figure 6: Query B evaluation result <band name= Emerson, Lake and Palmer style= rock ID= elp > <artist IDREF= emerson >K. Emerson <musician name= Greg Lake ID= lake > Figure 7: Query C evaluation result 4.1 Copied ID (without corresponding IDREF) The problem here relies to ID validity constraint of XML [8]. In case of using original document together with created one further in query processing, ID ambiguity appears. Proposed solution is to copy ID attribute from original element and change its value (to be in line with ID validity constraint). But in order to provide access to original element, we suggest to create IDREF attribute in newely created element with value equals to ID of original element. Thus, whenever one needs to access original element from new document, he should traverse two graph edges (to created element and then to original one). 4.2 Copied IDREF (without corresponding ID) This is not a problem, because XML is a semistructured data model, and it can be incomplete by its nature. One can consider this hanging IDREF in two ways: 1. Reference to the element from original document (if it is used afterwards). 2. True hanging reference, when original document is not used further in query processing. 4.3 Copied both ID and corresponding IDREF This problem here also relies to ID validity constraint, but has some more insight. When the entire

edge is copied from one graph to another, both elements might not be the same as they were in original document. Moreover, type of both elements could possibly change (and XQuery is sensitive to element types). Thus we suggest to proceed in this case as follows. For the element that is referenced, its ID is changed and IDREF reference is created to the original element. For the element that references, IDREF value changes accordingly to new ID value of abovementioned element. This will solve ID validity problem and allow to access originally referenced element if needed. 4.4 Implementation Proposed implementation is straight and simple. First, XQuery processor evaluates query result as temporary document. Second, it searches through this document for IDs that were extracted from original document and creates dictionary of changes, where original ID is the key and new ID is the value. This dictionary must provide uniqueness of all IDs. Third, query processor searches through temporary document once again and does follows: 1. If element has ID that is to be replaced, then ID is replaced with the one taken from the dictionary and IDREF attribute is added with value of original ID (reference to original element). 2. If element has IDREF, referencing to ID that is subject to change, then IDREF changes to appropriate value (taken the from dictionary). <guitar-player ID= lake-01 IDREF= lake > <name>greg Lake</name> <guitar-player ID= fripp-01 IDREF= fripp > <name>robert Fripp</name> Figure 8: Query A updated evaluation result <cast number= 1 years 1969-1970 > <artist IDREF= giles >M. Giles Figure 9: Query B updated evaluation result no changes <band name= Emerson, Lake and Palmer style= rock ID= elp-01 IDREF= elp > <artist IDREF= emerson >K. Emerson <artist IDREF= lake-01 >G. Lake <musician name= Greg Lake ID= lake-01 IDREF= lake > Figure 10: Query C updated evaluation result 4.5 Example Let s show now how our proposal will be reflected in examples given in Section 2. For the simplicity, we used -N postfix for newly created ID, where N is a number. A. First example the problem with ID duplicates. We need to change IDs of lake and fripp elements and create references from them to original elements (Figure 8). B. The second case is not really a problem, so nothing is going to change there (Figure 9). C. Third example combines two problems. We need to solve problem with elp duplicating ID (as in first case) and then solve problem with crossreference between artist in elp and lake (Figure 10). 5 Conclusion In this paper we investigated behaviour of IDREF- ID references in XML document during XQuery query evaluation. We found a problem with extracting ID and IDREF attributes from original document to a new one and proposed solution to this problem, that can be applied to the existing XQuery data model. While our straight approach can be a solution for the moment, there is another aspect that was uncovered in this article. Current version of XML allows usage of XML Namespaces [7], but there is no details on IDREF-ID references between namespaces. Moreover, there is no clarity if XML element could have different IDs for different namespaces. In case it can, one might use namespaces to propose another solution to the problem described in the article. So, the question is open and might be a good subject for further investigations. Furthermore, one can think of optimizing queries, using data about types (like DTD schema) [4]. This raises another problem, as reference types might change, and is also a good subject for investigation.

References [1] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. The Lorel query language for semistructured data. International Journal on Digital Libraries, 1(1):68 88, 1997. Also http://www-db. stanford.edu/pub/papers/lorel96.ps from Stanford DB group on-line publications http: //www-db.stanford.edu/pub/. [2] G.M. Bierman. Using xml as an object interchange format, 2000. [3] R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman, David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and Fernando Velez. The Object Data Standard: ODMG 3.0. Elsevier Science and Technology Books, 1999. [4] Chin-Wan Chung Chang-Won Park, Jun- Ki Min. Structural function inlining technique for structurally recursive XML queries. In Proc. VLDB 2002, 2002. [5] R. Goldman, J. McHugh, and J. Widom. From semistructured data to XML: Migrating the lore data model and query language. In Workshop on the Web and Databases (WebDB 99), pages 25 30, 1999. [6] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record (ACM Special Interest Group on Management of Data), 26(3):54 66, September 1997. [7] Namespaces in XML, 1999. http://www.w3.org/tr/rec-xml-names/. [8] Extensible markup language (XML) 1.0 (third edition), 2004. http://www.w3.org/tr/2004/rec-xml- 20040204/. [9] XQuery 1.0 and XPath 2.0 Data Model, 2003. http://www.w3.org/tr/xpath-datamodel/. [10] XQuery 1.0: An XML Query Language, 2003. http://www.w3.org/tr/xquery/.