Use of XML Schema and XML Query for ENVISAT product data handling Stéphane Mbaye stephane.mbaye@gael.fr GAEL Consultant Cité Descartes, 8 rue Albert Einstein 77420 Champs-sur-Marne, France Abstract * This paper proposes a solution to use XML and related technologies such as XML Schema, XML Query and XSL-FO in order to access, transform and render ENVISAT product data. The solution extends the XML technologies to be directly exploitable for non-xml documents such as ENVISAT products. It defines an extension of XML Schema to describe the physical and logical structure down to the bit level representation. A dedicated API named DRB (similar to the DOM [0] API specified by the W3C) allows to access the products. DRB supports the latest XML Query language on top of XPath to locate, select and transform any data from one or more ENVISAT products. In addition, the output can be stored into one of the handled format (including ENVISAT products themselves). As an application of DRB, we introduce Derby software, a datamining tool for managing and analyzing data. Derby is an integrated environment that takes advantage of the DRB functionality through a graphical user interface. We also provide some EO applications illustrating the benefits of such concept during design and operational phases.. Introduction In addition to the definition of the extensible Markup Language (XML) [], the World Wide Web Consortium (W3C) develops a large amount of adjacent technology specifications. The objective is to provide the Web developer community with a complete and unified set of recommendations gathering all emerging and consolidated concepts in the domain. Although most of the W3C tasks remain under development, an increasing number of applications This study has been funded in the framework of a contract N 5247/0/I-LG Development of ENVISAT LVL0 Analysis and Product Comparison Tool agreed between European Space Agency ESA/ESRIN ENVISAT Ground Segment Engineering Division and GAEL Consultant. are already available. The Space Engineering community was not long in coming to XML technologies by developing standards such as the Data Entity Dictionary Specification Language, the Baseband Data Archive Interchange Format 2, DIMAP 3 or the latest International Metadata Standard 4. These applications are focused on the management of metadata for which XML language is suitable. On the contrary, XML does not fit the management of EO data themselves (i.e. signal data or measurement data). The translation of such data to XML is actually ineffective in most cases. For instance, it is obviously worthless to translate binary encoded data or large datasets to XML character encoding: such operation would result in both a dramatic increase of size and a decrease of access performances. Despite these drawbacks, the use of XML-related technologies such as XML Schema [2] and XQuery language [4] remain however advantageous to the management of EO data. It was therefore necessary to find a way to combine the use of XML-related technologies with the handling of EO data disregarding their encoding format. In the continuity of GAEL Consultant experiences in the handling of EO formats, we have studied and are currently developing a system for this purpose. Its first application is dedicated to the ENVISAT product data analysis during the maintenance and operational phases of the instrument processors. Data Entity Dictionary Specification Language (DEDSL) is a standard developed by the Consultative Committee for Space Data System (CCSDS). 2 The Baseband Data Archive Interchange Format (CEOS ICF) is a standard developed by the Committee on Earth Observation Satellites (CEOS). 3 DIMAP is the format for the SPOT 5 products dissemination. 4 International Metadata Standard (ISO/DIS 95) is a Draft International Standard developed by the ISO Technical Committee 2.
2. Use of XML Schema Recently the W3C has recommended the new XML Schema [2][3] dedicated to the validation of the XML documents. As a summary, XML Schema is an XML language for describing and constraining the content of XML documents. 2.. Need of extension The XML Schema language is focused on the logical structure of the XML documents and does not provide any functionality for the description of their physical structure. In order to handle ENVISAT products, it was therefore necessary to extend the XML Schema with specific markups. The extension is performed through a new namespace [9] currently identified by http://www.gael.fr/drb/. The markups defined within this namespace are inserted in the XML Schema document as defined in the W3C recommendation [2] (see example below): <schema xmlns= http://www.w3.org/200/xmlschema/ xmlns:drb= http://www.gael.fr/drb/ > <complextype name= MPH > <sequence> <element name= FileName type= String > <drb:byteoffset>0</drb:byteoffset> </element> </sequence> </complextype> Figure. Example of physical description markup The extension of XML Schema for the ENVISAT product description has been preferred to the development of any new language. Thanks to this choice, the same files define both the logical and the physical description of the products. This choice facilitates the maintenance of the file descriptors and guarantees their integrity. 2.2. New markups The following list provides the most important markups which have been added and which handle the description of the products physical structure. <drb:byteoffset> The byte offset of an element. <drb:bitoffset> <drb:length> The bit offset of an element. The bit length of an element. <drb:occurrence> The occurrence count of an element. Figure 2. Example of new markups 3. DRB, an API for accessing products 3.. DOM interfaces The W3C provides a standard programming interface for manipulating XML Documents: the Document Object Model (DOM) [0]. It is designed essentially for the management of HTML pages in Web browsers or servers. Even if the DOM definition has been refined several times, it does not already support XML-Schema (in particular Schema Types), XPath 2.0 or XQuery languages. 3.2. DRB interfaces Because DOM interfaces do not already support neither XML-Schema nor XQuery, it was necessary to design and develop a new API called: Data Request Broker (DRB). DRB proposes several interfaces fixing the problems listed above. We intend however to minimize the departure from DOM specifications. 3.3. DRB implementations We are currently developing several implementations of DRB interfaces necessary to support ENVISAT products. At the end of the project, the software will be able to browse and query XML documents and binary data files such as ENVISAT products. These implementations are presented in the next sections. 3.4. XML implementation The DRB API is able to handle XML documents. The XML implementation has not been redeveloped but consist in a wrapping to the xerces API provided by Apache organization. XML implementation is for instance necessary to read the XML Schema documents. 3.5. SDF implementation The implementation dedicated to the management of ENVISAT products is called Structured Data File (SDF). It uses the XML Schema definitions and the DRB extensions presented previously to locate and extract the ENVISAT product fields. 4. Queries for transformations 4.. XPath to locate information The XPath [7] is used to locate the data within the documents. For instance the next example locates all the Top of atmosphere radiance values of the first band in an ENVISAT MERIS Level parent product. mds stands for Measurement Data Set, mdsr for Measurement Data Set Record and toa_rad for the Top of Atmosphere Radiance. 2
document( MER_RR P )/mds[]//mdsr/toa_rad Figure 3. Top of Atmosphere Radiance locator The node names are not embedded in the ENVISAT products but derived from the XML Schema. Their types as well as their documentation are also retrieved from the same XML Schema definition. 4.2. XQuery On top of XPath, XQuery language [4] provides a powerful solution for the selection and transformation of data extracted from the products. Even if the XQuery is not already recommended by W3C, its syntax basics are stable enough to implement them. Using XQuery, it is possible to access any information from one or more ENVISAT products. The following examples provide an overview of what can be done with XQuery language applied on ENVISAT products. FOR $rec IN /mds[2]//mdsr[range 00 TO 200] IF $rec/val_qi = 0 THEN RETURN $rec/toa_rad[range 00 TO 500] Figure 6. Conditional expression The same query as Figure 5, using conditional expressions IF and THEN instead of WHERE clause. FOR $current_mds IN //mds RETURN count($current_mds/mdsr[val_qi = - ]) Figure 7. Built-in function The query of Figure 7 presents the use of a built-in function (i.e. count() in the example) that simplifies the editing of the query and optimizes the processing performances. /mds[range 2 TO 5]//mdsr[val_qi= 0 ] Figure 4. Simple selection The query presented in Figure 4 extracts the qualified scan lines (i.e. quality indicator of measurement equal to 0) from band 2 to 5 of an ENVISAT MERIS product. FOR $p IN document( MER RR )/mds[]//mdsr, $p2 IN document( MER FR )/mds[]//mdsr WHERE $p/time = $p2/time RETURN count($p) Figure 8. Joins FOR $rec IN /mds[2]//mdsr[range 00 TO 200] WHERE $rec/val_qi = 0 RETURN $rec/toa_rad[range 500 TO 600] Figure 5. Selection with constraint The query presented in Figure 5 applies a FLWR expression [4] on a MERIS product. The result is a window extracted from the band 2, bounded from columns 500 to 600 and lines 00 to 200. As for Figure 4, only the qualified lines are extracted. The use of joins (i.e. selections from multiple documents with interrelated constraints) is probably one of the most interesting functionality of queries for ENVISAT products. In the example above, the measurement data set records are extracted from two distinct files (i.e. a reduced resolution product and a full resolution product). The extracted records are synchronized along time within the FOR loop using a WHERE clause. Such case may be useful to compare or extract values from both a parent and a child product (a child product is an extraction performed on a product, on request of ENVISAT product final users). 5. Rendering The results of the queries can be displayed into several views. As an example we are developing display modules handling tables, plots, images and 3
.5 0.5 0-0.5 - -.5 reports. The rendering definition is done using the XSL Formatting Objects (XSL-FO) language [6].,5 0,5 0-0,5 - -,5 Figure 9. Example of rendered plot In addition, the output of the queries may be saved into one of the supported formats (XML or ENVISAT formats) for further complementary analysis or exchanges with other systems. 6. Derby software As an application of DRB, we are currently developing the Derby software, a datamining tool for managing and analyzing data. Derby is an integrated environment that takes advantage of the DRB functionality through a graphical user interface. Figure 0. Derby main interface As an example, Derby will enable to browse all ENVISAT product fields from a tree representation. It will also permit to edit XQuery scripts and to render their results into tables, plots, image views or compile them into a configurable report. Additional functionality such as syntax highlighting or automatic completion for query editing will ease the usage and facilitate the training. The first release of Derby software will be made available by the beginning of next year. Its operational usage is foreseen during the ENVISAT early phases to support the changes in the different instrument processors. 7. Benefits and example of applications The presented system enables the use of ENVISAT product data with a minimum knowledge of their physical representation. Each part of the product is described by a unique description based on XML Schema, an existing and public standard that is opened to any other system. All information can be extracted from one or more products through XQuery scripts. These main features make the system advantageous for many EO applications. A non-exhaustive list of such applications are introduced in the next sections. 7.. Processor configuration management The system is helpful for the configuration management of the instrument processors. It actually enable to compare the processed products against reference ones, highlighting possible discrepancies between them. This application was the starting point of the present study. 7.2. New missions Even if initially developed for ENVISAT product data handling, the opened architecture of the system reduces the manpower required to support additional missions. 7.3. Products definition The XML Schema is a good candidate for the editing and maintenance of the product definitions in a unique repository that may be shared among data users. From the XML Schema it is actually possible to extract all information usually printed out in the product specification documents with the advantage of being directly exploitable from software in a networked or local environment. 7.4. Translations The system is able to read and write information disregarding the physical format. It is therefore possible to use it as a format translator defining an output-encoding format different from the input one. In addition minor processing (at least the one made possible from the XQuery operators) may be applied during translation. 7.5. Software development The DRB software may be used as standard API for accessing all supported products. This may avoid the reengineering of product specification by software developers and minimize the implementations of components requiring access to product data. 7.6. Quality analysis The system may be useful for the Systematic Quality Analysis (SQA) as well as Long Loop 4
Performance Analysis (LLPA) by providing a simple interface for the extraction and interpretation of relevant product information. It is already foreseen by ESA/ESRIN to use this software as an integrated part of the next operational quality control system of ENVISAT. 8. Conclusions The outputs of the performed study confirmed the feasibility of a system combining the use of XMLrelated technologies with the handling of ENVISAT product data. Moreover they emphasized the interest of providing the highest level of abstraction from the data physical representation, allowing EO scientists, engineers and managers to concentrate on their lines of business. This promising concept should lead to further investigations and developments. [8] XML Linking Language, W3C Recommendation 27 June 200, http://www.w3.org/tr/xlink/ [9] Namespaces in XML, W3C Recommendation 4 January 999, http://www.w3.org/tr/rec-xml-names [0] Document Object Model (DOM) Level 3 Core Specification, W3C Working Draft 3 September 200, http://www.w3.org/tr/dom-level-3-core Acknowledgments We would like to thank Eric Monjoux from ESA ESRIN for his contribution and comments in reviewing the present paper. References [] Extensible Markup Language (XML).0 (Second Edition), W3C Recommendation 6 October 2000,http://www.w3.org/TR/2000/ REC-xml-2000006 [2] XML Schema Part : Structures, W3C Recommendation 2 May 200, http://www.w3.org/tr/xmlschema-/ [3] XML Schema Part 2 : Datatypes, W3C Recommendation 2 May 200, http://www.w3.org/tr/xmlschema-2/ [4] XQuery.0 : an XML Query Language, W3C Working Draft 07 May 200, http://www.w3.org/tr/xquery/ [5] Extensible Stylesheet Language (XSL) Version.0, W3C Recommendation 5 October 200, http://www.w3.org/tr/xsl/ [6] XSL Transformations (XSLT) Version.0, W3C Recommendation 6 November 999, http://www.w3.org/tr/xslt [7] XML Path Language (XPath), W3C Recommendation 6 November 999, http://www.w3.org/tr/xpath 5