Part 2: XML and Data Management Chapter 6: Overview of XML Prof. Dr. Stefan Böttcher 6. Overview of the XML standards: XML, DTD, XML Schema 7. Navigation in XML documents: XML axes, DOM, SAX, XPath, Tree Pattern queries Further topics: XML streaming, compression
Data centric XML - XML data storage <doc> <order> start tag <customer> Alice </customer> <PC> pc400 </PC> </order> <order> <customer> Bob </customer> <PC> pc500 </PC> </order> <order> <customer> Carla </customer> <PC> pc600 </PC> </order> </doc> end tag doc % customer PC order( Alice pc400 ). order( Bob pc500 ). order( Carla pc600 ). 2/25
extended Markup Language (XML) XML - a family of standards: XML (extensible Markup Language) exchangable data format across different operating systems, applications, and enterprises often used for content XPath path expressions used for navigation in XML trees used within other XML standards (e.g. XSL(T)) XSL (extensible Stylesheet Language) used to describe layout of content / to convert data many more standards: XQuery ( queries ), DTD ( type definitions ), XML-Schema ( integrity constraints ) 3/25
Separation of content and layout content (product2.xml) layout ( technican2.xsl) content (product1.xml) layout (customer1.xsl) HTML file combines requested data with requested layout 4/25
Separation of content and layout (2) consequences: 1 (content) data source for different layouts (technican, seller, customer, re-seller,...) layout may change without changing the content ( different logo, different seller or customer, different employee or job, new view of data ) reuse 1 layout for different content ( frame with company logo,...) content may change without changing layout ( new prices, ) 5/25
XML on Java servers XML + XSL layout (.xsl file) content data (.xml file) separate layout and content combine them in the web server XML file XSL file HTMLpage input browser client calls generated HTML page servlet server transform XML+XSL HTML 6/25
XML document as a data storage <doc> <order> opening tag <customer> Alice </customer> <PC> pc400 </PC> </order> <order> <customer> Bob </customer> <PC> pc500 </PC> </order> <order> <customer> Carla </customer> <PC> pc600 </PC> </order> </doc> closing tag doc % customer PC order( Alice pc400 ). order( Bob pc500 ). order( Carla pc600 ). 7/25
XML syntax XML - Prolog: version character set without DTD! <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> <?xml-stylesheet type="text/xsl" href="xmlbsp1.xsl"?> XML - main part: used stylesheet (only inside ie5) element start tag /end tag <order> <customer> Alice </customer> <PC> pc400 </PC> </order> text node 8/25
XML syntax (2) In the XML main part: (arbitrarily) no text node <offers> <offer supplier= vobis item= pc500 > </offer> attribute attribute value end of tag (no text) <offer supplier= IBM item= pc600 / > </offers> element 9/25
XML syntax (3) all tags must be closed (<tag>... </tag> or <singletag />) incorrectly nested tags not allowed ( <tag1> <tag2>... </tag1> </tag2> ) case-sensitive ( <tag> different from <Tag> ) attribute values must be quoted ( e.g. <p align="center"> ) text must be enclosed in elements 10/25
XML document as a tree <doc> <customer name= Alice > <order>... </order> <address> </address> </customer> <customer> <order/> <address/> </customer> </doc> name = Alice customer doc customer order address order address 11/25
XML node types 7 kinds of nodes: root - has no parent node element text attribute - leaf node (has no child node) - leaf node (has no child node) comment - leaf node (has no child node) name-space - leaf node (has no child node) processing-instruction - leaf node (has no child node) 12/25
Node type definitions: Common aspects DTD or XML Schema or Relax NG: defines structure of all XML trees exchanged => unique data format for all participants data formats exchangeable across company borders New data exchange formats and languages based on XML example: ebxml (E-Business XML) as a basis for OTA (Open Travel Association) data exchange between travel agency, airline etc. Consequence of these standards: ( economic ) force to use the standard 13/25
Differences between DTD and XML Schema DTD ( the older standard ) : + defines the structure (nesting of tags) of the documents <customer> <order> <item> + defines structural dependencies, e.g. every order contains at least one item element XML-Schema ( the newer standard ) additionally : + binds XML elements to types defined in the XML Schema + defines domains + defines integrity constraints 14/25
Document-Type-Definition (DTD) <!-- DTD xmlbsp2d.dtd for example xmlbsp2d.mxl --> <!ELEMENT orders ( order )* > <!ELEMENT order ( customer, PC ) > <!ELEMENT customer (#PCDATA) > <!ELEMENT PC (#PCDATA) > arbitrary many root element parsed char data sequence required <?xml version="1.0" encoding="iso-8859-1" standalone="no"?> <!DOCTYPE orders SYSTEM "xmlbsp2d.dtd"> <?xml-stylesheet type="text/xsl" href="xmlbsp2.xsl"?> <orders> <order> <customer> Alice </customer> <PC> pc400 </PC> </order> <order>... </order> </orders> 15/25
Element declarations in DTDs <!ELEMENT PC (#PCDATA) > <!ELEMENT offer (EMPTY) > <!ELEMENT supplies (offer) > <!ELEMENT offers (offer)* > <!ELEMENT order (customer,pc) > <!ELEMENT payment (cash card) > <!ELEMENT E ((A B)*,C,(D)?)+ > text (no elements) empty 1 sub-element? 0 or 1 * arbitrary many + al least 1 sub-element sequence choice paranthesis 16/25
Attribute declarations in DTDs <!-- DTD xmlbsp2d.dtd for the example xmlbsp2d.xml --> <!ELEMENT offers (offer)* > arbitrary many <!ELEMENT offer (EMPTY) > empty <!ATTLIST offer supplier CDATA #REQUIRED item CDATA #REQUIRED > root element attribute type (char data) must occur <offers> <offer supplier= vobis item= pc500 > </offer> <offer supplier= IBM item= pc600 / > </offers> 17/25
XML Schema Beyond DTDs: distinguish between element name and element type ( e.g. street and firstname being of type xsd:string ) domain restriction ( e.g. plz being a 5 digit xsd:integer ) sub-typing by domain restriction type extension (e.g. additional children or attributes) structural constraints ( e.g. MinOccurs=5 MaxOccurs=7 ) 18/25
XML Schema - simple data types Basic types: xsd:string xsd:decimal xsd: integer xsd: float xsd:boolean xsd:date xsd:time QName ( qualified name ) anyuri language (e.g. de-de, en-us, ) ID IDREF 19/25
XML Schema - list and union data types <xsd:simpletype name="intday"> <xsd:restriction base="xsd:integer"> <xsd:mininclusive value="1"/> <xsd:maxinclusive value="7"/> </xsd:restriction> </xsd:simpletype> data type definitions <xsd:simpletype name="days"> <xsd:list itemtype="intday"/> </xsd:simpletype> <days> 1 2 3 4 5 </days> instance of type days 20/25
fullname followed by either street or POB XML Schema example (1) <xsd:element name="address" > <xsd:sequence> <xsd:element name="fullname" maxoccurs="1"> <xsd:sequence> <xsd:element name="firstname" type="xsd:string"/> <xsd:element name="lastname "type="xsd:string"/> </xsd:sequence> </xsd:element> <xsd:choice> <xsd:element name="street" type="xsd:string"/> <xsd:element name="pob" type="xsd:integer"/> </xsd:choice> </xsd:sequence> </xsd:element> nested elements both in this order either street or POB 21/25
XML Schema example (2) <xsd:element name="shipto" type="coaddress"/> <xsd:complextype name="address"> <xsd:complexcontent> <xsd:sequence> <xsd:element name="fullname"/> <xsd:element name="street"/> </xsd:sequence> </xsd:complexcontent> </xsd:complextype> element of extended type nested elements of complex type Address <xsd:complextype name="coaddress"> <xsd:extension base="address"> <xsd:sequence> <xsd:element name="countrycode"/> </xsd:sequence> </xsd:extension> </xsd:complextype> </xsd:element> extension of complex type Address additional element 22/25
ID and IDREF <person pnr= 12345 > <in pid= p1 /> <in pid= p2 /> </person> <project p_id= p1 > <with persid= 12345 /> <with persid= 6789 /> </project> <project p_id= p2 > <with persid= 12345 /> </project> unique <!-- DTD for element person --> <!ELEMENT person (in)* > <!ATTLIST person pnr PCDATA ID #REQUIRED> <!ELEMENT with (EMPTY) > <!ATTLIST with pid PCDATA IDREF #REQUIRED> <!-- DTD for element project --> <!ELEMENT project (with)* > <!ATTLIST project p_id PCDATA ID #REQUIRED> <!ELEMENT with (EMPTY) > <!ATTLIST with persid PCDATA IDREF #REQUIRED > no type checking error prone 23/25
ID and IDREFS <person pnr= 12345 pids= p1 p2 /> <!-- DTD for element person --> <!ELEMENT person (EMPTY) > <!ATTLIST person pnr PCDATA ID #REQUIRED pids PCDATA IDREFS #REQUIRED > <project p_id= p1 persids= 12345 6789 /> <project p_id= p2 persids= 12345 /> <!-- DTD for element project --> <!ELEMENT project (EMPTY) > <!ATTLIST project p_id PCDATA ID #REQUIRED persids PCDATA IDREFS #REQUIRED > 24/25
XML - summary XML : DTD : XML-Schema tree structure for content structure definition additionally: type checking and logical consistency checking well documented standards http://www.w3c.org 25/25