2 Basics of XML and XML documents 2.1 XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces XML 1.0 - Extensible Markup Language, 3C Recommendation, February 1998 not an official standard, but a stable industry standard 2 nd Ed 2000, 3 rd Ed 2004, 4 th Ed 2006, 5 th Ed Nov 2008» editorial revisions, not new versions of XML 1.0 a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 what is said later about valid XML documents applies to SGML documents, too SDPL 2011 2: XML Basics 1 SDPL 2011 2: XML Basics 2 hat is XML? Semantics of XML Markup Extensible Markup Language is not a markup language! does not fix a tag set nor its semantics (unlike markup languages, e.g. HTML) XML documents have no inherent (processing or presentation) semantics even though many think that XML is semantic or self- describing; See next SDPL 2011 2: XML Basics 3 Meaning of this XML fragment? The computer cannot know it either! Implementing the semantics is the topic of this course SDPL 2011 2: XML Basics 4 hat is XML (2)? How does it look? XML is a (A) way to use markup to represent information (B) metalanguage» supports definition of specific markup languages through XML DTDs or Schemas» E.g. XHTML a reformulation of HTML using XML (C) Often XML XML + XML technology that is, processing models and languages we re studying (and many others...) SDPL 2011 2: XML Basics 5 <?xml version= 1.0 encoding= iso-8859-1?> <invoice num= 1234 > <client clnum= 00-01 > 01 > <name>pekka Kilpeläinen</name> <email>kilpelai@cs.uku.fi</email> </client> <item price= 60 unit= EUR > XML Handbook</item> <item price= 350 unit= FIM > XSLT Programmer s Ref</item> </invoice> SDPL 2011 2: XML Basics 6 1
Essential Features of XML Overview of XML essentials many details skipped» some to be discussed in exercises or with other topics when the need arises Learn to consult original sources (specifications, documentation etc) for details!» The XML specification is easy to browse SDPL 2011 2: XML Basics 7 XML Document Characters XML documents are made of ISO-10646 (32-bit) characters; ; in practice of 16-bit Unicode chars (cf. Java) Three aspects or characters (see next): representation by bytes/octets numeric code points 97 98 99... visual presentation SDPL 2011 2: XML Basics 8 External Aspects of Characters XML Encoding of Structure 1 Documents are stored/transmitted as a sequence of bytes (or octets). An encoding determines how characters are represented by bytes. UTF-8 ( 7-bit ASCII) is the XML default encoding encoding="iso-8859-1" ~> 256 estern European chars as single bytes A font (collection of character images called glyphs) ) determines the visual presentation of characters XML is, essentially, just a textual encoding scheme of labelled, ordered and attributed trees,, in which internal nodes are elements labelled by type names leaves are text nodes labelled by string values, or empty element nodes the left-to-right order of children of a node matters element nodes may carry attributes (= name-string-value pairs) SDPL 2011 2: XML Basics 9 SDPL 2011 2: XML Basics 10 XML Encoding of Structure 2 XML Encoding of Structure: Example XML encoding of a tree corresponds to a pre-order walk start of an element node with type name A denoted by a start tag <A>, and its end denoted by end tag </A> possible attributes written within the start tag: <A attr 1 = value 1 attr n = value n >» names must be unique: attr k attr h when k h text nodes written as their string value SDPL 2011 2: XML Basics 11 Hello S E A=1 <S> <> Hello </> <E A= 1 /> <>world! </> </S> SDPL 2011 2: XML Basics 12 world! 2
XML: Logical Document Structure Elements indicated by matching (case-sensitive!) sensitive!) tags <ElementTypeName> </ElementTypeName> can contain text and/or subelements can be empty: <elem-type></elem-type> or <elem-type/> (e.g. <br/> in XHTML) unique root element document a single tree Logical document structure (2) Attributes name-value pairs attached to elements metadata, usually not treated as content in start-tag tag after the element type name Also: <div class="preface" date='990126'> <!-- comments outside other markup --> <?PI value of processing instruction PI?> SDPL 2011 2: XML Basics 13 SDPL 2011 2: XML Basics 14 CDATA Sections Two levels of correctness (1) CDATA Sections to include XML markup characters as textual content <![CDATA[ Here we can include for example code fragments: <example>if (Count < 5 && Count > 0) </example> ]]> SDPL 2011 2: XML Basics 15 ell-formed documents roughly: follows the syntax of XML, markup correct (elements properly nested, tag names match, attributes of an element have unique names,...) violation is a fatal error Valid documents (in addition to being well-formed) obey an associated grammar (DTD/Schema) SDPL 2011 2: XML Basics 16 XML docs and valid XML docs An XML Processor (Parser) DTD-valid documents Schema-valid documents XML documents = well-formed XML documents Reads XML documents and reports their contents to an application relieves the application from details of markup XML Recommendation specifies, partly, the behaviour of XML processors: recognition of characters as markup or data; what information to pass to applications; how to check the correctness of documents; validation based on comparing document against its grammar SDPL 2011 2: XML Basics 17 SDPL 2011 2: XML Basics 18 3
2. 2 Basics of XML DTDs Markup Declarations A Document Type Declaration provides a grammar (document type definition, DTD) ) for a class of documents [Defined in XML Rec ec] Syntax (in the prolog of a document instance): <!DOCTYPE rootelemtype SYSTEM "ex.dtd" <!-- "external subset" in file ex.dtd --> [ <!-- "internal subset" may come here --> ]> DTD = union of the external and internal subset (could be empty); internal has higher precedence can override entity and attribute declarations (see next) SDPL 2011 2: XML Basics 19 DTD consists of markup declarations element type declarations» similar to productions of ECFGs attribute-list declarations» for declared element types entity declarations» short-hand hand notations and physical storage units notation declarations» information about external (binary) objects SDPL 2011 2: XML Basics 20 How do Declarations Look Like? Element type declarations <!ELEMENT invoice (client, item+)> <!ATTLIST invoice num NMTOKEN #REQUIRED> <!ELEMENT client (name, email?)> <!ATTLIST client num NMTOKEN #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT item (#PCDATA)> <!ATTLIST item price NMTOKEN #REQUIRED unit (FIM EUR) EUR > string of name characters; (Letter [0-9]. - _ : ) + SDPL 2011 2: XML Basics 21 The general form is <!ELEMENT elemtype ( E )> where E is a content model (regular expr.); ECFG production elemtype E ) Content model operators: E F : alternation E, F: concatenation E? : optional E* : zero or more E+ : one or more (E) : grouping No priorities: : either (A,B) C or A,(B C), but no A,B C SDPL 2011 2: XML Basics 22 Attribute-List Declarations Mixed, Empty and Arbitrary Content Can declare attributes for elements: Name, data type and possible default value: <!ATTLIST FIG id ID #IMPLIED descr CDATA #REQUIRED class (a b c) "a"> Semantics mainly up to the application parser checks that ID attributes are unique and that targets of IDREF and IDREFS attributes exist SDPL 2011 2: XML Basics 23 Mixed content: <!ELEMENT P (#PCDATA I IMG)*> may contain text (#PCDATA)) and elements Empty content: <!ELEMENT IMG EMPTY> Arbitrary content: <!ELEMENT HTML ANY> (= <!ELEMENT HTML (#PCDATA choice-of-all-declared-element-types)*> )*>) SDPL 2011 2: XML Basics 24 4
Entities (1) Named storage units or XML fragments (~ macros in some languages) character entities:» << and< all expand to < (treated as data, not as start-of-markup)» other predefined entities: & > &apos apos; "e; for & > ' " general entities are shorthand notations: <!ENTITY CS School of Computing"> SDPL 2011 2: XML Basics 25 Entities (2) physical storage units comprising a document parsed entities <!ENTITY chap1 SYSTEM "http://myweb/ch1"> document entity is the starting point of processing entities and elements must nest properly: <!DOCTYPE doc [ <!ENTITY chap1 ( as above ) > ] <doc> &chap1; </doc> <sec num="1"> </sec> <sec num="2"> </sec> SDPL 2011 2: XML Basics 26 Unparsed Entities An Alternative ay For connecting external binary objects to XML documents; (XML processor handles only text) <!NOTATION TIFF "bin/gimp"> <!ENTITY fig123 SYSTEM "figs/f123.tif" NDATA TIFF> <!ATTLIST IMG file ENTITY #REQUIRED> Usage: <IMG file="fig123"/> parser provides notation and address to the application an SGML legacy technique(?) SDPL 2011 2: XML Basics 27 I have rarely used unparsed entities or notations Easier "HTML-style" linking: <IMG type="tiff" file="figs/fig123"/> the link indicates the format and the address directly maintenance of numerous entities might be easier with (indirect references through) explicit declarations SDPL 2011 2: XML Basics 28 Parameter Entities to parameterize and modularize DTDs: <!ENTITY % tabledtd SYSTEM "dtds/tab.dtd"> %tabledtd; <!-- include sub-dtd --> <!ENTITY % stattr "ready (yes no) 'no'"> <!ATTLIST CHAP %stattr; > <!ATTLIST SECT %stattr; > (Parameter entities inside a markup declaration allowed only in external DTD subset) Speculations about XML Parsing Parsing involves two things: 1. Pulling the entities together, and checking the well- formedness 2. Building a parse tree for the input (a'la DOM), or otherwise informing the application about document content and structure (e.g. a'la SAX) Task 2 is simple ( simplicity of XML markup; see next) Checking well-formedness is straightforward; Implementing validation is a bit more challenging SDPL 2011 2: XML Basics 29 SDPL 2011 2: XML Basics 30 5
Building an XML Parse Tree Simplified XML Tree Building Algorithm Hello S E <S> <> Hello </><E> <E> </E> <> world! </></S> </S> world! C new Document(); // current node Uses DOM like Scan document text until at end: tree operations case of scanned text: "<Type>": N new ElementNode(Type); C.addChildNode(N); C N; ": C.addChildNode(new TextNode(Txt)); Type>": C C.parentNode(); "Txt": "</ </Type return C; SDPL 2011 2: XML Basics 31 SDPL 2011 2: XML Basics 32 Sketching XML validation Treat the document as a tree d Document is valid w.r.t.. a grammar (DTD/Schema) G iff d is a syntax tree over G Check that the root is labelled by the start symbol of G For each element node n of the tree, check that its» attributes match with those of the element type» content matches the content model of its type: If n is of type A and its children of type B 1,, B n, check that the grammar has a production A E for which Sketching the validation (2) How to check condition (1; matching of children with a content model)? by an automaton built from content model E Example: <!ELEMENT A ((B C)+, D)> B 1 B n L(E) (1) SDPL 2011 2: XML Basics 33 SDPL 2011 2: XML Basics 34 2.3 XML Namespaces Documents often comprise parts processed by different applications (and/or defined in different schemas) for example, in XSLT scripts: <xsl:template match="doc/title"> <H1> HTML <xsl:apply-templates /> elements </H1> </xsl:template> XSLT elements/ How to manage multiple sets of names? instructions XML Namespaces (2/5) Solution: XML Namespaces, 3C Rec. Jan 99, for separating possibly overlapping vocabularies (sets of element type and attribute names) within a single document by introducing (arbitrary) local name prefixes,, and binding them to (fixed) globally unique URIs For example, the local prefix xsl: conventionally used in XSLT scripts SDPL 2011 2: XML Basics 35 SDPL 2011 2: XML Basics 36 6
XML Namespaces briefly (3/5) XML Namespaces (4/5) Namespace identified by a URI (through the associated local prexif) e.g.http://www.w3.org/ http://www.w3.org/1999/xsl/transform for XSLT conventional but not required to use URLs the identifying URI has to be unique, but it does not have to be an existing address Association inherited to sub-elements see the next example (of an XSLT script) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform" xmlns="http://www.w3.org/tr/xhtml1/strict"> <!-- XHTML is the default namespace --> <xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> </xsl:stylesheet> SDPL 2011 2: XML Basics 37 SDPL 2011 2: XML Basics 38 XML Namespaces briefly (5/5) Built on top of basic XML By overloading attribute syntax (xmlns:foo=... =...) does not affect validation» namespace attributes must be declared for DTD-validity» all element type names must be declared (with their prefixes!) > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespaces Processing languages allow to declare NS prefixes; see next SDPL 2011 2: XML Basics 39 Example: Namespaces in XSLT XML: <test xmlns:foo="http://www.uef.fi/test"> <foo:a>aaa</ </foo:a> <foo:b>bbb</ </foo:b> </test> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform" xmlns:bar="http://www.uef.fi/test"> <xsl:template match="/"> <xsl:apply-templates select="//bar:*" /> </xsl:template> </xsl:transform> SDPL 2011 2: XML Basics 40 Example: Namespaces in XQuery ns-test.xml : <test xmlns:foo="http://www.uef.fi/test"> <foo:a>aaa</ </foo:a> <foo:b>bbb</ </foo:b> </test> xquery version "1.0"; declare namespace zap = "http://www.uef.fi/test"; doc("ns-test.xml")//zap:* For NS support in XML APIs, see next SDPL 2011 2: XML Basics 41 7