COMP60411 Semi-structured Data and the Web A bit of XPath, namespaces, and XML schema

COMP60411 Semi-structured Data and the Web A bit of XPath, namespaces, and XML schema week 2 Uli Sattler University of Manchester 1

.plagiarism again... Work through the COMP609PM Plagiarism and Malpractice Test..possibly a second time if you have questions, ask! don t risk your marks don t risk your degree 2

...blackboard again... Bijan and yourself have answered hundreds of questions discussed various solutions given numerous hints Some of you were stuck, mostly with M1 but didn t read those hints rather, waited for us to explain (waste of time), or panicked... 3

Errata Last week, I defined validity of an XML document w.r.t. a schema and I forgot 1 condition - here is the correct definition: A document D is valid if D is well-formed, D is associated with a DTD (internal or external or both), D is valid w.r.t. that DTD, and the declaration element is D s root element 4

Any questions? 5

XML documents... There are various standards, tools, APIs, data models for XML: to describe XML documents & validate XML document against: we have seen: DTDs today: XML schema to parse & manipulate manipulate XML documents: we have seen: SAX (and DOM) in next week s coursework: DOM transform and XML document into another XML document or into an instance of aother formats, e.g., html, excel, relational tables 6

Manipulation of XML documents XPath for navigating and querying through XML documents XQuery more expressive than XPath, uses XPath for querying and data manipulation Turing complete designed to access large amounts of data, to interface with relational systems XSLT similar to XQuery in that it uses XPath,... designed for styling, together with XSL-FO or CSS contrast this with DOM and SAX: a collection of APIs for programmatic manipulation includes data model and parser to build your own applications 7

XPath ML Schema later more designed to navigate to/select parts in a well-formed XML document no transformational capabilities (as in XQuery and XSLT) is a W3C standard: XPath 1.0 is a 1999 W3C standard XPath 2.0 is a 2007 W3C standard that extends/is a superset of XPath 1.0 richer set of WXS datatypes support type information from WXS validation see http://www.w3.org/tr/xpath20 allows to select/define parts of an XML document: sequence of nodes uses path expressions Difference sequence - set? to navigate in XML documents to select node-lists in an XML document similar to expressions in a traditional computer file system rm */*/*.pdf provides numerous built-in functions e.g., for string values, numeric values, date and time comparison, node and QName manipulation, sequence manipulation, Boolean values, etc. 8

XPath: Datamodel remember how an XML document can be seen as a node-labelled tree with element names as labels XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax - but not on DOM tree! XPath uses XQuery/XPath Datamodel there is a translation at http://www.w3.org/tr/xpath20/#datamodel see XPath process model... 9

Level Data unit examples cognitive application choice: DOM Infoset XPath. tree adorned with... namespace tree complex token simple character schema Element Element Element Attribute Element Element Element Attribute <foo:name t= 8 >Bob <foo:name t= 8 >Bob < foo:name t= 8 >Bob bit 10011010 Information or Property required nothing a schema well-formedness which encoding (e.g., UTF-8) parsing serializing 10

XPath: Datamodel the XPath DM uses the following concepts nodes: element attribute text namespace processing-instruction comment document (root) document (root) node element node atomic value: behave like nodes without children or parents is an atomic value, e.g., xsd:string item: atomic values or nodes <?xml version="1.0" encoding="iso-8859-1"?> <bookstore> <book> <title lang="en">harry Potter</title> <author>j K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore> text node attribute node

XPath Data Model From: http://oreilly.com/perl/excerpts/systemadmin-with-perl/ten-minute-xpath-utorial.html <?xml version="1.0" encoding="utf-8"?> <network> <description name="boston"> This is the configuration of our network in the Boston office. </description> <host name="agatha" type="server" os="linux"> <interface name="eth0" type="ethernet"> <arec>agatha.example.edu</arec> <cname>mail.example.edu</cname> <addr>192.168.0.4</addr> </interface> <service>smtp</service> <service>pop3</service> <service>imap4</service> </host> <host name="gil" type="server" os="linux"> 13

Comparison XPath DM and DOM datamodel Document nodetype = DOCUMENT_NODE nodename = #document nodevalue = (null) Element nodetype = ELEMENT_NODE nodename = mytext nodevalue = (null) firstchild lastchild attributes XPath DM and DOM DM are similar, but different most importantly regarding names and values of nodes but also structurally (see ) in XPath, only attributes, elements, processing instructions, and namespace nodes have names, of form (local part, namespace URI) whereas DOM uses pseudo-names like #document, #comment, #text In XPath, the value of an element or root node is the concatenation of the values of all its text node descendants, not null as it is in DOM: e.g, XPath value of <a>a<b>b</b></a> is AB XPath does not have separate nodes for CDATA sections (they are merged with their surrounding text) XPath has no representation of the DTD <N>here is some text and <![CDATA[some CDATA < >]]> </N> 14

XPath: core terms -- relation between nodes each node has at most one parent each node but the root node has exactly one parent the root node has no parent each node has zero or more children ancestor is the transitive closure of parent, i.e., a node s parent, its parent, its parent,... descendant is the transitive closure of child, i.e., a node s children, their children, their children,... when evaluating an XPath expression p, we assume that we know which document and which context we are evaluating p over we see later how they are chosen/given an XPath expression evaluates to a node sequence, a node is a document/element/attribute node or an atomic value document order is preserved among items 15

XPath - by example <?xml version="1.0" encoding="utf-8"?> <network> <description name="boston"> This is the configuration of our network in the Boston office. </description> <host name="agatha" type="server" os="linux"> <interface name="eth0" type="ethernet"> <arec>agatha.example.edu</arec> <cname>mail.example.edu</cname> <addr>192.168.0.4</addr> </interface> <service>smtp</service> <service>pop3</service> <service>imap4</service> </host> <host name="gil" type="server" os="linux"> 16

XPath - abbreviated syntax by example context node XPath expression: */*[2] <?xml version="1.0" encoding="utf-8"?> <network> <description name="boston"> This is the configuration of our network in the Boston office. </description> <host name="agatha" type="server" os="linux"> <interface name="eth0" type="ethernet"> <arec>agatha.example.edu</arec> <cname>mail.example.edu</cname> <addr>192.168.0.4</addr> </interface> <service>smtp</service> <service>pop3</service> <service>imap4</service> </host> <host name="gil" type="server" os="linux"> 17

XPath - abbreviated syntax by example context node XPath expression: */*[2]/*[1]/*[3] <?xml version="1.0" encoding="utf-8"?> <network> <description name="boston"> This is the configuration of our network in the Boston office. </description> <host name="agatha" type="server" os="linux"> <interface name="eth0" type="ethernet"> <arec>agatha.example.edu</arec> <cname>mail.example.edu</cname> <addr>192.168.0.4</addr> </interface> <service>smtp</service> <service>pop3</service> <service>imap4</service> </host> <host name="gil" type="server" os="linux"> 18

XPath - abbreviated syntax know your context node context node XPath expression: *[1] <?xml version="1.0" encoding="utf-8"?> <network> <description name="boston"> This is the configuration of our network in the Boston office. </description> <host name="agatha" type="server" os="linux"> <interface name="eth0" type="ethernet"> <arec>agatha.example.edu</arec> <cname>mail.example.edu</cname> <addr>192.168.0.4</addr> </interface> <service>smtp</service> <service>pop3</service> <service>imap4</service> </host> <host name="gil" type="server" os="linux"> 19

XPath - abbreviated syntax absolute paths context node XPath expression: /*/*[1] <?xml version="1.0" encoding="utf-8"?> <network> <description name="boston"> This is the configuration of our network in the Boston office. </description> <host name="agatha" type="server" os="linux"> <interface name="eth0" type="ethernet"> <arec>agatha.example.edu</arec> <cname>mail.example.edu</cname> <addr>192.168.0.4</addr> </interface> <service>smtp</service> <service>pop3</service> <service>imap4</service> </host> <host name="gil" type="server" os="linux"> 20

Modelling Or, how to make do

Case study 1 1 + 1

Let us consider a simple format Arithmetic expressions! This is our domain High level description: addition, multiplication, subtraction over the integers Example in informal notation 4+5*7 Twelve minus fifty-nine plus one, that sort of thing. Request Design a reasonable XML format for this domain Provide a DTD that describes that format (Schema as medium of communication) We have choices! First choice is the root element name Let s say, expression

Work from an example!

Pick a root element

Work from an example!

Picking example(s) Different principles Coverage (hit all the features) Simplicity (easy to get right or get something) Corner cases (the hard situations) Realism (hit an actual situation) Trade off principles E.g., coverage vs. simplicity More examples are (or can be) better! Your example(s) should Give you insight into the domain Highlight benefits or drawbacks Force (or elicit) design decisions Ours 2+3*(5-4)

Initial design thoughts Three design issue families Elements vs. attributes vs. text Also type of textual content Structural relationships E.g., nesting Naming Choices on one constrain the others Suppose we decide to represent operators with elements We cannot then use the names +, *, and - They aren t legal element names Design 1 Elements for everything Names: plus, minus, times, open_parens, close_parens, int Structure?

A flat style

How to evaluate? Establish evaluation axes Coverage Did I capture all of my domain? E.g., is there an arithmetic expression I can t encode? Usability Can people write/read/otherwise use your format? Is working with the format error prone? Naturalness or Fit or Fidelity Is your format natural? Does your format make good (standard, best practice) use of the data model? Evolvability Can we extend the format with minimal disruption? Can we change the format with minimal disruption? Again, there may be trade offs It depends on the context and interests

Structure?

Maybe indentation will help?

The system ignores it!

The system ignores it? Why should we care what the formatter does? It s a warning sign We cannot enforce key syntax constraints E.g., Parenthesis must balance For every open there is a close We cannot express that to an arbitrary depth! Wait! maybe we can? Cool CS fact: Regular expressions can t express nesting!

Really can t!

Other aspects of the system Queries! (over (10 + 5) * (6-4) ) Get all parenthesized expressions I.e., (10 + 5) and (6-4) Get the content of the first parenthesized expression I.e., 10 + 5 Get the first element of the first parenthesized expression I.e., 10 We need to parse! We need a stack! (Can we do this in XPath?) DTD constraints We talked about SAX processing DOM processing Looks like the SAX! No nesting! No tree!

SAX Processing startdocument startelement [null] [expression] [expression] comment [ A simple expression which touches all features. 2+3*(5-4) ] [21] [68] startelement [null] [int] [int] characters [2] [102] [1] endelement [null] [int] [int] startelement [null] [plus] [plus] endelement [null] [plus] [plus] startelement [null] [int] [int] characters [3] [131] [1] endelement [null] [int] [int] startelement [null] [times] [times] endelement [null] [times] [times] startelement [null] [open_parens] [open_parens] endelement [null] [open_parens] [open_parens] startelement [null] [int] [int] characters [5] [180] [1] endelement [null] [int] [int] startelement [null] [minus] [minus] endelement [null] [minus] [minus] startelement [null] [int] [int] characters [4] [210] [1] endelement [null] [int] [int] startelement [null] [close_parens] [close_parens]

Ok already! New design Design 1 Elements for everything Names: plus, minus, times, open_parens, close_parens, int Structure = Flat Design 2 Elements for everything Names: plus, minus, times, parens, int Structure = Nested parens Is this better?

Doesn t look too different!

But the system knows!

Evaluating Design 2 Design 2 Elements for everything Names: plus, minus, times, parens, int Structure = Nested parens DTD constraints More! Nested parens work SAX/DOM processing Some tree structure, but still some flat analysis Queries? <!ELEMENT expression ( parens)> <!ELEMENT parens ((int, (minus times plus), int) parens )> <!ELEMENT plus EMPTY> <!ELEMENT minus EMPTY> <!ELEMENT times EMPTY> <!ELEMENT int (#PCDATA)>

Queries are better! (10 + 5) * (6-4) Get all parenthesized expressions //parens

Queries are better! (10 + 5) * (6-4) Get the content of the first parenthesized expression //parens[1]/*

Queries are better! (10 + 5) * (6-4) Get the first element of the first parenthesized expression //parens[1]/*[1]

Good use of the data model Flat vs. nested parens Both encode parenthesization! But Design 2 does it in more a XML natural way How can we tell? All XML sensitive tools and languages do more with Design 2 Key features of our information are salient to those tools! Is this the best use of the model?

Take it further! Design 1 Elements for everything Names: plus, minus, times, open_parens, close_parens, int Structure = Flat Design 2 Elements for everything Names: plus, minus, times, parens, int Structure = Nested parens Design 3 Elements for everything Names: plus, minus, times, int Structure = Nested operators (no parens!)

Nesting, nesting everywhere?

Evaluating Design 3 Design 3 Elements for everything Names: plus, minus, times, int Structure = Nested operators (no parens!) DTD constraints Lots! SAX/DOM processing Natural tree structure Usability Fewer elements (and concepts!) Queries? All our old ones are irrelevant! Our new ones are more content oriented Get all additions of subtractions <!ELEMENT expression (plus times minus int )> <!ELEMENT plus ((plus times minus int ), (plus times minus int ))> <!ELEMENT times ((plus times minus int ), (plus times minus int ))> <!ELEMENT minus ((plus times minus int ), (plus times minus int ))> <!ELEMENT int (#PCDATA)>

Last twiddle to get to calc1 Design 3.2 Elements for everything Elements for everything except int values Names: plus, minus, times, int, value Structure = Nested operators (no parens!) The change From: <int>3</int> To: <int value="3"/> Why prefer one to the other?

Look familiar?

We have our language design! Design 3.2 Elements for everything Elements for everything except int values Names: plus, minus, times, int, value Structure = Nested operators (no parens!) We re done, right? We need to capture it Say in a DTD Our DTD needs evaluation too!

Two versions <!ELEMENT expression (plus times minus int )> <!ELEMENT plus ((plus times minus int ), (plus times minus int )+)> <!ELEMENT times ((plus times minus int ), (plus times minus int )+)> <!ELEMENT minus ((plus times minus int ), (plus times minus int ))> <!ELEMENT int EMPTY> <!ATTLIST int value NMTOKEN #REQUIRED> <!ENTITY % expr "(plus times minus int )"> <!ELEMENT expression %expr;> <!ELEMENT plus (%expr;, (%expr;)+)> <!ELEMENT times (%expr;, (%expr;)+)> <!ELEMENT minus (%expr;, %expr;)> <!ELEMENT int EMPTY> <!ATTLIST int value NMTOKEN #REQUIRED> They say the same thing! Exactly One says it better (slightly)

What does calc1.dtd give us? For SaxCalc It documented the format It could be used for authoring Both as documentation and in tools Autocompletion, error correction Can (partially) generate examples It could be used to check input By validation or by hard coded checks But it doesn t Say what to do with the format! Capture all the constraints Some must be hard coded (Integers!)

What is validity? XML 1.0 Definition <http://www.w3.org/tr/rec-xml/#sec-prolog-dtd [Definition: An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it.] Two conditions (beyond being well-formed): A declaration document type declaration Meets the constraints of the definition document type definition (the DTD) Sometimes known as internally valid (Though thatʼs a bit of misnomer; the spec calls this Valid ) The document says what it is

Is this a good notion? SaxCalc Error handling What would a user want us to do? <expression> <plus> <times> <int value="7"/> <minus> <int value="4"/> <int value="24"/> </minus> </times> <int value="7"/> </plus> </expression> <html> <expression> <plus> <times> <int value="7"></int> <minus> <int value="4"></int> <int value="24"></int> </minus> </times> <int value="7"></int> </plus> </expression> </html> <html> <expression> <plus> Multiplication is cool! <times> <int value="7"></int> <minus> <int value="4"></int> <int value="24"></int> </minus> </times> <int value="7"></int> </plus> </expression> </html>

External validation XML Documents conform to many DTDs That is, are valid with respect to many DTDs Some tighter, some looser Different schemas can offer different things External validation breaks self-describingness

External validation (2) Internal validation can go wrong in a lot of ways Missing declaration User forgot! Missing definition Internal subsets are verbose and hard to update Web based schema have problems Can t depend on them (app breaks if the server dies) Hard on the server External validation with DTDs Poorly handled by SAX Even redirecting is a bit of a PITA Other schema languages fare better javax.xml.validation.validator;

Error handling Do we need anything? In SaxCalc, you could get errors for free parseint failures Keep track of arguments Why not hard code? May need to track characters() But throwing unhandled exceptions is cool, right? You ll get a stack trace! Separate input checking and evaluation Allow the format to evolve more or less separately Within certain constraints What about non-well-formed documents?

DTDs as a language We can evaluate it! Usability Syntax isn t too bad Regular expression syntax is terse, yet readable(ish)» Some things get awkward Expressivity What can we say?» Can t constrain text to be integers Maintainability, readability, evolvability Parameter entities are pretty weak» Textual macros! Computability How hard is it to validate? Can we do better?

Case study 2 or, my brain hurts

M1: An XML Syntax for DTDs DTDs do not have an XML based syntax, which makes their manipulation by XML tools difficult. In this assignment, you will create an XML format (called "DTD/XML") for a subset of the DTD formalism. The root element of your format is named dtdx. The children of your root element are element declarations, attribute declarations, or comments, roughly following the markupdecl production from the XML 1.0 Recommendation. Note that you are not covering the full DTD language. DTD is a language for describing/manipulating XML But DTDs are not homoiconic That is, they are not represented in the same language they represent The meta level and the object level are distinct Thus, e.g., can t use XPath to count element declarations Can t use DTDs to define subsets or extensions Is homoiconicity desirable?

An argument the first striking and unexpected observation is that most published DTDs are not correct, with missing elements, wrong syntax or incompatible attribute declarations. This might prove that such DTDs are being used for documentation purposes only and are not meant to be used for validation at all. The reason might be that because of the ad-hoc syntax of DTDs (inherited from SGML), there are no standard tools to validate them. The observation: Published DTDs are broken Explanation: Ad hoc syntax inhibits tools Solution: Use XML for the syntax This issue will be addressed by proposals that use XML as the syntax to describe DTDs themselves http://www.springerlink.com/content/jrabc1a5hvpdmhc9/ Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask

A modelling challenge! First, get clear on the domain What are we trying to represent? Element declarations, attribute declarations, and comments M1 said so! What is the structure of the information? Perhaps an example? <!ELEMENT foo EMPTY> Covering? Simple? Corner case? Realistic? Whatʼs a reasonable, simple first design? I trust we donʼt need to consider flat designs. How about a bit of CDATA? <dtdx> <![CDATA[<!ELEMENT foo EMPTY>]]> </dtdx>

That was not a good design Design 1 Move the DTD into an element Doesn t really change the DTD syntax Validation is not helpful No nice querying Still need a traditional DTD (DTD/trad) syntax parser Design 2 Lots of elements, some attributes comment, element, choice, seq, plus, etc. Some nesting Mirror the syntax tree!

It s verbose <!DOCTYPE dtdx SYSTEM "dtdx1.dtd"> <dtdx> <comment>i omit most comments for brevity in this example.</comment> <element name="expression"> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </element> <element name="plus"> <seq> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> <plus> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </plus> </seq> </element> Why all the attributes? <element name="times"> <seq> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> <plus> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </plus> </seq> </element> <element name="minus"> <seq> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </seq> </element> <element name="int"><empty/></element> <attlist on="int"> <attdef name="value"> <tokenized type="nmtoken"/> <required/> </attdef> </attlist> </dtdx>

Why attributes? DTDs can constrain attributes More than elements In two ways: Type constraints Key constraints Both are helpful for representing DTD syntax!

Examples Key constraints Every element can have at most one declaration <!ATTLIST element name ID #REQUIRED> Every ref must have a corresponding def <!ATTLIST ref to IDREF #REQUIRED> Every attlist ref must have a corresponding def <!ATTLIST attlist on IDREF #REQUIRED> Type constraints ID and IDREFs must be NMTOKENS Convenient! Tokenized attribute values have a constraint set <!ATTLIST tokenized type (ID IDREF IDREFS ENTITY ENTITIES NMTOKEN NMTOKENS) #REQUIRED> (This could be done with an element.)

Our Design Design 1 Move the DTD into an element Doesn t really change the DTD syntax Validation is not helpful No nice querying Still need a traditional DTD (DTD/trad) syntax parser Design 2.1 Lots of elements, attributes for type and key constraints comment, element, choice, seq, plus, etc. Some nesting Mirror the syntax tree! Use ID and IDREF as appropriate

Tail wagging dog One design decision is odd Attributes for name, ref, etc. Not driven by general design considerations Driven by limitations of our schema langauge! Another strike against DTDs! Ad hoc syntax Expressivity limitations (enough to distort modelling) Key constraints only on attribute content Type constraints only on attribute content Limited types (no integers!) Poor structured development support Parameter entities!

Evaluating Design 2.1 Design 2.1 Lots of elements, attributes for type and key constraints comment, element, choice, seq, plus, etc. Some nesting Mirror the syntax tree! Use ID and IDREF as appropriate Authoring Bit verbose, but good editor support DTD constraints Not bad at all! SAX/DOM processing Pretty good Queries?

Queries Get all comments //comment Get all elements with a top level choice //element/choice/.. Get all elements with a choice anywhere /*/element//choice/ancestor::element Get all attribute declarations on <int> /*/attlist[@on="int"] How many comments? count(/*/comment) Queryability excellent!

XML Namespaces or, making things simpler by making them much more complex

An observation Both calc1-bjp and dtdx-bjp have a plus element <plus> <int value="4"/> <int value="5"/> </plus> <plus> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </plus> We have an element name conflict! How do we distinguish them? Semantically? In a combined document?

Uniquing the names (1) We can add some characters <calcplus> <int value="4"/> <int value="5"/> </calcplus> <dtdxplus> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </dtdxplus> No name clash now But the meaningful part of the name is hard to see calcplus isn t a real word!

Uniquing the names (2) We can use a separator or other convention <calc:plus> <int value="4"/> <int value="5"/> </calc:plus> <dtdx:plus> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </dtdx:plus> No name clash now The meaningful part of the name is clear The disambiguator is clear But we can get clashes! Need a registry to coordinate?

Uniquing the names (3) Use URls for disambiguation <http://bjp.org/calc/:plus> <int value="4"/> <int value="5"/> </http://bjp.org/calc/:plus> <http://bjp.org/dtdx/:plus> <choice> <ref to="plus"/> <ref to="times"/> <ref to="minus"/> <ref to="int"/> </choice> </http://bjp.org/dtdx/:plus> No name clash now The meaningful part of the name clear The disambiguator is clear Clashes are hard to get Existing URI allocation mechanism But not well formed!

Uniquing the names (4) Combine the two! <dtdx:plus xmlns:dtdx="http://bjp.org/calc/"> <calc:plus <choice> xmlns:calc="http://bjp.org/calc/"> <ref to="plus"/> <int value="4"/> <ref to="times"/> <int value="5"/> <ref to="minus"/> </calc:plus> <ref to="int"/> </choice> </dtdx:plus> No name clash now The meaningful part of the name clear The disambiguator is clear Clashes are hard to get Existing URI allocation mechanism Well formed! But the model doesn t know

Layered!

Anatomy of Namespaces Namespace declarations Qualified names ( QNames ) Prefix Local name Expanded name {http://bjp.org/calc/}plus Namespace name http://bjp.org/calc/ <calc:plus xmlns:calc="http://bjp.org/calc/"> <int value="4"/> <int value="5"/> </calc:plus>

We don t need a prefix We can have default namespaces Terser Less cluttered Retrofit legacy documents Safer for non-namespace aware processors But trickiness! What s the expanded name of int in each document? Default namespaces and attributes interact weirdly <plus xmlns="http://bjp.org/calc/"> <int value="4"/> <int value="5"/> </plus> <calc:plus xmlns:calc="http://bjp.org/calc/"> <int value="4"/> <int value="5"/> </calc:plus>

Multiple namespaces We can have multiple declarations Each declaration has a scope The scope of a declaration is: The element where the declaration appears The descendants of that element......except those descendants which have a conflicting declaration (and their descendants, etc.) I.e., a declaration with the same prefix Scopes nest and shadow Nested declaration redefine outer declarations <plus xmlns="http://bjp.org/calc/" xmlns:n="http://bjp.org/numbers/ > <n:int value="4"/> <n:int value="5"/> </plus> <plus xmlns="http://bjp.org/calc/"> <int xmlns="http://bjp.org/numbers/ value="4"/> <int value="5"/> </plus>

Much more about NS in our future Issues Namespaces are increasingly controversial Modelling principles Schema language support Speaking of which...

DTDs and Namespaces Another expressivity limitation DTDs came before namespaces And were never updated DTDs only understand XML without namespaces Namespaces are syntactically layered So we can do something with them But not with full generality General schema principle If two documents are the same (equivalent) according to the data model, the if one is valid wrt a schema so should the other Not true for namespace equivalent documents and DTDs

An example A simple document with a namespace <foo xmlns="http://ex.org/"> <bar/> </foo> We can write a DTD for it <!DOCTYPE foo [ <!ELEMENT foo (bar)> <!ATTLIST foo xmlns CDATA #FIXED "http://ex.org/"> <!ELEMENT bar EMPTY> ]> <foo xmlns="http://ex.org/"> <bar/> </foo> This document is valid!

An example (cont) A simple document with a namespace a) <foo xmlns="http://ex.org/"> <bar/> </foo> We can compare with another document b) <ex:foo xmlns:ex="http://ex.org/"> <ex:bar/> </ex:foo> This is, from a namespace point of view, the same as a) But it is not valid wrt: <!ELEMENT foo (bar)> <!ATTLIST foo xmlns CDATA #FIXED "http://ex.org/"> <!ELEMENT bar EMPTY> Not the same!

Problems with DTDs Ad hoc, non-xml based syntax Expressivity limitations (enough to distort modelling) Key constraints only on attribute content Type constraints only on attribute content Limited types (no integers!) No namespace support (We ll see more!) Poor structured development support Parameter entities! Poor support from tools and APIs External validation tricky in SAX Some security issues

Security aside Consumers Entity expansion attacks ( http://bit.ly/ds6wyt ) Million laughs attack Quadratic Blowup Disappearing DTDs Network failure or removal Publishers Accidental DOS (http://hsivnen.iki.fi/no-dtd/) the RSS 0.91 DTD is retrieved over 4 million times per day. That s nuts. Burdening a single third party like that for something as useless as a DTD makes no sense.

Should you use DTDs? Probably not, but The syntax is reasonable for authoring and reading So prototyping There are a fair number of DTDs out there A common denominator Namespace lack hurts a lot As we ll see They are more interesting as a starting point

XML Schema - another schema language for XML 91

Schema languages for XML provide means to define the legal structure of an XML document <?xml version="1.0" encoding="utf-8"?> <!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon copyright CDATA #REQUIRED> <!ATTLIST cartoon year CDATA #REQUIRED> <!ELEMENT prolog (series, author, characters)> <!ELEMENT series (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT characters (character)*>... cartoon.dtd, a DTD for cartoon descriptions... <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="united Feature Syndicate" year="2000"> <prolog> <series>dilbert</series> <author>scott Adams</author> <characters> <character>the Pointy-Haired Boss</character> <character>dilbert</character> </characters>... <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="bill Watterson" year="1994"> <prolog> <series>calvin and Hobbs</series> <author>bill Watterson</author> <characters> <character>calvin</character> <character>hobbs</character> <character>snowman</character> </characters>...... 92

Schema languages for XML A variety of schema languages have been developed for XML; they vary w.r.t. their expressive power: do I have a means to express foo? how hard is it to describe foo? e.g., try to say, in a DTD that element must contain, in any order, an element1, an element2,..., and an element27? ease of use/understanding: how easy it is to write a schema? how easy is it to understand a schema written by somebody else? the complexity of validating a document w.r.t. a schema: how much space/time does it take to verify whether a document is valid w.r.t. a schema (in the size of document and schema)? e.g., checking this for DTDs requires only space linear in depth of the document 93

Schema languages for XML provide means to define the legal structure of an XML document <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema" elementformdefault="qualified"> <xs:element name="cartoon"> <xs:complextype> <xs:sequence> <xs:element ref="prolog"/> <xs:element ref="panels"/> </xs:sequence> <xs:attributegroup ref="attlist.cartoon"/>... cartoon.xsd, an XML schema for cartoon descriptions... <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="united Feature Syndicate" year="2000"> <prolog> <series>dilbert</series> <author>scott Adams</author> <characters> <character>the Pointy-Haired Boss</character> <character>dilbert</character> </characters>... <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="bill Watterson" year="1994"> <prolog> <series>calvin and Hobbs</series> <author>bill Watterson</author> <characters> <character>calvin</character> <character>hobbs</character> <character>snowman</character> </characters>...... 94

XML Schema XML Schema is also referred to as XML Schema Definition (XSD) is a W3C standard, see http://www.w3.org/xml/schema can be seen as successor of DTDs: a DTD is not a well-formed XML document an XML Schema is a well-formed XML document XML Schema is mostly more expressive than DTDs we ll talk about this at length in contrast to DTDs, XML Schema supports namespaces, so we can combine several documents: for schema validation, universal names are used (rather than qualified names) datatypes, including simple datatypes for parsed character data and for attribute values, e.g., for date (when was 11/10/2006?) XML provides more support for describing the (element and mixed) content of elements 95

XML Schema: a first example Example with DTD: <?xml version="1.0"?> <!DOCTYPE note SYSTEM "note.dtd"> <note> <to>tove</to> <from>jani</from> <senton>2007-01-29</senton> <body> Have a nice weekend! </body> </note> note.dtd : <?xml version="1.0" encoding="utf-8"?> <!ELEMENT note (to, from, senton,heading, body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT senton (#PCDATA)> <!ELEMENT body (#PCDATA)> 96

XML Schema: a first example note.xsd: <?xml version="1.0"?> <note xmlns= "http://www.w3schools.com" xmlns:xs= "http://www.w3.org/2001/xmlschema" xmlns:xsi= "http://www.w3.org/2001/xmlschema-instance"> <to>tove</to> <from>jani</from> <senton>2007-01-29</senton> <body> Have a nice weekend! </body> </note> <?xml version="1.0"?> <xs:schema xmlns:xs= "http://www.w3.org/2001/xmlschema" targetnamespace= "http://www.w3schools.com" xmlns="http://www.w3schools.com" elementformdefault="qualified"> <xs:element name="note"> <xs:complextype> <xs:sequence> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="senton" type="xs:date"/> <xs:element name="body" type="xs:string"/> </xs:sequence> </xs:complextype></xs:element></xs:schema>

XML Schema: some remarks to validate an XML document against an XML schema, we use a validating XML parser that supports XML Schema e.g., DOM level 2, SAX2 in XML Schema, each element and type can only be declared once almost all elements can contain an element <xs:annotation>...</xs:annotation> as their first child: useful, e.g., for <xs:simpletype name="northweststates"> <xs:annotation><xs:documentation>states in the Pacific Northwest of US </xs:documentation></xs:annotation> <xs:restriction base="xs:string"></xs:restriction> </xs:simpletype> XML Schema provides support for modularity & re-use through xs:import xs:include xs:redefine 98

XML Schema & Namespaces most XML schemas start like this, in note.xsd <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema" targetnamespace="http://www.w3schools.com" xmlns="http://www.w3schools.com" elementformdefault="qualified" >.. </xs:schema> XML Schema namespace e.g. for datatypes Target namespace of elements defined in this schema and a document using such a schema looks like this: <?xml version="1.0"?> <note xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"> Local (default) namespace This document uses a schema 99

XML Schema & Namespaces in contrast to DTDs, XML Schema supports namespaces a XML Schema either has no namespace or 2 namespaces: targetnamespace for those elements defined in schema and XMLSchema namespace http://www.w3.org/2001/xmlschema note.xsd: <?xml version="1.0"?> <p:note xmlns:p="http://www.w3schools.com" xmlns:xs="http://www.w3.org/2001/xmlschema" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"> <p:to>tove</p:to> <?xml version="1.0"?> <note xmlns="http://www.w3schools.com" xmlns:xs="http://www.w3.org/2001/xmlschema" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"> <to>tove</to> <?xml version="1.0"?> <xs:schema xmlns:xs= "http://www.w3.org/2001/xmlschema" targetnamespace= "http://www.w3schools.com" xmlns="http://www.w3schools.com" elementformdefault="qualified"> <xs:element name="note"> <xs:complextype> <xs:sequence> <xs:element name="to" type="xs:string"/>... 100

XML Schema core concepts: datatypes in the previous examples, we used 2 Built-in datatypes: xs:string xs:date many more: built-in/atomic/ primitive e.g., xs:datetime composite/ user-defined e.g., xs:lists, xs:union through restrictions/ user-defined e.g., ints < 10 101

XML Schema core concepts: datatypes each XML datatype comes with a value space, e.g., for xs:boolean, this is {true, false}. lexical space, e.g., for xs:boolean, this is {true, false, 1, 0}, and lexical-to-value mapping that has to be neither injective nor surjective (for boolean, it s surjective, but not injective) constraining facets that can be used in restrictions of that datatype, e.g., maxinclusive, maxexclusive, mininclusive, minexclusive for most 102

XML Schema: types We can define types in XSD, in two ways: xs:simpletype for simple types, to be used for attribute values and elements without element child nodes and without attributes xs:complextype for complex types, to be used for elements with element content or mixed element content or text content and attributes 103

XML Schema: type declarations can be anonymous, e.g., in the definition of age or person below: <xs:element name="age"> <xs:simpletype> <xs:restriction base="xs:integer"> <xs:mininclusive value="3"/> <xs:maxinclusive value="7"/> </xs:restriction> </xs:simpletype> </xs:element> can be named, e.g., agetype or persontype <xs:element name="person"> <xs:complextype> <xs:sequence> <xs:element name="name" type="nametype"/> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean"/> </xs:complextype> </xs:element> <xs:element name="age" type="agetype"/> <xs:simpletype name="agetype"> <xs:restriction base="xs:integer"> <xs:mininclusive value="3"/> <xs:maxinclusive value="7"/> </xs:restriction> </xs:simpletype> <xs:element name="person" type="persontype"/> <xs:complextype name ="PersonType"> <xs:sequence> <xs:element name="name" type="nametype"/> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean"/> </xs:complextype> 104

XML Schema: atomic simple types are based on the numerous built-in datatypes that can be restricted using xs:restriction facets, e.g., enumeration length maxlength minlength maxexclusive/maxinclusive minexclusive/mininclusive patterns (using regular expressions close to Perl s) <xs:simpletype name= biketype > <xs:restriction base="xs:string"> <xs:enumeration value= MTB"/> <xs:enumeration value= road"/> </xs:restriction></ xs:simpletype> <xs:simpletype name= eightchar ><xs:restriction base="xs:string"> <xs:length value="8"/> </xs:restriction></xs:simpletype> <xs:simpletype name= medstr > <xs:restriction base="xs:string"> <xs:minlength value="5"/> <xs:maxlength value="8"/> </xs:restriction></xs:simpletype> <xs:simpletype name= age > <xs:restriction base="xs:integer > <xs:mininclusive value="0"/> <xs:maxinclusive value="120"/> </xs:restriction></xs:simpletype> <xs:simpletype name= simplestr ><xs:restriction base="xs:string"> <xs:pattern value="([a-z][a-z])+"/> </xs:restriction></xs:simpletype> 105

XML Schema: composite simple types we can use built-in datatypes not only in restrictions, but also in compositions to : xs:list xs:union <xs:simpletype name='mylist'> <xs:list itemtype='xs:integer'/> </xs:simpletype> <xs:simpletype name='shortlist'> <xs:restriction base='mylist'> <xs:maxlength value='8'/> </xs:restriction> </xs:simpletype> <xs:simpletype name="colourlistordate"> <xs:union membertypes="colourlist xs:date"/> </xs:simpletype> <xs:simpletype name="colourlist"> <xs:list> <xs:simpletype> <xs:restriction base="xs:string"> <xs:enumeration value="red"/> <xs:enumeration value="green"/> <xs:enumeration value="blue"/> </xs:restriction> </xs:simpletype> </xs:list> </xs:simpletype> 106

XML Schema: simple types can be used in element declarations, for elements without element child nodes (instead of PCDATA in DTDs) <xs:complextype name="persontype"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean" default="true"/> <xs:attribute name="phone" type="xs:string"/> </xs:complextype> attribute declarations (instead of CDATA in DTDs) as in DTDs, we can specify fixed or default values 107

XML Schema: simple content xs:simpletype for attribute values and elements without element child nodes and without attributes for elements where we cannot use xs:simpletype because of attribute declarations but that have simple (e.g., text) content only, we can use xs:simplecontent, e.g. <xs:element name="size"> <xs:complextype> <xs:simplecontent> <xs:extension base="xs:integer"> <xs:attribute name="country" type="xs:string"/> </xs:extension> </xs:simplecontent> </xs:complextype> </xs:element> 108

XML Schema: complex types xs:complextype for elements with element content or mixed element content or text content and attributes element order enforcement constructs: sequence: order preserving all: like sequence, but not order preserving choice: choose exactly one <xs:complextype name="nametype"> <xs:sequence> <xs:element name="fname" type="xs:string"/> <xs:element name="lname" type="xs:string"/> </xs:sequence> </xs:complextype> these constructs can be combined with minoccurs and maxoccurs, by default, both are set to 1, but they can be set to any non-negative integer or unbounded, e.g. <xs:complextype name="nametype"> <xs:sequence> <xs:element name="fname" type="xs:string"/> <xs:element name="mname" type="xs:string" minoccurs="0" maxoccurs="7"/> <xs:element name="lname" type="xs:string"/> </xs:sequence> </xs:complextype> 109

XML Schema: mixed content to allow for mixed content, set attribute mixed= true, e.g., <xs:complextype name="persontype" mixed="true"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean" default="true"/> <xs:attribute name="phone" type="xs:string"/> </xs:complextype> like in DTDs, we cannot constrain where the text occurs between elements, can only say that content can be mixed 110

XML Schema: restriction and extension we have already used xs:extension and xs:restriction both for simple types and complex types they are XML Schema s mechanisms for inheritance extension: specifying a new type X by extending Y this appends X s definition to Y s, e.g., <xs:simpletype name="agetype"> <xs:restriction base="xs:integer"> <xs:mininclusive value="3"/> <xs:maxinclusive value="7"/> </xs:restriction> </xs:simpletype> <xs:complextype name="newagetype"> <xs:simplecontent> <xs:extension base="agetype"> <xs:attribute name="range" type="xs:string"/> </xs:extension> </xs:simplecontent> </xs:complextype> <xs:complextype name="persontype"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean" default="true"/> <xs:attribute name="phone" type="xs:string"/> </xs:complextype> <xs:complextype name="longpersontype"> <xs:complexcontent> <xs:extension base="persontype"> <xs:sequence> <xs:element name="address" type="xs:string"/> </xs:sequence> </xs:extension> </xs:complexcontent></xs:complextype> 111

XML Schema: restriction and extension restriction: easy for simple types we have seen it several times <xs:simpletype name="agetype"> <xs:restriction base="xs:integer"> <xs:mininclusive value="3"/> <xs:maxinclusive value="7"/> </xs:restriction> </xs:simpletype> cumbersome for complex types: specifying a new type X by restricting a complex type Y requires the reproduction of Y s definition, e.g., <xs:complextype name="persontype"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean"/> <xs:attribute name="phone" type="xs:string"/> </xs:complextype> <xs:complextype name="strictpersontype"> <xs:complexcontent> <xs:restriction base="persontype"> <xs:sequence> <xs:element name="name"> <xs:simpletype> <xs:restriction base="xs:string"> <xs:pattern value="[a-z]([a-z]+) "/> </xs:restriction> </xs:simpletype> </xs:element> <xs:element name="dob" type="xs:date"/> </xs:sequence> <xs:attribute name="friend" type="xs:boolean"/> <xs:attribute name="phone" type="xs:string"/> </xs:restriction> </xs:complexcontent></xs:complextype> 112

XML Schema: restriction and extension usage: in a document, an element of a type derived by restriction or extension from Y can be used in place of an element of type Y provided you say so explicitly, e.g., in <person phone="2"> <Name>Peter</Name> <DoB>1966-05-04</DoB> </person> <person xsi:type="longpersontype" phone="5432" friend="0"> <Name>Paul</Name> <DoB>1967-05-04</DoB> <address>manchester</address> </person> this means that a validating XML parser has to manage a schema s type hierarchy to ensure that LongPersonType was really derived by restriction or extension from the type expected for person but it doesn t have to guess an element s type from its properties also: compare they pain & gain of using types to pain & gain of using substitution groups! 113

XML Schema: restriction and extension to prevent a type from being instantiated directly, use e.g., <xs:complextype name="strictpersontype" abstract="true"> to prevent a type from being further extended and/or restricted use e.g., <xs:complextype name="strictpersontype" final="#all"> closely related to the mechanism of restriction/extension are substitution groups, i.e., a mechanism to allow to replace one element with a group of others 114

XML Schema: summary of complex types we have simple and complex types: simple types for attribute values and text in elements complex types for elements with child elements or attributes we have simple and complex content of elements: simple content: elements with only text between tags and possibly attributes complex content element content (elements only) mixed content (elements and text) empty content (at most attributes) a complex content type can be specified in 3 ways: using element order enforcement constructs (all, sequence, choice) a single child of simplecontent: derive a complex type from a simple or complex type with simple content a single child of complexcontent: derive a complex type from another complex type using restriction or extension 115

Comparing XML Schema & DTDs You know one better than the other one is simpler than the other in DTDs, no equivalent to ALL in XML Schema in DTDs, no cool & useful datatypes, lists, unions, in DTDs, no restrictions & extension, no types in a document, an element of a type derived by restriction from Y can be used in place of an element of type Y this can make writing complex schemas easier! but this means that a validating XML parser has to manage a schema s type hierarchy we will see later that both DTDs and XML have additional constraints on content models so that matching a node s childnode sequence against the corresponding content model is easier...will be discussed next week is there a set of XML documents (e.g., your cartoon descriptions) for which we can formulate a DTD but not an XML schema? or the other way round?...more next week 116

Let s have a look at this week s coursework 117