Structured documents An overview of XML Structured documents Michael Houghton 15/11/2000
Unstructured documents Broadly speaking, text and multimedia document formats can be structured or unstructured. An unstructured document simply contains the necessary instructions to render the content on screen, such as: position information typefaces and sizes colours They are typically stored in binary, and their formats are often secret, hindering document exchange. Examples include Postscript, PDF, TeX, and Word (mostly).
An example
Structured documents Structured document formats describe the function of each part of a document, for instance: titles, subtitles citations, quotes table of contents, index They are often encoded in text (ASCII or Unicode), with the emphasis on document sharing. Structured documents are more friendly to automated processing (e.g. autogeneration of indices). Examples include LaTeX, HTML, and XML.
Structure benefits A known document structure enables uses other than reading and printing: Separation of style and content e.g. company-wide styles, structured templates Automated document production e.g. generation of indices, tables of contents Archiving and retrieval e.g. searching for documents with a given title Metadata applications e.g. keywords, library indexing, annotations, interdocument linking
Markup languages Markup is the instruction annotation system used to express the structure in a document. Markup can be in two forms: Macros a code fragment or function (as in an office application) Tags a text sequence identifying the start or end of a part of the document HTML and XML are derived from SGML, which stands for Standard Generalised Markup Language. SGML uses the 'tag' form.
Markup concepts in SGML Structured documents in SGML consist of plain text, with tag sequences identifying the document structure, for example: <author> John Q Normal </author> This fragment describes two elements: author a text element: "John Q Normal" A structured document in SGML is a nested set of elements, with a single parent element, called the document element
A simple document (1) <person> <name> <!-- the person's name --> </name> <given> John </given> <family> Random </family> <initial> Q </initial> <prefers> Jack </prefers> <contact> <!-- their contact details --> <email> johnq@random.org </email> <phone> 0555 555555 </phone> </contact> </person>
A simple document (2) In the previous example, the document element is <person>. Some important points: '<' and '>' are (usually) special characters To appear in a text element, they must be 'escaped' as < and > respectively Whitespace is preserved However, it is often ignored by applications '<!--' and '-->' enclose comments These are special sequences identifying a note annotating the document but not part of the structure.
Tree structures SGML documents are often visualised as tree structures. Here's part of the previous document in tree form: (Note that this tree ignores whitespace nodes.)
Attributes Attributes are properties of an element which are not considered part of the main document structure. For example: <person id="458"> Attributes have a name and a value. In this example, the name is 'id' the value is '458' An element may have one or more attributes, but they must have different names.
SGML and HTML HTML is an application of SGML. An application is a set of SGML tags and attributes that can be used to describe a particular class of document. Here's a simple HTML document: <html> <head> <title> John Q. Random's home page </title> </head> <body> <h1>hello there!</h1> <p> <font color="blue">welcome to my home page!</font> </p> </body> </html>
Example HTML Here's the previous example in a browser:
What is XML? An overview of XML What is XML? Michael Houghton 15/11/2000
What is XML? XML stands for Extensible Markup Language It is an attempt to: Introduce formal structure to web documents Separate style from content Expand the scope and usefulness of web content It has been designed to allow the creation and combination of custom markup languages The XML standards suite is supported by the World Wide Web Consortium (W3C)
What's wrong with HTML? HTML has several failings: Little separation of style and content Heading and font tags are used interchangeably; current browsers still lack full CSS compliance De facto hardcoded presentation Leads to widespread incompatibility, and crash-prone browsers Little support for automated processing through metadata Only <META>, which is of limited use for complex metadata Lenient parsers allow poorly structured markup: e.g. <B>... <I>... </B>... </I>...
Why not use SGML? XML is a cut-down ( profile) implementation of SGML (Standard Generalised Markup Language). SGML was considered too complicated for web use; complete SGML implementations are large and complex. XML is simpler and stricter than SGML: XML requires end tags e.g. </P> would not be optional Empty elements need to be identified e.g. <BR> becomes <BR/> Attribute values must be quoted e.g. id="johnqrandom"
Core XML concepts Document validation: Checking for correct document structure Document Type Definitions (DTDs) XML Schema Definition Language Presentation and transformation: Transformation for display and document exchange Cascading Style Sheets (CSS) Extensible Style Language (XSL) Internal document structure: XPath
Some simple XML <?xml version="1.0" standalone="no"?> <!DOCTYPE book SYSTEM "http://somesite.org/book.dtd"> <book id="nielsen01"> <title> Designing Web Usability </title> <subtitle> The Practice of Simplicity </subtitle> <author> Jakob Nielsen </author> <info> <key> Design </key> <key> Internet </key> <isbn> 1-56205-810-X </isbn> </info> </book>
CDATA If you wish to 'protect' some text from being interpreted as markup, you can: encode the '<' and '>' characters enclose all the text in a CDATA section CDATA sections look like this: Here is some text containing a <sequence/> which would be interpreted as markup <![CDATA[ This time, the <sequence/> won't be interpreted as markup ]]>
Checking for correctness XML parsers check documents for two kinds of correctness: Well-formedness Checks that tags nest correctly, attributes are quoted, singleton tags are correctly closed An XML document must be well-formed. Validity Checks the document against a DTD, to see if its structure is allowed. Validation is not necessary. However, a validating parser will fail an invalid document.
The DTD (1) A Document Type Definition (DTD) describes the possible valid structures of an XML document. A DTD can be associated with the document in two ways: As a linked document in the XML header e.g. <?xml version="1.0"?> <!DOCTYPE book SYSTEM "http://somesite.org/book.dtd"> By directly embedding it into the document The DTD appears before the document root node, but after the XML declaration.
The DTD (2) Here's a DTD for a slide like the example: <!DOCTYPE book [ ]> <!ELEMENT book (title, subtitle?, author+, info) > <!ATTLIST book id CDATA #REQUIRED > <!ELEMENT title PCDATA > <!ELEMENT subtitle PCDATA > <!ELEMENT author PCDATA > <!ELEMENT info (key*, isbn) > <!ELEMENT key PCDATA > <!ELEMENT isbn PCDATA >
The DTD (3) These lines define the element elements and attributes: book, and the allowed child <!ELEMENT book (title, subtitle?, author+, info) > <!ATTLIST book id CDATA > The book element consists of: a mandatory title an optional subtitle one or more author elements The book element has a required attribute id, which consists of a character data value.
XML Schema One drawback of XML DTDs is that they are described in a separate syntax, inherited from SGML. XML Schema offers an alternative way to describe XML document structure, in XML syntax This provides many benefits: simplicity XML Schema rules are often easier to understand. interrogation of data structure XSLT transformations can know more about document structure tool reuse The same tools used to create and maintain XML documents can be used to maintain their structure
An example schema <?xml version ="1.0"?> <schema xmlns:xsd = "http://www.w3.org/1999/xmlschema"> <element name = "book"> <complextype content = "elementonly"> <sequence> <element ref = "title" /> <element ref = "subtitle" minoccurs = "0" maxoccurs = "1" /> <element ref = "author" minoccurs = "1" maxoccurs = "unbounded" /> </sequence> <attribute name = "id" use = "required" type = "string"/> </complextype> </element> <element name = "title"> <complextype content = "elementonly" /> </element>... </schema>
Namespaces Namespaces framework. are the cornerstone of the modular design of the XML XML uses namespaces to allow the combination of different markup languages. For example, this document includes an element to describe a person from another namespace: <slide> <author> <person:name xmlns:person="http://somesite.org/"> <person:given>john</person:given> <person:initial>q</person:initial> <person:family>random</person:family> </person:name> </author> </slide>
Stylesheets XML pages can make use of Cascading Stylesheets in the same way as HTML.However, they can also make use of more sophisticated XSL transformations. XSL is a two-part technology: XSLT is a rules-based system used to transform XML documents to other XML forms (such as WML), and to HTML. XSL-FO (Formatting Objects), is used to describe presentation objects for rendering in a browser. An XSL stylesheet will typically map XML to XSL-FO by means of XSLT rules.
XSLT rules An XSLT stylesheet consists of a set of XSLT rules. These rules are described in XML, with XSLT structures described using the xsl: namespace. Essentially, a rule is a chunk of output document, scripted by XSLT constructs which describe the parts of the source to which the rules are applied. The output is created by applying the closest matching rule to each part of the input document. Some XSLT processors allow extension rules to be written which talk to databases and other back-end systems.
Example XSLT rules A title rule: <xsl:template match="book"> Book Review: <xsl:value-of select="./title"/> by <xsl:value-of select="./author"/> </xsl:template> Applied to the example, this might produce: Book Review: Designing Web Usability by Jakob Nielsen XSLT is a so-called push/pull stylesheet system: push rules ( xsl:template) can pull information from other parts of the document.
XHTML and migration As we've seen, HTML is an application of SGML. XHTML is a similar markup language, expressed in XML syntax. XHTML can be viewed in current browsers, with some limitations. Converting HTML to XHTML will allow content created today to be processed into other forms - XHTML is XML. Conversion can be done with Dave Raggett's HTML Tidy utility, available from the W3C website.
XHTML profiles Three profiles of XHTML are under development by the Web Consortium: Transitional This is designed to ease the transition from HTML to XHTML. All of the 'questionable' parts of HTML (font colours, etc.) are still available. Strict This profile strictly enforces a style/content separation. According to the W3C, it is free of any tags associated with layout.used with W3C's Cascading Style Sheet language (CSS). Frames This profile has support for 'multi-framed' web designs.
XML and metadata An overview of XML XML and metadata Michael Houghton 15/11/2000
Metadata in HTML HTML's support of metadata is limited to the <META> element. This element supports a simple name/value pair, with two attributes: name The identifier of the data element content The data element's content This scheme is basic; more structured metadata has to be expressed through complicated naming schemes
Metadata in XHTML (1) The <META> element still works, in its XML form: <meta name="author" content="joe Bloggs" /> With XSLT, this data can be interrogated and acted on in a stylesheet or transformation Thus metadata can be preserved while migrating content from HTML to XHTML, then extracted and processed into a more expressive form.
Metadata in XHTML (2) However, with XHTML you're not restricted to <META> Using namespaces, it is possible to combine other metadata formats with XHTML. Common uses include adding RDF metadata, which can be done without breaking HTML backward compatibility. Since current browsers ignore tags they don't recognise, but include any text data in the output document, it is best to use metadata schemes that carry their content in attributes.
Dublin Core/RDF in XHTML <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>open.gov.uk - organisation index (a-b)</title> <rdf:rdf xmlns:rdf="http://w3.org/tr/1999/pr-rdf-syntax-19990105#" xmlns:dc="http://purl.org/metadata/dublin_core#"> <rdf:description about="http://www.open.gov.uk/index/orgindex.htm" dc:creator="neil Pawley" dc:title="open.gov.uk - organisation index" dc:subject="organisation, index, listing, directory" dc:description="this section contains the open.gov.uk..." dc:publisher="ccta"... /> </rdf:rdf>
More on RDF RDF ( Resource Description Framework) is a W3C recommendation for general website metadata. It is already used in conjunction with Dublin Core metadata schemas. However, it is also in use: in 'intelligent' browsers such as Netscape 6 (the 'site summary' tree browser) and Metabrowser ( http://metabrowser.spirit.net.au/) for content rating RDF was inspired by PICS, and there is a PICS to RDF mapping
Metadata migration Some ideas for metadata migration: Convert your HTML content to XHTML Use HTML Tidy Change your content generation scripts If your pages are generated dynamically, make the scripts generate XHTML with RDF Translate <META> data If your (X)HTML is static, consider crunching it with an XSLT processor to extract <META> data and output RDF instead