Structured documents

Similar documents
Copyright 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 7 XML

7.1 Introduction. extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML

COMP9321 Web Application Engineering

XML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11

Delivery Options: Attend face-to-face in the classroom or via remote-live attendance.

Delivery Options: Attend face-to-face in the classroom or remote-live attendance.

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

A tutorial report for SENG Agent Based Software Engineering. Course Instructor: Dr. Behrouz H. Far. XML Tutorial.

Introduction to XML. XML: basic elements

Module 2 (III): XHTML

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0


Tutorial 1 Getting Started with HTML5. HTML, CSS, and Dynamic HTML 5 TH EDITION

IBM. XML and Related Technologies Dumps Braindumps Real Questions Practice Test dumps free

Extensible Markup Language (XML) Hamid Zarrabi-Zadeh Web Programming Fall 2013

EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents

Web Standards Mastering HTML5, CSS3, and XML

COMP9321 Web Application Engineering. Extensible Markup Language (XML)

Chapter 10: Understanding the Standards

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

Chapter 1: Getting Started. You will learn:

XML Motivations. Semi-structured data. Principles of Information and Database Management 198:336 Week 8 Mar 28 Matthew Stone.

Author: Irena Holubová Lecturer: Martin Svoboda

XML: Extensible Markup Language

What is XHTML? XHTML is the language used to create and organize a web page:

CSI 3140 WWW Structures, Techniques and Standards. Markup Languages: XHTML 1.0

What is XML? XML is designed to transport and store data.

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

The XML Metalanguage

Introduction Syntax and Usage XML Databases Java Tutorial XML. November 5, 2008 XML

COMP9321 Web Application Engineering

XHTML. XHTML stands for EXtensible HyperText Markup Language. XHTML is the next generation of HTML. XHTML is almost identical to HTML 4.

Announcements. Paper due this Wednesday

XML. Objectives. Duration. Audience. Pre-Requisites

Introduction to XML. Chapter 133

XHTML & CSS CASCADING STYLE SHEETS

.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..

extensible Markup Language (XML) Basic Concepts

CSI 3140 WWW Structures, Techniques and Standards. Representing Web Data: XML

Chapter 2:- Introduction to XHTML. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

W3C XML XML Overview

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54

11. EXTENSIBLE MARKUP LANGUAGE (XML)

Introduction to XML. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University

Electronic Commerce Architecture Project LAB ONE: Introduction to XML

introduction to XHTML

CountryData Technologies for Data Exchange. Introduction to XML

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

Introduction to XML 3/14/12. Introduction to XML

Understanding the Web Design Environment. Principles of Web Design, Third Edition

Implementing Web Content

XML and DTD. Mario Alviano A.Y. 2017/2018. University of Calabria, Italy 1 / 28

7.1 Introduction. 7.1 Introduction (continued) - Problem with using SGML: - SGML is a meta-markup language

HTML. Mohammed Alhessi M.Sc. Geomatics Engineering. Internet GIS Technologies كلية اآلداب - قسم الجغرافيا نظم المعلومات الجغرافية

Style Sheet A. Bellaachia Page: 22

extensible Markup Language

The main Topics in this lecture are:

Layered approach. Data

Introduction to XML Zdeněk Žabokrtský, Rudolf Rosa

but XML goes far beyond HTML: it describes data

XML. COSC Dr. Ramon Lawrence. An attribute is a name-value pair declared inside an element. Comments. Page 3. COSC Dr.

Exam : Title : XML 1.1 and Related Technologies. Version : DEMO

Java EE 7: Back-end Server Application Development 4-2

Web Programming Paper Solution (Chapter wise)

XML Structures. Web Programming. Uta Priss ZELL, Ostfalia University. XML Introduction Syntax: well-formed Semantics: validity Issues

White Paper. elcome to Nokia s WAP 2.0 XHTML browser for small devices. Advantages of XHTML for Wireless Data

- XML. - DTDs - XML Schema - XSLT. Web Services. - Well-formedness is a REQUIRED check on XML documents

XML. Rodrigo García Carmona Universidad San Pablo-CEU Escuela Politécnica Superior

Birkbeck (University of London)

XML is a popular multi-language system, and XHTML depends on it. XML details languages

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

HTML and XML. XML stands for extensible Markup Language

XML Overview, part 1

Question Bank XML (Solved/Unsolved) Q.1 Fill in the Blanks: (1 Mark each)

XML for Java Developers G Session 2 - Sub-Topic 1 Beginning XML. Dr. Jean-Claude Franchitti

Outline. XML vs. HTML and Well Formed vs. Valid. XML Overview. CSC309 Tutorial --XML 4. Edward Xia

TASC Consulting Technical Writing Courseware Training

XML Update. Royal Society of the Arts London, December 8, Jon Bosak Sun Microsystems

Introduction to Semistructured Data and XML. Overview. How the Web is Today. Based on slides by Dan Suciu University of Washington

HTML vs. XML In the case of HTML, browsers have been taught how to ignore invalid HTML such as the <mymadeuptag> element and generally do their best

Introduction to XML. An Example XML Document. The following is a very simple XML document.

PART. Oracle and the XML Standards

XML Applications. Prof. Andrea Omicini DEIS, Ingegneria Due Alma Mater Studiorum, Università di Bologna a Cesena

~ Ian Hunneybell: DIA Revision Notes ~

Motivation (WWW) Markup Languages (defined). 7/15/2012. CISC1600-SummerII2012-Raphael-lec2 1. Agenda

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Technology for the Rest of Us: XML. May 26, 2004 Columbus, Ohio

HTML Overview. With an emphasis on XHTML

CSS, Cascading Style Sheets

Semistructured Content

(1) I (2) S (3) P allow subscribers to connect to the (4) often provide basic services such as (5) (6)

Layered approach. Data

XML (Extensible Markup Language

GRAPHIC WEB DESIGNER PROGRAM

Semistructured data, XML, DTDs

Create web pages in HTML with a text editor, following the rules of XHTML syntax and using appropriate HTML tags Create a web page that includes

TagSoup: A SAX parser in Java for nasty, ugly HTML. John Cowan

Chapter 2 XML, XML Schema, XSLT, and XPath

Chapter 7: XML Namespaces

Manipulating XML Trees XPath and XSLT. CS 431 February 18, 2008 Carl Lagoze Cornell University

Transcription:

Structured documents An overview of XML Structured documents Michael Houghton 15/11/2000

Unstructured documents Broadly speaking, text and multimedia document formats can be structured or unstructured. An unstructured document simply contains the necessary instructions to render the content on screen, such as: position information typefaces and sizes colours They are typically stored in binary, and their formats are often secret, hindering document exchange. Examples include Postscript, PDF, TeX, and Word (mostly).

An example

Structured documents Structured document formats describe the function of each part of a document, for instance: titles, subtitles citations, quotes table of contents, index They are often encoded in text (ASCII or Unicode), with the emphasis on document sharing. Structured documents are more friendly to automated processing (e.g. autogeneration of indices). Examples include LaTeX, HTML, and XML.

Structure benefits A known document structure enables uses other than reading and printing: Separation of style and content e.g. company-wide styles, structured templates Automated document production e.g. generation of indices, tables of contents Archiving and retrieval e.g. searching for documents with a given title Metadata applications e.g. keywords, library indexing, annotations, interdocument linking

Markup languages Markup is the instruction annotation system used to express the structure in a document. Markup can be in two forms: Macros a code fragment or function (as in an office application) Tags a text sequence identifying the start or end of a part of the document HTML and XML are derived from SGML, which stands for Standard Generalised Markup Language. SGML uses the 'tag' form.

Markup concepts in SGML Structured documents in SGML consist of plain text, with tag sequences identifying the document structure, for example: <author> John Q Normal </author> This fragment describes two elements: author a text element: "John Q Normal" A structured document in SGML is a nested set of elements, with a single parent element, called the document element

A simple document (1) <person> <name> <!-- the person's name --> </name> <given> John </given> <family> Random </family> <initial> Q </initial> <prefers> Jack </prefers> <contact> <!-- their contact details --> <email> johnq@random.org </email> <phone> 0555 555555 </phone> </contact> </person>

A simple document (2) In the previous example, the document element is <person>. Some important points: '<' and '>' are (usually) special characters To appear in a text element, they must be 'escaped' as < and > respectively Whitespace is preserved However, it is often ignored by applications '<!--' and '-->' enclose comments These are special sequences identifying a note annotating the document but not part of the structure.

Tree structures SGML documents are often visualised as tree structures. Here's part of the previous document in tree form: (Note that this tree ignores whitespace nodes.)

Attributes Attributes are properties of an element which are not considered part of the main document structure. For example: <person id="458"> Attributes have a name and a value. In this example, the name is 'id' the value is '458' An element may have one or more attributes, but they must have different names.

SGML and HTML HTML is an application of SGML. An application is a set of SGML tags and attributes that can be used to describe a particular class of document. Here's a simple HTML document: <html> <head> <title> John Q. Random's home page </title> </head> <body> <h1>hello there!</h1> <p> <font color="blue">welcome to my home page!</font> </p> </body> </html>

Example HTML Here's the previous example in a browser:

What is XML? An overview of XML What is XML? Michael Houghton 15/11/2000

What is XML? XML stands for Extensible Markup Language It is an attempt to: Introduce formal structure to web documents Separate style from content Expand the scope and usefulness of web content It has been designed to allow the creation and combination of custom markup languages The XML standards suite is supported by the World Wide Web Consortium (W3C)

What's wrong with HTML? HTML has several failings: Little separation of style and content Heading and font tags are used interchangeably; current browsers still lack full CSS compliance De facto hardcoded presentation Leads to widespread incompatibility, and crash-prone browsers Little support for automated processing through metadata Only <META>, which is of limited use for complex metadata Lenient parsers allow poorly structured markup: e.g. <B>... <I>... </B>... </I>...

Why not use SGML? XML is a cut-down ( profile) implementation of SGML (Standard Generalised Markup Language). SGML was considered too complicated for web use; complete SGML implementations are large and complex. XML is simpler and stricter than SGML: XML requires end tags e.g. </P> would not be optional Empty elements need to be identified e.g. <BR> becomes <BR/> Attribute values must be quoted e.g. id="johnqrandom"

Core XML concepts Document validation: Checking for correct document structure Document Type Definitions (DTDs) XML Schema Definition Language Presentation and transformation: Transformation for display and document exchange Cascading Style Sheets (CSS) Extensible Style Language (XSL) Internal document structure: XPath

Some simple XML <?xml version="1.0" standalone="no"?> <!DOCTYPE book SYSTEM "http://somesite.org/book.dtd"> <book id="nielsen01"> <title> Designing Web Usability </title> <subtitle> The Practice of Simplicity </subtitle> <author> Jakob Nielsen </author> <info> <key> Design </key> <key> Internet </key> <isbn> 1-56205-810-X </isbn> </info> </book>

CDATA If you wish to 'protect' some text from being interpreted as markup, you can: encode the '<' and '>' characters enclose all the text in a CDATA section CDATA sections look like this: Here is some text containing a <sequence/> which would be interpreted as markup <![CDATA[ This time, the <sequence/> won't be interpreted as markup ]]>

Checking for correctness XML parsers check documents for two kinds of correctness: Well-formedness Checks that tags nest correctly, attributes are quoted, singleton tags are correctly closed An XML document must be well-formed. Validity Checks the document against a DTD, to see if its structure is allowed. Validation is not necessary. However, a validating parser will fail an invalid document.

The DTD (1) A Document Type Definition (DTD) describes the possible valid structures of an XML document. A DTD can be associated with the document in two ways: As a linked document in the XML header e.g. <?xml version="1.0"?> <!DOCTYPE book SYSTEM "http://somesite.org/book.dtd"> By directly embedding it into the document The DTD appears before the document root node, but after the XML declaration.

The DTD (2) Here's a DTD for a slide like the example: <!DOCTYPE book [ ]> <!ELEMENT book (title, subtitle?, author+, info) > <!ATTLIST book id CDATA #REQUIRED > <!ELEMENT title PCDATA > <!ELEMENT subtitle PCDATA > <!ELEMENT author PCDATA > <!ELEMENT info (key*, isbn) > <!ELEMENT key PCDATA > <!ELEMENT isbn PCDATA >

The DTD (3) These lines define the element elements and attributes: book, and the allowed child <!ELEMENT book (title, subtitle?, author+, info) > <!ATTLIST book id CDATA > The book element consists of: a mandatory title an optional subtitle one or more author elements The book element has a required attribute id, which consists of a character data value.

XML Schema One drawback of XML DTDs is that they are described in a separate syntax, inherited from SGML. XML Schema offers an alternative way to describe XML document structure, in XML syntax This provides many benefits: simplicity XML Schema rules are often easier to understand. interrogation of data structure XSLT transformations can know more about document structure tool reuse The same tools used to create and maintain XML documents can be used to maintain their structure

An example schema <?xml version ="1.0"?> <schema xmlns:xsd = "http://www.w3.org/1999/xmlschema"> <element name = "book"> <complextype content = "elementonly"> <sequence> <element ref = "title" /> <element ref = "subtitle" minoccurs = "0" maxoccurs = "1" /> <element ref = "author" minoccurs = "1" maxoccurs = "unbounded" /> </sequence> <attribute name = "id" use = "required" type = "string"/> </complextype> </element> <element name = "title"> <complextype content = "elementonly" /> </element>... </schema>

Namespaces Namespaces framework. are the cornerstone of the modular design of the XML XML uses namespaces to allow the combination of different markup languages. For example, this document includes an element to describe a person from another namespace: <slide> <author> <person:name xmlns:person="http://somesite.org/"> <person:given>john</person:given> <person:initial>q</person:initial> <person:family>random</person:family> </person:name> </author> </slide>

Stylesheets XML pages can make use of Cascading Stylesheets in the same way as HTML.However, they can also make use of more sophisticated XSL transformations. XSL is a two-part technology: XSLT is a rules-based system used to transform XML documents to other XML forms (such as WML), and to HTML. XSL-FO (Formatting Objects), is used to describe presentation objects for rendering in a browser. An XSL stylesheet will typically map XML to XSL-FO by means of XSLT rules.

XSLT rules An XSLT stylesheet consists of a set of XSLT rules. These rules are described in XML, with XSLT structures described using the xsl: namespace. Essentially, a rule is a chunk of output document, scripted by XSLT constructs which describe the parts of the source to which the rules are applied. The output is created by applying the closest matching rule to each part of the input document. Some XSLT processors allow extension rules to be written which talk to databases and other back-end systems.

Example XSLT rules A title rule: <xsl:template match="book"> Book Review: <xsl:value-of select="./title"/> by <xsl:value-of select="./author"/> </xsl:template> Applied to the example, this might produce: Book Review: Designing Web Usability by Jakob Nielsen XSLT is a so-called push/pull stylesheet system: push rules ( xsl:template) can pull information from other parts of the document.

XHTML and migration As we've seen, HTML is an application of SGML. XHTML is a similar markup language, expressed in XML syntax. XHTML can be viewed in current browsers, with some limitations. Converting HTML to XHTML will allow content created today to be processed into other forms - XHTML is XML. Conversion can be done with Dave Raggett's HTML Tidy utility, available from the W3C website.

XHTML profiles Three profiles of XHTML are under development by the Web Consortium: Transitional This is designed to ease the transition from HTML to XHTML. All of the 'questionable' parts of HTML (font colours, etc.) are still available. Strict This profile strictly enforces a style/content separation. According to the W3C, it is free of any tags associated with layout.used with W3C's Cascading Style Sheet language (CSS). Frames This profile has support for 'multi-framed' web designs.

XML and metadata An overview of XML XML and metadata Michael Houghton 15/11/2000

Metadata in HTML HTML's support of metadata is limited to the <META> element. This element supports a simple name/value pair, with two attributes: name The identifier of the data element content The data element's content This scheme is basic; more structured metadata has to be expressed through complicated naming schemes

Metadata in XHTML (1) The <META> element still works, in its XML form: <meta name="author" content="joe Bloggs" /> With XSLT, this data can be interrogated and acted on in a stylesheet or transformation Thus metadata can be preserved while migrating content from HTML to XHTML, then extracted and processed into a more expressive form.

Metadata in XHTML (2) However, with XHTML you're not restricted to <META> Using namespaces, it is possible to combine other metadata formats with XHTML. Common uses include adding RDF metadata, which can be done without breaking HTML backward compatibility. Since current browsers ignore tags they don't recognise, but include any text data in the output document, it is best to use metadata schemes that carry their content in attributes.

Dublin Core/RDF in XHTML <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>open.gov.uk - organisation index (a-b)</title> <rdf:rdf xmlns:rdf="http://w3.org/tr/1999/pr-rdf-syntax-19990105#" xmlns:dc="http://purl.org/metadata/dublin_core#"> <rdf:description about="http://www.open.gov.uk/index/orgindex.htm" dc:creator="neil Pawley" dc:title="open.gov.uk - organisation index" dc:subject="organisation, index, listing, directory" dc:description="this section contains the open.gov.uk..." dc:publisher="ccta"... /> </rdf:rdf>

More on RDF RDF ( Resource Description Framework) is a W3C recommendation for general website metadata. It is already used in conjunction with Dublin Core metadata schemas. However, it is also in use: in 'intelligent' browsers such as Netscape 6 (the 'site summary' tree browser) and Metabrowser ( http://metabrowser.spirit.net.au/) for content rating RDF was inspired by PICS, and there is a PICS to RDF mapping

Metadata migration Some ideas for metadata migration: Convert your HTML content to XHTML Use HTML Tidy Change your content generation scripts If your pages are generated dynamically, make the scripts generate XHTML with RDF Translate <META> data If your (X)HTML is static, consider crunching it with an XSLT processor to extract <META> data and output RDF instead