Introduction to Semistructured Data and XML. Contents

Similar documents
EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents

Introduction to Semistructured Data and XML. Overview. How the Web is Today. Based on slides by Dan Suciu University of Washington

ADT 2005 Lecture 7 Chapter 10: XML

Introduction. " Documents have tags giving extra information about sections of the document

Introduction. " Documents have tags giving extra information about sections of the document

Introduction to Semistructured Data and XML

XML: extensible Markup Language

Introduction to XML. Yanlei Diao UMass Amherst April 17, Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Lecture 7 Introduction to XML Data Management

Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

XML. Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data XML Applications

Introduction to XML. XML: basic elements

10/24/12. What We Have Learned So Far. XML Outline. Where We are Going Next. XML vs Relational. What is XML? Introduction to Data Management CSE 344

Introduction to Database Systems CSE 414

Additional Readings on XPath/XQuery Main source on XML, but hard to read:

Chapter 13 XML: Extensible Markup Language

Relational Data Model is quite rigid. powerful, but rigid.

Semistructured data, XML, DTDs

Introduction to XML Zdeněk Žabokrtský, Rudolf Rosa

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Parallel/Distributed Databases XML

A tutorial report for SENG Agent Based Software Engineering. Course Instructor: Dr. Behrouz H. Far. XML Tutorial.

Structured documents

XML: Extensible Markup Language

COMP9321 Web Application Engineering

extensible Markup Language

Introduction to Database Systems CSE 444

M359 Block5 - Lecture12 Eng/ Waleed Omar

Data Formats and APIs

Copyright 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 7 XML

ADT XML, XPath & XQuery

COMP9321 Web Application Engineering. Extensible Markup Language (XML)

7.1 Introduction. extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML

.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..

Digital Asset Management 3. Multimedia Database System

Overview. Structured Data. The Structure of Data. Semi-Structured Data Introduction to XML Querying XML Documents. CMPUT 391: XML and Querying XML

CSCI3030U Database Models

XML. Extensible Markup Language

The XML Metalanguage

XML in Databases. Albrecht Schmidt. al. Albrecht Schmidt, Aalborg University 1

Semistructured Data and XML

The Xlint Project * 1 Motivation. 2 XML Parsing Techniques

XML Origin and Usages

COMP9321 Web Application Engineering

Chapter 1: Semistructured Data Management XML

Author: Irena Holubová Lecturer: Martin Svoboda

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

Data Presentation and Markup Languages

XML. COSC Dr. Ramon Lawrence. An attribute is a name-value pair declared inside an element. Comments. Page 3. COSC Dr.

XML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11

Chapter 1: Semistructured Data Management XML

XML, DTD, and XPath. Announcements. From HTML to XML (extensible Markup Language) CPS 116 Introduction to Database Systems. Midterm has been graded

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

The concept of DTD. DTD(Document Type Definition) Why we need DTD

Well-formed XML Documents


11. EXTENSIBLE MARKUP LANGUAGE (XML)

EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML

<account_number> A-101 </account_number># <branch_name> Downtown </branch_name># <balance> 500 </balance>#

Introduction to XML. Chapter 133

CSE 544 Data Models. Lecture #3. CSE544 - Spring,

2009 Martin v. Löwis. Data-centric XML. XML Syntax

CS145 Introduction. About CS145 Relational Model, Schemas, SQL Semistructured Model, XML

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

XML. extensible Markup Language. Overview. Overview. Overview XML Components Document Type Definition (DTD) Attributes and Tags An XML schema

EMERGING TECHNOLOGIES

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54

Chapter 1: Getting Started. You will learn:

Delivery Options: Attend face-to-face in the classroom or remote-live attendance.

XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013

XML Structures. Web Programming. Uta Priss ZELL, Ostfalia University. XML Introduction Syntax: well-formed Semantics: validity Issues

XML. Rodrigo García Carmona Universidad San Pablo-CEU Escuela Politécnica Superior

XML 2 APPLICATION. Chapter SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC.

XML Processing & Web Services. Husni Husni.trunojoyo.ac.id

XML and Web Services

Informatics 1: Data & Analysis

markup language carry data define your own tags self-descriptive W3C Recommendation

CSC Web Technologies, Spring Web Data Exchange Formats

Solutions. a. Yes b. No c. Cannot be determined without the DTD. d. Schema. 9. Explain the term extensible. 10. What is an attribute?

Chapter 1: Introduction

The Semi-Structured Data Model. csc343, Introduction to Databases Diane Horton originally based on slides by Jeff Ullman Fall 2017

Using UML To Define XML Document Types

PART. Oracle and the XML Standards

XML. Document Type Definitions. Database Systems and Concepts, CSCI 3030U, UOIT, Course Instructor: Jarek Szlichta

Introduction to XML. An Example XML Document. The following is a very simple XML document.

Introduction to XML. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University

extensible Markup Language (XML) Basic Concepts

W3C XML XML Overview

Introduction to XML (Extensible Markup Language)

Introduction to XML 3/14/12. Introduction to XML

Delivery Options: Attend face-to-face in the classroom or via remote-live attendance.

Data Exchange. Hyper-Text Markup Language. Contents: HTML Sample. HTML Motivation. Cascading Style Sheets (CSS) Problems w/html

XML & Related Languages

CSI 3140 WWW Structures, Techniques and Standards. Representing Web Data: XML

Introduction. Web Pages. Example Graph

CSE 880. Advanced Database Systems. Semistuctured Data and XML

RepCom: A Customisable Report Generator Component System using XML-driven, Component-based Development Approach

XML and information exchange. XML extensible Markup Language XML

Transcription:

Contents Overview... 106 What is XML?... 106 How the Web is Today... 108 New Universal Data Exchange Format: XML... 108 What is the W3C?... 108 Semistructured Data... 110 What is Self-describing Data?... 111 The Semistructured Data Model... 112 Characteristics of Semistructured Data... 113 Conversion from XML to Objects... 114 Conversion from Objects to XML... 114 XML s origin is document processing, not databases... 115 From HTML to XML... 116 HTML... 117 XML... 117 Is XML a Database?... 118 Databases and XML... 120 The extensible Markup Language (XML)... 121 Markup Languages... 121 A brief history of markup... 123 XML is... 124 XML Features and motivations... 124 XML features... 124 XML: Motivation... 127 XML Structure... 128 Rules for Well-Formed XML... 130 Motivation for Nesting... 131 Structure of XML... 132 Attributes Vs. Subelements... 134 104

Namespaces... 135 XML: A Simple Example... 137 XML Data Model - A Tree... 139 XML Document Schema... 139 Document Type Definition (DTD)... 140 Element Specification in DTD... 141 Attribute specification in DTD:... 142 IDs and IDREFs... 143 Limitations of DTDs... 145 XML Processing: The XML Parser... 146 Well-Formed XML Documents... 147 Important XML Standards... 149 XML Terminology Summary... 149 105

Introduction to Semistructured Data and XML Sources Database System Concepts- Silberschatz Database Management Systems, R. Ramakrishnan Some slides by Dan Suciu from University of Washington Overview What is XML? Semistructured data HTML vs. XML XML Terminology Namespace DTD XML Schema What is XML? XML stands for EXtensible Markup Language XML is a markup language much like HTML. XML was designed to describe data. XML tags are not predefined in XML. You must define your own tags. XML is self describing. XML uses a DTD (Document Type Definition) or schema to formally describe the data. 106

XML is a standard for data exchange. All major database products have been retrofitted with facilities to store and construct XML documents There are already database products that are specifically designed to work with XML documents rather than relational or object-oriented data XML is closely related to object-oriented and so-called semistructured data XML can be used to exchange data In the real world, computer systems and databases contain data in incompatible formats. One of the most time consuming challenges for developers has been to exchange data between such systems over the Internet. Converting the data to XML can greatly reduce this complexity and create data that can be read by different types of applications. Suitable for semistructured data and has become a standard: o Easy to describe object-like data o Selfdescribing o Doesn t require a schema (but can be provided optionally) XML related documents and Languages: o DTDs an older way to specify schema o XML Schema a newer, more powerful (and much more complex!) way of specifying schema 107

o Query and transformation languages: XPath XSLT XQuery How the Web is Today HTML documents o often generated by applications o consumed by humans only o easy access: across platforms, across organizations No application interoperability: o HTML not understood by applications New Universal Data Exchange Format: XML A recommendation from the W3C XML = data XML generated by applications XML consumed by applications Easy access: across platforms, organizations What is the W3C? Group of member organizations (more than 400) o Hosted by MIT, INRIA, Keio University o 50 full-time staff members 108

Posts specifications for the Web o Notes: submitted by member, made public for comments, no endorsement yet o Working drafts: specification that is under consideration and open to comment o Recommendation: Accepted working draft becomes recommendation; since the W3C is not a government body, cannot use the term standard XML is suitable for semistructured data o Easy to describe object-like data o Selfdescribing o Doesn t require a schema (but can be provided optionally) 109

Semistructured Data Examples of Data sources with non-rigid structure o Biological data o Web data To make the previous student list suitable for machine consumption on the Web, it should be o self-describing (some schema-like information, like attribute names, is part of data itself) 110

What is Self-describing Data? 111

The Semistructured Data Model Syntax for Semistructured Data Observe: Nested tuples, set-values, oids! 112

Characteristics of Semistructured Data Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections Comparison with Relational Data 113

Conversion from XML to Objects Conversion from Objects to XML 114

XML s origin is document processing, not databases Allows things like standalone text (useless for databases) o <foo> Hello <moo>123</moo> Bye </foo> Attributes aren t needed just bloat the number of ways to represent the same thing XML data is ordered, while database data is not: <something><foo>1</foo><bar>2</bar></something> is different from <something><bar>2</bar><foo>1</foo></something> but these two complex values are same: Overview of XML [something: [bar:2, foo:1]] [something: [foo:1, bar:2]] The HyperText Markup Language (HTML) o A simple language for distributing text-based information XML is Extensible, unlike HTML o Users can add new tags, and separately specify how the tag should be handled for display XML combined with other Web technologies to yield. o A distributed information Web Like HTML, but any number of different tags can be used (up to the document author) 115

Unlike HTML, no semantics behind the tags o For instance, HTML s <table> </table> means: render contents as a table; in XML: doesn t mean anything Unlike HTML, is intolerant to bugs Browsers will render buggy HTML pages XML processors are not supposed to process buggy XML documents From HTML to XML HTML describes the presentation 116

HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 XML <bibliography> <book> <title> Foundations </title> </book> </bibliography> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> XML describes the content 117

Is XML a Database? An XML document is a database only in the strictest sense of the term. That is, it is a collection of data. In many ways, this makes it no different from any other file -- after all, all files contain data of some sort. As a "database" format, XML has some advantages. For example, o it is self-describing (the markup describes the structure and type names of the data, although not the semantics), o it is portable (Unicode), and o it can describe data in tree or graph structures. It also has some disadvantages. For example, 118

o it is verbose and o access to the data is slow due to parsing and text conversion Does XML and its surrounding technologies constitute a DBMS? The answer to this question is, "Sort of." On the plus side, XML provides many of the things found in databases: o storage (XML documents), o schemas (DTDs, XML Schemas, and so on), o query languages (XQuery, XPath, XQL, XML-QL, QUILT, etc.), o programming interfaces (SAX, DOM, JDOM), and so on. On the minus side, it lacks many of the things found in real databases: o efficient storage, indexes, o security, transactions and data integrity, o multi-user access, triggers, o and so on. It may be possible to use an XML document or documents as a database in environments o with small amounts of data, o few users, and o modest performance requirements, It will fail in most production environments, o which have many users, 119

o strict data integrity requirements, o and the need for good performance. Examples of less sophisticated data sets for which an XML document might be suitable as a database are o personal contact lists (names, phone numbers, addresses, etc.), o descriptions of the MP3s you've got However, given the low price and ease of use of databases like dbase and Access, there seems little reason to use an XML document as a database even in these cases. Databases and XML Database content can be presented in XML o XML processor can access DBMS or file system and convert data to XML o Web server can serve content as either XML or HTML 120

The extensible Markup Language (XML) A metalanguage o A language used to describe other languages using markup Markup describes properties of the data Designed to be structured o Strict rules about how data can be formatted Designed to be extensible o Can define own terms and markup Markup Languages XML has its roots in document management XML is derived from a language for structuring large documents known as the Standard Generalized Markup Language (SGML). To understand XML, it is important to understand its roots as a document markup language The term markup refers to anything in a document that is not intended to be part of the printed output In electronic document processing, a markup language is a formal description of o what part of the document is content, o what part is markup, and o what the markup means. 121

Markup languages evolved from specifying instructions for how to print parts of the document to specify the function of the content. For instance, with functional markup, text representing section headings (for this section, the words Markup language ) would be marked up as being a section heading, instead of being marked up as text to be printed in large size, bold font. Such functional markup allowed the document to be formatted differently in different situations. It also helps different parts of a large document, or different pages in a large Web site to be formatted in a uniform manner. Functional markup also helps automate extraction of key parts of documents. For the family of markup languages that includes HTML, SGML, and XML the markup takes the form of tags enclosed in angle-brackets, <>. Tags are used in pairs, with <tag> and </tag> delimiting the beginning and the end of the portion of the document to which the tag refers. For example, the title of a document might he marked up as follows. <title> Database System Concepts </title> Unlike HTML, XML does not prescribe the set of tags allowed, and the set may be specialized as needed. This feature is the key to XML s major role in data representation and exchange, whereas HTML is used primarily for document formatting. The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents. 122

Much of the use of XML has been in data exchange applications, not as a replacement for HTML Tags make data (relatively) self-documenting, describing : o Syntax The permitted arrangement or structure of letters and words in a language as defined by a grammar (XML) o Semantics The meaning of letters or words in a language A brief history of markup GML: Generalised Markup Language o Developed in 60 s and 70 s by IBM o Used for IBM technical manuals SGML: Standardized GML HTML XML o 70 s, 80 s with ANSI standard in 1983 o Flexible and very general, but difficult and costly o Early 90 s: compact markup for hypertext docs o Now seen as a step backwards 123

XML is Simpler than SGML More flexible than HTML An application of SGML a toolkit however, common to refer to documents as being written in XML Surrounded by a family of technologies which extend its use (eg transformation) XML Features and motivations XML features Represent most kinds of information, Easily customizable Allows validation of documents Easy to read by humans and machines Open standard, managed by W3C A Piece of XML Tags provide context for each value and allow semantics of the value to be identified. 124

Another Piece of XML Another Piece of XML 125

XML : Advantages Compared to storage of data in a database, the XML representation may be inefficient, since tag names are repeated throughout the document. However, in spite of this disadvantage, an XML representation has significant advantages when it is used to exchange data. o First, the presence of the tags makes the message self documenting; that is, a schema need not be consulted to understand the meaning of the text. o Second, the format of the document is not rigid. For example if some sender adds additional information, such as a tag lastaccessed noting the last date on which an account was accessed, the recipient of the XML data may simply ignore the tag. The ability to recognize and ignore unexpected tags allows the format of the data to evolve over time, without invalidating existing applications. o Finally, since the axml format is widely accepted. a wide variety of tools are available to assist in its processing, including browser software and database tools. 126

XML: Motivation Data interchange is critical in today s networked world o Examples: Banking: funds transfer Order processing (especially inter-company orders) Scientific data Chemistry: ChemML, Genetics: BSML (Bio-Sequence Markup Language), o Paper flow of information between organizations is being replaced by electronic flow of information Each application area has its own set of standards for representing information XML has become the basis for all new generation data interchange formats Just as SQL is the dominant language for querying relational data, XML is becoming the dominant format for data exchange. Each XML based standard defines what are valid elements, using o XML type specification languages to specify the syntax DTD (Document Type Descriptors) XML Schema o Plus textual descriptions of the semantics 127

A wide variety of tools is available for parsing, browsing and querying XML documents/data Many other specifications based upon XML coming out o XSL/XSLT, XQL, XPath, XPointer, XLink, MathML, CML, BIOML, GAME, BSML, XML- Communication and Integration XML can represent o many kinds of structured data used in business applications o database data XML is particularly useful as a data format when an application must communicate with another application, or integrate information from several other applications XML Structure Structure of XML Data The fundamental construct in an XML document is the element. Element: section of data beginning with <tagname> and ending with matching </tagname> Tag: label for a section of data 128

Example Elements must be properly nested o Proper nesting <account> <balance>. </balance> </account> o Improper nesting <account> <balance>. </account> </balance> o Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element. o Every document must have a single top-level element 129

Rules for Well-Formed XML Some basic rules for XML o All tags must be balanced - <TAG>...</TAG> o Empty tags expressed - <EMPTY_TAG/> o Tags must be nested - <B><I> </B></I> o All element attributes must be quoted - <TAG name= value > o Text is case-sensitive - <TAG>!= <Tag> o Comments are allowed - <!-- --> o Must begin - <?xml version= 1.0?> o Special characters must be escaped Special Characters Some characters need to be escaped because they have special significance: o < o > o & o o < > & &apos; " If they were not escaped - would be processed as markup by XML engine 130

A Piece of XML Example of Nested Elements Motivation for Nesting Nesting of data is useful in data transfer o Example: elements representing customer-id, customer name, and address nested within an order element 131

Nesting is not supported, or discouraged, in relational databases o With multiple orders, customer name and address are stored redundantly o Normalization replaces nested structures in each order by foreign key into table storing customer name and address information But nesting is appropriate when transferring data o External application does not have direct access to data referenced by a foreign key Nested representations are widely used in XML data interchange applications to avoid joins. Structure of XML Mixture of text with sub-elements is legal in XML. o Example: <account> This account is seldom used any more. <account-number> A-102</account-number> <branch-name> Perryridge</branch-name> <balance>400 </balance> </account> o Useful for document processing context, but discouraged for more structured data representation such as database content in XML 132

Attributes Elements can have attributes <account acct-type = checking monthly-fee= 5 > </account> <account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance> Attributes are specified by name = value pairs inside the starting tag of an element Attributes are strings, and do not contain markup An element may have several attributes, but each attribute name can only occur once in a given tag Another Example <account acct-type = checking monthly-fee= 5 > 133

Attributes Vs. Subelements Distinction between subelement and attribute o In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents o In the context of data representation, the difference is unclear and may be confusing Same information can be represented in two ways <account account-number = A-101 >. </account> <account> <account-number>a-101</account-number> </account> o Suggestion: use attributes for identifiers of elements, and use subelements for contents More on XML Syntax Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag <account number= A-101 branch= Perryridge balance= 200 /> Other XML Constructs XML Declaration, comments, processing instructions, DTD XML Declaration Comments <?xml version = 1.0 standalone= yes encoding= UTF-8?> <! this is a comment --> 134

Processing Instruction <?xml-stylesheet href= book.css type= text/css?> Namespaces Since anybody can create their own tags, possibility of naming collisions XML data has to be exchanged between organizations Same tag name may have different meaning in different organizations, causing confusion on exchanged documents Specifying a unique string as an element name avoids confusion Better solution: use unique-name:element-name Avoid using long unique names all over document by using XML Namespaces Example Want to provide two different ratings to each movie <movie> </movie> <name>apollo 13</name> <off-rating>pg 13</off-rating> <qual-rating>excellent</qual-rating> 135

Better <movie xmlns:off= http://somefilmauthority.com/ratings </movie> Namespaces xmlns:qual= http://somefilmreview.net/ratings > <name>apollo 13</name> <off:rating>pg 13</off:rating> <qual:rating>excellent</qual:rating> <bank Xmlns:FB= http://www.firstbank.com > </bank> <FB:branch> <FB:branchname>Downtown</FB:branchname> <FB:branchcity> Brooklyn </FB:branchcity> </FB:branch> 136

XML: A Simple Example Here s a simple example of an XML document -- in this case a made-up document describing information related to ordering parts (from a factory, for example). Note how the tags are written --surrounded by angle brackets. Tags of the form <Tag> always have a matching end tag of the form </Tag> -- this is one of the syntax rules. Tags that don t have ends (and don t have stuff inside them) can be written as <Tag /> -- with a special trailing slash. That s another syntax rule. The start and end tag, plus the stuff in between, is called an element -- and is the basic component part of an XML document. These must be organized hierarchically -- again one of the syntax rules. Elements can be assigned properties, called attributes, by putting the attribute and its value in the start tag, as in <order ref= xyxxy > 137

Example Revisited Here s a simple example of an XML document -- in this case a made-up document describing information related to ordering parts (from a factory, for example). Note how the tags are written --surrounded by angle brackets. Tags of the form <Tag> always have a matching end tag of the form </Tag> -- this is one of the syntax rules. Tags that don t have ends (and don t have stuff inside them) can be written as <Tag /> -- with a special trailing slash. That s another syntax rule. The start and end tag, plus the stuff in between, is called an element -- and is the basic component part of an XML document. These must be organized hierarchically -- again one of the syntax rules. Elements can be assigned properties, called attributes, by putting the attribute and its value in the start tag, as in <order ref= xyxxy > 138

XML Data Model - A Tree This illustrates how the document structure maps directly onto a data tree (a directed acyclic graph, actually) XML Document Schema Database schemas constrain o what information can be stored, and o the data types of stored values XML documents are not required to have an associated schema However, schemas are very important for XML data exchange o Otherwise, a site cannot automatically interpret data received from another site Possible to use XML as is as long as it follows the rules for wellformed XML 139

But doesn t allow structure to be validated o Can t check that all elements are present and correct o Can t check that attributes are correct o Can t specify value type of attributes o Can t describe format to others Two mechanisms for specifying XML schema, i.e., define and validate XML o Document Type Definition (DTD) Widely used o XML Schema Newer, increasing use Document Type Definition (DTD) DTD is a file which contains a formal definition of the permitted structure of the document DTD constraints structure of XML data o What elements can occur o What attributes can/must an element have o What subelements can/must occur inside each element, and how many times. DTD does not constrain data types o All values represented as strings in XML 140

DTD syntax o <!ELEMENT element (subelements-specification) > o <!ATTLIST element (attributes) > Element Specification in DTD Elements can be specified as o names of subelements, or o #PCDATA (parsed character data), i.e., character strings Example <! ELEMENT depositor (customer-name account-number)> <! ELEMENT customer-name (#PCDATA)> <! ELEMENT account-number (#PCDATA)> Subelement specification may have regular expressions Bank DTD <!ELEMENT bank ( ( account customer depositor)+)> Notation: <!DOCTYPE bank [ - alternatives + - 1 or more occurrences * - 0 or more occurrences <!ELEMENT bank ( ( account customer depositor)+)> <!ELEMENT account (account-number branch-name balance)> 141

<!ELEMENT customer(customer-name customer-street customer-city)> <!ELEMENT depositor (customer-name account-number)> <!ELEMENT account-number (#PCDATA)> <!ELEMENT branch-name (#PCDATA)> <!ELEMENT balance(#pcdata)> <!ELEMENT customer-name(#pcdata)> <!ELEMENT customer-street(#pcdata)> <!ELEMENT customer-city(#pcdata)> ]> Attribute specification in DTD: for each attribute o A Name o A Type CDATA character data ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) more on this later o A Default declaration mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) no default value has been provided 142

Example of Attribute specification in DTD A DTD specification for the element account which has an attribute of type acc-type, with default value checking <!ATTLIST account acct-type CDATA checking > A DTD specification for the element customer o <!ATTLIST customer customer-id ID # REQUIRED accounts IDREFS # REQUIRED > IDs and IDREFs An element can have at most one attribute of type ID The ID attribute value of each element in an XML document must be distinct o Thus the ID attribute value is an object identifier An attribute of type IDREF must contain the ID value of an element in the same document An attribute of type IDREFS contains a set of (0 or more) ID values. Each IDREF value must contain the ID value of an element in the same document 143

Bank DTD with Attributes XML data with ID and IDREF attributes 144

Limitations of DTDs No typing of text elements and attributes o All values are strings, no integers, reals, etc. Difficult to specify unordered sets of subelements o Order is usually irrelevant in databases o (A B)* allows specification of an unordered set, but Cannot ensure that each of A and B occurs only once IDs and IDREFs are untyped o The owners attribute of an account may contain a reference to another account, which is meaningless owners attribute should ideally be constrained to refer to customer elements Can have a single key item (ID), but: XML Schema o No support for multi-attribute keys XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports o Includes primitive data types (integers, strings, dates, etc.) Also, constraints on min/max values o Specified in XML format, unlike DTDs More standard representation, but verbose o Is integrated with namespaces 145

o Many more features List types, uniqueness and foreign key constraints, inheritance.. o Supports value-based constraints More on XML Schema E.g., (integers > 100) BUT: significantly more complicated than DTDs, not yet widely used. XML document that conforms to a given schema is said to be schema valid and is called an instance of the schema Similarly to DTDs, the XML Schema specification does not require an XML processor to actually use the document schema o In contrast with databases where ALL data MUST comply with schema XML Processing: The XML Parser The parser must verify that the XML is syntactically correct Such data is said to be well-formed o The minimal requirement to be XML 146

A parser MUST stop processing if the data isn t well-formed o E.g., stop processing and throw an exception to the XML-based application. The XML 1.0 spec requires this behaviour This is how all parsers work. They all have some sort of API so a program can get at the XML data. Also all parsers must indicate failure, and refuse to further process XML data, if the data violates basic syntax requirements of XML. These requirements are called well-formedness constraints, since if the data doesn t comply with those constraints, then it is not well-formed XML. Well-Formed XML Documents XML documents are subject to two specific constraints o Well-formedness: An XML document is well-formed if: It has a root element Every opening tag is followed by a matching closing tag, and the elements are properly nested inside each other Any attribute can occur at most once in a given opening tag, its value must be provided, and the value must be quoted o Validity: An XML document is valid, if it obeys the document type definition (DTD) or XML schema that you use to specify the legal syntax of the document Ensures that XML document parses into labeled tree 147

XML Schema Version of Bank DTD A Piece of XML 148

Important XML Standards XSL/XSLT: presentation and transformation standards Xpath/Xpointer/Xlink: standard for linking to documents and elements within Namespaces: for resolving name clashes DOM: Document Object Model for manipulating XML documents SAX: Simple API for XML parsing XQuery: query language XML Terminology Summary Tags: book, title, author, o start tag: <book>, end tag: </book> Elements: <book> <book>,<author> </author> o elements can be nested o empty element: <red></red> (Can be abbrv. <red/>) XML document: Has a single root element Well-formed XML document: Has matching tags Valid XML document: conforms to a schema 149