Semistructured data, XML, DTDs

Similar documents
XML. Document Type Definitions XML Schema. Database Systems and Concepts, CSCI 3030U, UOIT, Course Instructor: Jarek Szlichta

Relational Data Model is quite rigid. powerful, but rigid.

XML. Document Type Definitions. Database Systems and Concepts, CSCI 3030U, UOIT, Course Instructor: Jarek Szlichta

The Semi-Structured Data Model. csc343, Introduction to Databases Diane Horton originally based on slides by Jeff Ullman Fall 2017

CSCI3030U Database Models

CS145 Introduction. About CS145 Relational Model, Schemas, SQL Semistructured Model, XML

Query Languages for XML

Overview. Structured Data. The Structure of Data. Semi-Structured Data Introduction to XML Querying XML Documents. CMPUT 391: XML and Querying XML

Introduction to XML Zdeněk Žabokrtský, Rudolf Rosa

XML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11

Chapter 13 XML: Extensible Markup Language

EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

The concept of DTD. DTD(Document Type Definition) Why we need DTD

Copyright 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 7 XML

Data Presentation and Markup Languages

Introduction to Semistructured Data and XML. Overview. How the Web is Today. Based on slides by Dan Suciu University of Washington

7.1 Introduction. extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML

Chapter 1: Getting Started. You will learn:

XML: Extensible Markup Language

XML. Extensible Markup Language

Tutorial 2: Validating Documents with DTDs

Constructing a Document Type Definition (DTD) for XML

markup language carry data define your own tags self-descriptive W3C Recommendation

CSS, Cascading Style Sheets

Introduction to XML. An Example XML Document. The following is a very simple XML document.

EMERGING TECHNOLOGIES

2009 Martin v. Löwis. Data-centric XML. XML Syntax

Introduction to XML. XML: basic elements

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54

Additional Readings on XPath/XQuery Main source on XML, but hard to read:

Introduction to XML. Chapter 133

Introduction to XML. National University of Computer and Emerging Sciences, Lahore. Shafiq Ur Rahman. Center for Research in Urdu Language Processing

XML. Jonathan Geisler. April 18, 2008

Structured documents

Well-formed XML Documents


ODL Design to Relational Designs Object-Relational Database Systems (ORDBS) Extensible Markup Language (XML)

COMP9321 Web Application Engineering. Extensible Markup Language (XML)

XML, DTD, and XPath. Announcements. From HTML to XML (extensible Markup Language) CPS 116 Introduction to Database Systems. Midterm has been graded

Introduction to Semistructured Data and XML. Contents

extensible Markup Language

CSC Web Technologies, Spring Web Data Exchange Formats

User Interaction: XML and JSON

Introduction to Database Systems CSE 414

XML. Objectives. Duration. Audience. Pre-Requisites

COMP9321 Web Application Engineering

Querying XML. COSC 304 Introduction to Database Systems. XML Querying. Example DTD. Example XML Document. Path Descriptions in XPath

Outline. XML vs. HTML and Well Formed vs. Valid. XML Overview. CSC309 Tutorial --XML 4. Edward Xia

Semistructured Data and XML

ADT 2005 Lecture 7 Chapter 10: XML

Fundamentals of Web Programming a

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

XML. extensible Markup Language. Overview. Overview. Overview XML Components Document Type Definition (DTD) Attributes and Tags An XML schema

Introduction to XML. Yanlei Diao UMass Amherst April 17, Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

XQUERY: an XML Query Language

Session [2] Information Modeling with XSD and DTD

CSCC43 Introduction to Databases. XML Query Languages CSCC43

Author: Irena Holubová Lecturer: Martin Svoboda

Introduction to XML (Extensible Markup Language)

Information Systems. DTD and XML Schema. Nikolaj Popov

extensible Markup Language (XML) Basic Concepts

(One) Layer Model of the Semantic Web. Semantic Web - XML XML. Extensible Markup Language. Prof. Dr. Steffen Staab Dipl.-Inf. Med.

Delivery Options: Attend face-to-face in the classroom or remote-live attendance.

XML. Rodrigo García Carmona Universidad San Pablo-CEU Escuela Politécnica Superior

Delivery Options: Attend face-to-face in the classroom or via remote-live attendance.

XML and DTD. Mario Alviano A.Y. 2017/2018. University of Calabria, Italy 1 / 28

H2 Spring B. We can abstract out the interactions and policy points from DoDAF operational views

DISCUSSION 5min 2/24/2009. DTD to relational schema. Inlining. Basic inlining

XML. extensible Markup Language. ... and its usefulness for linguists

Introduction Syntax and Usage XML Databases Java Tutorial XML. November 5, 2008 XML

XML. COSC Dr. Ramon Lawrence. An attribute is a name-value pair declared inside an element. Comments. Page 3. COSC Dr.

EXtensible Markup Language XML

Informatics 1: Data & Analysis

Introduction to XML. When talking about XML, here are some terms that would be helpful:

XML Structures. Web Programming. Uta Priss ZELL, Ostfalia University. XML Introduction Syntax: well-formed Semantics: validity Issues

Extensible Markup Language (XML) Hamid Zarrabi-Zadeh Web Programming Fall 2013

Part II: Semistructured Data

10/24/12. What We Have Learned So Far. XML Outline. Where We are Going Next. XML vs Relational. What is XML? Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

Java EE 7: Back-end Server Application Development 4-2

CSE 880. Advanced Database Systems. Semistuctured Data and XML

Describing, Transforming, and Querying Semistructured Data

XML 2 APPLICATION. Chapter SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC.

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

XML. Semi-structured data (SSD) SSD Graphs. SSD Examples. Schemas for SSD. More flexible data model than the relational model.

XML in Databases. Albrecht Schmidt. al. Albrecht Schmidt, Aalborg University 1

XML Information Set. Working Draft of May 17, 1999

COMP9321 Web Application Engineering

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

Chapter 1: Semistructured Data Management XML

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0

2006 Martin v. Löwis. Data-centric XML. Document Types

DATA MODELS FOR SEMISTRUCTURED DATA

What is XML? XML is designed to transport and store data.

W3C XML XML Overview

Chapter 1: Semistructured Data Management XML

Introduction to Database Systems CSE 414

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while

Transcription:

Semistructured data, XML, DTDs Introduction to Databases Manos Papagelis Thanks to Ryan Johnson, John Mylopoulos, Arnold Rosenbloom and Renee Miller for material in these slides

Structured vs. unstructured data 2 Databases are highly structured Well-known data format: relations and tuples Every tuple conforms to a known schema Data independence? Woe unto you if you lose the schema Plain text is unstructured Cannot assume any predefined format Apparent organization makes no guarantees Self-describing: little external knowledge needed... but have to infer what the data means

Semi-structured data 4 Observation: most data has some structure Text: sentences, paragraphs, sections,... Books: chapters Web pages: HTML Idea of semistructured data: Enforce well-formatted data => Always know how to read/parse/manipulate it Optionally, enforce well-structured data also => Might help us interpret the data, too Pro: highly portable Con: verbose/redundant

XML 6

XML: designed for data interchange 7 <books search-terms= database+design > <book> <title>database Design for Mere Mortals </title> <author>michael J. Hernandez</author> <date>13/03/2003 </date> </book> <book id= B2 > <title>beginning Database Design</title> <subtitle>from Novice to Professional</subtitle> <author>clare Churcher</author> </book> </books>

Features of XML 8 Intentionally similar syntax to HTML Tree-structured (hierarchical) format Elements surrounded by opening and closing tags Attributes embedded in opening tags => <tag-name attr-name= attr-value >data</tag-name> But with important differences Strictly well-formed (must close all tags, etc.) Tag/attribute names carry no semantic meaning Data-only format: no implied presentation Valid identifiers (for elements and attributes) [a-z][a-z] Both descendants of SGML

XML terminology 9 <?xml version= 1.0?> attributes <PersonList Type= Student Date= 2002-02-02 > <Title Value= Student List /> <Person> </Person> <Person> </Person> </PersonList> elements Empty element Root element Element (or tag) names Elements are nested Root element contains all others

Content of Person XML terminology (cont.) Opening tag <Person Name = John Id = s111111111 > John is a nice fellow <Address> <Number>21</Number> <Street>Main St.</Street> </Address> </Person> standalone text, not very useful as data, non-uniform Child of Address, Descendant of Person Nested element, child of Person Parent of Address, Ancestor of Number Closing tag: What is open must be closed

Example XML Document 11 <?xml version= 1.0?> <!-- Some comment --> <Students> <Student StudId= 111111111 > <Name><First>John</First><Last>Doe</Last></Name> <Status>U2</Status> <CrsTaken CrsCode= CS308 Semester= F1997 /> <CrsTaken CrsCode= MAT123 Semester= F1997 /> </Student> <Student StudId= 987654321 > <Name><First>Bart</First><Last>Simpson</Last></Name> <Status>U4</Status> <CrsTaken CrsCode= CS308 Semester= F1994 /> </Student> </Students> <!-- Some other comment -->

XML Document is a Tree 12

Two kinds of XML Documents 13 Well-Formed XML Just need to use proper nesting Can invent your own tags Any tag can go anywhere Valid XML Can invent tags, but you define what they are and where they can go A DTD (document type definition) specifies these rules

Rules for well-formed XML Must have a root element Every opening tag must have matching closing tag Elements must be properly nested <foo><bar></foo></bar> is a no-no An attribute name can occur at most once in an opening tag. If it occurs: It must have an explicitly specified value (Boolean attrs, like in HTML, are not allowed) The value must be quoted (with or ) XML processors not allowed to fix ill-formed documents (HTML browsers expected to!)

XML, text, and whitespace 16 Adjacent non-tag chars parsed as text nodes Parser never ignores whitespace Leading and trailing space left with its text node Whitespace between tags produces empty text nodes Example: <foo> hi<bar> ho </bar> </foo> foo hi bar \n ho \n

Example: Well-Formed XML 17 <?xml version = 1.0 standalone = yes?> <BARS> <BAR><NAME>Joe s Bar</NAME> <BEER><NAME>Bud</NAME> Root tag <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> </BAR> </BARS> Tags surrounding a BAR element A NAME subelement A BEER subelement Nesting rule for tags must be obeyed

Checking your XML 18 http://validator.w3.org xmllint command on cdf. By default, checks if well-formed --debug Outputs an annotated tree of the parsed document

Problems with well-formed XML 19 If a program will process XML, good to know things like: What tags are allowed What order, nesting What attributes for each tag What s mandatory or optional A DTD specifies exactly this

DOCUMENT TYPE DEFINITION (DTD) 20

Document type definition (DTD) 21 Enforces more than well-formed -ness Which entities may (or must) appear where Attributes entities may (or must) have Types attributes and data must adhere to DTD is separate from XML it constraints May be embedded in separate section Most often linked (referenced externally) Validation: checking XML against its DTD(s) Important for interpreting/validating data Not necessary for parsing

DTD building blocks 22 Elements (<an-element>...</an-element>) Must always close tags If no contents: <empty-element/> Attributes (<... an-attr=......>) Text PCDATA (parsed character data) Mixed text and markup Use entities to escape >, etc. which should not be parsed CDATA ([non-parsed] character data) Plain text data Tags not parsed, entities not expanded Entities ( special tokens) e.g. < > & " &apos; HTML defines lots of others (e.g. )

DTD elements 23 <!ELEMENT $e...> $e is the element name " " may contain any of: Nothing: <!ELEMENT $e EMPTY> Anything: <!ELEMENT $e ANY> Text data: <!ELEMENT $e (#PCDATA)> Always parsed (#CDATA not allowed here) Child elements: <!ELEMENT $e (...)> Any child referenced must also be declared Child elements may themselves have children Mixed content: <!ELEMENT $e (#PCDATA...

DTD elements: children 24 Base construct: sequence (,) <!ELEMENT $e (a)> <!ELEMENT $e (a, b, c,...)> Comma ","defines order of which children must appear in XML Either-or content ( ) <!ELEMENT $e (a b...)> Exactly one of the options must appear in the XML Constraining child cardinality <!ELEMENT $e (a, b+, c*, d?)> not followed by any of +, *,? : exactly one (e.g., a) +: at least one (e.g., b) *: zero or more (e.g., c)?: at most one (e.g., d)

DTD elements: Example 25 <!ELEMENT resume ( bio,interests,education, experience,awards,service)> <!ELEMENT bio ( name, addr, phone, email?, fax?, url?)> <!ELEMENT interests (interest+)> <!ELEMENT education (degree*)> <!ELEMENT awards ((award honor)*)>... Sequences and either-or can both nest

DTD elements: Another Example 26 <!DOCTYPE BARS [ ]> <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> NAME and PRICE are text A BARS object has zero or more BAR s nested within A BAR has one NAME and one or more BEER subobjects A BEER has a NAME and a PRICE

DTD Attributes 27 <!ATTLIST $e $a $type $required> Declares an attribute $a on element $e $type may be any of character data: CDATA one of a set of values: (v1 v2...) unique identifier: ID references to one/many ID token(s) of other attributes: IDREF[S] valid xml name (or list of names): NMTOKEN[S] entity (or entities): ENTITY/ENTITIES $required may be required (not required): #REQUIRED (#IMPLIED) fixed value (always the same): #FIXED $value default value (used if none given): $value

DTD attributes: examples 28 <!ATTLIST person sin ID #REQUIRED spouse IDREF #IMPLIED name CDATA John Doe trusted (yes no) no species #FIXED homo sapiens alive (yes no) #IMPLIED > #IMPLIED unless specified otherwise

DTD attributes: ID[REF][S] 29 ID attribute type Uniquely identifies an element in the document (like keys) Error to have two Like HTML id attribute, but can have any name IDREF Refers to another element by ID (like foreign keys) Error if corresponding ID does not exist Like HTML href attribute, but no # needed IDREFS List of IDREF attributes, space-separated Problem: only one global set of IDs

Example: The DTD 30 <!DOCTYPE BARS [ <!ELEMENT BARS (BAR*, BEER*)> <!ELEMENT BAR (SELLS+)> <!ATTLIST BAR name ID #REQUIRED> <!ELEMENT SELLS (#PCDATA)> <!ATTLIST SELLS thebeer IDREF #REQUIRED> <!ELEMENT BEER EMPTY> <!ATTLIST BEER name ID #REQUIRED> <!ATTLIST BEER soldby IDREFS #IMPLIED> ]>

Example: The XML Document 31 <BARS> <BAR name = JoesBar > <SELLS thebeer = Bud >2.50</SELLS> <SELLS thebeer = Miller >3.00</SELLS> </BAR> <BEER name = Bud soldby = JoesBar SuesBar /> </BARS>

DTD entities 32 The XML equivalent of #define <!ENTITY $name $substituted-value > Can t take parameters, though Used just like other entities <politician-speak> I vow to lead the fight to stamp out &buzz-word; by instituting powerful new programs that will... </politician-speak> Pick your favorite substitution: <!ENTITY buzz-word communism > <!ENTITY buzz-word racism > <!ENTITY buzz-word terrorism > <!ENTITY buzz-word illegal file sharing > Not heavily used: better templating methods exist

Embedded vs. External DTD Specified as part of a document <?xml version= 1.0?> <!DOCTYPE Book [ ]> <Book> </Book> Reference to external (stand-alone) DTD <?xml version= 1.0?> <!DOCTYPE Book http://csc343.com/book.dtd > <Book> </Book>

EXAMPLE: Emdedded DTD 34 <?xml version = 1.0 standalone = no?> <!DOCTYPE BARS [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> The DTD <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> ]> <BARS> <BAR><NAME>Joe s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> </BARS> The document

EXAMPLE: External DTD 35 Assume the BARS DTD is in file bar.dtd: <?xml version = 1.0 standalone = no?> <!DOCTYPE BARS SYSTEM bar.dtd > <BARS> <BAR><NAME>Joe s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> </BARS> Get the DTD from the file bar.dtd

Limitations of DTDs 36 Don t understand namespaces Very limited typing (just strings and xml names) Very weak referential integrity All ID / IDREF / IDREFS use global ID space Can t express unordered contents conveniently How to specify that a,b,c must all appear, but in any order? All element names are global Is <name> for people or companies? can t declare both in the same DTD

XML Schema 37 Designed to improve on DTDs Advantages: Integrated with namespaces Many built-in types User-defined types Has local element names Powerful key and referential constraints Disadvantages: Unwieldy, much more complex than DTDs We won t cover XML schema in class

What is Next? 38 XML Query Languages XPATH XQUERY