Ghislain Fourny Big Data 9. Data Models pinkyone / 123RF Stock Photo 1
Syntax vs. Data Models Physical view Syntax <a> <d e="f"/> <c>this is <b>text</b>.</c> </a> 2
Syntax vs. Data Models a Logical view Data Model e = f d This is c b. text Physical view Syntax <a> <d e="f"/> <c>this is <b>text</b>.</c> </a> 3
Edge vs. Node labeling foo foo foobar bar bar 4
XML Data models Information Set (Infoset) http://www.w3.org/tr/xml-infoset/ Post Schema-Validation Infoset (PSVI) http://www.w3.org/tr/xmlschema11-1/ XQuery and XPath Data Model (XDM) http://www.w3.org/tr/xpath-datamodel/ 5
HTML/XML Data model Document Object Model (DOM) http://www.w3.org/tr/rec-dom-level-1/ 6
grigory_bruev / 123RF Stock Photo XML Information Set 7
Information Set <?xml version="1.0" encoding="utf-8"?> <dc:metadata xmlns:dc="http://www.systems.ethz.ch"> <title xml:lang="en" year="2008" >Systems Group</title> <publisher>eth Zurich</publisher> </dc:metadata> 8
Information Set 9
The 11 XML Information Items Document Element Attribute Processing Instruction Character Comment Namespace Unexpanded Entity Reference DTD Unparsed Entity Notation 10 10
The 11 XML Information Items Document Element Attribute Processing Instruction Character Comment Namespace Unexpanded Entity Reference DTD Unparsed Entity Notation 11 11
Document Information Items Document Information Item doc [children] Element Information Item metadata [document element] Element Information Item metadata [notations] <empty> [unparsed entities] <empty> [base URI ] file:///users/bigdata/documents/info.xml [character encoding scheme] UTF-8 [standalone] <no value> [version] 1.0 12
Element Information Items Element Information Item metadata metadata [namespace name] http://www.systems.ethz.ch [local name] metadata [prefix] dc [children] Element Information Items title, publisher [attributes] <empty> [namespace attributes] Attribute Information Item xmlns:dc [in-scope namespaces] Namespace Information Items dc->systems, xml->ns [base URI] file:///users/bigdata/documents/info.xml [parent] Document Information Item 13
Attribute Information Items Attribute Information Item xmlns:dc [namespace name] http://www.w3.org/2000/xmlns [local name] dc [prefix] xmlns [normalized value] http://www.systems.ethz.ch [specified] true [attribute type] <no value> [references] unknown [owner element] Element Information Item metadata xmlns:dc 14
Namespace Information Items Namespace Information Item dc->systems [prefix] dc [namespace name] http://www.systems.ethz.ch dc->systems Namespace Information Item xml->ns xml->ns [prefix] xml [namespace name] http://www.w3.org/xml/1998/namespace 15
XML Infoset - the tree doc xmlns:dc metadata xml->ns dc->systems xml->ns dc->systems title publisher dc->systems xml->ns lang year ETH Zurich Systems Group 16
Post-Schema-Validation Infoset Infoset + Types Post-Schema-Validation Infoset (PSVI) 17
Weerapat Kiatdumrong / 123RF Stock Photo XPath and XQuery Data Model 18
XDM: Sequences of Items (,,,,, ) 19
XDM: Sequence of one item = ( ) 20
XDM: Sequences are flat ((, ), )=(,, ) 21
XDM: Items Atomic Node 22
XDM: Seven Kinds of XML Nodes Document node Element node Attribute node Text node Comment node Processing instruction node Namespace node 23
XDM: Seven Kinds of XML Nodes Infoset XDM 24
XDM vs. Infoset Infoset XDM xs:untyped 25
XDM: New Items in 3.0 and 3.1 lorem ipsum dolor sit amet Functions Maps Arrays 26
XDM and Querying for let order by if + any else = then every while where return exit with Expression 27
Types (In general) 28
Types (General) Atomic Types vs. Structured Types 29
Atomic Types Strings Numbers Booleans Dates and Times Time Intervals Binaries Null 30
Lexical Space vs. Value Space 1 01-24.30 1-24.30 3.1415 15e+0 3.1415 5 31 Lexical space Value space
Subtypes Supertype's value space Subtype's value space 32
Structured Types Data Structure Associative Arrays (a.k.a. maps) Ordered Lists Examples JSON Object, Protobuf Message, Set of XML Attributes JSON Array, XML Element, Protobuf repeated field 33
Cardinality How many? One Common sign Zero or more * Common adjective required Zero or one? optional One or more + 34
wklzzz / 123RF Stock Photo Protocol Buffers 35
Messages message Person { required string last_name = 1; repeated string first_name = 2; optional Title title = 3; optional Person boss = 4; } 36
Scalar types double, float int32, int64 and variants bool string bytes 37
Enums enum Title { MR = 1; MS = 2; MRS = 3; } 38
In C++ person.boss().first_name() 39
Validation Burak Cakmak / 123RF Stock Photo 40
Validation: The Pipeline Document Well- Formedness Validation 41
On the oxygen Cheat Sheet Validity Well- Formedness 42
Validation vs. Annotation Validation Annotation 43
Validation 44
DTD Validation 45
Document Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a>& <a>& &&<d&e="f"/>& &&<c>this&is&<b>text</b>.</c>& </a>& & 46
Document Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a>& <a>& &&<d&e="f"/>& &&<c>this&is&<b>text</b>.</c>& </a>& & 47
Document Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& ]>& <a>& &&<d&e="f"/>& &&<c>this&is&<b>text</b>.</c>& </a>& & 48
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& ]>& <a>& &&<d&e="f"/>& &&<c>this&is&<b>text</b>.</c>& </a>& & 49
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& ]>& <a/>& & 50 Empty Content
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&(#PCDATA)>& ]>& <a>& &&This&is&text.& </a>& & 51 Simple Content
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&(foo,&bar)>& <!ELEMENT&foo&EMPTY>& <!ELEMENT&bar&EMPTY>& ]>& <a>& &&<foo/>& &&<bar/>& </a>& & 52 Complex Content
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&(foo+,&bar*,&foobar?)>& <!ELEMENT&foo&EMPTY>& <!ELEMENT&bar&EMPTY>& <!ELEMENT&foobar&EMPTY>& ]>& <a>& &&<foo/>& &&<foo/>& &&<foo/>& &&<foobar/>& 53 </a>& Complex Content
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&(&bar*& &(foo foobar)+)?>& <!ELEMENT&foo&EMPTY>& <!ELEMENT&foobar&EMPTY>& ]>& <a>& &&<foo/>& &&<foobar/>& &&<foo/>& &&<foobar/>& </a>& 54 Complex Content
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&(#PCDATA& &foo)*>& <!ELEMENT&foo&EMPTY>& ]>& <a>& &&<foo/>lorum<foo/>ipsum<foo/>& </a>& & 55 Mixed Content
Element Type Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&ANY>& <!ELEMENT&foo&EMPTY>& <!ELEMENT&bar&EMPTY>& ]>& <a>& &&<foo/>& &&Lorem& &&<bar/>& &&Ipsum& &&<bar/>& &&<bar/>& </a>& 56 Mixed Content
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&CDATA&#REQUIRED>& ]>& <a&foo="this&is&a&"value""></a>& & 57
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&CDATA&#IMPLIED>& ]>& <a&foo="this&is&a&"value""></a>& & 58
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&CDATA&"bar">& ]>& <a&foo="this&is&a&"value""></a>& & 59
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&CDATA&"bar">& ]>& <a></a>& & 60
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&CDATA&#FIXED&"bar">& ]>& <a&foo="bar"></a>& & 61
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&NMTOKEN&#REQUIRED>& ]>& <a&foo="123atoken456"></a>& & 62
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&a&[& <!ELEMENT&a&EMPTY>& <!ATTLIST&a&foo&NMTOKENS&#REQUIRED>& ]>& <a&foo="123atoken&456"></a>& & 63
Attribute-List Declaration <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&root&[& <!ELEMENT&root&(foo+,&bar,&barlist)>& <!ELEMENT&foo&EMPTY>& <!ELEMENT&bar&EMPTY>& <!ELEMENT&barlist&EMPTY>& <!ATTLIST&foo&myid&ID&#REQUIRED>& <!ATTLIST&bar&ref&IDREF&#REQUIRED>& <!ATTLIST&barlist&ref&IDREFS&#REQUIRED>& ]>& <root>& &&<foo&myid="foobar"/>& &&<foo&myid="foobar2"/>& &&<bar&ref="foobar"/>& &&<barlist&ref="foobar&foobar2"/>& </root>& 64
DTD Example: External Subset <?xml&version="1.0"&encoding="utf98"?>& <!DOCTYPE&root&SYSTEM&"schema.dtd">& <a&foo="bar"/>& 65
Warning: DTDs and Namespaces <!ELEMENT(eth((date,(president,(Rektor)>( <!ATTLIST(eth(xmlns(CDATA(#FIXED( "http://www.ethz.ch"( ((((((((((((((xmlns:xmldb(cdata(#fixed( "http://www.dbis.ethz.ch"( ((((((((((((((date(cdata(#implied( ((((((((((((((xmldb:date(cdata(#implied>( <!ELEMENT(date((#PCDATA)>( <!ELEMENT(president((#PCDATA)>( <!ATTLIST(president(number(CDATA(#IMPLIED>( <!ELEMENT(Rektor((#PCDATA)>( ( 66
Notations and Unparsed Entities <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE foo [ <!ENTITY presentation SYSTEM "/Users/bigdata/desktop/presentation.pptx" NDATA pptx> <!NOTATION pptx PUBLIC "powerpoint"> <!ELEMENT foo EMPTY> <!ATTLIST foo foo ENTITY #REQUIRED> ]> <foo foo="presentation"></foo> 67
XML Schema 68
Empty Schema <?xml&version="1.0"&encoding="utf98"?>& <xs:schema& &&xmlns:xs="http://www.w3.org/2001/xmlschema">& </xs:schema>& & 69
Simple Scenario <?xml&version="1.0"&encoding="utf98"?>& <xs:schema& &&xmlns:xs="http://www.w3.org/2001/xmlschema"& &&<xs:element&name="foo"&type="xs:string"/>& </xs:schema>& & & schema.xsd <?xml&version="1.0"&encoding="utf98"?>& <foo& &&xmlns:xsi="http://www.w3.org/2001/xmlschema9instance"& &&xsi:nonamespaceschemalocation="schema.xsd">& &&This&is&text.& </foo>& & file.xml 70
Simple Scenario <?xml&version="1.0"&encoding="utf98"?>& <xs:schema& &&xmlns:xs="http://www.w3.org/2001/xmlschema"& &&<xs:element&name="foo"&type="xs:integer"/>& </xs:schema>& & & schema.xsd <?xml&version="1.0"&encoding="utf98"?>& <foo& &&xmlns:xsi="http://www.w3.org/2001/xmlschema9instance"& &&xsi:nonamespaceschemalocation="schema.xsd">& &&142857& </foo>& & file.xml 71
Simple Types: Built-in Strings Numbers Booleans string anyuri QName decimal integer float double long int short byte positiveinteger nonnegativeinteger... unsignedlong unsignedint... boolean 72
Simple Types: Built-in Dates and Times Time Intervals Binaries Null - datetime time date gyearmonth gmonthday gyear gmonth gday datetimestamp duration yearmonthduration daytimeduration hexbinary base64binary 73
Dates 2014-12-02 2014-12-02T10:15:00Z 01:15:00-08:00 74
User-defined types Restriction Union Not atomic List Not atomic 75
Restriction <?xml&version="1.0"&encoding="utf98"?>& <xs:schema& &&xmlns:xs="http://www.w3.org/2001/xmlschema">& &&<xs:simpletype&name="myfixedlengthstring">& &&&&<xs:restriction&base="xs:string">& &&&&&&<xs:length&value="3"/>& &&&&</xs:restriction>& &&</xs:simpletype>& &&<xs:element&name="foo"&type="myfixedlengthstring"/>& </xs:schema>& schema.xsd <?xml&version="1.0"&encoding="utf98"?>& <foo& &&xmlns:xsi="http://www.w3.org/2001/xmlschema9instance"& &&xsi:nonamespaceschemalocation="schema.xsd">zrh</foo>& & & file.xml 76
Restriction <xs:simpletype,name="myfixedlengthstring">,,,<xs:restriction,base="xs:string">,,,,,<xs:length,value="3"/>,,,</xs:restriction>, </xs:simpletype>,, <foo>zrh</foo>,,, 77
List <xs:simpletype,name="mylist">,,,<xs:list,itemtype="xs:string"/>, </xs:simpletype>,, <foo>foo,bar,foobar</foo>,, 78
Union <xs:simpletype,name="myunion">,,,<xs:union,membertypes="xs:integer,xs:boolean"/>, </xs:simpletype>,, <foo>true</foo>,,,, 79
Complex Types Empty <foo/> Simple Content <foo>text</foo> Complex Content Mixed Content <foo> <a/> <b/> </foo> <foo> Text<a/>Text<b/> </foo> 80
Complex content <xs:complextype-name="complexcontent">- --<xs:sequence>- ----<xs:element-name="twotofour"-type="xs:string"-minoccurs="2"-maxoccurs="4"/>- ----<xs:element-name="zeroorone"-type="xs:boolean"-minoccurs="0"-maxoccurs="1"/>- --</xs:sequence>- </xs:complextype>- - <foo>- --<twotofour>foobar</twotofour>- --<twotofour>foobar</twotofour>- --<twotofour>foobar</twotofour>- --<zeroorone>true</zeroorone>- </foo>- - - - 81
Complex content <xs:complextype-name="complexcontent">- --<xs:sequence>- ----<xs:element-name="twotofour"-type="xs:string"-minoccurs="2"-maxoccurs="4"/>- ----<xs:element-name="zeroorone"-type="xs:boolean"-minoccurs="0"-maxoccurs="1"/>- --</xs:sequence>- </xs:complextype>- - <foo>- --<twotofour>foobar</twotofour>- --<twotofour>foobar</twotofour>- --<twotofour>foobar</twotofour>- --<zeroorone>true</zeroorone>- </foo>- - - - 82
Empty content <xs:complextype-name="emptytype">- --<xs:sequence/>- </xs:complextype>- - <foo/>- - - 83
Simple content <xs:complextype-name="datecountry">- --<xs:simplecontent>- ----<xs:extension-base="xs:date">- ------<xs:attribute-name="country"-type="xs:string"/>- ----</xs:extension>- --</xs:simplecontent>- </xs:complextype>- - <foo-country="switzerland">2014d12d02</foo>- - - 84
Mixed content <xs:complextype-name="mixedcontent"-mixed="true">- --<xs:sequence>- ----<xs:element-name="b"-type="xs:string"-minoccurs="0"-maxoccurs="unbounded"/>- --</xs:sequence>- </xs:complextype>- - <foo>some-text-and-some-<b>bold</b>-text.</foo>- - - - - 85
Simple type on attributes <xs:complextype-name="withattribute">- --<xs:sequence/>- --<xs:attribute-name="country"- ----------------type="xs:string"- ----------------default="switzerland"/>- </xs:complextype>- - <foo-country="switzerland"/>- - - 86
Named Types <xs:complextype-name="empty">---- <xs:sequence/>- </xs:complextype>- - <xs:element-name="c"-type="empty">- </xs:element>- - 87
Anonymous Types <xs:element*name="c">* **<xs:complextype>* ****<xs:sequence/>* **</xs:complextype>* </xs:element>* * 88
No namespaces <?xml&version="1.0"&encoding="utf98"?>& <xs:schema& &&xmlns:xs="http://www.w3.org/2001/xmlschema"& &&<xs:element&name="foo"&type="xs:string"/>& </xs:schema>& & & <?xml&version="1.0"&encoding="utf98"?>& <foo& &&xmlns:xsi="http://www.w3.org/2001/xmlschema9instance"& &&xsi:nonamespaceschemalocation="schema.xsd">& &&This&is&text.& </foo>& & 89
With namespaces <?xml&version="1.0"&encoding="utf98"?>& <xs:schema& &&xmlns:xs="http://www.w3.org/2001/xmlschema"&!!targetnamespace="http://www.example.com/bigdata"!!!xmlns:big="http://www.example.com/bigdata">& &&<xs:element&name="foo"&type="xs:string"/>& </xs:schema>& & <?xml&version="1.0"&encoding="utf98"?>& <big:foo& &&xmlns:xsi="http://www.w3.org/2001/xmlschema9instance"& &&xsi:schemalocation="!http://www.example.com/bigdata!schema.xsd"! &&xmlns:big="http://www.example.com/bigdata">& &&This&is&text.& </big:foo>& & & 90
Keys <xs:schema* ****xmlns:xs="http://www.w3.org/2001/xmlschema">* **<xs:element*name="root">* ****<xs:complextype>* ******..* ****</xs:complextype>* ****<xs:key*name="foodid">* ******<xs:selector*xpath="foo"/>* ******<xs:field*xpath="@id"/>* ****</xs:key>* **</xs:element>* </xs:schema>* * <?xml*version="1.0"*encoding="utfd8"?>* <root* What must be unique What makes it unique **xmlns:xsi="http://www.w3.org/2001/xmlschemadinstance"* **xsi:nonamespaceschemalocation="schema.xsd">* **<foo*id="foo"/>* **<foo*id="bar"/>* **<foo*id="foobar"/>* </root>* * 91
Bonus material: The Schema of Schemas!!<xs:schema!!!!!xmlns:xs="http://www.w3.org/2001/XMLSchema"!!!!!targetNamespace="http://www.w3.org/2001/XMLSchema">!!!!!<xs:element!name="schema"!id="schema">!!!!!!!<xs:complexType>!!!!!!!!!<xs:complexContent>!!!!!!!!!!!..!!!!!!!!!</xs:complexContent>!!!!!!!</xs:complexType>!!!!!</xs:element>!!!!!<xs:element!name="element"!type="xs:topLevelElement"!id="element"/>!!!!!<xs:element!name="simpleType"!type="xs:topLevelSimpleType"!id="simpleType"/>!!!!!<xs:element!name="complexType"!type="xs:topLevelComplexType"!id="complexType"/>!!!!!<xs:complexType!name="element"!abstract="true">!!!!!!!<xs:complexContent>!!!!!!!!!..!!!!!!!</xs:complexContent>!!!!!</xs:complexType>!!!</xs:schema>! 92