SAX Simple API for XML

Similar documents
1 <?xml encoding="utf-8"?> 1 2 <bubbles> 2 3 <!-- Dilbert looks stunned --> 3

4 SAX. XML Application. SAX Parser. callback table. startelement! startelement() characters! 23

Part V. SAX Simple API for XML. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 84

Part V. SAX Simple API for XML

XML Databases 4. XML Processing,

4. XML Processing. XML Databases 4. XML Processing, The XML Processing Model. 4.1The XML Processing Model. 4.1The XML Processing Model

Part VII. Querying XML The XQuery Data Model. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153

The XQuery Data Model

The DOM approach has some obvious advantages:

Outline of this part (I) Part IV. Querying XML Documents. Querying XML Documents. Outline of this part (II)

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

MANAGING INFORMATION (CSCU9T4) LECTURE 4: XML AND JAVA 1 - SAX

XML APIs. Web Data Management and Distribution. Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart

XML Parsers. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University

Chapter 13 XML: Extensible Markup Language

Parsing XML documents. DOM, SAX, StAX

1 The size of the subtree rooted in node a is 5. 2 The leaf-to-root paths of nodes b, c meet in node d

Document Parser Interfaces. Tasks of a Parser. 3. XML Processor APIs. Document Parser Interfaces. ESIS Example: Input document

Introduction to XML. Large Scale Programming, 1DL410, autumn 2009 Cons T Åhs

XML: Managing with the Java Platform

Databases and Information Systems XML storage in RDBMS and core XPath implementation. Prof. Dr. Stefan Böttcher

XML: Extensible Markup Language

Semi-structured Data: Programming. Introduction to Databases CompSci 316 Fall 2018

Part IV. DOM Document Object Model

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Databases and Information Systems 1

XML in the Development of Component Systems. Parser Interfaces: SAX

XML Master: Professional V2

The Xlint Project * 1 Motivation. 2 XML Parsing Techniques

Part IV. DOM Document Object Model. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 70

Web Data Management. Tree Pattern Evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Part XII. Mapping XML to Databases. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 324

XML. Presented by : Guerreiro João Thanh Truong Cong

Accelerating SVG Transformations with Pipelines XML & SVG Event Pipelines Technologies Recommendations

Evaluating XPath Queries

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

Chapter 2 XML, XML Schema, XSLT, and XPath

Simple API for XML (SAX)

LECTURE 11 TREE TRAVERSALS

XML Data Management. 5. Extracting Data from XML: XPath

COMP4317: XML & Database Tutorial 2: SAX Parsing

XML: Tools and Extensions

CSI 3140 WWW Structures, Techniques and Standards. Representing Web Data: XML

XML: Tools and Extensions

XML Programming in Java

Data Exchange. Hyper-Text Markup Language. Contents: HTML Sample. HTML Motivation. Cascading Style Sheets (CSS) Problems w/html

XML and Databases. Lecture 10 XPath Evaluation using RDBMS. Sebastian Maneth NICTA and UNSW

Making XT XML capable

H2 Spring B. We can abstract out the interactions and policy points from DoDAF operational views

xmlreader & xmlwriter Marcus Börger

Figure 1. A breadth-first traversal.

M359 Block5 - Lecture12 Eng/ Waleed Omar

XML and Databases. Outline. Outline - Lectures. Outline - Assignments. from Lecture 3 : XPath. Sebastian Maneth NICTA and UNSW

Semi-structured Data. 8 - XPath

Streaming Validation of Schemata: the Lazy Typing Discipline

Introduction to XML. XML: basic elements

xmlreader & xmlwriter Marcus Börger

Part II. XML Basics. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 15

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

2 Well-formed XML. We will now try to approach XML in a slightly more formal way. The nuts and bolts of XML are pleasingly easy to grasp.

The Extensible Markup Language (XML) and Java technology are natural partners in helping developers exchange data and programs across the Internet.

Part III. Well-Formed XML. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 33

Lecture Notes. char myarray [ ] = {0, 0, 0, 0, 0 } ; The memory diagram associated with the array can be drawn like this

SAX & DOM. Announcements (Thu. Oct. 31) SAX & DOM. CompSci 316 Introduction to Database Systems

Grammars and Parsing, second week

XML Technologies. Doc. RNDr. Irena Holubova, Ph.D. Web pages:

[ DATA STRUCTURES ] Fig. (1) : A Tree

Analysis of Algorithms

SFilter: A Simple and Scalable Filter for XML Streams

Part VI. Updating XML Documents. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 587

Lec 17 April 8. Topics: binary Trees expression trees. (Chapter 5 of text)

INF2220: algorithms and data structures Series 1

Systems Programming. 05. Structures & Trees. Alexander Holupirek

Informatics 1: Data & Analysis

Accelerating XML Structural Matching Using Suffix Bitmaps

Assignment 3 Suggested solutions

Web architectures Laurea Specialistica in Informatica Università di Trento. DOM architecture

Practical 5: Reading XML data into a scientific application

Informatics 1: Data & Analysis

Data Abstractions. National Chiao Tung University Chun-Jen Tsai 05/23/2012

Summer Final Exam Review Session August 5, 2009

Introduction p. 1 An XML Primer p. 5 History of XML p. 6 Benefits of XML p. 11 Components of XML p. 12 BNF Grammar p. 14 Prolog p. 15 Elements p.

Part XVII. Staircase Join Tree-Aware Relational (X)Query Processing. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440

XML. Rodrigo García Carmona Universidad San Pablo-CEU Escuela Politécnica Superior

Trees. (Trees) Data Structures and Programming Spring / 28

Handling SAX Errors. <coll> <seqment> <title PMID="xxxx">title of doc 1</title> text of document 1 </segment>

CSE 431S Type Checking. Washington University Spring 2013

Supporting Positional Predicates in Efficient XPath Axis Evaluation for DOM Data Structures

Ecient XPath Axis Evaluation for DOM Data Structures

Binary Trees. Height 1

Definitions A A tree is an abstract t data type. Topic 17. "A tree may grow a. its leaves will return to its roots." Properties of Trees

Crit-bit Trees. Adam Langley (Version )

<Insert Picture Here> Experience with XML Signature and Recommendations for future Development

SERIOUS ABOUT SOFTWARE. Qt Core features. Timo Strömmer, May 26,

.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..

To accomplish the parsing, we are going to use a SAX-Parser (Wiki-Info). SAX stands for "Simple API for XML", so it is perfect for us

XQuery Advanced Topics. Alin Deutsch

Intro to XML. Borrowed, with author s permission, from:

Acceleration Techniques for XML Processors

CSE 530A. B+ Trees. Washington University Fall 2013

Transcription:

4. SAX SAX Simple API for XML SAX 7 (Simple API for XML) is, unlike DOM, not a W3C standard, but has been developed jointly by members of the XML-DEV mailing list (ca. 1998). SAX processors use constant space, regardless of the XML input document size. Communication between the SAX processor and the backend XML application does not involve an intermediate tree data structure. Instead, the SAX parser sends events to the application whenever a certain piece of XML text has been recognized (i.e., parsed). The backend acts on/ignores events by populating a callback function table. 7 http://www.saxproject.org/ Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 79

A SAX processor reads its input document sequentially and once only. No memory of what the parser has seen so far is retained while parsing. As soon as a significant bit of XML text has been recognized, an event is sent. The application is able to act on events in parallel with the parsing Marc H. progress. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 80 4. SAX Sketch of SAX s mode of operations <? x m l... <... startelement! callback table characters! startelement() SAX XML Parser Application

SAX Events 4. SAX SAX Events To meet the constant memory space requirement, SAX reports fine-grained parsing events for a document: Event... reported when seen Parameters sent startdocument <?xml...?> 8 enddocument EOF startelement <t a 1=v 1... a n=v n> t, (a 1, v 1),..., (a n, v n) endelement </t> t characters text content Unicode buffer ptr, length comment <!--c--> c processinginstruction <?t pi?> t, pi. 8 N.B.: Event startdocument is sent even if the optional XML text declaration should be missing. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 81

4. SAX SAX Events 1 <?xml encoding="utf-8"?> 1 2 <bubbles> 2 3 <!-- Dilbert looks stunned --> 3 dilbert.xml 4 <bubble speaker="phb" to="dilbert"> 4 5 Tell the truth, but do it in your usual engineering way 6 so that no one understands you. 5 7 </bubble> 6 8 </bubbles> 7 8 Event 9 10 Parameters sent 1 startdocument 2 startelement t = "bubbles" 3 comment c = " Dilbert looks stunned " 4 startelement t = "bubble", ("speaker","phb"), ("to","dilbert") 5 characters buf = "Tell the... understands you.", len = 99 6 endelement t = "bubble" 7 endelement t = "bubbles" 8 enddocument 9 Events are reported in document reading order 1, 2,..., 8. 10 N.B.: Some events suppressed (white space). Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 82

SAX Callbacks 4. SAX SAX Callbacks To provide an efficient and tight coupling between the SAX frontend and the application backend, the SAX API employs function callbacks: 11 1 Before parsing starts, the application registers function references in a table in which each event has its own slot: Event Callback. startelement? endelement?. SAX register(startelement, startelement ()) SAX register(endelement, endelement ()) Event Callback. startelement startelement () endelement endelement (). 2 The application alone decides on the implementation of the functions it registers with the SAX parser. 3 Reporting an event i then amounts to call the function (with parameters) registered in the appropriate table slot. 11 Much like in event-based GUI libraries. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 83

4. SAX SAX Callbacks Java SAX API In Java, populating the callback table is done via implementation of the SAX ContentHandler interface: a ContentHandler object represents the callback table, its methods (e.g., public void enddocument ()) represent the table slots. Example: Reimplement content.cc shown earlier for DOM (find all XML text nodes and print their content) using SAX (pseudo code): content (File f ) printtext ((Unicode) buf, Int len) // register the callback, // we ignore all other events SAX register (characters, printtext); SAX parse (f ); Int i; foreach i 1... len do print (buf [i]); Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 84

4. SAX SAX and XML Trees SAX and the XML Tree Structure Looking closer, the order of SAX events reported for a document is determined by a preorder traversal of its document tree 12 : 1 Doc 16 Sample XML document 1 1 2 <a> 2 3 <b> 3 foo 4 </b> 5 4 <!--sample--> 6 5 <c> 7 6 <d> 8 bar 9 </d> 10 7 <e> 11 baz 12 </e> 13 8 </c> 14 9 </a> 15 16 b 2 Elem 15 a 3 Elem 5 6 Comment 7 Elem 14 c 4 Text "sample" 8 Elem 10 d 11 Elem 13 "foo" 9 Text 12 Text e "bar" "baz" N.B.: An Elem [Doc] node is associated with two SAX events, namely startelement and endelement [startdocument, enddocument]. 12 Sequences of sibling Char nodes have been collapsed into a single Text node. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 85

Challenge 4. SAX SAX and XML Trees This left-first depth-first order of SAX events is well-defined, but appears to make it hard to answer certain queries about an XML document tree. Collect all direct children nodes of an Elem node. In the example on the previous slide, suppose your application has just received the startelement(t = "a") event 2 (i.e., the parser has just parsed the opening element tag <a>). With the remaining events 3... 16 still to arrive, can your code detect all the immediate children of Elem node a (i.e., Elem nodes b and c as well as the Comment node)? Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 86

4. SAX SAX and XML Trees The previous question can be answered more generally: SAX events are sufficient to rebuild the complete XML document tree inside the application. (Even if we most likely don t want to.) SAX-based tree rebuilding strategy (sketch): 1 [startdocument] Initialize a stack S of node IDs (e.g. Z). Push first ID for this node. 2 [startelement] Assign a new ID for this node. Push the ID onto S. 13 3 [characters, comment,... ] Simply assign a new node ID. 4 [endelement, enddocument] Pop S (no new node created). Invariant: The top of S holds the identifier of the current parent node. 13 In callbacks 2 and 3 we might wish to store further node details in a table or similar summary data structure. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 87

4. SAX SAX and XML Trees SAX Callbacks SAX callbacks to rebuild XML document tree: We maintain a summary table of the form ID NodeType Tag Content ParentID insert (id, type, t, c, pid) inserts a row into this table. Maintain stack S of node IDs, with operations push(id), pop(), top(), and empty(). startdocument () id 0; S.empty(); insert (id, Doc,,, ); S.push(id); enddocument () S.pop(); Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 88

SAX Callbacks 4. SAX SAX and XML Trees startelement (t, (a 1, v 1),... ) id id + 1; insert (id, Elem, t,, S.top()); S.push(id); characters (buf, len) id id + 1; insert (id, Text,, buf [1... len], S.top()); endelement (t) S.pop(); comment (c) id id + 1; insert (id, Comment,, c, S.top()); Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 89

4. SAX SAX and XML Trees Run against the example given above, we end up with the following summary table: ID NodeType Tag Content ParentID 0 Doc 1 Elem a 0 2 Elem b 1 3 Text "foo" 2 4 Comment "sample" 1 5 Elem c 1 6 Elem d 5 7 Text "bar" 6 8 Elem e 5 9 Text "baz" 8 Since XML defines tree structures only, the ParentID column is all we need to recover the complete node hierarchy of the input document. Walking the XML node hierarchy? Explain how we may use the summary table to find the (a) children, (b) siblings, (c) ancestors, (d) descendants of a given node (identified by its ID). Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 90

4. SAX SAX and Path Queries SAX and Path Queries Path queries are the core language feature of virtually all XML query languages proposed so far (e.g., XPath, XQuery, XSLT,... ). To keep things simple for now, let a path query take one of two forms (the t i represent tag names): //t 1 /t 2 /... /t n or //t 1 /t 2 /... /t n 1 /text() Semantics: A path query selects a set of Elem nodes [with text(): Text nodes] from a given XML document: 1 The selected nodes have tag name t n [are Text nodes]. 2 Selected nodes have a parent Elem node with tag name t n 1, which in turn has a parent node with tag name t n 2, which... has a parent node with tag name t 1 ( not necessarily the document root element). Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 91

Examples: 4. SAX SAX and Path Queries 1 Retrieve all scene nodes from a DilbertML document: //panels/panel/scene 2 Retrieve all character names from a DilbertML document: //strip/characters/character/text() Path Query Evaluation The summary table discussed in the previous section obviously includes all necessary information to evaluate both types of path queries. Evaluating path queries using the materialized tree structure. Sketch a summary table based algorithm that can evaluate a path query. (Use //a/c/d/text() as an example.) Note that, although based on SAX, such a path query evaluator would probably consume as much memory as a DOM-based implementation. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 92

4. SAX SAX and Path Queries SAX-based path query evaluation (sketch): 1 Preparation: Represent path query //t 1/t 2/... /t n 1/text() via the step array path[0] = t 1, path[1] = t 2,..., path[n 1] = text(). Maintain an array index i = 0... n, the current step in the path. Maintain a stack S of index positions. 2 [startdocument] Empty stack S. We start with the first step. 3 [startelement] If the current step s tag name path[i] and the reported tag name match, proceed to next step. Otherwise make a failure transition 14. Remember how far we have come already: push the current step i onto S. 4 [endelement] The parser ascended to a parent element. Resume path traversal from where we have left earlier: pop old i from S. 5 [characters] If the current step path[i] = text() we have found a match. Otherwise do nothing. 14 This Knuth-Morris-Pratt failure function fail[] is to be explained in the tutorial. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 93

4. SAX SAX and Path Queries SAX-based path query evaluation (given step array path[0... n 1]): startdocument () i 0; S.empty(); These SAX callbacks startelement (t, (a 1, v 1 ),... ) S.push(i); while true do if path[i] = t then i i + 1; if i = n then Match ; i fail[i]; break; if i = 0 then break; i fail[i]; characters (buf, len) endelement (t) if path[i] = text() i S.pop(); then Match ; N.B.: 1 evaluate a path query while we receive events (stream processing), and 2 operate without building a summary data structure and can thus evaluate path queries on documents of arbitrary size. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 94

4. SAX SAX and Path Queries 1 Doc 16 Tracing SAX Events... Is there a bound on the stack depth we need during the path query execution? b 2 Elem 15 a 3 Elem 5 6 Comment 7 Elem 14 c 4 Text "sample" 8 Elem 10 d 11 Elem 13 "foo" 9 Text "bar" 12 Text "baz" Path Query (length n = 4): //a/c/d/text() path[0]=a, path[1]=c, path[2]=d, path[3]=text() e 1 startdocument() i = 0 S = 2 startelement(t = a) i = 1 S = 0 3 startelement(t = b) i = 0 S = 1 0 4 characters("foo", 3) i = 0 S = 1 0 5 endelement(b) i = 1 S = 0 6 comment("sample") i = 1 S = 0 7 startelement(c) i = 2 S = 1 0 8 startelement(d) i = 3 S = 2 1 0 9 characters("bar", 3) i = 3 S = 2 1 0 Match 10 endelement(d) i = 2 S = 1 0 11 startelement(e) i = 0 S = 2 1 0 12 characters("baz", 3) i = 0 S = 2 1 0 13 endelement(e) i = 2 S = 1 0 14 endelement(c) i = 1 S = 0 15 endelement(a) i = 0 S = 16 enddocument() Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 95

Final Remarks on SAX 4. SAX Final remarks on SAX For an XML document fragment shown on the left, SAX might actually report the events indicated on the right: XML fragment 1 <affiliation> 2 AT&T Labs 3 </affiliation> XML + SAX events 1 <affiliation> 1 2 AT 2& 3T Labs 3 4</affiliation> 5 1 startelement(affiliation) 2 characters("\n AT", 5) 3 characters("&", 1) 4 characters("t Labs\n", 7) 5 endelement(affiliation) White space is reported. Multiple characters events may be sent for text content (although adjacent). (Often SAX parsers break text on entities, but may even report each character on its own.) Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 96