Part V. SAX Simple API for XML

Similar documents
Part V. SAX Simple API for XML. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 84

4 SAX. XML Application. SAX Parser. callback table. startelement! startelement() characters! 23

SAX Simple API for XML

1 <?xml encoding="utf-8"?> 1 2 <bubbles> 2 3 <!-- Dilbert looks stunned --> 3

XML Databases 4. XML Processing,

4. XML Processing. XML Databases 4. XML Processing, The XML Processing Model. 4.1The XML Processing Model. 4.1The XML Processing Model

Part IV. DOM Document Object Model

Part IV. DOM Document Object Model. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 70

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

Part III. Well-Formed XML. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 33

Part II. Markup Basics. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 18

Part VII. Querying XML The XQuery Data Model. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153

The Xlint Project * 1 Motivation. 2 XML Parsing Techniques

The DOM approach has some obvious advantages:

XML Parsers. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University

XML APIs. Web Data Management and Distribution. Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Simple API for XML (SAX)

Document Parser Interfaces. Tasks of a Parser. 3. XML Processor APIs. Document Parser Interfaces. ESIS Example: Input document

Databases and Information Systems 1

2 Well-formed XML. We will now try to approach XML in a slightly more formal way. The nuts and bolts of XML are pleasingly easy to grasp.

XML Master: Professional V2

XML in the Development of Component Systems. Parser Interfaces: SAX

XML: Tools and Extensions

XML: Tools and Extensions

COMP4317: XML & Database Tutorial 2: SAX Parsing

Databases and Information Systems XML storage in RDBMS and core XPath implementation. Prof. Dr. Stefan Böttcher

MANAGING INFORMATION (CSCU9T4) LECTURE 4: XML AND JAVA 1 - SAX

Parsing XML documents. DOM, SAX, StAX

Looking Inside the Developer s Toolkit: Introduction to Processing XML with RPG and SQL Too! Charles Guarino

Grammars and Parsing, second week

Web Data Management. Tree Pattern Evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Semi-structured Data: Programming. Introduction to Databases CompSci 316 Fall 2018

XML Programming in Java

Introduction to XML. Large Scale Programming, 1DL410, autumn 2009 Cons T Åhs

XML: Managing with the Java Platform

Evaluating XPath Queries

Part XVII. Staircase Join Tree-Aware Relational (X)Query Processing. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440

LECTURE 11 TREE TRAVERSALS

CSI 3140 WWW Structures, Techniques and Standards. Representing Web Data: XML

Outline. Approximation: Theory and Algorithms. Ordered Labeled Trees in a Relational Database (II/II) Nikolaus Augsten. Unit 5 March 30, 2009

Accelerating SVG Transformations with Pipelines XML & SVG Event Pipelines Technologies Recommendations

The Extensible Markup Language (XML) and Java technology are natural partners in helping developers exchange data and programs across the Internet.

XML: Extensible Markup Language

Introduction to XML. XML: basic elements

Systems Programming. 05. Structures & Trees. Alexander Holupirek

Part II. XML Basics. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 15

Figure 1. A breadth-first traversal.

Chapter 13 XML: Extensible Markup Language

Acceleration Techniques for XML Processors

Data Abstractions. National Chiao Tung University Chun-Jen Tsai 05/23/2012

The XQuery Data Model

Making XT XML capable

CSE 431S Type Checking. Washington University Spring 2013

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

XML. Presented by : Guerreiro João Thanh Truong Cong

Assignment 3 Suggested solutions

xmlreader & xmlwriter Marcus Börger

EXIP - Embeddable EXI implementation in C EXIP USER GUIDE. December 11, Rumen Kyusakov PhD student, Luleå University of Technology

Introduction p. 1 An XML Primer p. 5 History of XML p. 6 Benefits of XML p. 11 Components of XML p. 12 BNF Grammar p. 14 Prolog p. 15 Elements p.

SERIOUS ABOUT SOFTWARE. Qt Core features. Timo Strömmer, May 26,

xmlreader & xmlwriter Marcus Börger

Accelerating XML Structural Matching Using Suffix Bitmaps

To accomplish the parsing, we are going to use a SAX-Parser (Wiki-Info). SAX stands for "Simple API for XML", so it is perfect for us

XML and Databases. Lecture 10 XPath Evaluation using RDBMS. Sebastian Maneth NICTA and UNSW

Handling SAX Errors. <coll> <seqment> <title PMID="xxxx">title of doc 1</title> text of document 1 </segment>

Binary Trees. Height 1

Ecient XPath Axis Evaluation for DOM Data Structures

Streaming Validation of Schemata: the Lazy Typing Discipline

CS301 - Data Structures Glossary By

Data Abstraction and Specification of ADTs

Enabling Semantic Web Programming by Integrating RDF and Common Lisp

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Accelerated, Threaded XML Parsing

Linked List Implementation of Queues

XML and Databases. Outline. Outline - Lectures. Outline - Assignments. from Lecture 3 : XPath. Sebastian Maneth NICTA and UNSW

.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..

The tree data structure. Trees COL 106. Amit Kumar Shweta Agrawal. Acknowledgement :Many slides are courtesy Douglas Harder, UWaterloo

1. Stack overflow & underflow 2. Implementation: partially filled array & linked list 3. Applications: reverse string, backtracking

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

Processing XML Documents with SAX Using BSF4ooRexx

[ DATA STRUCTURES ] Fig. (1) : A Tree

XBinder. XML Schema Compiler Version 1.4 C EXI Runtime Reference Manual

Call: JSP Spring Hibernate Webservice Course Content:35-40hours Course Outline

M359 Block5 - Lecture12 Eng/ Waleed Omar

Data Exchange. Hyper-Text Markup Language. Contents: HTML Sample. HTML Motivation. Cascading Style Sheets (CSS) Problems w/html

Computer Science 210 Data Structures Siena College Fall Topic Notes: Trees

Tree. Virendra Singh Indian Institute of Science Bangalore Lecture 11. Courtesy: Prof. Sartaj Sahni. Sep 3,2010

Java EE 7: Back-end Server Application Development 4-2

References and Homework ABSTRACT DATA TYPES; LISTS & TREES. Abstract Data Type (ADT) 9/24/14. ADT example: Set (bunch of different values)

Compilers. Intermediate representations and code generation. Yannis Smaragdakis, U. Athens (original slides by Sam

A6-R3: DATA STRUCTURE THROUGH C LANGUAGE

Applied Databases. Sebastian Maneth. Lecture 4 SAX Parsing, Entity Relationship Model. University of Edinburgh - January 21st, 2016

Prototype for Wrapping and Visualising Geo- Referenced Data in Distributed Environment using XML

Arrays. Comp Sci 1570 Introduction to C++ Array basics. arrays. Arrays as parameters to functions. Sorting arrays. Random stuff

Experience with XML Signature and Recommendations for future Development

Duncan Temple Lang. This is more of a ``how to'' than a ``why'', or ``how does it work'' document.

Largest Online Community of VU Students

Analysis of Algorithms

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct.

Transcription:

Part V SAX Simple API for XML Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 76 Outline of this part 1 SAX Events 2 SAX Callbacks 3 SAX and the XML Tree Structure 4 Final Remarks on SAX Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 77

SAX Simple API for XML SAX 5 (Simple API for XML) is, unlike DOM, not a W3C standard, but has been developed jointly by members of the XML-DEV mailing list (ca 1998) SAX processors use constant space, regardless of the XML input document size Communication between the SAX processor and the back-end XML application does not involve an intermediate tree data structure Instead, the SAX parser sends events to the application whenever a certain piece of XML text has been recognized (ie, parsed) The back-end acts on/ignores events by populating a callback function table 5 http://wwwsaxprojectorg/ Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 78 Sketch of SAX s mode of operations? x m l SAX Parser startelement! characters! callback table XML Application startelement() A SAX processor reads its input document sequentially and once only No memory of what the parser has seen so far is retained while parsing As soon as a significant bit of XML text has been recognized, an event is sent The application is able to act on events in parallel with the parsing progress Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 79

SAX Events SAX Events To meet the constant memory space requirement, SAX reports fine-grained parsing events for a document: Event reported when seen Parameters sent startdocument?xml? 6 enddocument EOF startelement t a 1 =v 1 a n =v n t, (a 1, v 1 ),, (a n, v n ) endelement /t t characters text content Unicode buffer ptr, length comment!--c-- c processinginstruction?t pi? t, pi 6 NB: Event startdocument is sent even if the optional XML text declaration should be missing Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 80 1?xml encoding= utf-8? 1 2 bubbles 2 SAX Events 3!-- Dilbert looks stunned -- 3 dilbertxml 4 bubble speaker= phb to= dilbert 4 5 Tell the truth, but do it in your usual engineering way 6 so that no one understands you 5 7 /bubble 6 8 /bubbles 7 8 Event 7 8 Parameters sent 1 startdocument 2 startelement t = bubbles 3 comment c = Dilbert looks stunned 4 startelement t = bubble, ( speaker, phb ), ( to, dilbert ) 5 characters buf = Tell the understands you, len = 99 6 endelement t = bubble 7 endelement t = bubbles 8 enddocument 7 Events are reported in document reading order 1, 2,, 8 8 NB: Some events suppressed (white space) Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 81

1?xml encoding= utf-8? 1 2 bubbles 2 SAX Events 3!-- Dilbert looks stunned -- 3 dilbertxml 4 bubble speaker= phb to= dilbert 4 5 Tell the truth, but do it in your usual engineering way 6 so that no one understands you 5 7 /bubble 6 8 /bubbles 7 8 Event 7 8 Parameters sent 1 startdocument 2 startelement t = bubbles 3 comment c = Dilbert looks stunned 4 startelement t = bubble, ( speaker, phb ), ( to, dilbert ) 5 characters buf = Tell the understands you, len = 99 6 endelement t = bubble 7 endelement t = bubbles 8 enddocument 7 Events are reported in document reading order 1, 2,, 8 8 NB: Some events suppressed (white space) Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 81 SAX Callbacks SAX Callbacks To provide an efficient and tight coupling between the SAX frontend and the application backend, the SAX API employs function callbacks: 9 1 Before parsing starts, the application registers function references in a table in which each event has its own slot: Event Callback startelement? endelement? SAX register(startelement, startelement ()) SAX register(endelement, endelement ()) Event Callback startelement startelement () endelement endelement () 2 The application alone decides on the implementation of the functions it registers with the SAX parser 3 Reporting an event i then amounts to call the function (with parameters) registered in the appropriate table slot 9 Much like in event-based GUI libraries Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 82

SAX Callbacks Java SAX API In Java, populating the callback table is done via implementation of the SAX ContentHandler interface: a ContentHandler object represents the callback table, its methods (eg, public void enddocument ()) represent the table slots Example: Reimplement contentcc shown earlier for DOM (find all XML text nodes and print their content) using SAX (pseudo code): content (File f ) // register the callback, // we ignore all other events SAX register (characters, printtext); SAX parse (f ); return ; printtext ((Unicode) buf, Int len) Int i; foreach i 1 len do print (buf [i]); return ; Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 83 SAX and XML Trees SAX and the XML Tree Structure Looking closer, the order of SAX events reported for a document is determined by a preorder traversal of its document tree 10 : 1 Doc 16 Sample XML document 2 Elem 15 a 1 1 2 a 2 3 b 3 foo 4 /b 5 4!--sample-- 6 5 c 7 6 d 8 bar 9 /d 10 7 e 11 baz 12 /e 13 8 /c 14 9 /a 15 16 b 3 Elem 5 4 Text foo 6 Comment 7 Elem 14 c sample 8 Elem 10 d 11 Elem 13 9 Text 12 Text e bar baz NB: An Elem [Doc] node is associated with two SAX events, namely startelement and endelement [startdocument, enddocument] 10 Sequences of sibling Char nodes have been collapsed into a single Text node Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 84

Challenge SAX and XML Trees This left-first depth-first order of SAX events is well-defined, but appears to make it hard to answer certain queries about an XML document tree Collect all direct children nodes of an Elem node In the example on the previous slide, suppose your application has just received the startelement(t = a ) event 2 (ie, the parser has just parsed the opening element tag a ) With the remaining events 3 16 still to arrive, can your code detect all the immediate children of Elem node a (ie, Elem nodes b and c as well as the Comment node)? Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 85 SAX and XML Trees The previous question can be answered more generally: SAX events are sufficient to rebuild the complete XML document tree inside the application (Even if we most likely don t want to) SAX-based tree rebuilding strategy (sketch): 1 [startdocument] Initialize a stack S of node IDs (eg Z) Push first ID for this node 2 [startelement] Assign a new ID for this node Push the ID onto S 11 3 [characters, comment, ] Simply assign a new node ID 4 [endelement, enddocument] Pop S (no new node created) Invariant: The top of S holds the identifier of the current parent node 11 In callbacks 2 and 3 we might wish to store further node details in a table or similar summary data structure Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 86

SAX and XML Trees SAX Callbacks SAX callbacks to rebuild XML document tree: We maintain a summary table of the form ID NodeType Tag Content ParentID insert (id, type, t, c, pid) inserts a row into this table Maintain stack S of node IDs, with operations push(id), pop(), top(), and empty() startdocument () id 0; Sempty(); insert (id, Doc,,, ); Spush(id); enddocument () Spop(); Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 87 SAX Callbacks SAX and XML Trees startelement (t, (a 1, v 1 ), ) id id + 1; insert (id, Elem, t,, Stop()); Spush(id); endelement (t) Spop(); characters (buf, len) id id + 1; insert (id, Text,, buf [1 len], Stop()); comment (c) id id + 1; insert (id, Comment,, c, Stop()); Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 88

SAX and XML Trees Run against the example given above, we end up with the following summary table: ID NodeType Tag Content ParentID 0 Doc 1 Elem a 0 2 Elem b 1 3 Text foo 2 4 Comment sample 1 5 Elem c 1 6 Elem d 5 7 Text bar 6 8 Elem e 5 9 Text baz 8 Since XML defines tree structures only, the ParentID column is all we need to recover the complete node hierarchy of the input document Walking the XML node hierarchy? Explain how we may use the summary table to find the (a) children, (b) siblings, (c) ancestors, (d) descendants of a given node (identified by its ID) Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 89 Final remarks on SAX Final Remarks on SAX For an XML document fragment shown on the left, SAX might actually report the events indicated on the right: XML fragment 1 affiliation 2 AT&T Labs 3 /affiliation XML + SAX events 1 affiliation 1 2 AT 2 & 3 T Labs 3 4 /affiliation 5 1 startelement(affiliation) 2 characters( n AT, 5) 3 characters( &, 1) 4 characters( T Labs n, 7) 5 endelement(affiliation) White space is reported Multiple characters events may be sent for text content (although adjacent) (Often SAX parsers break text on entities, but may even report each character on its own) Torsten Grust (WSI) Database-Supported XML Processors Winter 2012/13 90