Conception of Information Systems Part 1: Data Representation - XML. 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 1

Size: px

Start display at page:

Download "Conception of Information Systems Part 1: Data Representation - XML. 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 1"

Nickolas Skinner
5 years ago
Views:

1 Conception of Information Systems Part 1: Data Representation - XML 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 1 1

2 PART I Data Representation - XML 1. Motivation: Data Integration 2. Repetition: Data Models 3. Data on the Web - XML 4. XML Syntax 1. Well-formed XML 2. Document Type Definitions 3. Entities 4. Namespaces 5. XML Architecture 1. Document Object Model 2. Xpath 3. XSLT 4. XMLSchema 5. XMLQuery 6. References 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 2 2

3 1. Data Management Applications use persistent data Problems Reuse Dependencies Sharing Application 1 Application 2 Application 3 Application FILE access FILE calls Data Data 1 Data 2 Data 3 Data , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 3 In the most general sense an information systems runs applications accessing (a large amount of) information that is persistently stored. The notion of persistence expresses that the lifetime of data is independent of the execution of a program. In that sense "Persistence bridges temporal distance among applications", similarly as distribution bridges spatial distance. Typical examples of applications with persistent data are: Product catalog - order application Bank account data - bank transaction application Gene databank - protein search application Large databases are normally used by many different applications and the same applications are using many different databases. This leads to a number of problems All applications need to know how to manage the access to a large database. There is little reuse of the functions needed to efficiently access large datasets. Inconsistencies and redundancies occur in the data when different applications access the same databases and each application has its own assumptions on which constraints or dependencies hold. Concurrent access to the same data leads to conflicts if multiple applications are using the same database at the same time. Dependencies exist: among databases (databases assume the existence of certain data in other databases), databases and applications (applications assume certain properties to hold in the database), among applications (applications assume that other applications are performing certain operations) 3

4 Database Management Systems Provide a unique point of access to the data DDL/DML Transactions Software system Application 2 Application 1 Application 2 Application 3 Application 1 Application 3 DDL/DML FILE calls Unique Interface to the database FILE calls Data 1 Data 2 Data 3 Data 4 Database 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 4 A DBMS (database management systems) factors out the standard functions to access large databases. In fact we could see a DBMS as a way of integrating the access to otherwise unrelated data files. A DBMS supports an abstract data model that consists of DDL = language to describe the structure of data DML = language to access the data (query, update) A DBMS coordinates the access by multiple users (operational model) by supporting the concept of transactions Transaction = operation that moves the database from one consistent state to another The notion of database and database management system need to be differentiated The DBMS is a software system that supports DDL/DML/Transactions for accessing a database The main differences among different DBMS are Data and operational model Type of transaction support Performance 4

5 Integrating Databases Application 1 Application 2 Application 3 Application 1 Application 2 Application 3 DDL/DML DDL/DML Federated DBMS query language wrapper wrapper wrapper wrapper ql1 ql2 ql3 ql4 DBMS 1 DBMS 2 DBMS 3 DBMS 4 DBMS 1 DBMS 2 DBMS 3 DBMS 4 Data 1 Data 2 Data 3 Data 4 Data 1 Data 2 Data 3 Data , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 5 As soon as multiple database systems come into existence it is likely that a situation occurs where multiple applications would like to use data from multiple of the databases managed by those DBMS. This means that queries should be be enabled spanning multiple databases, updates should be directed to those databases where the data is originating from and integrity constraints spanning multiple databases should be supported. Example: Integration of publication data. An integrated database could draw from many different resources Web resources: e.g. DB Server in Trier, CiteSeer citation index Local bibliography: e.g. MS Access application for lab library University library server: e.g. Oracle database application Private bibliography: e.g. BibTex file, MS Word document The difficulty of this task originates in problems like Different data models at the participating DBMS Different data schemas and representation at the participating DBMS Different processing capabilities at the participating DBMS Distribution and communication This means we encounter the typical problems of distribution, heterogeneity and autonomy when trying to integrate the access to multiple database that are a priori unrelated. Therefore a system is required that overcomes these problems, which is called a federated database management system. As we have learned in the introduction also an embedding of the participating database is required, which is provided by so-called wrappers. Wrappers overcome, for example differences in the data models (relational, OO, file, XML, etc.). 5

6 Database Integration Distribution Distributed databases Multidatabases Federated databases Autonomy Centralized DBMS Heterogeneity 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 6 In order to position federated database systems let us look at our problem dimensions. Federated databases have to deal with all three problem dimensions. There exist however situations where autonomy and heterogeneity are less important. With distributed databases we have no heterogeneity and autonomy at all, as they deal only with the problem of distributing on logical database over the network in order to improve performance. Multidatabases are in between, for example, as they assume no heterogeneity in the underlying data models (e.g. all databases being relational) 6

7 Distributed Database Systems Distribution of one logical database on different physical locations A priori Fragmentation Horizontal Vertical Replication Lazy Eager Allocation problem 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 7 Distributed databases deal with the distribution of one logical database onto different physical locations. Distributed databases are typically designed "a priori". Distributed databases do not have to deal with the problem of "a posteriori" integration. Therefore heterogeneity does not really occur as a problem, there is one unique data model (so no syntactic heterogeneity) and there exists a well-defined global schema (so no semantic heterogeneity). The local database managers in a distributed database are fully under the control of the global database manager and therefore also autonomy does not occur. The fundamental problems addressed by distributed databases are: Fragmentation: units of distribution Replication Allocation Horizontal: based on selection predicates in user queries Vertical: based on projection attributes in user queries Trade-off: efficient read vs. expensive write Update strategy: lazy vs. eager Given a set of sites, a set of queries and a set of fragments find an optimized allocation of the fragments to the sites such that the storage, query, update and communication costs are minimized and the performance of the systems (response time, throughput) are optimized. This is in general a difficult optimization problem. In summary, building a distributed database is mostly a "design problem" and not an "integration problem". 7

8 Data Integration Checklist Task: integrate distributed, heterogeneous (and probably autonomous) data sources 1. Abstract Model: which model to use? Relational, object-oriented, XML? 2. Embedding 3. Architectures and Systems 4. Methodology 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 8 For federated database management systems the situation is fundamentally different. Since the databases participating in a federation have been developed independently (design autonomy) we encounter normally a high degree of heterogeneity. In addition the participating DBMS continue to serve their original purposes even when participating in a database federation and therefore they have a high degree of autonomy. Looking at the four key questions in information systems integration we want to look now at the first one, the question which is the abstract model that is used for the integrated system, in other words the data model that is supported by the federated DBMS and under which the integrated databases can be accesses. We can consider all of the existing kinds of data models that have been proposed, and in fact each of those has also been used as global model for federated DBMS: Relational Model Most frequently used data model for centralized and distributed DBMS Difficult to represent data from more complex models (documents, OO) Thus best suited for integrating relational data sources Difficult to represent constraints occuring in integration (e.g. generalization/specialization) Object-oriented Data Model Expressive Has been proven as successful approach and used in research and some systems 8

9 2. Repetition: Relational Data Model Data model all data represented as a set of tables a table is a set of tuples Notation Schema: RS A 1 :D 1, A 2 :D 2,,A n :D n in short RS=A 1 A 2 A n Instance: R D 1 D 2 D n Domains Usually only simple data types (integer, string(n), ) Operations on the domains: comparison, addition, 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 9 9

10 Example Relational Database Professors(PName: STRING, ZIP: INT, State: STRING) Lectures(LNr: INT, PName: STRING, nrstudents: INT, title: STRING) Rooms(number: STRING, places: INT) Professor Name ZIP State Karl 1026 VD Andre 1018 VD John 8001 ZH Serge 1026 VD Alain 1034 VD Hans 8001 ZH Rooms number places INN IN IN INR INJ Lectures Lnr Pname nrstudents title 1 Karl 30 Dist Sys 2 Karl 50 Inf Syst 3 Andre 30 Dist Sys 4 Alain 180 Progr 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 10 10

11 Integrity Constraints Keys Key: a set of (non-null) attributes that identify a tuple uniquely Candidate key: minimal set of key attributes Primary key: one selected candidate key Functional Dependency A set of attributes depends functionally on another set of attributes Special case: the set of all attributes depends on the primary key Referential integrity Foreign key: a subset X of the attributes of a relation R that refers to a primary key of a relation S Referential integrity is the guarantee, that for each value of the foreign key there exists a tuple in S of which the primary key is this value no dangling references 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 11 11

12 Normalization Join q-join: R P S = s P (R S) Natural Join: R S = p att(r) att(s) (s att(r) att(s)= att(r) att(s) (R S)) Normalized relations To avoid update anomalies relations must be decomposed Natural joins are used for reconstruction Person Name ZIP state p Name,ZIP (Person) Name ZIP p ZIP, State (Person) ZIP state Karl 1026 VD Karl VD Andre 1018 VD Andre VD Tom 8001 ZH Tom ZH Serge 1026 VD Serge VD Alain 1034 VD Alain 1034 Hans 8001 ZH Hans 8001 Person=p Name,ZIP (Person) p ZIP, State (Person) 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 12 12

13 Language for Relational Data - SQL SQL92 is an industry standard Sample DDL statement CREATE TABLE Lectures ( LNr CHAR(15) NOT NULL, PName VARCHAR(20), NrStudents DECIMAL(10) DEFAULT 0, Title CHAR(9), PRIMARY KEY (LNr)); Generic structure of DML statements (query) SELECT [DISTINCT] A 1, A 2,..., A n FROM R 1, R 2,..., R m WHERE P A i are attributes from the relations R i P is a predicate Corresponds to the following relational algebra expression: p A1,,An (s P (R 1 x R 2 x... X R m )) 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 13 13

14 SQL Features Join SELECT P.Name, L.Title, R.Number, P.Zip FROM Professor P, Rooms R, Lectures L WHERE P.Name = L.PName AND L.nrStudents < R.Places Nesting SELECT * FROM Lectures L WHERE L.nrStudents > ANY ( SELECT L2.nrStudents FROM Lectures L2 WHERE L2.PName = "Karl") Aggregation SELECT L.PName, MAX(L.nrStudents) FROM Lectures L GROUP BY L.PName HAVING COUNT(*) > , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 14 14

15 Object-oriented Data Model - ODMG, OQL ODMG: extends OMG object model for data management Extent, Keys, Relationships (ODL extends IDL) Query language (OQL extends SQL), Language Bindings (C++, Smalltalk, Java) More or less supported by all OO vendors Object Design, Objectivity Inc., ONTOS, O 2 Technology, POET, Sunsoft, Versant Object Technology Close to the basic object model Classes and types, methods, type constructors However, top level type constructor is always tuple Support for relationships (ref. integrity) Inheritance Was considered as an appropriate model for database integration! 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 15 Object-oriented database management systems (OODBMS) provide persistent storage for objects that are represented in an object-oriented datamodel. The original object-oriented data models for OODBMS were usually extensions of OO programming languages by database-specific concepts related to the management of a large number of objects. Most importantly, classes are not only considered as type definitions but also as containers for holding objects of the type (the extension). Concepts to identify objects (keys) and reference objects (relationships) are also required. The query languages in OODBMS are typically extensions of SQL with operators to support the access to complex objects. ODMG is a standardized data model for object-oriented databases that builds on the object model of the OMG (Object Management Group) which in particular also standardized the distributed object management architecture (OMA/CORBA) that we will introduce later. It extends the programming-oriented object model by the mechanisms mentioned above. It supports all standard OO concepts (as known from JAVA or C++), like types, classes, methods, type constructors, or inheritance. In addition it introduces a mechanism to model relationships (similarly as relationships in an ER model). A specific assumption of the model is that every object has the tuple constructor as its root type. This is a tribute to the fact that the model is used for databases, and you can compare it to the ER model, where also each entity type has at the top level a set of attributes, which then in turn can be od a complex type. 15

16 ODL Example Class Slides : Slides_type(extent slides key snr) Interface Slides_type { // Interface Body // Type Declaration struct Bullet {string text, Content content}; // Attribute Declaration attribute string title; attribute set<bullet> bullets; // Relationship Declaration relationship set<lecture> lecture inverse Lecture::slides; // Method Declaration short nrbulletscontaining(string s); short nrbullets(); Bullet firstbullet(); } 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 16 To give a flavor, we illustrate ODL by one simple example: we see the schema for an object class Slides. A class has a type (which is Slides_type), an extent, which can be referred to by the name "slides", and a primary key, which is referred to by "snr". The type Slides_Type is constructed as any other object-oriented type: it has attributes, which can be simple (like title) or complex (like bullets). It can contain named types (like Bullet) which can be used within the scope of the type specification. There exists a specific type relationship, which can be used to refer to objects of another type (like lecture). Relationships are always declared together with an inverse relationship (in that case Lecture::slide) such that navigation in both directions is possible. The OODBMS is responsible for managing these relationships consistently. And finally objects may have methods declared. The implementation of the methods is not part of the ODMG type specification but needs to be provided using system-specific deployment mechanisms when instantiating the database. 16

17 Main Elements of OQL Fully functional language, extension of SQL select s from Slides s where s.title="xyz" first(select s from Slides s).nrbulletscontaining("introduction") Constructors Slide(title: "Introduction") Structuring of results select struct(title: s.title, lectures: s.lecture) from Slides s where s.nrbullets > 3 Result is of type: Set<struct(title: string, lectures: set<lecture>)> Path expressions and method calls select s.firstbullet.content.comment from Lecture l, l.slides s where l.author="karl" Result is of type set<string> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 17 OQL is a functional query language that builds on SQL concepts. We see that from the first example, which is a query accessing objects of class Slides using a selection predicate on a simple attribute. This can be done using a query that matches exactly its relational counterpart. However, from the second example we recognize the functional nature of the language. We can apply for example an operator "first" to extract the first element of a set and then apply to this a method (nrbulletscontaining). This also results in a valid query. Other features of the language allow to create new objects of specific types by using a constructor (Slide) or to define query results of a complex structured type, by creating the complex data values on the fly. An important aspect of OQL is the support of PATH EXPRESSIONS that allow to navigate along relationships and attributes of complex structured data values, as seen in the last example. 17

18 Object-oriented Data Model SQL99 Adaptation of OO concepts to SQL92 More relevant than ODMG in practice Implementations of large DB vendors are "proprietary" Different (non-standard) implementations and component concept Oracle 8 (Cartridges) Informix (DataBlades) DB2 (Extenders) Focus on new base types and stored functions rather than on constructed types "Killer application": multimedia in relational databases Large object types, Arrays Standardization in SQL , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 18 Despite their extended functionality OODBMS never succeeded to replace the big relational database systems (both for economic and technical reasons) and are today niche products for specialized applications (e.g. for engineering or model data). At the same time however relational vendors started to extend their products (and thus the relational model) with objectoriented features. These extensions came along with concepts of how to package database extensions, that support new data types, intn components, such that they could be easily deployed within the database systems. The emphasis on these extensions was mainly on providing new base types (extending the existing atomic type systems of the relational DBMS) and the support of stored functions, in particular those supporting operations on these new atomic types. Constructed complex types in the object-oriented sense were less of a focus though foreseen in the standard SQL99 and partially implemented. Thus the typical examples of components extending relational databases are for supporting multimedia data, like image or text, by introducing new base types, and operators that would support the typical operations for processing and searching on multimedia data. 18

19 Main Elements of SQL99 Distinct types CREATE TYPE authorname AS CHAR(20) E.g. author names can no longer be compared with titles Structured Types (named row types) and implicit types CREATE ROW TYPE slide_type ( title CHAR(10) bullets Bullet ARRAY[10] author ROW (firstname CHAR(10), lastname CHAR(10)) snr INTEGER(5)) INSTANTIABLE NOT FINAL REF(snr) INSTANCE METHOD nrbulletscontaining(s CHAR(10)) RETURNS INTEGER(2) Classes CREATE TABLE slides OF slide_type Each row with OID (either implicitely or explicitely defined) Type and class inheritance Paths and method calls in queries SELECT s.author.firstname FROM slides s WHERE s.nrbulletscontaining("xyz")>1 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 19 Next we review some of the important extensions of SQL99 over SQL92, and we see that they are very similar to concepts that have been introduced in OQL. The simplest form of introducing a new type in SQL99 is by renaming an existing type. This allows to enforce type safety, such that values of the new type can only be compared, for example, to values of the same type. New complex types can be introduced by a type definition, which bears a lot of similarity with a normal table definition in SQL92. A complex type consists of a number of attributes, which, in contrast to SQL92, can be also based on new type constructors (like ARRAY) or on user defined complex types (like authors). The clause INSTANTIABLE NOT FINAL expresses that subtyping is allowed. As for OQL also method interfaces can be defined for the type. Classes in SQL1999 correspond to tables, for which the attribute types are obtained from a complex type definition. SQL1999 provides an inheritance mechanism both for classes (extensional, thus the extensions of subclasses are subsets of their superclasses) and for types (the extensions of classes derived from the types are unrelated). Since complex types and methods can be defined they can be used in queries similarly as with OQL, by using path expressions. 19

20 Data Model Desiderata for Data Integration Model integrity constraints Identifiers (keys) Relationships (between different databases) Flexibility Missing values, deviations from schema Complex datatypes Sets, lists, arrays, graphs Simplifies mapping from different respresentations Inheritance Factor out commonalities among different schemas (generalization) Derive specialized representations for specific schemas from general models Methods Encapsulation of complex access functions and transformations OO seems to be quite adequate 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 20 20

21 Recapitulation What services does a database management system provide? What is the difference between a distributed and a federated database management system? What is a wrapper? Which are desirable properties of a canonical data model for data integration? Which features does the object-oriented data model offer that the relational model doesn't? Which features does the object-oriented query language offer that the relational query language doesn't? Which are the differences between ODMG and SQL99? 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 21 21

22 3. Data on the Web: HTML and XML The Web is A huge collection of hypermedia documents A gateway to databases and applications A platform for interacting with services (Web Services) It started with HTML Markup language to define document layout Hyperlinks for navigation Interactive forms 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 22 The Web can be considered as being an interface to documents, databases and interactive services. HTML is used as surface language at the user interface level. All the developments for making database models type extensible described so far took place essentially before the Web emerged as the major platform for exchanging and managing data. The development of the Web had a profound impact in particular on the way of how data from different resources is integrated. Therefore we look next on the data models that developed in the context of the Web, starting from the hypertext document model up to the models we find today, where the Web serves as a ubiquitous platform for exchanging data and services. 22

23 Data on the Web: XML Limitations of HTML Structure of data expressed as layout (example) Semantics of data hard to analyse and difficult to share No schemas, no constraints Thus XML (extensible Markup Language) has been developed Markup language to define structured documents Document schemas to fix the structure of documents User-defined markup to express semantics XML architecture for processing and extended functionality 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 23 A substantial amount of data on the Web is represented in HTML (tables, lists, matrices). It quickly turned out that this is not a good idea as it is hard to interpret the data correctly, not only for humans but even more for computers. The problem is that representing data in a layout format like HTML does not allow to define a schema according to which the data is structured. To remedy this situation XML was developed. It transferred the concept of schemas into the document world. Document-oriented people also say that XML supports "user-defined " markup, as opposed to the "system markup" of HTML. 23

24 Relationship between HTML and XML XML SGML Data model e.g. MathML HTML Document type "Schema" e iπ +1= 0 n f (n) = Σ k k=1 Homepage of Jojn Hacker Document Database 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 24 Actually XML was not such a new invention as it may appear. In fact, before HTML there already existed a language called SGML, which was mostly used by the publishing industry to markup documents in their production process. More accidentally than planned, SGML served as the language for defining the HTML standard (think of SGML being a data model and HTML a schema). Only after HTML was so successful people remembered its roots in SGML, when the need for a data model allowing to express application-specific markup/schemas was recognized. Since SGML was a somewhat complex language, it was simplified in many respects, and the result of that was XML. 24

25 Example: Data in HTML vs. XML (1) <html> <head> <meta http-equiv="content-type" content="text/html; charset=iso "> <title>publications</title> <body bgcolor="#f2effc" link="#0000ff" vlink="#800080"> <table border="0" width="100%"> <tr> <td width="100%" bgcolor="#008080"><font color="#ffffff" face="arial">journals</font></td> </tr> <tr> <td width="100%"><ol> <li><a name="_ref "><font size="2" face="arial">w. Klas, G. Fischer, K. Aberer: "Integrating a Relational Database System into VODAK using its Metaclass Concept", <i>journal of Systems Integration</i>, Kluwer Academic Publishers, Vol. 4, No. 4, pp , 1994.</font></a></li> <li><a name="_ref "><font size="2" face="arial">m. Volz, K. Aberer, K. Böhm: "An OODBMS-IRS Coupling for Structured Documents", <i>data Engineering Bulletin 19(1)</i>, pp 34-42, 1996.</font></a></li></ol> </td></tr></table></body></html> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 25 In order to illustrate the difference between data as represented in XML and HTML, let us first look at a typical HTML document as it would be published over the Web. 25

26 Example: Data in HTML vs. XML (2) 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 26 In a visualization generated by an HTML editor the logical (hierarchical) structure of the document becomes somewhat clearer. 26

27 Example: Data in HTML vs. XML (3) <publications> <journals> <paper> </paper> <paper> </paper> </journals> </publications> <authors> <author>w. Klas</author> <author>g. Fischer</author> <author>k. Aberer</author> </authors> <title>integrating a Relational Database System into VODAK using its Metaclass Concept</title> <journal>journal of Systems Integration</journal> <publisher>kluwer Academic Publishers</publisher> <issue pages=' '>vol. 4, No. 4</issue> <year>1994</year> <authors> <author>m. Volz</author> <author>k. Aberer</author> <author>k. Böhm</author> </authors> <title>an OODBMS-IRS Coupling for Structured Documents</title> <journal>data Engineering Bulletin</journal> <issue pages='34-42'>19(1)</issue> <year>1996</year> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 27 And this is the same data as it could be represented in XML. The difference is that the data elements are now marked up with tags that carry semantically meaningful names and the structuring of the tags follows a specific schema. This simplifies the interpretation of the document both by users and by programs substantially. 27

28 Example: Data in HTML vs. XML (4) <!DOCTYPE publication [ <!ELEMENT publications (journals,conferences, books)> <!ELEMENT journals (paper)> <!ELEMENT paper (authors, title, journal, publisher?, issue, year)> <!ELEMENT authors (author*)> <!ELEMENT author, title, journal, publisher, year (#PCDATA)> <!ELEMENT issue (#PCDATA)> <!ATTLIST issue pages CDATA #REQUIRED> ]> Document type "Schema" 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 28 The structure of an XML document follows (normally) a specific schema which is given by a so-called document type definition. The document type definition specifies which elements can occur and according to which constraints the can be contained within each other. On the left hand side we see a graphical representation of the XML document from the previous page as it would be given by an XML document editor. One can recognize easily the hierarchical structure. We can also interpret the document in this way as a complex hierarchical data type. The rules (or the type) according to which this hierarchy is organized are given on the right-hand side. One sees that these rules specify for example the containment of elements within each other. For example, one rule expresses that the element "papers" can occur only within the element "journal". 28

29 Data and Documents "Serialization" Document = medium for exchange of information Communication Information system 1 Information system , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 29 The two notions of document and data and the ambiguë role of XML being both a document and data format require some clarification. A document is a piece of data that is used as a medium for exchange of information. For that purpose it must be represented as sequence of symbols that can be sent over a communication channel, i.e. a text string. On the other hand within an information system the same information can be represented as a complex data type on which programs can operate. In that context we are talking of "data". When information that is represented as data within an information system must be exchanged with another information system, then the data must be turned into a document, it must be serialized. The nice thing about XML is that it can be both viewed as a document (when we look at it as a string containing markup and interpret the schema as a grammar) and as data (when we look at it as a hierarchically structured data object and the schema as a type definition). We can also say that XML data has a natural serialization or an XML document has a natural representation as a data type. 29

30 What is XML? Interpretation depends on viewpoint and intended use a language to describe the structure of documents. the foundation of the W3C architecture for Hypermedia documents on the Web. the successor of HTML. a method to put structured data into text documents. a standard data exchange format. a data model for semi-structured (partially structured) data. What are the main characteristics of the XML language? No schema is required, but schemas can be used Flexible data model: complex data structures, optional elements Canonical serialization for data exchange Use of names allows to impose (agreed) semantics structural elements. Human-readability Widely accepted and standardized Wide availability of tools for processing XML data 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 30 The interpretation of what XML is depends also on the viewpoint one takes: in the simplest case it may be considered as a document syntax, but depending on the context other interpretations are possible. It is a language to describe hierarchically structured text documents (this is also where the origin of XML lies). By the W3C XML was chosen to be the foundation of the whole architecture of the Web, which explains also why it is called successor of HTML. For database people XML is another way of how to structure data, with the additional advantage of getting a textual representation (or a serialization) for free. The textual form of a XML document can thus also directly be used to exchange data via messages, which is why the EDI (Electronic data Interchange) community views XML nowadays as their standard message syntax. From the data perspective XML also leads to a more fundamental shift in the way data management is perceived: XML schemas are much less strict in their constraints and thus allow more flexilibity at the data level (e.g. optional elements, recursive definitions etc.). This leads to the whole new area of semistructured data management (which is a topic of the companion lecture "distributed information systems") 30

31 Why Do We Need XML? Tim Bray: "XML will be the ASCII of the Web basic, essential, unexciting" Distributed information systems XML as data model for managing semi-structured data distinction between documents and data disappears XML as canonical model to integrate heterogeneous data XML as canonical data format to exchange data among information systems Web information systems separation of layout and structure better support to keep data consistent reuse of data client-side processing more semantics available for more intelligent processing (personalization, agents, search engines) Electronic business Integration of businesses processes messages and contracts represented in XML replaces former EDI formats Standardization XML is an integration technology 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 31 From a distributed information systems XML plays an increasing role as canoncical data model to integrate heterogeneous data. This results on the one hand from its dual role as document and data model, on the other hand on from its flexibility in modelling less regularly structured data (semi-structured data). From an application perspective XML plays its role in two main areas: 1. Information systems on the web: there the important advantage compared to HTML is the separation of layout specification from data structure. This allows to keep data consistent, reuse it, and to perform more intelligent processing of the data. 2. In electronic business XML plays its role as message format in implementing distributed processes within and across enterprises. It replaces earlier formats (EDI) and by being an authorative standard it resolves some issues of heterogenity (mainly at the level of syntax). 31

32 XML Architecture Standard XML applications XHTML, SMIL, P3P, MathML Specific XML Applications Layout - XSL - CSS Hyperlinks - XLink - XPointer Metadata - RDF, RDFS - PICS API - DOM - SAX Schemas - XSchema Queries - XSLT, XPath - XQL, XQuery XML 1.0 DTD Namespaces Unicode URI 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 32 The importance of XML roots not only in its inherent properties as document markup language, but also in the fact that it is used as the basis for a whole architecture covering almost every aspect of information processing imaginable. This architecture is developed under the auspices of the W3C consortium and builds on XML 1.0 syntax which has been approved by the W3C. The architecture comprises the following components: Programming API s: DOM and SAX, these are required to build applications that can process XML documents that have been parsed and made available in main memory. Layout and hyperlinks: standardize the language constructs needed in order to represent the hypertext and layout properties of documents. XML Data Management: XML Schema and XQuery provide the languages for managing large data-oriented documents and document-collections. Metadata: as XML data will be exchanged and processed in many different contexts, it will often be necessary to provide additional data (information) about the data (metadata) in order to enable applications and users to correctly interpret and use the data. In addition standardized applications for specific domains are specified (e.g. SMIL for multimedia presentation, P3P for privacy, MathML for the markup of mathematical documents). These standardization are not carried out by the W3C but by other industrial or governmental standardization bodies. 32

33 Main Elements of the XML Core Architecture DOM - Document Object Model Object-oriented representation of XML documents and APi Xpath A language to access XML document parts XSLT A language to transform XML documents XML Schemas A representation of database schemas in XML XML-Query A language for set-oriented query access to XML documents and document collections 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 33 33

34 Recapitulation Which is the difference between HTML and XML? What is the difference between data and document? What is the difference between XML as a data model and the relational model? What is the difference between XML as a data model and the objectoriented model? How does the XML architecture compare to a database management system? Which aspects of XML (data/document) do the three main application areas for XML emphasize/exploit? 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 34 34

35 4. XML Syntax 1. Well-formed XML 2. Document Type Definitions 3. Entities 4. Namespaces 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 35 35

36 4.1 Well-formed XML Well-formed XML conforms to a basic XML syntax and some semantic constraints for well-formedness Main concepts Elements: used to structure the document, identify a portion of the document Attributes: associate data values with elements, used to reduce the number of elements and for typed data Character data (PCDATA): textual content of the document <journal> <issue page=1> PCDATA, the content </issue> <issue page=2> <last/> more PCDATA </issue> </journal> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 36 Though XML supports schemas (by means of DTDs) XML documents are not required to have one. Rather any document that is following a certain basic syntax can be a valid XML document. These documents are called well-formed XML documents. The constituents of an XML document are elements (the tags), attributes and textual content. The tags must be nested, such that the elements form a hierarchy. Therefore it is always possible to view an XML document also as a tree, as illustrated in the example. It is important to consider always both views on XML, either as document as in the ASCII (actually UNICODE) representation on the right - or as data as in the tree representation on the left. 36

37 Well-Formed XML Syntax Syntax (excerpt) document ::= prolog element Misc* element ::= EmptyElemTag STag content Etag STag ::= '<' Name (S Attribute)* S? '>' ETag ::= '</' Name S? '>' Attribute ::= Name Eq AttValue Syntactic Properties single root element tag names start with letter tags must be properly nested hierarchic structure induced by parents-child relationship special syntax for empty tags Semantic Constraints Start and end tag name must match Attribute names within an element are unique 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 37 Well-formed XML can be specified by means of a formal syntax of which we illustrate only some important production rules. In particular one can see the hierarchical buildup for the production of elements: each element is either an empty element or a content embraced by element tags. The content in turn could be another element or textual content. This syntax characterizes the syntactic properties of well-formed XML documents but does not characterize them completely. In addition, semantic properties, that are not specified as part of the XML Syntax, need to be satisfied by XML documents. This concern in particular properties of element and attribute names that are used. 37

38 Structure of a Well-formed XML Document XML document <?xml version="1.0"?> Prologue <!DOCTYPE publication [ Document Type Definition <!ELEMENT publications (journals, conferences, books)>... <!ELEMENT author (#PCDATA)> <!ELEMENT issue (#PCDATA)> <!ATTLIST issue pages CDATA #REQUIRED> <!ENTITY JSI " <journal>journal of Systems Integration</journal> <publisher>kluwer Academic Publishers</publisher>"> ]> <publications> <journals>... &JSI;... </publications> Root Document Element Document 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 38 Every XML document follows the same global structure. It starts with a Prologue, that typically indicates the XML version in use. This is followed by the document type definition. After the document type definition the document itself starts. Here it is important to observe that every document has one single root document element. 38

39 4.2 XML Document Type Definitions Declarations Definition of element and attribute names Association of attributes with elements Content Model (regular expressions) Association of elements with other elements (containment) Order and cardinality constraints 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 39 XML document type definitions (DTDs) are used to specify -which elements and attributes are allowed -And how they can appear in the document in relation to each other However, in contrast to database schemas they are not given as types (a relational schema is for example a (data) type definition) but rather in the form of a grammar. 39

40 Element Declarations Basic form <!ELEMENT elementname (contentmodel)> Contentmodel determines which other elements can be contained Given by a regular expression Atomic contents Element content <!ELEMENT example ( a )> Text content <!ELEMENT example (#PCDATA)> Empty Element <!ELEMENT example EMPTY> Arbitrary content <!ELEMENT example ANY> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 40 The main construct in DTDs is the declaration of elements. It consists of two parts, the name of the element and the content model, which is a regular expression that determines which other elements are allowed to appear within (or below) the element, and in which order and multiplicity. The content model is given by a regular expression built up from other element names. The constituents of the regular expression are atomic contents, which are other elements, text, which is represented by the special built-in element name #PCDATA (=parsable character data), the EMPTY content or the ANY content, which imposes no constraints on the elements that may occur. 40

41 Element Declarations Sequence <!ELEMENT example ( a, b )> Alternative <!ELEMENT example ( a b )> Optional (zero or one) <!ELEMENT example ( a )?> Optional and repeatable (zero or more) <!ELEMENT example ( a )*> Required and repeatable (one or more) <!ELEMENT example ( a )+> Mixed content <!ELEMENT example (#PCDATA a)*> Content model can be grouped by paranthesis Cyclic element containment is allowed 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 41 From the atomic content model one can construct composite content models, by using the standard regular expression operators sequence, alternative, optional and repeatable. If text content occurs together with user-defined elements in the content model, this is called mixed content. The regular expression operators can be nested using paranthesis and cyclic element containment is allowed. This allows for example to specify DTDs that allow XML documents of arbitrary depth. 41

42 Attribute Declarations Each element can be associated with an arbitrary number of attributes Basic form <!ATTLIST Elementname Attributename Type Default Attributename Type Default... > Possible Defaults Required attribute: #REQUIRED Optional attribute: #IMPLIED Fixed attribute: #FIXED "value" Default : "value" 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 42 Attributes are used to associate additional data with elements that are not represented as contents. Attributes are a heritage from the document processing world, where elements have been used to structure the documents, and attributes were used to specify instructions for document processing. From a data modelling viewpoint, in many cases attributes and elements can be used interchangeably, and the preference is a matter of taste and capabilities of the XML processing environment. 42

43 Example Document Type Definition <!ELEMENT shipto <!ATTLIST shipto (#PCDATA)> country CDATA #REQUIRED state CDATA #IMPLIED version CDATA #FIXED "1.0" payment (cash creditcard) "cash"> Document <shipto > </shipto> country="switzerland" version="1.0" payment="creditcard" 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 43 In this example we are declaring an element with four attributes. The first attribute is required and has as default value "US". The second attribute is optional and does not appear in the document instance. The third attribute has a fixed value and cannot be changed. This value must occurs in the document instance. The fourth attribute is an enumeration attribute with default. The document instance contains a different value. With no value given the value would be the default value "cash". 43

44 Attribute Declarations - Types CDATA String <!ATTLIST example HREF CDATA #REQUIRED> Enumeration Token from given set of values, Default possible <!ATTLIST example selection ( yes no maybe ) "yes"> ID, IDREF ID is a unique identifier within the document IDREF is a reference to an ID Referential integrity checked by the parser ID's determined by the application <!ATTLIST example identity ID #IMPLIED reference IDREF #IMPLIED> Other attribute types: ENTITY, ENTITIES, NOTATION, NAME, NAMES, NMTOKEN, NMTOKENS 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 44 Attributes can be typed. The standard type of an attribute is CDATA, i.e. string. Enumerations allow to specify finite sets of datavalues (note that the same could be achieved by defining an appropriate DTD, with an empty element type for each data value.) The ID/IDREF mechanism is one case where attributes expressive power goes beyond that of elements. It allows to specify references WITHIN documents. It is required in the XML specification that an XML parser must check referential integrity of those references. A number of other attribute types result from the history of XML (or SGML) in the document processing world. The most notable among those are entities, which provide a macro mechanism, that allows to factor out repeating parts in the documents and document type definitions. We will introduce entitites in more detail later. 44

45 Example: ID/IDREF DTD fragment: <!ATTLIST fig id ID #IMPLIED> <!ATTLIST figref refid IDREF #IMPLIED> Document fragment: <chapter> <title>apples<\title> <para> <fig id="1"> <caption> this is a figure<\caption> <\fig> <para> <\chapter> <chapter> <title>frogs<\title> <references> <figref refid="1"\> <\references> <\chapter> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 45 The document fragment in this example contains an element fig that has an identifier and an element figref that uses this identifier to create a reference to the fig element. 45

46 Inclusion of XML Document Type Definitions External DTD Declaration <?xml version="1.0" encoding="iso "?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM " <test> "test" is a document element </test> Internal DTD Declaration <!DOCTYPE test [ <!ELEMENT test EMPTY> ]> <test/> Mixed usage <!DOCTYPE test SYSTEM " [ <!ENTITY hello "hello world"> ]> <test>&hello;</test> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 46 Depending on the application requirements the DTD can be included into the document, which makes sense for documents that are processed by applications that cannot know the DTD, or they are kept external to the document, which makes sense if there exists a common context among the applications from which they can obtain the DTD. Sometimes also part of the DTD can be referred to externally and completed by some parts that are kept local to the document. We see here an example were entities are defined which are used only within the scope of this specific document whereas the other declarations are from an external DTD. 46

47 More Constructs of Well-formed XML CDATA Is not processed by the parser Used for code examples etc.... <script> <![CDATA[ if ( a < b ) { &subroutine(a,b) } else { &subroutine(b,a) } ]]> </script>... Comments...  , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 47 These constructs are provided for practical purposes. 47

48 Processing Instructions PI are not part of the document but calls to external applications Are forwarded to the application without change Poor programming practice ("hacker style") <?xml is reserved for the prolog of an XML document and is used to determine the character encoding <?TARGETAPPLICATION Parameter, Program, etc.?>... <?xml version="1.0" encoding="iso " standalone="no"?>... <?php echo $title;?> , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 48 48

49 4.3 Entities Entities allow to organize XML documents physically Entities work like macros External entities Distribution of one logical document over several physical files Reduce the size of files Organization of files Integration of non-xml resources Reuse of DTD fragments Internal entities Factorization of repeating contents within a document Better readability of document Less code, reuse Consistency 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 49 DTDs when considered as a schema language do not provide any means for modularizing schema designs, as we know them for example from objectoriented (schema) languages (inheritance). There exists however a syntactic mechanism for factorizing repeating constructs in DTDs and documents for XML, which is called entitites and works as a macro mechanism. With entitites fragments of XML documents and DTDs can be either stored in separate files (external entities) or be declared within a DTD (internal entitites). 49

50 Example <!ENTITY journals SYSTEM " <!ENTITY conferences SYSTEM " <publications> &journals; &conferences; </publications> <!ENTITY JSI " <journal>journal of Systems Integration</journal> <publisher>kluwer Academic Publishers</publisher>"... <journals> <paper> <authors> <author>w. Klas</author><author>G. Fischer</author><author>K. Aberer</author> </authors> <title>integrating a Relational Database System into VODAK using its Metaclass Concept</title> &JSI; <issue pages=' '>vol. 4, No. 4</issue> <year>1994</year></paper> , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 50 This example shows both the use of external and internal entities. In the upper document fragment XML code is included by referring to external entitites. In particular by declaring the entity journals in the first line (using the keyword SYSTEM to indicate that it is an external entity), at any place within the document, where the entitity is used (indicated by the syntax &journals), the text found in the file /journals.xml will be substituted. Within the file journals.xml we see an example of an internal entity. Anywhere in the document where it is used (syntax: &JSI) the text found in the declaration is replaced literally. 50

51 Parsed vs. Unparsed entities Internal entities become always part of the document and are parsed <!ENTITY JSI " <journal>journal of Systems Integration</journal> <publisher>kluwer Academic Publishers</publisher>" External entities can either be parsed or not Parsed entity <!ENTITY journals SYSTEM " Unparsed entity: used to include non-xml data <!ENTITY pic SYSTEM "logo.gif" NDATA GIF> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 51 Internal entities are always parsed after they are replaced. External entities can either be parsed (standard case) or unparsed. Unparsed entities are used to include non-xml data, such as images. They are indicated by the keyword NDATA. Their processing requires adequate support of the application processing the XML document. 51

52 General vs. Parameter Entities Difference in usage of the entities General entities are used in the document content Parameter entities are used in the DTD declarations Parameter entities Allow to modularize DTDs Fewer declarations in a DTD Both external and internal Always parsed Different syntax <!ENTITY % address "(name, street, zip)"> <!ELEMENT customer %address;> <!ELEMENT supplier %address;> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 52 Depending on whether entities are used to substitute parts of a DTD or parts of a document, one distinguishes parameter and general entitites. Parameter entities are used within DTDs, as shown in the example above. Note the different syntax, instead of & a % is used (BOTH in the declaration and the usage). As they are part of the DTD they are necessarily parsed. 52

53 4.4 XML Namespaces An XML namespace is a collection of names (markup vocabulary) identified by a URI reference used in XML documents as element and attribute names URIs are used just as unique identifiers, nothing else in particular they do not refer to a DTD or schema Uses universally agreed names combination of names from different DTDs without name conflicts But not combination of different DTDs 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 53 XML namespaces are a convention to distinguish different vocabularies for different applications in order to avoid names collisions. Consider just of how many XML applications would use the element name "name" and what happens if an application encounters this element name in documents originating from different sources without being able to distinguish the origin of the name. Therefore element and attribute names are prefixed by a label that is unique for a namespace. The problem is of course of how to obtain these unique labels. The solution is very simple: one uses Universal Resource Idenitifiers (typically URLs). But one must be careful here: the only purpose of using a URI (URL) is to have a unique prefix. There is nothing else assoicated with the URL, in particular the corresponding URL may not exist, it does not contain any information like a DTD or a Schema (though often useful information pertaining to the namespace is found under its specific URL). Namespaces are important in order to provide universally agreed names with unambiguous semantics that can be exploited by applications. They also allow to combine element names from different DTDs which would otherwise be in conflict which each other. But the namespace concept is not related to the problem of combining definitions from different DTDs in any way. 53

54 Declaration Declaration of Default namespace: xmlns (without :ns-name) Identification by prefix: xmlns:ns (ns is the prefix) Declared in different places In internal DTD by using default attributes In document by using attributes <?xml version= 1.0?>  <book xmlns= urn:loc.gov:book xmlns:isbn= urn:isbn > <title>xml Handbook</title> <isbn:number> </isbn:number> </book> <!DOCTYPE doc [ <!ELEMENT doc (x)> <!ELEMENT x EMPTY> <!ATTLIST x xmlns CDATA #FIXED " ]> <doc><x/></doc> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 54 Within the scope of one element one can always declare one default namespace, to which then all then names belong that do not have a prefix. In addition, an arbitrary number of other namespaces can be distinguished by a prefix. The namespace declaration can be given both in the DTD, then it is a default attribute, or within the document by using attributes. We see in the example all the cases illustrated: In the example document an attribute xmlns is used to declare the default namespace, such that for example the element title is from the namespace "urn:loc.gov:bok" and a second namespace isbn is declared. In the example DTD the element type x is equipped with the default namespace Note that the keyword FIXED is used in the attribute declaration in order to express that it is an unchangeable attribute value. 54

55 Recapitulation Which constraints has a well-formed XML document to obey (syntactic and semantic)? Which atomic element types and operators can be used to build an element content model? Which default values exist for attributes? What is the ID/IDREF mechanism? Is an DTD always a separate document? Is an DTD a grammar specification? What is the difference between a CDATA section and a comment? Which entities need not to be parsed? Can parameter entitites be external entitites? What information can be found at the location of an URI that is used as a namespace identifier? Are namespaces always declared in the document instance? 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 55 55

56 4 Standard XML applications Document Object Model Xpath XSLT XMLSchema XMLQuery 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 56 56

57 4.1 Document Object Model Standardized object-oriented API for accessing XML document Specifies only interfaces (in IDL), not implementation Views XML document as tree structure Methods for navigation and manipulation Not all concepts supported, e.g. DTDs, Entities Abstract Classes Node: superclass for all constituents of a document NodeList: representations of node lists NamedNodeMap: attributes of an element Concrete Classes Document, Element, Attribute, etc. 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 57 The document object model (DOM) is an object-oriented interface for accessing XML documents that have been parsed into an object-oriented representation of the XML document tree. It specifies exclusively the interfaces that are required to access the document tree, not the implementation. It specifies the interfaces of basic methods for navigation, search and manipulation of XML document trees. Some concepts found in XML are not supported (in the current version 2.0) The DOM specification consists of a collection of classes specified in IDL (Interface Definition Language the language to define interfaces specified by the OMG who also defines CORBA and ODMG). There exists an abstract class Node that captures the common properties of all constituents of documents from which the concrete classes, like for elements, attributes etc. are derived as subclasses. In addition there exist auxiliary abstract classes that are needed to hold intermediate results of processing of XML document trees: NodeList to represent lists of nodes, as it is required for the attribute type of property childnodes, as well as for result types of methods like selectnodes and getelementsbytagname, and the class NamedNodeMap in order to represent all the attributes of an element. 57

58 Attributes and Methods of Abstract Class Node namednodemap attributes parentnode ownerdocument... previoussibling nodename nodetype nodevalue nextsibling... childnodes firstchild lastchild... nodelist , Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 58 This figure illustrates the attributes of the abstract class Node, that is used to capture the structural properties of an XML document tree, essentially the relationships with the neighbouring nodes required for navigation. The blue node in the middle is the node which owns the attributes indicated in the figure. 58

59 Example: Class Node Interface interface Node { NodeType const unsigned short ELEMENT_NODE = 1;... const unsigned short NOTATION_NODE = 12; readonly attribute DOMString nodename; attribute DOMString nodevalue; readonly attribute unsigned short nodetype; // Navigation readonly attribute Node parentnode; readonly attribute NodeList childnodes; readonly attribute Node firstchild; readonly attribute Node lastchild; readonly attribute Node previoussibling; readonly attribute Node nextsibling; readonly attribute NamedNodeMap attributes; readonly attribute Document ownerdocument; // Methods Node insertbefore(in Node newchild, in Node refchild) raises(domexception); Node replacechild(in Node newchild, in Node oldchild) raises(domexception); Node removechild(in Node oldchild) raises(domexception); Node appendchild(in Node newchild) raises(domexception); boolean haschildnodes(); Node clonenode(in boolean deep);}; 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 59 This is the corresponding IDL specification of the Node class. One can see that node types are encoded by using short integers. The methods supported are used to insert, replace, remove and append child nodes and to test whether a node has a child. One specific method is clonenode which is used to create deep copies of whole document (sub-)trees. 59

60 Concrete Classes and Their Relationships Node type Possible sons Document DocumentFragment, Element, Entity, EntityReference Attr Element (at most one), ProcessingInstruction, Comment, DocumentType Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference Text, EntityReference DocumentType, ProcessingInstruction, Comment, Text, CDATASection, Notation none 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 60 This table provides an overview of the concrete classes of DOM and their possible parent-child relationships. 60

61 Processing of XML Documents XML Processor (XML engine): supports the access to the structure and content of XML documents XML Application: Software that uses the services provided by the XML processor XML application Call for an XML document Callback methods XML processor DOM calls 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 61 This figure illustrates the basic working of an XML processor: When an XML application requests a document from the XML processor, the processor starts to parse this document. During the parsing the parser may generate events that are passed to the application via callback methods (for which the application has to provide corresponding implementations in order to handle the events). In fact, there exist simpler XML processing models (in particular SAX) where the parser only produces these callback methods without creating a internal document presentation. But also DOM parsers support these callback methods. Once a document is parsed the application can use the internal document representation of the XML processor in order to access and manipulate the document. 61

62 Parsing of XML Documents (XML processor) Document D o c External DB/Files... G E n t. GEntity Expansion D o c Wellformed Check Element Tree Tokenization (Metasyntax) D T D... PEnt. PEntity Expansion DTD Valid Check Instantiated Schema External DB/Files 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 62 This figure illustrates the details of the parsing process as it takes place in the XML processor. The parser obtains first the document and performs the tokenization. Taking the tokenized document and the general entities (GEntity) it expands the entities in the document. For doing so it may require to access external files containing external entities. The document modified in that way is checked then for well-formedness, and if the check is successful an internal document tree representation is created that can be accessed by the applications (note that up to now we did not make use of the DTD). If a DTD is given (this is seen from the document) also the Parameter Entities in the DTD need to be replaced and the thus modified DTD is then validated against the element tree produced previously. If the validation is successful the element tree represents a validated instance of the DTD schema. 62

63 Example: Use of DOM in JavaScript <HTML><HEAD> <TITLE>Generic Page for DOM tests</title> <SCRIPT> window.onload = myload; function myload() { if(go) {show("if you are reading this, your DOM is working!");}} function show(str) { resulttext = document.createtextnode(str); // create a text object resultbr = document.createelement("br"); // create an element with tag BR resultelement = document.body; // access the BODY element of the document resultelement.appendchild(resulttext); // insert the element as child resultelement.appendchild(resultbr); // insert the BR element } </SCRIPT> </HEAD> <BODY> </BODY></HTML> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 63 This is a very simple, yet complete, example of how to use DOM from within JavaScript. The java script is embedded into an HTML page. When the HTML page is loaded into a (DOM-enabled) browser (as Internet Explorer 5.0 or higher) it is automatically parsed and available for DOM accesses. In the Java script upon loading the page a function mayload is invoked. This function essentially accesses the DOM tree of the displayed page and appends a node enclosed in a <BR> tag. Thus when the document is shown the text in the parameter of the function myload will also appear on the page. 63

64 4.2 XPath Overview Logical adressing of document parts Key concepts Location paths select nodes relative to a given context node Absolute location paths start at the document root Location paths consist of a sequence of location steps The last location step determines the result Filter expressions contain location paths Context of filter expressions is the associated location step Practical aspects Non-XML Syntax Used by other XML standards (XSLT, Xpointer) Used within XML attributes and URIs Location step Location step Filter? (yes) Filter? (no) Location step Result 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 64 When considering XML documents as data, we need to provide a query language to access this data. As the data view is a tree representation, a primary capability of any XML query language is to navigate along the paths in the tree. This is the essential function that Xpath provides. It allows to address (specify) document parts by providing a navigation «instruction» of how the part can be reached. The basic principle of Xpath is to navigate in the document tree in a way comparable to the navigation in a directory tree in a file system and to evaluate at each step additional filter conditions. The navigation steps (location steps) and the evaluation of filter conditions depend on the current navigation context (the place in the tree where the navigation has arrived). It is important to understand that the result of an Xpath query is always a set of element (or attribute) nodes (and nothing else, in particular not an XML document fragment) From a practical viewpoint it is interesting to mention that Xpath does not use an XML syntax, thus an Xpath expression is not a well-formed XML document (other standards, e.g. XSLT, the XML document transformation language, use XML syntax to denote expressions in their specific model). Xpath is a component that is reused in other standards, notably in XSLT. Xpath expressions are also intended to extend URLs to URIs (universal resource identifier) to address parts of XML documents. 64

65 XML example document 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 65 This is an example of an XML document we will use to introduce Xpath. 65

66 XPath Example 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 66 We introduce Xpath by means of a simple example. As one can see from the Xpath expresssion the syntax is reminiscent of the UNIX syntax for navigation in directories. The expression in square brackets is a filter condition. The navigation starts in this example at the root of the document and traverses over the shaded path. At the item element a filter condition is evaluated and only one of the possible continuations of the path is selected. 66

67 XPath Location Paths A location step consists of an axis (the navigation direction), a node test, and a predicate Axis operators AxisName ::= 'ancestor' 'ancestor-or-self' 'attribute' 'child' 'descendant' 'descendant-or-self' 'following' 'following-sibling' 'namespace' 'parent' 'preceding' 'preceding-sibling' 'self' Example: absolute location path /child::purchaseorder[child::shipto/child::name=«alice»]/ child::items/child::item[position()=1] Abbreviated syntax (used in practice) /purchaseorder[shipto/name=«alice»]/items/item[1] 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 67 The simple navigation pattern of the previous example (downwards to elements with specific names) is generalized in Xpath. Each navigation step is characterized by three properties: The direction, i.e. a navigation needs not necessarily move from a parent to a child node, but can follow any relation among elements, in particular allowing traversal of elements at the same tree level according to the document order (following, preceding) and traversal of multiple nodes (descendants). A complete account of the possible navigation operators is given. The node test checks whether a element node that is encountered in the course of the navigation matches a certain element name. And finally the predicates are the filter expressions which allow to select nodes based on other properties than their name. In particular, filters allow to use other Xpath queries to check a property of an element. In that case the predicate is considered as successfully evaluated if a non-empty result set is generated. In order to take account of the different axis operators an extended syntax is used in Xpath that specifies for each location step the axis operator. In practice however, the abbreviated syntax that we have already seen is more common. 67

68 XPath Abbreviated Syntax Selection of elements item selects from the current context all elements with name item Hierarchy operators item/price all price elements within item./item equivalent to item purchaseorder//item all descending elements with name item name/.. parent node of name 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 68 We introduce all operators of the abbreviated syntax by means of examples. The hierarchy operators are straightforward to understand as their semantics coincides with the UNIX directory navigation semantics, except the // operator. It represents the navigation to all descendant elements of the current context element, i.e. all elements on the paths to the leafs are selected (not only the leaf nodes!). 68

69 XPath Abbreviated Syntax Wildcards purchaseorder/*/item all item that are reachable by passing through one arbitrary element */* all all attributes Indexed Access item[1] first item element item[1, 4] first and fourth item element item[last()] last item element 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 69 Wildcards match all element names, in other words no node test is performed. Be careful: the wildcard * has in Xpath another meaning than in a regular expression or in an XML DTD, where it indicates repeated occurrence, which is covered in turn in Xpath by the // operator. Also attributes can be selected. This is particularly useful in filter expressions when attribute values need to be checked. Indexed access allows to traverse neighbouring elements at the some tree level. We see here also one example of using a (built-in) function in a predicate, namely last(). A number of basic functions for node access and string manipulation are specified in Xpath. 69

70 XPath Abbreviated Syntax Filter item[price] all item elements containing a price element purchaseorder[billto]/items[item] All items elements containing an item element that are contained in a purchaseorder element containing a billto element purchaseorder[items/item] purchaseorder[shipto and billto] item[productname=«car»] All item elements containing a productname element with textual content «car» item[@partnumber=«a100»] Union purchaseorder/billto/name purchaseorder/shipto/name Only at top level! 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 70 We see that two types of filter expressions exist. One that returns an element (or attribute) set. These are true if the sets are non-empty. In this way also Boolean combinations of Xpath expressions (e.g. shipto and billto) make sense. And others returning a Boolean value (e.g. using the equality predicate). These are true if for at least one element in the Xpath expression appearing in the predicate the condition is satisfied. Xpath provides only one set operator for set union (in contrast to SQL) and this with the restriction that it can only be applied at the top level of the Xpath expression. 70

71 XPath Abbreviated Syntax Operators / Child operator document element that is the direct child // Recursive descent all elements in the document that are indirect child. Current context node * Wildcard matches all element and attribute Attribute distinguishes attributes from elements [] Filter f() Method call () Grouping 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 71 71

72 4.3 XSLT Restructuring of XML documents Layout of documents overcoming heterogeneity, transformation of schemas XSLT Processor Document pattern Template Source document + Rule set = (XSLT Stylesheet) Target document 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 72 XSLT is a language that provides the capability of implementing mappings of XML document into other XML or general Unicode documents. An XSLT program consists of a set of rules that are recursively applied to all nodes of the XML document, transforming them to the nodes of the target document. The rules that are applied consist of a pattern, that matches nodes in the documents to be transformed and templates that are applied if a match is successful. Originally the XSLT language was designed in order to produce different layout documents (e.g. in the HTML format) from one single XML document. The language is in the meantime also widely used in order to transform XML documents in a mroe general way, e.g. in order to transform documents from different sources into common representations (schemas) or to transform XML documents conforming to one schema to documents conforming to a different schema. 72

73 XSLT Example <?xml version="1.0"?> <LectureNotes> <chapter>first Chapter</chapter> <chapter>second Chapter <chapter>subchapter 1</chapter> <chapter>subchapter 2</chapter> </chapter> <chapter>third Chapter <chapter>subchapter A</chapter> <chapter>subchapter B <chapter>sub a</chapter> <chapter>sub b</chapter> </chapter> <chapter>subchapter C</chapter> </chapter> </LectureNotes> XML Source <xsl:stylesheet xmlns:xsl= ' > <xsl:template match="/"> <TABLE BORDER="1"> <TR> <TH>Number</TH> <TH>text</TH> </TR> <xsl:for-each select="//chapter"> <TR> <TD><xsl:number/></TD> <TD> <xsl:value-of select="./text()" /> </TD> </TR> </xsl:for-each> </TABLE> </xsl:template> </xsl:stylesheet> XSLT Stylesheet HTML Output (formatted) 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 73 This example illustrates the basic working of XSLT: on the left hand side we see an XML document that should be displayed on the Web in HTML. In the middle we have an XSLT style-sheet. As we can see XSLT stylesheets are themselves represented in XML format, thus they can be treated just like XML documents. The stylesheet consists of one template rule contained within the template element. We recognize in the attribute match of the template element an Xpath expression. In fact, Xpath is the language that is used in XSLT in order to match the nodes to which a specific template should be applied. The content of the template element is the document fragment that is produced in case of a successful match of the rule. However the templates contain procedural-like processing instructions that provide capabilities to manipulate the original document in order to produce the desired result. We will subsequently go more into the details of this aspect of the XSLT language. On the right hand side we see the output that would be produced (in HTML) applying the template, as it appears in a browser. 73

74 Two Types of XSLT Stylesheets Literal result-element generation <?xml version='1.0'?> <doc xmlns:xsl=' xsl:version='1.0'> <xsl:copy-of select='/publications//author'/> </doc> Template-based stylesheet <?xml version='1.0'?> <xsl:stylesheet version='1.0' xmlns:xsl=' <xsl:template match='//paper' > <contributor> <xsl:value-of select='authors/author' /> </contributor> </xsl:template> </xsl:stylesheet> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 74 There exist two fundamentally different types of XSLT stylesheets: Stylesheets using literal result-element generation use no templates. Rather they instantiate the result document by processing the XSLT statements that are embedded in an "ordinary" XML document. In other words these types of stylesheet are XML documents, that exploit some of the result generation processing capabilities that XSLT provides. So in the first example one result element doc would be created that contains a copy of the contents of the elements that qualify for the path expression /publications//author. Normally however an XSLT stylesheet consists of a set of template rules (the order is not essential). This is the second example. We see in yellow the element stylesheet, that indicates that this is a stylesheet containing template rules. Within this element we find one template rule, that consists of the element "template" with the match attribute containing an Xpath expression. For each matching node (element "paper") an output document fragment is generated, consisting of the "contributor" element and containing the value of the first element found for the path expression 'authors/author' An important question is of how the template rules for the second, template based stylesheets are applied. The processing is initiated by applying the template rulre to the root node of the document. Within the template there exist operators that control then the further invocation of template rules for other nodes. 74

75 Basic XSLT Programming Constructs: Generating Output Copying from the input document <xsl:copy-of select=' '>: copies all nodes from the node set selected (and their subelements) <xsl:value-of select=' '>: copies the textual content of the first node of the node set selected <result><xsl:copy-of select='//authors'/></result> <authors><author>w. Klas</author><author>G. Fischer</author><author>K. Aberer</author></authors> <authors><author>m. Volz</author><author>K. Aberer</author><author>K. Böhm</author></authors> <result><xsl:value-of select='//authors'/></result> <result>w. Klas G. Fischer K. Aberer</result> Explicit instantiation of nodes <result> <xsl:element name='doc'/> <doc> <xsl:attribute name='id' >32</xsl:attribute></doc></result> <result version='1.0'><doc/><doc id='32'/></result> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 75 The question is now of how can a result document be generated (either when a template matches or in literal result generation). There exist two basic possibilities: 1. The result consists of some fragments of the source document that are copied to the output. There exist two operators that support this: "copyof" takes all nodes that are selected by an Xpath expression and copies them unchanged to the output document. The context of the Xpath expression (given in the "select" attribute of the "copy-of" element) is the node at which the template match has been successful (in case it is a relative Xpath expression, i.e. not starting with /). The second operator, "value-of", copies only textual content (but no elements or attributes). It does this also only for the FIRST element for which the Xpatch expression is matching. 2. The result consists of document fragments that are constructed from scratch. For this purpose there exist a number of XSLT operators (essentially for each XML construct) that allow to do this. For example, the operator <xsl:element name = 'xyz'/> is used to generate an element with name "doc". We can also generate such an element by directly outputting it as shown above. However, the element operator makes sense, e.g. when the element name should be chosen dynamically. 75

76 Basic XSLT Programming Constructs: Program Logic Conditional statement: <xsl:if test=' '> For-each loop: <xsl:for-each select=' '> Sorting of nodes in for-each: <xsl:sort select=' '/> <authorcount xmlns:xsl=' xsl:version='1.0'> <xsl:for-each select='//paper'> <xsl:sort select='year'/> <xsl:if test='count(.//author) > 2 '> <xsl:copy-of select='title'/> <nr><xsl:value-of select='count(.//author)'/></nr> </xsl:if> </xsl:for-each> </authorcount> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 76 XSLT supports also standard programming logic such as conditional execution and loops. For conditional execution an Xpath expression is evaluated and the test is sucessful if there exists a match. In that case the XSLT operators within the conditional expression are further processed. For loops again an Xpath expression is evaluated. Then the operators enclosed in the iterator element "for-each" are processed for each of the elements found. Within the scope of the iterator expression the found elements serve as context for the evaluation of Xpath expressions occuring in that scope. The sort operator allows to sort the result of an iterator before processing it by specifying (again by means of an Xpath expression) which is the element to be used as sorting criterion. The "xsl:sort" instruction elements must appear as the initial child of an "xsl:for-each" element. 76

77 XSLT Templates XSLT templates correspond to functions Allow for modularization of stylesheets Allow recursive programming! Can be called with parameters Invocation Explicitely by a given name <xsl:template name='name' > called by <xsl:call-template name='name'/> Implicitely by pattern matching (declarative programming) Pattern given as Xpath expression <xsl:template match='pattern'> called by <xsl:apply-templates select=' '/> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 77 Now we come to the XSLT templates which correspond to functions. Their purpose is to modularize stylesheets, to support recursive programming and to allow parametrized functions. Templates can be either invoked explicitely by giving them a name and invoking them by using that name, or, and this makes up much of the power (and declarative nature) of XSLT by implicitely invoking them by specifying a pattern for which the should be applied. The invocation is initiated then by the operator "apply-templates", which in case it is invoked leads to the matching of all templates to the nodes that are processed in the subtree for which the "apply-templates" operator has been invoked. 77

78 Example <?xml version='1.0'?> <xsl:stylesheet version='1.0' xmlns:xsl=' <xsl:template match='//paper'> <xsl:call-template name='printtitle'/> </xsl:template> <xsl:template name='printtitle' > <TITLE><xsl:value-of select='title'/></title> </xsl:template> Default invocation Invocation for each element of context </xsl:stylesheet> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 78 This example illustrates the use of templates. When the traversal of the document starts, at the root node the first template is invoked (this happens by default, we will explain immediately how). This template matches at the root node all "paper" elements. For each matched element the template body says that the template "printtitle" should be invoked. This completes then the processing of this stylesheet. 78

79 Template Processing Initialisation: implicit template <xsl:template match='/' > <xsl:apply-templates/> </xsl:template> Conflict resolution Only templates that match the current context node may be chosen Imported and included rules have lower priority Otherwise the rule with higher priority is chosen Priority is computed from structure of matching template or explicitely given Two equal priority rules are an error Default templates If no template applies to a node in a context that needs to be processed due to an apply-template statement, built-in default template rules apply, e.g. <xsl:template match='* /' > <xsl:apply-templates select='node()' /> </xsl:template> <xsl:template > <xsl:value-of select='.' /> </xsl:template> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 79 We have to understand three issues for processing templates 1. How is the processing initialized? 2. What happens if multiple templates can be applied to a node? 3. What happens if no templates can be applied to a node? Question 1 is answered as follows: An implicit xsl:apply-templates as shown above is assumed to be present, that initiates the template processing at the root node. It applies the operator apply-templates to the root node. Question 2: It may occur that multiple patterns match for a given node. Then there exist priorities according to which the template is selected that must be applied. Imported and included rules have always lower priority. For local rules the matching patterns are compared and from those essentially the more specialized one is chosen. The priority can also be explicitely specified. A split situation is not possible. If the priority of two rules is equal this produces an error. Note, that it is for example possible to override the implicitely given initialization rule by any rule that matches the root node. Question 3: When in a specific processing context the operator applytemplates is invoked, it can (frequently) occur, that none of the template rules match the node. In that case again implicit rules are applied, that essentially lead to a copying of the contents of the node. This default behavior can be sometimes quite cumbersome when developing XSLT stylesheets. 79

80 Parameters and Variables Templates may have parameters <xsl:template name='function' > <xsl:param name='n1' /> <xsl:param name='n2' />... </xsl:template> Called as follows <xsl:call-template name='function'> <xsl:with-param name='n1' select='expr1' /> <xsl:with-param name='n2' select='expr2' /> </xsl:call-template> Templates may also have variables Analogous syntax for declaration No setting of value at invocation time Can be declared everywhere Variables and parameters can be defined only once per template Are globally visible 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 80 Named templates can also be equipped with parameters that can be set at invocation time. The syntax is given above. In addition to named parameters, XSLT also supports named variables via the xsl:variable instruction. The syntax of the xsl:variable instruction is identical to that of xsl:param except for the element name. The difference between xsl:param and xsl:variable is that parameters, unlike variables, can have their initial values overridden at template invocation time by using xsl:with-param. Additionally, xsl:param instructions must appear at the top of the template in which they are included; xsl:variable instructions can appear anywhere an instruction is allowed. In either case, a given variable or parameter name can only be defined once per template. Variables and parameters defined as children of an xsl:stylesheet element are global in scope and are visible across all templates, but a template can hide the global definition by defining a variable or parameter with the same name. 80

81 Example: Recursive Programming <?xml version='1.0'?> <xsl:stylesheet version='1.0' xmlns:xsl=' <xsl:template match='/publications'> <xsl:call-template name='recursive'> <xsl:with-param name='n' select='count(journals/paper)' /> </xsl:call-template> </xsl:template> <xsl:template name='recursive' > <xsl:param name='n' /> <xsl:copy-of select='journals/paper[$n]/title'/> <xsl:if test='$n > 0'> <xsl:call-template name='recursive'> <xsl:with-param name='n' select='$n - 1' /> </xsl:call-template> </xsl:if> </xsl:template> </xsl:stylesheet> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 81 Templates can be used for recursive programming as this example illustrates: The first template invokes initially a template named recursive (with one parameter that is used to end the recursion). The template recursive invokes, provided the parameter value condition evaluates true, itself again. What this stylesheet does is that it traverses through the papers one by one (by decreasing the count of papers) and copies the title of the papers to the output. (of course this could be done also in a simpler way, we use recursion for illustration purposes only) Another interesting feature that can be observed in this example is the use of parameters. In the Xpath expression select='$n-1' we see of how a parameter value is used in order to derive from it (via an arithmetic operation) a new parameter value. Note that this is a valid Xpath expression, as any arithmetic expression (or other expression returning a atomic data value) is an Xpath expression. 81

82 Summary of XSLT Concepts Declarative programming language Functional programming paradigm (no side-effects!) Xpath is used as locator language XML representation of programs Programming constructs Loops, conditional statements, sorting Templates (correspond to parametrized functions) Pattern matching (for invocation of templates) Copying and creation of document constituents (elements, attributes, text, ) Program execution Input document is processed in a top-down fashion Output is generated during processing Processing context: each command is executed in a context consisting of a node set, initial context is the root node Conflict resolution for templates Default processing rules 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 82 82

83 Processing Architectures for XSLT New Web Browser e.g. IE 5.0 Old Web Browser e.g. IE 4.0 Old Web Browser, e.g. IE 4.0 New Web Browser e.g. IE 5.0 XSL Interpreter XSL Interpreter Proxy/Firewall XSL Interpreter XSL Interpreter XSL Interpreter Server Server Server Server XML Document XML Document XML Document XML Document 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 83 We give here a small overview on possible processing architectures when using XSLT. The original intention was that the XSLT Interpreter resides at the Web Browser. The idea is that the Web servers would provide XML content and the end users would visualize the contents in an individualized manner using their personal stylesheets. In case the users have non-xslt enabled Web browsers, it would be required that the Web servers do the actual conversion. In case they are also not doing that an alternative is that that some proxy server performs this task, probably for some organization (resp. the intranet of the organization). Finally also any combination of doing the conversion both at the server side, e.g. for preparing the XML contents for general distribution (e.g. selecting the relevant content), and at the client site for adapting to individual needs is possible. Besides the more qualitative problem of availability of XSLT processing capabilities, there exists also the concern about the resouce consumption in terms of processing power. XSLT conversion is fairly expensive, and thus server-side XSLT conversion quickly might lead to a very high load and possible congestion of the server. 83

84 Recapitulation What is the difference among abstract and concrete DOM classes? Which is the difference in access granularity among DOM and XSLT? What is an axis in XPath? Which result types are possible in XPath? What is the difference between a/*/b and a//b? Which is the relationships between XSLT and Xpath? How is an XSLT template rule processed? How can the output of an XSLT stylesheet be generated from the input document? What is the difference between variables and parameters in XSLT? 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 84 84

4.4 XML Schemas - Motivation 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 85 If we consider typical data-oriented XML documents, such as the purchase order document

85 4.4 XML Schemas - Motivation 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 85 If we consider typical data-oriented XML documents, such as the purchase order document shown in this example, we recognize that they are much more like regularly structured complex objects rather than flexibly structured documents. Using XML DTDs to capture this structure works to a certain degree quite well, but there exist with DTDs some deficiencies that are not so easy to remedy, due to its origin in the document processing world rather than the data processing world. 85

86 Limitations of XML DTDs <!ENTITY % Address "name, street, city, state, zip"> <!ENTITY % Address.Attribute "country CDATA #REQUIRED"> <!ELEMENT purchaseorder (shipto, billto, comment?, items )> <!ATTLIST purchaseorder orderdate CDATA #REQUIRED > <!ELEMENT shipto %Address;> <!ATTLIST shipto Address.Attribute> <!ELEMENT billto %Address;> <!ATTLIST billto Address.Attribute> <!ELEMENT name (#PCDATA )> <!ELEMENT street (#PCDATA )> <!ELEMENT city (#PCDATA )> <!ELEMENT state (#PCDATA )> <!ELEMENT zip (#PCDATA )> <!ATTLIST billto country CDATA #REQUIRED> <!ELEMENT comment (#PCDATA )> <!ELEMENT items (item* )> <!ELEMENT item (productname, quantity, price, comment?, shipdate? )> <!ATTLIST item partnum CDATA #REQUIRED > <!ELEMENT productname (#PCDATA )> <!ELEMENT quantity (#PCDATA )> <!ELEMENT price (#PCDATA )> <!ELEMENT shipdate (#PCDATA )> Non-XML Syntax Modularisation: Textual replacement with parameter entities Only String datatype Only constructor is the content model 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 86 Here we point out some of these deficiencies: Very obviously DTDs have their proprietary (non-xml) syntax. This seems not to be a major issue, but it requires if one intends to manage a large number of DTDs special mechanisms, either by managing them in a separate (non-xml) repository or by providing some kind of proprietary mapping to XML (which in fact has been done in many cases). One of the nice things about XSLT, for example, was that XSLT stylesheets could be treated just like any other XML document. This would also be desirable for XML DTDs (or XML Schemas in general). The second major deficiency is the lack of modularization support. In particular for complex documents this is of great importance, and when inspecting real-world DTDs one sees that they make extensive use of entitities in order to modularize. Using entities is however dangerous: since they are only textually replaced, we can never be sure whether the type of the entitities that we replace matches the requirements. In object-oriented models this problem is dealt with inheritance mechanisms, and something comparable would also be desirable for typing XML documents. Thirdly, the set of datatypes available in XML DTDs is extremely limited. Comparing to data models as they are used for DBMS, there exists practically nothing. And finally, the use of content models to construct types leaves on the one hand a lot of flexibility (and is elegant), but there exist a number of type constraints that cannot be expressed by using content models, for example quantitative cardinality constraints or referential integrity constraints. 86

87 XML Schema Overview Unifies a typical object-oriented modelling paradigm with the DTD constructs in an XML syntax Main concepts Simple types: rich set of basic types, user-definable simple types Complex types: extend the content model of DTDs Anonymous complex types Choice and sequence construct (instead of and, in DTDs) Explicit cardinality constraints (instead of?, + and * in DTDs) Inheritance mechanisms by extensions and restriction Integrity constraints (uniqueness constraints) 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 87 All these problems motivated the development of a new standard for typing XML documents, namely XML Schema, which is closely related to the object-oriented (database) modelling paradigm (i.e. contains many elements from relational and particularly object-oriented data models like ODMG and SQL99), but remains compatible with the typing constructs that are supplied by the XML DTD model. Furthermode XML Schemas are specifieid using an XML syntax. In the following we introduce some of the key concepts of XML Schema. 87

88 Example Document Sequence Constructor XML Document <USAddress country="us"> <name>alice Smith</name> <street>123 Maple Street</street> <city>mill Valley</city> <state>ca</state> <zip>90952</zip> </USAddress > DTD <!ELEMENT USAdress (name, street, city, state, zip )> <!ATTLIST USAdress country CDATA #FIXED > <!ELEMENT name #PCDATA> etc. XML Schema <xsd:complextype name="usaddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:nmtoken" use="fixed" value="us"/> </xsd:complextype> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 88 We illustrate first the difference between an XML DTD and an XML Schema by means of an example. Given the document on top, we can give the corresponding XML DTD and XML schema below. Now there is a first important observation: The use of XML Schema has no influence whatsoever on the nature of the XML document instances! These are still well-formed XML documents, XML Schema just provides a different mechanism for expressing the type of these documents. The first observation on XML Schema is that it is in XML Syntax. We also see that rather having an element content model we introduce a complex type which is named using the XML Schema attribute "name" within the XML Schema element "complextype". Furthermore we see that the content model operator "sequence" corresponds to an element sequence which contains the elements that must occur in the USAdress element. The elements are specified by giving them a name using the XML Schema attribute "name" in "element" and specifying the type using the attribute "type". A major difference is now that we can use not only the type "xsd:string" which essentially corresponds to #PCDATA but also other simple types such as "xsd_decimal". Next we see that an attribute is also declared in a different way by using the "attribute" element. The attributes "name", "type", "use" and "value" are provided in order to specify the same properties of an attribute as within an XML DTD. 88

89 XML Schema DTD Differences <xsd:complextype name="purchaseordertype"> <xsd:element name="shipto" type="address"/> <xsd:element name="billto" type="address"/> <xsd:element ref="comment" minoccurs="0" /> <xsd:element name="items" type="items"/> <xsd:attribute name="orderdate" type="xsd:date"/> </xsd:complextype> <xsd:element name="comment" type="xsd:string"/> v e r s u s <!ELEMENT purchaseorder (shipto, billto, comment?, items )> <!ATTLIST purchaseorder orderdate CDATA #REQUIRED > <!ELEMENT shipto %Address;> <!ATTLIST shipto...> <!ELEMENT comment (#PCDATA )> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 89 These two fragments of an XML Schema and an XML DTD reveal more differences: 1. Elements can be declared either locally or globally in XML Schema depending on the scope that is desired. The elements shipto and billto and items are declared locally (element declarations), whereas comment is declared globally (via a so-called element reference) (i.e. outside of the scope of the type definition of PurchaseOrderType). This difference cannot be made in a DTD, where every element is declared globally 2. Cardinality constraints are expressed explicitely using attributes minoccurs and maxoccurs, which can take any integer values. For example, the DTD operator? translates thus into minoccurs="0" maxoccurs="1". The default value for minoccurs and maxoccurs is always 1. In this way more general cardinalty constraints can be given. 3. Elements are typed. By assigning to the elements shipto and billto the type Adress it is possible that also address elements that correspond to subtypes will be correct instantiations of the shipto and billto elements. This is clearly not possible by a simple string replacement mechanism as with entities in DTDs. They cannot support inheritance. 89

90 Anonymous Types and User-Defined Simple Types <xsd:complextype name="items"> <xsd:sequence> <xsd:element name="item" minoccurs="0" maxoccurs="unbounded"> <xsd:complextype> <xsd:sequence> <xsd:element name="productname" type="xsd:string"/> <xsd:element name="quantity"> Anonymous complex type <xsd:simpletype> <xsd:restriction base="xsd:positiveinteger"> <xsd:maxexclusive value="100"/> </xsd:restriction> </xsd:simpletype> </xsd:element> <xsd:element name="usprice" type="xsd:decimal"/> <xsd:element ref="comment" minoccurs="0"/> <xsd:element name="shipdate" type="xsd:date" minoccurs="0"/> </xsd:sequence> User-defined simple type <xsd:attribute name="partnum" type="sku"/> </xsd:complextype> </xsd:element> </xsd:sequence> </xsd:complextype> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 90 This example illustrates two further concepts in XML Schema: 1. Anonymous types: The type of the element "item" is not explicitely declared but included as content into the "xsd:element" element. Thus it obtains no name, is therefore anonymous and cannot be reused in another context. One can compare this to the declaration of attributes in XML DTDs. They are also directly attached to their corresponding element and their type is not explicitely given a name (as opposed to all the elements in an XML DTD). 2. User-defined simple types: Within the element with name "quantity" we find a user-defined (anonymous) simple type. It is derived from one of the built-in XML Schema simple types: xsd:positiveinteger. It is obtained by imposing an additional constraint on the possible range of integer values, namely a maximal value of

91 Model Groups and Choice <xsd:complextype name="purchaseordertype"> <xsd:sequence> <xsd:choice> <xsd:group ref="shipandbill"/> <xsd:element name="singleaddress" type="address"/> </xsd:choice> <xsd:element ref="comment" minoccurs="0" /> <xsd:element name="items" type="items"/> </xsd:sequence> <xsd:attribute name="orderdate" type="xsd:date"/> </xsd:complextype> <xsd:element name="comment" type="xsd:string"/> <xsd:group name="shipandbill"> <xsd:sequence> <xsd:element name="shipto" type="address"/> <xsd:element name="billto" type="address"/> </xsd:sequence> </xsd:group> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 91 In this example we see two more constructs of XML Schema 1. The sequence operator is assumed by default: We have added in this example the xsd:sequence operator explicitely for clarity only. Remember in the original example we omitted it. 2. Choice operator: the choice operator xsd:choice corresponds to the operator in XML DTDs, thus exactly one of the elements occuring within the choice must occur in the document instance. 3. Model groups: The model group shipand Bill can be used exactly like a parameter entity in an XML DTD for substitution. Note that it is not declared as a type and therefore cannot by used for declaring the type of an element. 91

92 Model Groups: All <xsd:complextype name="purchaseordertype"> <xsd:all> <xsd:element name="shipto" type="address"/> <xsd:element name="billto" type="address"/> <xsd:element ref="comment" minoccurs="0" /> <xsd:element name="items" type="items"/> </xsd:all> <xsd:attribute name="orderdate" type="xsd:date"/> </xsd:complextype> <xsd:element name="comment" type="xsd:string"/> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 92 If instead of the sequence operator the xsd:all operator is used there exists no constraint on the order in which the elements appear within the group/type. They just have all to be present. This operator existed actually in SGML, the predecessor of XML, but was abandoned in order to simplify the document processing (unordered collection are very expensive to parse). The usage of xsd:all is however substantially restricted: 1. The xsd:all operator can only be used at the top level of the content model (not nested) 2. The children occuring within the operator must all be simple elements (groups or nested types are not allowed) 3. No element may occur more than once, i.e. maxoccurs can be not larger than 1. 92

93 XML Schema: Inheritance <complextype name="address"> <element name="name" type="string"/> <element name="street" type="string"/> <element name="city" type="string"/> </complextype> <complextype name="us-address" base="address" derivedby="extension"> <element name="state" type="us-state"/> <element name="zip" type="positiveinteger"/> </complextype> <complextype name="uk-address" base="address" derivedby="extension"> <element name="postcode" type="uk-postcode"/> <attribute name="export-code" type="positiveinteger" use="fixed" value="1"/> </complextype>  2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 93 XML Schema provides also inheritance mechanisms: The first is called inheritance by extension and consists basically of subtyping by adding elements. In the example above we see this: A base type Address is refined into two more special types USAddress and UKAddress, by adding corresponding elements and attributes. 93

94 Using Derived Types <shipto exportcode="1" xsi:type="ipo:ukaddress"> <name>helen Zoe</name> <street>47 Eden Street</street> <city>cambridge</city> <postcode>cb1 1JR</postcode> </shipto> <billto xsi:type="ipo:usaddress"> <name>robert Smith</name> <street>8 Oak Avenue</street> <city>old Town</city> <state>pa</state> <zip>95819</zip> </billto> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 94 With inheritance it is possible to use the derived type always where the base type is expected. In our example if a document should comply to a schema where the type Address is expected for shipto/billto, the above document fragment would be valid. However, there is an important rule: when using a derived type it must be explicitely specified in the document instance as shown in the example. 94

95 Derived Types by Restriction <complextype name="items"> <sequence> <element name="item" minoccurs="0" maxoccurs="unbounded"> <complextype> <sequence> <element name="productname" type="string"/> <element name="quantity"> </sequence> </complextype> </element> </sequence> </complextype> <complextype name="confirmeditems"> <restriction base="ipo:items"> <sequence> <element name="item" minoccurs="1" maxoccurs="unbounded"> <complextype> same as before </complextype> </sequence> </restriction> </complextype> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 95 Restriction can be used in order to derive a more specialized type by adding constraints. The possible constraints that can be added concern the cardinality constraints (minoccurs, maxoccurs), default values and fixed values. We see this in the example: The type ConfirmedItems is derived from Items by imposing a more restrictive constraint on minoccurs. It is important to remark that the complete content model of the base type needs to be repeated when defining a derived type by restriction (this is different to the derivation by extension where only the additional elements and attributes need to be given!) 95

96 XML Schema: Integrity Constraints <element name="purchasereport"> <complextype> <element name="regions" type="regionstype"/> <element name="parts type="partstype"/> <attribute name="period" type="timeduration"/> <attribute name="periodending" type="date"/> </complextype> <unique> <selector>regions/zip</selector> </unique> <key name="pnumkey"> <selector>parts/part</selector> </key> <keyref refer="pnumkey"> <selector>regions/zip/part</selector> </keyref> </element> XPath <complextype name="regionstype"> <element name= "zip" minoccurs="1" maxoccurs="unbounded"> <complextype> <element name="part"> <complextype content="empty"> <attribute name="number" type="sku"/> <attribute name="quantity" type="positiveinteger"/> </complextype> </element> <attribute name="code" type="positiveinteger"/> </complextype> </element> </complextype> <complextype name="partstype> <element name="part" minoccurs="1" maxoccurs="unbounded"> <complextype content="textonly"> <attribute name="number" type="sku"/> </complextype> </element> </complextype> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 96 XML Schema also supports the standard key integrity constraints known from relational databases. We see in this example a complex type definition. On the left-hand side three integrity constraints are declared: 1. A uniqueness constraint on the zip code is specified. This requires two things: first the determination of the context in which the uniqueness constraint has to hold. This is achieved by specifying an Xpath expression (regions/zip, as usual the current element is used as navigation context) which selects a set of elements. And second, the specification of the element/attribute that should be unique within that context. This is specified by a second Xpath expression, relative to the selected elements. In the example this is the this attribute should have unique values, or could be missing alternatively, within the selected scope. Values attributes in other contexts are not affected by this integrity constraint. 2. A primary key constraint: This follows the same pattern as the uniqueness constraint. The main difference is that it is required that each element selected has actually a corresponding value (i.e. cannot be missing resp. NULL). The primary key is given a name, such that it can be referred to by foreign keys. 3. Foreign key constraints: this is specified by first giving a name of a primary key. In the example this is attribute "refer" of element "keyref". Then, similarly as before, the context is determined, for which the constraint has to hold, and the element/attribute is selected within the context, for which the foreign key value is specified. Remark: one may think of the path expression selecting the context as the selection/definition of a relation, and the selection of an element/attribute within the context as the selection of an attribute of 96

97 Integrity Constraints at Instance Level <purchasereport xmlns=" period="p3m" periodending=" "> <regions> <zip code="95819"> <part number="872-aa" quantity="1"/> <part number="926-aa" quantity="1"/> <part number="833-aa" quantity="1"/> <part number="455-bx" quantity="1"/> </zip> <zip code="63143"> <part number="455-bx" quantity="4"/> </zip> </regions> <parts> <part number="872-aa">lawnmower</part> <part number="926-aa">baby Monitor</part> <part number="833-aa">lapis Necklace</part> <part number="455-bx">sturdy Shelves</part> </parts> </purchasereport> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 97 From this instance, respecting the integrity constraints from the schema, we see that uniqueness has only to hold within the context for which the constraint has been defined. So we see that the value " 872-AA" of attribute "number" in element parts is only unique within the context of element part, whereas in other contexts it occurs multiply. 97

98 4.5 XQuery - Querying XML Data Problem: XPath lacks basic capabilities of database query languages, in particular join capability and creation of new XML structures XQuery extends XPath to remedy this problem Additional concepts in XQuery Extended path expressions Element constructors FLWR expressions Expressions involving operators and functions Conditional expressions Quantified expressions 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 98 It should have become clear that Xpath lacks basic capabilities one would expect from a (declarative, set-oriented) database querying language. In particular it has no general support for set operators, it allows to return only element and attribute sets and is thus not closed and it has no support of an operation that is equivalent to a relational join, which would be required to establish value-based relationships among XML document parts. We introduce now the most important additional concepts of XMLQuery as they were specified in June The goal is not to obtain a thourough knowledge and capability to use Xquery, but to understand the substantial additional concepts to Xpath and the relationships to SQL. We must be aware that this standard is not yet finalized and we may expect certain changes (probably not major ones). With a knowledge of Xpath and SQL the following presentation of Xquery should be straightforward to follow. 98

99 Dereference Operator document("zoo.xml")// chapter[title = "Frogs"]//figref/@refid->fig/caption Find captions of figures that are referenced by figref elements in the chapter of "zoo.xml" with title "Frogs". document.xml: <chapter> <title>apples<\title> <para> <fig id=«1»> <caption> this is a figure<\caption> <\fig> <para> <\chapter> <chapter> <title>frogs<\title> <references> <figref refid=«1»\> <\references> <\chapter> 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 99 Navigation in Xquery is possible over IDREF attributes. In this example first the figref element would be located from which the IDREF value is taken and the fig element with that ID value is located, leading to the result. 99

100 Element Constructor and FLWR Expressions <result> { LET $a := avg(document("bib.xml")//book/price) FOR $b IN document("bib.xml")//book WHERE $b/price > $a RETURN <expensive_book> {$b/title} <price_difference> {$b/price - $a} </price_difference> </expensive_book> } </result> "macro" "FROM" "SELECT" For each book whose price is greater than the average price, return the title of the book and the amount by which the book's price exceeds the average price 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 100 This query exhibits a whole wealth of new concepts of Xquery. Most notably one can see that Xquery allows variable binding like SQL. Different to SQL, where relations are used to bind variables, in Xquery they have to be bound to sets that result from other Xquery expressions (this is the only possibility to obtain sets). This is done in the FOR clause. In addition using the LET clause one can introduce variables that factor out repeatedly occuring expressions in the queries. Note that these variables are used differently to the ones bound to set-valued expressions: they are just syntactically replaced in the query. The second observation is that a WHERE clause is available to express conditions. This allows in particular to express joins when multiple variables are bound in the FOR clause. The third observation is that a RETURN clause allows to return structured results, creating new XML document fragments. Finally we see that a query expression itself can be nested within a XML document fragment, as illustrated by the expression $b/title. 100

FLWR Expression Evaluation 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 101 The semantics of Xquery expressions is defined similarly to SQL (which is in short: build

101 FLWR Expression Evaluation 2002, Karl Aberer, EPFL-SSC, Laboratoire de systèmes d'informations rèpartis 101 The semantics of Xquery expressions is defined similarly to SQL (which is in short: build the Cartesian product of the relations in the FROM clause, evaluate the predicates in the WHERE clause and then project on the attributes in the SELECT clause). Also for FLWR expression first one generates all tuples from the Cartesian product of all the sets to which variables are bound. An important difference is that the order among document elements needs to be preserved, therefore also the order in which the variables appear in the FOR clause has an impact on the order the result tuples will be sorted. The WHERE clause is evaluated like for SQL and for each remaining tuple an XML document fragment is generated by replacing the variables by the tuple values. There is also a XML query algebra under development which is intended to provide a precise semantics to Xquery. 101

Chapter 1: Semistructured Data Management XML

Chapter 1: Semistructured Data Management XML 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis XML - 1 The Web has generated a new class of data models, which are generally