Documents and markup languages The term XML stands for extensible Markup Language. Used to label the different parts of documents. Labeling helps in: Displaying the documents in a formatted way Querying the document based on content SGML: Standard Generalized Markup Language: A very generalized specification of how to mark up a document. complex, used by specialized publishers HTML: Hyper Text markup Language Subset (application) of SGML. HTML conforms to the SGML specification for markup and defines the tags that can be used for web pages, but it has one significant limitation the tags that you can use are fixed by the W3C specification. XML: extensible Markup Language Subset of SGML, user defined tags, for both displaying and logically structuring documents XML and the DBMS The receiving organization is likely to receive transcripts for many students and, therefore, it needs some way of processing the XML document to extract and store the relevant data in its own database. <studenttranscript> <studentid >s05</studentid> <studentname >Ellis</studentName> <enrolments> <course> <coursecode>c2</coursecode> <title >Syntax</title> <credit>30</credit> </course > </enrolments > </studenttranscript > Structure of XML documents An XML document is a sequence of characters, which is partitioned into groups that are treated as either markup or character data. Elements The main structure of a document is determined by its elements, where an element is bounded by a start-tag(<element-name> and an end-tag(</element-name>) <element-name... > Element content.</element-name > A tag s element-name is a label that begins with a letter and continues with letters, digits or symbols such as a hyphen, underscore or full stop (a colon is also allowed but it has a special meaning). Whatever appears between a start-tag and its matching end-tag is the content of the element(character data or other elements). Page 1
empty element that has no content : <element-name.../> where the ending /> signifies that there is no matching end-tag. <title >Syntax</title > The element name is title and the character data content of the element is Syntax. A complex element beginning with the start-tag <studenttranscript> and finishing with the matching end-tag </studenttranscript >. In this case, the element name is studenttranscript and its content is considered as the sequence of elements named studentid, studentname and enrolments, each of which has its own content. A document obeying these rules is said to be well-formed XML. Every start-tag must have a matching end-tag. A document has one root element. Every other element must be contained within some parent element, requiring that each pair of matching tags is nested within another pair. Defining documents The original means of defining a document is given as part of the XML specification and is known as a document type definition (DTD). To define what can belong in a document, we use a Data Type Definition (DTD). Many ways can define the structure of an XML document in DTD Allowing elements to have a very varied content, such as containing both character data and elements (known as mixed content) or containing elements that can be optional or without ordering. There are more limited variations for attributes, since they have just a single character data value; for example, they can be optional and can have a default value. Prosperities of a DTD:- 1-The ability to define an entity, which is a named value that can be included within character data. 2-The data defined by a DTD for both element content and attribute values is always character data. There is no way to define integer or date. 3- DTD ordering does not matter you can mix up the element and attribute definitions and it does not affect their meaning, since it still determines the same structure. 4-the ability to define an entity, which is a named value that can be included within character data by giving its name embedded between & and;. When the document is parsed the name is replaced by its value. The alternatives to DTD(XML schema):- The main alternative developed by W3C is called XML Schema, which provides the means to define the structure of an XML document in a way that is different from a DTD as well as specifying a range of data types that can be used to define the values that are allowed within an XML document. An XML schema is itself an XML document, with its own schema to determine how it should be written. While this kind of schema meets the requirements of dataoriented applications, it is considered too complex for some uses of XML. Page 2
Alternatives to DTDs: RelaxNG schema Relax NG is a schema language that is an evolution and generalization of a DTD. It focuses on defining the structure of an XML document, though it can also define values by using the data types specified for XML Schema. A Relax NG schema can be written either as an XML document or with a more compact syntax. Schematron is another kind of schema language, which is based on writing assertions about the tree structure of an XML document. Difference between Database and XML schemas While a database must have a schema to define its tables before you can enter data, an XML document does not need a DTD or schema as long as its markup follows the rules, then it is legitimate, well-formed XML. Document prolog xml and encoding declarations External DTD: A DTD is kept in a file and the document type declaration references the file using a system identifier. Such a system identifier can take various forms, since the DTD can be anywhere on the internet, but the simplest option is to place the DTD in the same folder as the referencing XML document, in which case you need only the file name. Document prolog:- processing instruction & comments Another kind of markup in a prolog is called a processing instruction (PI). The purpose of a PI is to provide a specification concerning how an XML document may be processed, and so the first part of this markup is to identify the kind of processing to which the PI relates. An example of a common requirement is to display an XML document according to a stylesheet that specifies the format and appearance of the document. Processing XML documents Parsing XML An XML parser takes as input the sequence of characters from an XML document and analyses it to separate its markup and character data. A parser can then make the character data available to an XML application in various ways, depending on the interface it provides to applications. While parsing is common to processing all kinds of language, XML has some particular Features that we need to examine so that you can appreciate what is happening. How the parser can provide the data A parser may process the input document sequentially and extract the data requested by the XML application as it does so, without retaining a copy of the parsed XML. A parser may process the input document and construct an internal representation of the parsed XML, which is available for further requests from the XML application. the Document Object Model (DOM) When the parsed XML is retained, it is kept as a tree structure and It can be accessed by an XML application via an interface defined as the Document Object Model (DOM). This interface also allows the parsed tree to be updated. Page 3
Selecting XML content XPath is the simpler approach and it applies to the tree structure of a parsed XML document. It is basically similar to the way files and folders can be referenced by a path. An XML tree in XPath has one significant difference from the tree of elements we described previously, in that its root is the whole document. This means that the XPath tree includes processing instructions and comments that are part of a document s prolog but, because the document is parsed, it does not include things like entities. What is called an absolute path starts from a document root (referenced by /) and then specifies the steps to the required elements of the document. XQuery:- Is a query language that provides comparable querying capabilities to SQL, and thus can be quite complex. However, it has one form that makes direct use of XPath. In this example, the XML document to be queried is in the file given as the argument of the XQuery function doc, followed by the XPath expression that selects the elements to be returned. In this case, the result of the XQuery expression is the element There is an alternative and more general way of writing XQuery that is directly comparable to SQL. XQuery is not the only use of XPath. It is also used for XPointer a way of one XML document referencing another and in transforming one XML document into another, which is considered next. XPath and XQuery are also needed when we consider the use of XML in databases. Transforming XML documents First, an XML document can be transformed into another XML document, and the W3C specification of how to do this is known as XSLT (XML Stylesheet Language Transformations). Secondly, an XML document can be transformed into some other format suitable for output, such as a pdf file, and the W3C specification of how to do this is known as XSL-FO (XML Stylesheet Language Formatting Objects). Comparing XML and relational data:- The main features of relational data:- v Relational data is held as atomic values in a tabular structure with a unique name. v Columns of a table are identified by name within the table and all values in a column are of the same type; the order of columns is not significant. v Rows of a table are distinct, distinguished by the values in each row (particularly the primary key); the order of rows is not significant. v Access to and manipulation of data are expressed in terms of table operations that only involve value specifications (i.e. there is no concept of location of data by row number or column number). v Relations are logical structures that do not have any storage implications. The main features of XML data as follows. v XML data is held as nested elements in a tree structure with a named root element. v Elements in a tree are named; they can have attributes with values and can contain other elements or character data, which are all represented as character strings, though an attribute or character data value can have a data type determined by a schema. Page 4
v An element is distinguished by its location in the tree structure, specified as a path in terms of named elements and sequence numbers from the root element. v Access to and manipulation of elements and their contents are expressed in terms of operations that are based on the location of elements in the tree. v XML is a logical structure with a specified storage representation as a sequence of characters. Embedded SQL The SQL that is embedded in programs. There is a need to understand the relationship between SQL and the programming language that is used to write the program, as well as the means of transferring data between a compiled program and a DBMS. Processing embedded SQL How the source code is processed to give an executable program. There are two main factors relating to this issue. 1- An embedded SQL program is a hybrid, but the compilers that are used to produce an executable program cannot cope with such a mixture of languages they are designed for, say, pure C or pure Fortran. There needs to be some mechanism to convert SQL statements into a form acceptable to a language compiler. 2- It needs to be portable so that each SQL statement should work in the same way for different DBMSs. However, each DBMS is produced with its own interface designed by the vendor to implement the range of capabilities required to manage a database, with different ways in which these capabilities are invoked. This is called the native interface to the DBMS. Because the native interface is different for each DBMS, there is no standard way of converting SQL into a form that is understandable to a compiler. The solution to this problem is that each DBMS vendor provides what is called a precompiler for each language. A pre-compiler processes a hybrid program of SQL and the host language source code into pure source code for that language by replacing all SQL statements with requests to the native interface supported by that DBMS. A compiler for the host language can now process the result of the pre-compilation without having to be aware of any SQL it is pure source code. EXERCISE 2.1 It is a requirement of SQL that it is portable. Explain why it is only the source code for an embedded SQL program that is portable. Describe what must be done when an embedded SQL program, working with one vendor s DBMS, is required to work with another vendor s DBMS. SOLUTION Only the source code for an embedded SQL program is portable because the precompilation processing is different for each DBMS and results in object code that will only work with that DBMS that is, it is not portable. Transferring an embedded SQL program to another vendor s DBMS requires the original source code to be processed using that vendor s pre-compiler, and then compiled to produce another version of the object code. Page 5
ODBC (Open Database Connectivity) ODBC is not embedded SQL, though it is used embedded in programs and it does involve SQL. What is ODBC:- In general computing terms it is an application programming interface (API). ODBC was developed by DBMS vendors, mainly Microsoft, to respond to the problem with embedded SQL, a compiled embedded SQL program is not portable between different vendors DBMSs. Software developers wanted to produce compiled shrinkwrapped packages, that would work with any DBMS. The source code for the program is written with requests to the ODBC interface expressed using normal programming language invocation so that it can be compiled without any pre-processing. The compiled object code, executing as an application process, submits requests for database access via the ODBC interface. EXERCISE 2.6 Explain why an ODBC driver for a data source needs to be appropriate for the DBMS to be used. Solution An ODBC driver converts database requests expressed in terms of the ODBC interface into requests expressed in terms of the native interface for the DBMS being used. The ODBC driver can do this only if the request is appropriate that is, if it is written so that it can interact with the native interface of that DBMS. Page 6