Introduction to XML XML stands for the extensible Markup Language. It is a new markup language, developed by the W3C (World Wide Web Consortium), mainly to overcome limitations in HTML. HTML is an immensely popular markup language. Even though HTML is a popular and successful markup language, it has some major shortcomings. XML was developed to address these shortcomings. It was not introduced for replacement. When talking about XML, here are some terms that would be helpful: XML: extensible Markup Language, a standard created by the W3Group for marking up data. DTD: Document Type Definition, a set of rules defining relationships within a document; DTDs can be "internal" (within a document) or "external" (links to another document). XML Parser: Software that reads XML documents and interprets or "parse" the code according to the XML standard. A parser is needed to perform actions on XML, such as comparing an XML document to a DTD. XML Anatomy If you have ever done HTML coding, creating an XML document will seem very familiar. Like HTML, XML is based on SGML, Standard Generalized Markup Language, and designed for use with the Web. If you haven't coded in HTML before, after creating an XML document, you should find creating HTML documents easy. XML documents, at a minimum, are made of two parts: the prolog and the content. 1. The prolog or head of the document usually contains the administrative metadata about the rest of document. It will have information such as what version of XML is used, the character set standard used, and the DTD, either through a link to an external file or internally. 1 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
2. Content is usually divided into two parts that of the structural markup and content contained in the markup, which is usually plain text. Let's take a look at a simple prologue for an XML document: <?xml version="1.0" encoding="iso-8859-1"?> <?xml declares to a processor that this is where the XML document begins. version="1.0" declares which recommended version of XML the document should be evaluated in. encoding="iso-8859-1" identifies the standardized character set that is being used to write the markup and content of the XML. The structural markup consists of elements, attributes, and entities; primarily focus on elements and attributes. Elements have a few particular rules: 1. Element names can be any mixture of characters, with a few exceptions. However, element names are case sensitive, unlike HTML. For instance, <elementname> is different from <ELEMENTNAME>, which is different from <ElementName>. Note: The characters that are excluded from element names in XML are &, <, ", and >, which are used by XML to indicate markup. The character: should be avoided as it has been used for special extensions in XML. If you want to use these restricted characters as part of the content within elements but do not want to create new elements, then you would need to use the following entities to have them displayed in XML: XML Entity Names for Restricted Characters Use & For & < < > > " " 2 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
2. Elements containing content must have closing and opening tags. <elementname> (opening) </elementname> (closing). Note that the closing tag is the exact same as the opening tag, but with a backslash in front of it. The content within elements can be either elements or character data. If an element has additional elements within it, then it is considered a parent element; those contained within it are called child elements. For example, <elementname>this is a sample of <anotherelement> simple XML</anotherElement>Coding </elementname>. So in this example, <elementname> is the parent element. <anotherelement> is the child of elementname, because it is nested within elementname. Elements can have attributes attached to them in the following format: <elementname attributename="attributevalue" > While attributes can be added to elements in XML, there are a couple of reasons to use attributes sparingly: XML parsers have a harder time checking attributes against DTDs. If the information in the attribute is valuable, why not contain that information in an element? Since some attributes can only have predefined categories, you can't go back and easily add new categories. We recommend using attributes for information that isn't absolutely necessary for interpreting the document or that has a predefined number of options that will not change in the future. When using attributes in XML, the value of the attributes must always be contained in quotes. The quotes can be either single or double quotes. For example, the attribute version= 1.0 in the opening XML declaration could be written version= 1.0 and would be interpreted the same way 3 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
by the XML parser. However, if the attribute value contains quotes, it is necessary to use the other style of quotation marks to indicate the value. For example, if there was an attribute name with a value of John Q. Public then it would need to be marked up in XML as name= John Q Public, using the symbols for quotes to enclose the attribute value that is not being used in the value itself. 4 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
Creating a Simple XML Document Now that you know the basic rules for creating an XML document, let's try them out. Like most, if not all, standards developed by the W3Group, you can create XML documents using a plain text editor like Notepad (PC), TextEdit (Mac), or Pico (UNIX). You can also use programs like Dreamweaver and Cooktop, but all that is necessary to create the document is a text editor. Let's say we have two types of documents we would like to wrap in XML: emails and letters. We want to encode the emails and letters because we are creating an online repository of archival messages within an organization or by an individual. By encoding them in XML, we hope to encode their content once and be able to translate it to a variety of outputs, like HTML, PDFs, or types not yet created. To begin, we need to declare an XML version: <?xml version="1.0" encoding="iso-8859-1"?> Now, after declaring the XML version, we need to determine the root element for the documents. Let's use message as the root element, since both email and letters can be classified as messages. <?xml version="1.0" encoding="iso-8859-1"?> <message> </message> Note: You might have noticed that I created both the opening and closing tags for the message element. When creating XML documents, it is useful to create both the opening and closing elements at the same time. After creating the tags, you would then fill in the content. Since one of the fatal errors for XML is forgetting to close an element, if you make the opening and closing tags each time you create an element, you won't accidentally forget to do so. 5 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
Parent and child relationships A way of describing relationships in XML is the terminology of parent and child. In our examples, the parent or "root" element is <message>, which then has two child elements, <email>, and <letter>. An easy way of showing how elements are related in XML is to indent the code to show that an element is a child of another. For example, <?xml version="1.0" encoding="iso-8859-1"?> <message> <email> </email> </message> Now that we have the XML declaration, the root element, and the child element (email), let's determine the information we want to break out in an email. Say we want to keep information about the sender, recipients, subject, and the body of the text. Since the information about the sender and recipients are generally in the head of the document, let's consider them children elements of a parent element that we will call <header>. In addition to <header>, the other child elements of <email> will be <subject> and <text>. So our XML will look something like this: <?xml version="1.0" encoding="iso-8859-1"?> <message> <email> <header> <sender>me@ischool.utexas.edu</sender> <recipient>you@ischool.utexas.edu</recipient> </header> <subject>re: XML </subject> <text>i'm working on my XML project right now. </text> </email> </message> 6 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
Now, let's create an XML document for a letter. Some of the information in a letter we want to know include the sender, the recipient, and the text of the letter. Additionally, we want to know the date that it was sent and what salutation was used to start off the message. Let's see what this would look like in XML: <?xml version="1.0" encoding="iso-8859-1"?> <message> <letter> <letterhead> <sender>margaret</sender> <recipient>god</recipient> <date>1970</date> </letterhead> <text> <salutation>are you there God?</salutation> It's me Margaret... </text> </letter> </message> Now say we wanted to keep track of whether or not these messages were replies or not. Instead of creating an additional element called <reply>, let's assign an attribute to the elements <email> and <letter> indicating whether that document was a reply to a previous message. In XML, it would look something like this: <email reply="yes"> or <letter reply="no"> When creating XML documents, it's always useful to spend a little time thinking about what information you want to store, as well as what relationships the elements will have. Now that we've made some XML documents, let's talk about "well formed" XML and valid XML. We have DTD s which tells that the XML document is well formed and valid or not. 7 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
DTD (Document Type Definition) Document Type Definition, a mechanism to describe the structure of documents. Sometimes XML is too flexible: Most Programs can only process a subset of all possible XML applications. For exchanging data, the format (i.e., elements, attributes and their semantics) must be fixed. Document Type Definitions (DTD) for establishing the vocabulary for one XML application (in some sense comparable to schemas in databases) A document is valid with respect to a DTD if it conforms to the rules specified in that DTD. Most XML parsers can be configured to validate. The syntax for DTD The syntax for DTDs is different from the syntax for XML documents. Example: 1 is the address book XML code but with one difference: It has a new <!DOCTYPE> statement. The new statement is introduced in the section Document Type Declaration. For now, it suffices to say that it links the document file to the DTD file. Example: 2 is its DTD. Example: 1 An Address Book in XML <?xml version= 1.0?> <!DOCTYPE address-book SYSTEM address-book.dtd > <!-- loosely inspired by vcard 3.0 --> <address-book> <entry> <name>john Doe</name> <address> 8 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
<street>34 Fountain Square Plaza</street> <region>oh</region> <postal-code>45202</postal-code> <locality>cincinnati</locality> <country>us</country> </address> <tel preferred= true >513-555-8889</tel> <tel>513-555-7098</tel> <email href= mailto:jdoe@emailaholic.com /> </entry> <entry> <name><fname>jack</fname><lname>smith</lname></name> <tel>513-555-3465</tel> <email href= mailto:jsmith@emailaholic.com /> </entry> </address-book> Example: 2 DTD for the Address Book <!-- top-level element, the address book is a list of entries--> <!ELEMENT address-book (entry+)> <!-- an entry is a name followed by addresses, phone numbers, etc.--> 9 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
<!ELEMENT entry (name,address*,tel*,fax*,email*)> <!-- name is made of string, first name and last name. This is a very flexible model to accommodate exotic name--> <!ELEMENT name (#PCDATA fname lname)*> <!ELEMENT fname (#PCDATA)> <!ELEMENT lname (#PCDATA)> <!-- definition of the address structure if several addresses, the preferred attribute signals the default one --> <!ELEMENT address <!ATTLIST address <!ELEMENT street <!ELEMENT region (street,region?,postal-code,locality,country)> preferred (true false) false > (#PCDATA)> (#PCDATA)> <!ELEMENT postal-code (#PCDATA)> <!ELEMENT locality <!ELEMENT country (#PCDATA)> (#PCDATA)> <!-- phone, fax and email, same preferred attribute as address --> <!ELEMENT tel <!ATTLIST tel <!ELEMENT fax <!ATTLIST fax <!ELEMENT email <!ATTLIST email (#PCDATA)> preferred (true false) false > (#PCDATA)> preferred (true false) false > EMPTY> href CDATA #REQUIRED preferred (true false) false > 10 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
DTD Example: Elements Declaration in DTD One element declaration for each element type: <!ELEMENT element_name content_specification> Where content specification can be (#PCDATA) parsed character data (child) one child element (c1,,cn) a sequence of child elements c1 cn (c1 cn) one of the elements c1 cn Special Characters For each component c, possible counts can be specified: c exactly one such element c+ one or more c* zero or more c? zero or one Plus arbitrary combinations using parenthesis: <!ELEMENT f ((a b)*,c+,(d e))*> 11 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
Elements with mixed content: <!ELEMENT text (#PCDATA index cite glossary)*> Elements with empty content: <!ELEMENT image EMPTY> Elements with arbitrary content (this is nothing for production-level DTDs): <!ELEMENT thesis ANY> Attribute Declaration Attribute Example 12 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
Attributes are declared per element: <!ATTLIST section number CDATA #REQUIRED title CDATA #REQUIRED> declares two required attributes for element section. Possible attribute defaults: #REQUIRED is required in each element instance #IMPLIED is optional #FIXED default always has this default value default has this default value if the attribute is omitted from the element instance CDATA string data (A1 An) enumeration of all possible values of the attribute (each is XML name) ID unique XML name to identify the element IDREF refers to ID attribute of some other element ( intra-document link ) IDREFS list of IDREF, separated by white space Linking DTD and XML Docs DTDs are of two type: a) Internal b) Seperate Internal DTD: <?xml version= 1.0?> <!DOCTYPE article [ ]> <article>... </article> <!ELEMENT article (title,author+,text)>... <!ELEMENT index (#PCDATA)> 13 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
Both ways can be mixed, internal DTD overwrites external entity information: <!DOCTYPE article SYSTEM article.dtd [ <!ENTITY % pub_content (title+,author*,text) ]> Flaws of DTDs No support for basic data types like integers, doubles, dates, times, No structured, self-definable data types No type derivation id/idref links are quite loose (target is not specified) 14 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
RSS (Really Simple Syndication) About RSS If you frequent Weblogs, you've seen the little XML icons inviting you to "syndicate this site", but what does that really mean? A long time ago, newspaper managers realized that if they could use articles and stories from other newspapers in their paper, they could garner more readers because they could cover a wider area than they could with just their own reporters. This is an example of how syndication can work in print. Online, there are potentially millions of authors writing about millions of topics each day. It can be very difficult to keep track of without some type of automated system. And that's where RSS comes in. Really Simple Syndication (RSS) is an easy way for Web sites to share headlines and stories from other sites. Web surfers can use sophisticated news readers to surf these headlines using RSS aggregators. A Brief History of RSS RSS was first invented by Netscape, when they were trying to get into the portal business. They wanted an XML format (RSS.90) that would be easy for them to get news stories and information from other sites and have them automatically added to their site. They then came out with RSS.91 and dropped it when they decided to get out of the portal business. What is RSS? RSS is a protocol that lets users subscribe to online content using an RSS reader or aggregator, which checks subscribed Web pages and automatically downloads new content. The aggregators display a list of subscriptions, with highlighting or another indicator of RSS feeds that have added content since the user last logged in. Without having to go to all of the individual Web sites, users can quickly and easily access new material from sites that interest them. For many, RSS has become the pipe through which content flows from providers to consumers. What makes RSS important is that users decide exactly what content is allowed through that pipe. Since its introduction in the late 1990s, RSS has become almost ubiquitous. An excellent mechanism for distributing regularly updated content, RSS is a natural complement to blogs, news sites, photo-sharing applications, and podcasts. The popularity of podcasting results on some level 15 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
from RSS technology. When new podcasts are available, the aggregator (or, in this case, podcatcher) automatically downloads the new file to your computer or portable music player. Why it is significant? In many ways that lets users subscribe to online content using an RSS reader or aggregator, which checks subscribed Web pages and automatically downloads new content. The aggregators display a list of subscriptions, with highlighting or another indicator of RSS feeds that have added content since the user last logged in. Without having to go to all of the individual Web sites, users can quickly and easily access new material from sites that interest them. For many, RSS has become the pipe through which content flows from providers to consumers. What makes RSS important is that users decide exactly what content is allowed through that pipe. Since its introduction in the late 1990s, RSS has become almost ubiquitous. An excellent mechanism for distributing regularly updated content, RSS is a natural complement to blogs, news sites, photo-sharing applications, and podcasts. The popularity of podcasting results on some level from RSS technology. When new podcasts are available, the aggregator (or, in this case, podcatcher) automatically downloads the new file to your computer or portable music player. RSS Example: RSS files are essentially XML formatted plain text. The RSS file itself is relatively easy to read both by automated processes and by humans alike. An example file could have contents such as the following. This could be placed on any appropriate communication protocol for file retrieval, such as http or ftp, and reading software would use the information to present a neat display to the end users. <?xml version="1.0" encoding="utf-8"?> <rss version="2.0"> <channel> <title>rss Title</title> <description>this is an example of an RSS feed</description> <link>http://www.someexamplerssdomain.com/main.html</link> <lastbuilddate>mon, 06 Sep 2010 00:01:00 +0000 </lastbuilddate> <pubdate>mon, 06 Sep 2009 16:20:00 +0000 </pubdate> 16 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.
<ttl>1800</ttl> <item> <title>example entry</title> <description>here is some text.</description> <link>http://www.wikipedia.org/</link> <guid>unique string per item</guid> <pubdate>mon, 06 Sep 2009 16:20:00 +0000 </pubdate> </item> </channel> </rss> 17 P a g e C E M I S, U n i v e r s i t y o f N i z w a, O m a n.