SGML and Exceptions z. Pekka Kilpelainen and Derick Wood y. June University of Helsinki. Helsinki. Finland

Size: px

Start display at page:

Download "SGML and Exceptions z. Pekka Kilpelainen and Derick Wood y. June University of Helsinki. Helsinki. Finland"

Christine Merritt
6 years ago
Views:

1 SGML and Exceptions z Pekka Kilpelainen and Derick Wood y Technical Report HKUST-CS96-30 June 1996 Department of Computer Science University of Helsinki Helsinki Finland ydepartment of Computer Science Hong Kong University of Science & Technology Clear Water Bay, Kowloon Hong Kong Abstract The Standard Generalized Markup Language (SGML) allows users to dene document type denitions (DTDs), which are essentially extended context-free grammars in a notation that is similar to extended Backus{Naur form. The right-hand side of a production is called a content model and its semantics can be modied by exceptions. We give precise denitions of the semantics of exceptions and prove that they do not increase the expressive power of SGML. For each DTD with exceptions we can construct a structurally equivalent extended context-free grammar. On the other hand, exceptions are a powerful shorthand notation eliminating them may cause exponential growth in the size of a DTD. z The research of the rst author was supported by the Academy of Finland and the research of the second author was supported by grants from the Natural Sciences and Engineering Research Council of Canada and from the Information Technology Research Centre of Ontario. The Hong Kong University of Science & Technology Technical Report Series Department of Computer Science

2 SGML and Exceptions 1 Pekka Kilpelainen 2 Derick Wood 3 July 3, 1996 Abstract The Standard Generalized Markup Language (SGML) allows users to dene document type denitions (DTDs), which are essentially extended context-free grammars in a notation that is similar to extended Backus{Naur form. The right-hand side of a production is called a content model and its semantics can be modied by exceptions. We give precise denitions of the semantics of exceptions and prove that they do not increase the expressive power of SGML. For each DTD with exceptions we can construct a structurally equivalent extended context-free grammar. On the other hand, exceptions are a powerful shorthand notation eliminating them may cause exponential growth in the size of a DTD. 1 Introduction The Standard Generalized Markup Language (SGML) [9, 11] promotes the interchangeability and application-independent management of electronic documents by providing a syntactic metalanguage for the denition of textual 1 The research of the rst author was supported by the Academy of Finland and the research of the second author was supported by grants from the Natural Sciences and Engineering Research Council of Canada and from the Information Technology Research Centre of Ontario. 2 Department of Computer Science, University of Helsinki, Helsinki, Finland. kilpelai@cs.helsinki.fi. 3 Department of Computer Science, Hong Kong University of Science & Technology, Clear Water Bay, Kowloon, Hong Kong. dwood@cs.ust.hk. 1

3 markup systems. An SGML document consists of an SGML prolog and a marked-up document instance. The prolog contains a document type definition (DTD), which is an extended context-free grammar in which the right-hand sides of productions are both extended and restricted regular expressions called content models. Fig. 1 gives an example of a simple SGML DTD. <!DOCTYPE message [ <!ELEMENT message - - (head, body)> <!ELEMENT head - - (from & to & subject)> <!ELEMENT from - - (person)> <!ELEMENT to - - (person)+> <!ELEMENT person - - (alias j (forename?, surname))> <!ELEMENT body - - (paragraph)*> <!ELEMENT subject, alias, forename, surname, paragraph - - (#PCDATA)> ]> Figure 1: An example SGML DTD. The DTD in Fig. 1 denes a document type for messages, which consist of a head followed by a body. The element (or nonterminal) head consists of subelements from, to, and subject that can appear in any order. The element from is dened to be a person that can be denoted either by an alias or by an optional forename followed by a surname. The element to consists of a nonempty list of persons. The body of a message consists of a (possibly empty) sequence of paragraphs. Finally, the last element denition species that elements subject, alias, forename, surname, and paragraph are unstructured strings, denoted by the keyword #PCDATA. The structural elements of a document instance are made visible by enclosing them in matching pairs of start tags and end tags. A possible instance of the DTD of Fig. 1 is given in Fig. 2. The semantics of content models can be modied by what the Standard calls exceptions. Inclusion exceptions allow named elements to appear anywhere within a content model and exclusion exceptions preclude named elements from appearing in a content model. To dene the placement of sidebars, gures, equations, footnotes, and similar objects in a DTD using 2

4 <message> <head> <from><person><alias>boss</alias></person></from> <subject>tomorrow's meeting...</subject> <to><person><surname>franklin</surname></person> <person><alias>betty</alias><person></to> </head> <body><paragraph>..has been cancelled.</paragraph></body> </message> Figure 2: An SGML document instance. the usual grammatical approach is laborious; exceptions provide an alternative, concise, and formal mechanism. For example, with the DTD of Fig. 1, we might want to allow notes to appear anywhere in the bodies of messages, except within notes themselves. We could add the inclusion exception <!ELEMENT body - - (paragraph)* +(note)> to the denition of element body. This modication allows notes to appear within notes; therefore, to prevent such recursive appearances we add an exclusion exception to the denition of element type note: <!ELEMENT note - - (#PCDATA) -(note)>. Exclusion exceptions seem to be a useful concept, but their exact meaning is unclear from the Standard [11] and from Goldfarb's annotation of the Standard [9]. We give rigorous denitions for the meaning of exceptions. In the full paper [10], we also give algorithms for transforming grammars with exceptions to grammars without exceptions, as well as giving complete proofs of the results mentioned here. The correctness proofs of these methods imply that exceptions do not increase the expressiveness of SGML DTDs. An application that requires the elimination of exceptions from content models is the translation of DTDs into static database schemas. This method of integrating textual documents into an object-oriented database has been suggested by Christodes et al. [8]. 3

5 The SGML Standard requires content models to be unambiguous, meaning that each nonempty prex of an input string determines uniquely which symbols of the content model match the symbols of the prex. Our methods of eliminating exceptions preserve the unambiguity of the original content models. In this respect our work extends the work of Bruggemann-Klein and Wood [3, 4, 5, 6, 7]. The Standard gives rather vague restrictions on the applicability of exclusion exceptions. We propose a simple and rigorous denition for the applicability of exclusions; in the full paper [10], we also present an optimal algorithm for testing applicability. In this extended abstract we focus on the essential ideas underlying our approach. For this reason, we consider the removal of exceptions from only extended context-free grammars with exceptions, although we mention the problems of transferring this approach to DTDs. We refer the reader to the full paper [10] for more details. 2 Extended Context-Free Grammars with Exceptions We introduce extended context-free grammars as a model for SGML DTDs. We treat extended context-free grammars as context-free grammars in which the right-hand sides of productions are regular expressions. Let V be an alphabet. Then, we dene a regular expression over V and its language in the usual way [1, 12]. The symbol denotes the empty string. We denote by sym(e) the set of symbols of V that appear in a regular expression E. An extended context-free grammar G is specied by a tuple (N; ; P; S), where N and are disjoint nite alphabets of nonterminal symbols and terminal symbols, respectively, P is a nite set of production schemas, and the nonterminal S is the sentence symbol. Each production schema has the form A! E, where A is a nonterminal and E is a regular expression over V = N [. When = 1 A 2 2 V, A! E 2 P, and 2 L(E), the string 1 2 can be derived from the string and we denote this fact by writing ) 1 2. The language L(G) of an extended context-free grammar G is the set of terminal strings derivable from the sentence symbol of G. Formally, L(G) = fw 2 j S ) + wg, where ) + denotes the 4

6 transitive closure of the derivability relation. Even though a production schema may correspond to an innite number of ordinary context-free productions, it is known that extended and ordinary CFGs allow us to describe exactly the same languages; for example, see the text of Wood [12]. An extended context free grammar G with exceptions is specied by a tuple (N; ; P; S) and is similar to an extended context-free grammar except that the production schemas in P have the form A! E + I? X, where A is in N, E is a regular expressions over V = N [, and I and X are subsets of N. The intuitive idea is that the derivation of any string w from the nonterminal A using the production schema A! E + I? X must not involve any nonterminal in X yet w may contain, in any position, strings that are derivable from nonterminals in I. When a nonterminal is both included and excluded, its exclusion overrides its inclusion. We now dene the eect of inclusions and exclusions on languages. Let L be a language over the alphabet V and let I; X V. We dene a language L with inclusions I as the language L +I = fw 0 a 1 w 1 a n w n j a 1 a n 2 L; for n 0; and w i 2 I ; for i = 0; : : : ; ng: Thus, L +I consists of the strings in L with arbitrary strings from I inserted into them. The language L with exclusions X is dened as the language L?X that consists of the strings in L that do not contain any symbol in X. Notice that (L +I )?X (L?X ) +I, but the converse does not hold in general. In the sequel we will write L +I?X for (L +I )?X. We formally describe the global eect of exceptions by attaching exceptions to nonterminals and by dening derivations from nonterminals with exceptions. We denote a nonterminal A with inclusions I and exclusions X with the symbol A +I?X. Normally, we rewrite the nonterminal A, say, with a string, where A! E is the production schema for A and 2 L(E). But when A has inclusions I and exclusions X, and the production schema for A is A! E + I A? X A, we must cumulate the inclusions and exclusions in the string. Observe that I and X are the exceptions associated with A, whereas I A and X A are the exceptions to be applied to A's derived strings. We, therefore, replace A +I?X with (I[IA ;X[XA). This cumulation of inclusions and exclusions is described informally in the Standard. 5

7 AA 2 A 2 2 L(a 1 j A) +fa2 g?; : 2 We modify the standard denition of a derivation step in an extended context-free grammar as follows. For a string w over [ N, we denote by w (I;X) the string obtained from w by replacing every nonterminal A 2 sym(w) with A +I?X. Thus, we have attached the same inclusions and exclusions to every nonterminal in w. Let A +I?X be a string of nonterminal symbols with exceptions and terminal symbols. We say that the string 0 can be derived from A +I?X, when the following two conditions hold: 1. A! E + I A? X A is a production schema in P. 2. For some string in L(E) +(I[IA )?(X[XA), 0 = (I[IA ;X[XA). Observe that the second condition reects the idea that exceptions are propagated and cumulated by derivations. We illustrate these ideas with the following example grammar with exceptions. This grammar is also used to show that the exception-removal method we design can lead to an exponential blow-up in grammar size. Example 1 The example grammar is specied as follows: A! (A 1 j j A m ) + ;? ;; A 1! (a 1 j A) + fa 2 g? ;; A 2! (a 2 j A) + fa 3 g? ;;. A m! (a m j A) + fa 1 g? ;: We now demonstrate how exception propagation works. Consider a derivation step from A 1 with empty inclusions and empty exclusions (that is from A 1+;?; ). Now, A 1+;?; derives (AA 2 A 2 ) (fa2 g;;) = A +fa2 g?;a 2+fA2 g?;a 2+fA2 g?; since the production schema is in the grammar and A 1! (a 1 j A) + fa 2 g? ; 6

8 Finally, the language L(G) of an extended context-free grammar G with exceptions consists of the terminal strings derivable from the sentence symbol with empty inclusions and exclusions. Formally, L(G) = fw 2 j S +;?; ) + wg: Exceptions seem to be a context-dependent feature: Legal expansions of a nonterminal depend on the context in which the nonterminal appears. We show, however, that exceptions do not extend the descriptive power of extended context-free grammars by giving a transformation that produces an extended context-free grammar that is structurally equivalent to an extended context-free grammar with exceptions. The transformation propagates exceptions to production schemas and modies their associated regular expressions to capture the eect of exceptions. Step 1: We explain how to modify regular expressions to capture the eect of exceptions. Let E be a regular expression over V = [ N and let I = fi 1 ; : : : ; i k g be a set of inclusion exceptions. First, observe that we can remove the ; symbol from the regular expression E and maintain equivalence, if the language of the expression is not ;. We modify E to obtain a regular expression E +I such that L(E +I ) = L(E) +I by replacing each occurrence of a symbol a 2 sym(e) with and each occurrence of with (i 1 j i 2 j j i k ) a(i 1 j i 2 j j i k ) (i 1 j i 2 j j i k ) : For a set X of excluded elements, we obtain a regular expression E?X such that L(E?X ) = L(E)?X by replacing each occurrence of a symbol a 2 X in E with ;. Step 2: We describe an algorithm for eliminating exceptions from an extended context-free grammar G = (N; ; P; S) with exceptions. It propagates the exceptions in a production schema to nonterminals in the schema; see Fig. 3. The algorithm produces an extended context-free grammar G 0 = (N 0 ; 0 ; P 0 ; S 0 ) that is structurally equivalent to G. The nonterminals of G 0 have the form A +I?X, where A 2 N and I; X N. A derivation step using a new production schema A +I?X! E in P 0 corresponds to a derivation step 7

9 N 0 := fa +;?; j A 2 Ng; S 0 := S +;?; ; 0 := ; Q:= fa +;?;! E + I? X j A! E + I? X 2 P g; P 00 :=;; for all A +IA?XA! E + I? X 2 Q do for all (B 2 (sym(e) [ I)? X) and B +I?X 62 N 0 do N 0 := N 0 [ fb +I?X g; Q:= Q [ fb +I?X! E B + (I [ I B )? (X [ X B ) j B +;?;! E B + I B? X B 2 Qg od; Q := Q? fa +IA?XA! E + I? Xg; P 00 := P 00 [ fa +IA?XA! E + I? Xg od; P 0 := fa +IA?XA! E A j A +IA?XA! E + I? X 2 P 00 and E A = ((E +I )?X ) (I;X) g; Figure 3: Exception elimination from an extended context-free grammar (N; ; P; S) with exceptions. using an old production schema for nonterminal A under inclusions I and exclusions X. Termination: The algorithm terminates since it generates, from each nonterminal A, at most 2 2jN j new nonterminals of the form A +I?X. In the worst case the algorithm can exhibit this potentially exponential behavior. Given the grammar with exceptions that we dened in Example 1, the algorithm produces production schemas of the form A +I?;! E for every subset I fa 1 ; : : : ; A m g. We do not know whether this exponential behavior can be avoided. Is it always possible to obtain an extended context-free grammar G 0 without exceptions that is (structurally) equivalent to an extended context-free grammar G with exceptions such that the size of G 0 is bounded by a polynomial in the size of G? We conjecture that the answer is negative. 8

10 3 Exception-Removal for DTDs Document type denitions (DTDs) are, essentially, extended contextfree grammars that have restricted and generalized regular expressions on the right-hand sides of their productions called content models in the ISO Standard [9, 11]. The major dierence between regular expressions and content models is that content models have the additional operators: F &G, F?, and F +, where F &G F G j GF. The SGML Standard describes the basic meaning of inclusions as follows: \Elements named in an inclusion can occur anywhere within the content of the element being dened, including anywhere in the content of its subelements." The description is rened by the rule specifying that \: : :an element that can satisfy an element token in the content model is considered to do so, even if the element is also an inclusion." This renement means, for example, that given the content model (ajb) with inclusion a, baa is a valid string of the content model as one would expect intuitively; however, aab is not a valid string of the content model. The reason is that the rst a in aab must correspond to the a in the content model and then the sux ab cannot be obtained. On the other hand, the string aaa is a valid string of the content model. The Standard recommends that inclusions \: : :should be used only for elements that are not logically part of the content"; for example, neither for a nor for b in the preceding example. Since the diculty of understanding inclusions is caused, however, by the inclusion of elements that appear in the content model, we have to take them into account. The basic idea of compiling the inclusion of the set I = fi 1 ; : : : ; i k g of symbols in a content model E is to insert new subexpressions of the form (i 1 j ji k ) in E. Preserving the unambiguity of the content model requires some extra care. We dene the SGML eect of inclusions I on language L V, where V is an alphabet, as the language L I = fw 0 a 1 w n?1 a n w n j a 1 a n 2 L; n 0; where w i 2 (I? rst(tail(l; a 1 a i ))) ; i = 0; : : : ; ng; rst(l) = fa 2 V j au 2 L; for some u 2 V g 9

11 and tail(l; w) = fu 2 V j wu 2 Lg: For example, the language fab; bag fag consists of all strings of the forms a k ba l and ba k, where k 1 and l 0. We introduce the diculties caused by the & operator with the following example. Consider the content model E = a?&b, which is unambiguous. A content model that captures the inclusion of symbol a in E should match an arbitrary sequence of as after the b. A straightforward transformation would produce a content model of the form F &(ba ) or of the form (F &b)a, where a 2 rst(l(f )) and 2 L(F ). It easy to see that these content models are ambiguous since, in each case, any a following an initial b can be matched by both F and a. Our strategy to handle such problematic subexpressions F &G is rst to replace them by the equivalent subexpression (F GjGF ). (Notice that this substitution may not suce, since F GjGF can be ambiguous even if F &G is unambiguous. For example, the content model (a?bjba?) is ambiguous, whereas the context model a?&b is unambiguous.) Then, given a content model E and a set I of inclusions, we compute a new content model E I such that L(E I ) = L(E) I. Example 2 Let E = (a?&b?)c and I = fa; cg. We rst transform it into the content model (ab?jba?)?c and then into the content model (aa (ba )?jb(aa )?)?c(ajc) : In the full paper [10], we give a complete algorithm for computing the content model E I from a given content model E and a given set of inclusions I. Clause of the SGML Standard states that \: : :exclusions modify the eect of model groups to which they apply by precluding options that would otherwise have been available". The exact meaning of the phrase \precluding options" is not clear from the Standard. Our rst task is, therefore, to formalize the intuitive notion of exclusion. As a motivating example 2 10

12 consider excluding the symbol b from the content model E = a(bjc)c, which denes the language L(E) = fabc; accg. The element b is clearly an alternative to the rst occurrence of c, and we can realize its exclusion by modifying the content model to obtain E 0 = acc. Now, consider excluding b from the content model F = a(bcjcc). The case is not as clear since b appears in a seq subexpression. On the other hand, both E and F dene the same language. Let L V be a language and let X V. Motivated by the preceding examples, we dene the aect of excluding X from L, which we denote by L?X, to be the set of all strings in L that do not contain any symbol of X. As an example, the aect of excluding fbg from the language of the preceding content models E and F is L(E)?fbg = L(F )?fbg = faccg: Notice that an exclusion always species a subset of the original language. In the full paper [10], we show how to compute a content model E X such that L(E X) = L(E)?X from a given content model E and a given set X of exclusions. The modied content model E X is unambiguous if the original content model E is unambiguous and its computation takes time linear in the size of E. As a restriction of the applicability of exclusions the Standard states that \: : :an exclusion cannot aect a specication in a model group that indicates that an element is required." The Standard does not specify rigorously how a model group (a subexpression of a content model) indicates that an element is required. The intent of the Standard appears to be that when A is an element, then in the contexts A?, (AjB), and A, the A is optional, but in the contexts A, A +, A&B, it is required. Note that a content model cannot denote a language that is either ; or fg. The Standard gives a syntactic denition of applicability of exclusions, we prefer to give a semantic denition. Therefore, a reasonable requirement for the applicability of excluding X from a content model E is that L(E)?X 6 fg. Intuitively, E X ; or E X means that excluding X from E precludes all elements from the content of E. On the other hand, E X 6 ; and E X 6 fg means that X precludes only elements that are optional in L(E). We propose that the preceding requirement be the formalization of how a content model indicates that an element is required. Notice that computing E X is a reasonable and ecient test for the applicability of exclusions X to a content model E. 11

13 We are now in a position to consider the removal of exceptions from a DTD. Let G 1 = (N 1 ; ; P 1 ; S 1 ) be an extended context-free grammar with exceptions and let G 2 = (N 2 ; ; P 2 ; S 2 ) be the extended context-free grammar that results by eliminating exceptions from G using the algorithm in Fig. 3. If B +I?X 2 N 2, then there is a production schema B +I?X! E B in P 2 if and only if there is a production schema B! E + I B? X B in P 1 such that E B = (E +I[IB?X[XB ) (I[IB;X[XB). Lastly, we can apply the same idea to an SGML DTD with exceptions to obtain a structurally equivalent DTD without exceptions. 4 Concluding Remarks and Open Problems When we apply the exception removal transformation of Fig. 3 to an SGML DTD with exceptions, then we do indeed obtain a new DTD without exceptions. Unfortunately, the original DTD-document instances are not conformant to the new DTD since the new DTD has new elements and new tags that correspond to those elements that do not appear in the old DTD instances. Therefore, how useful are our results? First, the results are interesting in their own right as a contribution to the theory of extended context-free grammars and SGML DTDs. We can eliminate exceptions to give structurally equivalent grammars and DTDs while preserving their SGML unambiguity. Second, during the DTD design phase, it may be convenient to use exceptions. Our results imply that we can eliminate the exceptions and produce a nal DTD design without exceptions before any document instances are created. Third, rather than producing a new DTD, we can emulate it with an extended context-free grammar. We rst apply the exception-removal transformation to the extended context-free grammar with exceptions given by the original DTD with exceptions. We then modify its productions to explicitly include the old tags. For example, we transform a production of the form: into a production of the form: A +I?X! E A A +I?X! `< A >'E A`< =A >'; 12

14 where `< A >' and `< =A >' 2 0 are the start and end tags that the new grammar has to use as delimiters for the element A. The new productions can be applied to the old DTD instances. Lastly, we can attack the document-instance problem head on by translating old instances into new instances. A convenient technique is to use a generalization of syntax-directed translation grammars (see Aho and Ullman [1, 2] and Wood [12]) to give extended context-free transduction grammars and the corresponding transduction version of DTDs that we call \Document Type Transduction Denitions." We are currently investigating this approach which would also be applicable to the DTD database schema issue raised by Christodes et al. [8]. It could also be used to convert a document marked up according to one DTD into a document marked up according to a dierent, but related, DTD. Acknowledgements We would like to thank Anne Bruggemann-Klein and Gaston Gonnet for the discussions that encouraged us to continue our investigation of the exception problem in SGML. References [1] A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation, and Compiling, Vol. I: Parsing. Prentice-Hall, Inc., Englewood Clis, NJ, [2] A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation and Compiling, Vol. II: Compiling. Prentice-Hall, Inc., Englewood Clis, NJ, [3] A. Bruggemann-Klein. Unambiguity of extended regular expressions in SGML document grammars. In Th. Lengauer, editor, Algorithms ESA 93. Springer-Verlag, [4] A. Bruggemann-Klein. Regular expressions into nite automata. Theoretical Computer Science, 120:197{213,

15 [5] A. Bruggemann-Klein. Compiler-construction tools and techniques for SGML parsers: Diculties and solutions. To appear in EPODD, [6] A. Bruggemann-Klein and D. Wood. One-unambiguous regular languages. To appear in Information and Computation, [7] A. Bruggemann-Klein and D. Wood. The validation of SGML content models. To appear in Mathematical and Computer Modelling, [8] V. Christodes, S. Christodes, S. Cluet, and M. Scholl. From structured documents to novel query facilities. SIGMOD Record, 23(2):313{324, June (Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data). [9] C. F. Goldfarb. The SGML Handbook. Clarendon Press, Oxford, [10] P. Kilpelainen and D. Wood. Exceptions in SGML document grammars, Submitted for publication. [11] International Organization for Standardization. ISO 8879: Information Processing Text and Oce Systems Standard Generalized Markup Language (SGML), October [12] D. Wood. Theory of Computation. John Wiley, New York, NY,

HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-99-01

HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-99-01 SGML and XML Document Grammars and Exceptions Pekka Kilpelainen y January 25, 1999 Abstract Derick Wood z The Standard Generalized