Evaluating the Role of Context in Syntax Directed Compression of XML Documents

Size: px

Start display at page:

Download "Evaluating the Role of Context in Syntax Directed Compression of XML Documents"

Luke Oliver
5 years ago
Views:

1 Evaluating the Role of Context in Syntax Directed Compression of XML Documents S. Hariharan Priti Shankar Department of Computer Science and Automation Indian Institute of Science Bangalore 60012, India November 16, 0 Abstract We propose a new technique based on recursive finite state machines for tracking context to be used in a statistical code compression scheme for XML documents. We also study the tradeoffs between space and, by observing the effects of either using or ignoring root to leaf contexts for textual content in the associated tree structures. The advantage of our scheme is that it is syntax aware and the compressor and decompressor can be generated automatically from the Document Type Definition(DTD). 1 Introduction Extensible Markup Language(XML) [7] is a standard meta language used to describe a class of data objects, called XML documents and to specify how they are to be processed by computer programs. XML is rapidly becoming a standard for the creation and parsing of documents. However, a significant disadvantage is document size, which is a consequence of the overhead of markup information. Present day XML databases are massive and the need for compression is pressing. XML documents have their structure dictated by a Document Type Definition (DTD) which specifies the syntax of the documents. It is therefore natural to investigate the use of syntactic models for the compression of such data. We propose a syntax directed scheme for compression, which is totally automatic as the user is required to specify just the DTD, and the compressor and decompressor are generated from the syntactic specification. The model that is constructed mirrors the DTD, in that it tracks the structure of the document, and is able to make accurate predictions of the expected elements. Therefore whenever the predicted element is unique, there is no need to encode it at all as the decoder generates the same model from the DTD and is thus able to generate the unique expected symbol. Most markup symbols fall into this category of symbols. Character data associated with a single element can either be automatically directed to the same model for arithmetic compression irrespective of the instance of the element in the DTD, in which case the model is said to be path agnostic, or one may choose to have a separate 1

2 model for each root to leaf path in the underlying tree for the document, in which case the scheme is path sensitive. We evaluate both schemes in this paper. Since elements may be nested recursively, the set of models used is in general a set of mutually recursive automata. A stack is used to store root to leaf context in the underlying structure tree and operations on the stack are governed by syntax. We have run experiments on five large databases and compared the performance of our tool with that of two well known XMLaware compression schemes, XMill [9] and XMLPPM []. We do not address the problem of querying compressed documents in this paper. Section 2 describes related work. Section 3 provides the relevant background. Section 4 describes the new syntactic model proposed and used in our experiments. Section gives the results of experiments run using our tool and compares the results with those obtained using XMill and XMLPPM. Finally Section 6 concludes the paper. 2 Related Work The XML-specific compression schemes that we are aware of are XMLZIP [8], XMill [9] and XMLPPM []. The last two have tried to take advantage of the structure in XML data by either transforming the file after parsing, breaking up the tree into components (as in the case of XMill) or injecting hierarchical element structure symbols into a model that multiplexes several models based on the syntactic structure of XML (in the case of XMLPPM). They do not require the DTD to compress the document. XMLZIP parses XML data and creates the underlying tree. It then breaks up the tree into many components, the root component at depth d and a component for each of the subtrees at depth d. Each of the subtrees is compressed using Java s ZIP-DEFLATE archive library. The advantage of such a scheme is that it allows limited random access to parts of the document without the need to have the whole tree in main memory. XMill separates the structure from the content and compresses them separately. Data items are grouped into containers and each container is compressed separately. Different compressors are applied to compress different containers depending on the content. The criterion for grouping data into a container is not just the tag name but also the path from the root to the tag name. XMLPPM uses a modeling technique called Multiplexed Hierarchical Modeling (MHM), based on the SAX [4] encoding and on PPM [3] modeling. The technique employs two basic ideas: multiplexing several text compression models based on the syntactic structure of XML (one model for element structure, one for attributes, and so on), and injecting hierarchical element structure symbols into the multiplexed models (these are essentially root to leaf paths to the element). Multiplexing enables more effective hierarchical structure modeling. A common case for these dependencies is for the enclosing element tag to be strongly correlated with enclosed data. MHM exploits this by injecting the enclosing tag symbol into the element, attribute or string model immediately before an element, attribute or string is encoded. Injecting a symbol means telling the model that it has been seen but not explicitly encoding or decoding it. Earlier our tool XAUST(XML Compression with AUtomata and STack) was compared with XMLPPM and XMill using unbounded memory for PCDATA and now we evaluate the role of contexts in bounded memory case. 2

3 3 Background 3.1 Arithmetic Coding Arithmetic coding does not replace every input symbol with a specific code. Instead it processes a stream of input symbols and replaces it with a single number greater than or equal to 0 and less than 1. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. In order to construct the output number, the symbols being encoded need to have a set of probabilities assigned to them. Initially the range of the message is the interval [0, 1). As each symbol is processed, the range is narrowed to that portion of it allocated to the symbol. The range thus gets narrower and narrower requiring an increasing number of bits to represent it as successive symbols are encoded. At the end, a single number in the final interval encodes the stream. The decoder works in exactly the same manner and mimics the action of the encoder. For this scheme to be effective, the model should produce probabilities that deviate from a uniform distribution. The better the model is at making such predictions, the better the s will be Finite Context Modeling In a finite context scheme, the probabilities of each symbol are calculated based on the context the symbol appears in. In its traditional setting, the context is just the symbols that have been previously encountered. The order of the model refers to the number of previous symbols that make up the context. In an adaptive order k model, both the compressor and the decompresser start with the same model. The compressor encodes a symbol using the existing model and then updates the model to account for the new symbol. Typically a model is a set of frequency tables one for each context. After seeing a symbol the frequency counts in the tables are updated. The frequency counts are used to approximate the probabilities and the scheme is adaptive because this is being done as the symbols are being scanned. The decompresser similarly decodes a symbol using the existing model and then updates the model. Since there are potentially q k possibilities for level k contexts where q is the size of the symbol space, update can be a costly process, and the tables consume a large amount of space. This causes arithmetic coding to be somewhat slow. 3.2 XML Syntax XML documents contain element tags which include start tags like <name> and end tags like </name>. Elements can nest other elements and therefore a tree structure can be associated with an XML document. Elements can also contain plain text, comments and special processing instructions for XML processors. In addition, opening element tags can have attributes with values such as gender in <person gender= female >. Detailed specifications are given in [7]. XML documents have to conform to a specified syntax usually in the form of a DTD. Usually XML documents are parsed to ensure that only valid data reaches an application. Most XML parsing libraries use either the SAX interface or the DOM(Document Object Model) interface. SAX is an event based interface suitable for search tools and algorithms 3

4 that need one pass. The DOM model on the other hand is suitable for algorithms that have to make multiple passes. Since XML documents are stored as plain text files one possibility is to use standard compression tools like bzip2 or ppm*. Cheney[] has performed a study of the compression using such general purpose tools and observes that each general purpose compressor performs poorly on at least one document. Since XML documents are governed by a rather restrictive set of rules the obvious way to go is to try to use the rules to predict what symbols to expect. Further if the rules are already known a-priori then the compressor which is tuned to take advantage of the rules can be generated directly from the rules themselves. This is what we achieve with our tool XAUST(XML Compression with AUtomata and STack). The scheme proposed in this paper assumes that the DTD describing the data is known to both the sender and the receiver. Typically, an element of a DTD consists of distinct beginning and ending tags enclosing regular expressions over other elements. Elements can also contain plain text, comments and special instructions for XML processors ( processing instructions ). Opening element tags can have attributes with values. Example 1 Consider a DTD defined as follows: <!DOCTYPE addressbook[ <!ELEMENT addressbook(card*)> <!ELEMENT card((name (givenname,familyname)), , note?)> <!ELEMENT name(#pcdata)> <!ELEMENT givenname(#pcdata)> <!ELEMENT familyname(#pcdata)> <!ELEMENT (#pcdata)> <!ELEMENT note(#pcdata)> ]> Below is an instance of an XML document conforming to this DTD. <addressbook> <card> <givenname>hariharan</givenname> <familyname>iyer</familyname> < >hari@gmail.com</ > </card> <card> <name>priti Shankar</name> < >priti@gmail.com</ > <note>hariharan s advisor</note> </card> </addressbook> It can be seen that each rule has an element name followed by a regular expression involving elements. It is thus natural to associate a deterministic finite automaton (DFA) with an element definition in a rule. For example, the DFA in Figure 1 represents the rule 4

5 name note givenname familyname Figure 1: DFA for the element card in example 1 for the element card. There are two kinds of states in this automaton, those having a single output transition and those with multiple output transitions. Symbols that begin elements which label single output transitions need not be encoded as their occurence probability is 1. Thus encoding of symbols by the arithmetic compressor needs to be performed only at states with more than one outgoing transition. An arithmetic encoding procedure is called at each such state for each element. As we observed in Section 3, the arithmetic encoder maintains a set of tables of frequencies which it updates each time it encodes a symbol. Each element which has a #PCDATA attribute will result in a call to an arithmetic encoder which uses a common set of tables for all instances of that element attribute, whenever a path agnostic scheme is used. If a path sensitive scheme is used, different sets of tables are used for each state which has a transition labeled by that element. An example will illustrate the difference. Example 2 Consider the element below <!ELEMENT Project (date, date,...) > <!ELEMENT Employee (date,...) > <!ELEMENT date (#PCDATA)> XAUST provides the choice of either using a single set of tables for date or using the contexts Project, and Employee to route textual data associated with the element date, to two separate sets of tables. A typical sequence of actions is then as follows: Enter the start state of a DFA representing the right side of a rule; if there is only one edge out of the state do nothing; if that element has a #PCDATA attribute then encode the string of symbols using the frequency tables associated with that element; if there is more than one edge, encode the tag beginning the element labeling the edge taken, using an arithmetic encoder for that state, and transit to the the start state of the DFA for that element. The decoder mimics the action of the encoder generating symbols that are certain and using the arithmetic decoder for symbols that are not. We now define the model more formally. 4 The Recursive Finite State Machine We recall that the strings following each element declaration are just regular expressions over element names and therefore each of them can be associated with a deterministic finite automaton.

6 The collection of elements is described by a recursive finite state machine which we now define. Definition 4.1 A recursive finite state machine M over an alphabet Σ is specified by a tuple < M 1, M 2,... M k > where each element of the tuple is a finite state machine M i = (S i, Σ, δ i, s 0i, F i ) where S i is a finite set of states, Σ is the input alphabet, s 0i is the start state, F i is a subset of S i and is the set of final states; δ i is a mapping from S i Σ to S i. In the present setting, the members of Σ are the elements of the DTD and k is the number of elements. There is one finite state machine for each element. The recursive finite state machine maintains a stack during its operation. A configuration of M is a quadruple (index, state, stack, string) where index is the index of the current DFA which M is traversing, state is the state of the DFA where M is currently stationed, stack represents the content of the context stack, which is initially empty, and string represents the unconsumed suffix of the input string, namely, the XML document to be compressed. Assume that the current configuration of M is (i, s li, α, o m s), where o m is an open tag for element m and s is the suffix of the input string after o m. When an open tag is encountered for element m in the document, the pair (i, s li ) is stored on the calling stack and the start state s 0m of the DFA for the element m is entered. The current configuration of M now becomes (m, s 0m, α(i, s li ), s) where the pair (i, s li ) is concatenated with the stack contents. When the closing tag c m is encountered for element m, the stack is popped and the new configuration of M becomes (i, s l i, α, s ) where δ(s li, k) = s l i, and s is the suffix of the input following c m. We now indicate how to use the states of M to refine probability estimates. Each state of M is associated with a frequency table if there is more than one output transition from the state. The elements in the table are the labels of edges leaving that state. The frequencies are the frequencies with which the edges are taken. An order 0 arithmetic encoder is used at each state with the appropriate table to represent probabilities. The machine M begins in the start state of the first element, i.e. the element specified in the DOCTYPE statement. Each time it sees an opening tag o e, it takes the transition labeled with element e, pushes the current state and the index of the current machine M i on the context stack as described, and moves to the start state of the machine associated with element e. Each state initiates an encoding (or decoding, if decompression is being carried out) action. If there is a single transition out of that state then the element is not encoded as its probability is one and there is no need to maintain any table. If there is more than one transition, then an order 0 frequency table is maintained which gives the probabilities at that state. In the example below, we need not encode the tag D but we have to encode B and C. <!ELEMENT A ((B C), D)> We note here that there is an implicit transition out of every final state of every DFA M i to the state on top of the context stack. However such transitions depend on the calling context and are detected only at runtime (i.e. during compression or decompression). These transitions will be taken on encountering the closing tag of the element. If the element is associated with PCDATA then a path sensitive scheme uses the contents of the stack to route the compressor to the correct set of tables. This corresponds to possibly having a different model for PCDATA associated with each instance of the element in the DTD. In contrast, a 6

7 path insensitive scheme will route all PCDATA associated with any instance of that element in the DTD to the same set of tables. We implement both schemes in this paper to study space/compression-ratio tradeoffs. We note here that the stack contents denote the root to element path in the implied tree representation of the structure. Consider the element below. <!ELEMENT A ( (#PCDATA B)*)> There are two transitions from the start state of the DFA for element A. The first invokes the arithmetic model for PCDATA. The second invokes the DFA for element B after pushing the current pair on the stack. The pseudo-code for Encoder (Compressor) is given below. The pseudo-code for Decoder (Decompressor) is similar. Encoding attributes is similar to encoding PCDATA and hence not shown. void Encoder() { ExitLoop = true; //StateStruct is the pair of int(elementindex, StateIndex) //ElementIndex represents the automaton //StateIndex is the state in the above automaton StateStruct CurrState(0, 0); while(exitloop == false) { Type = GetNextType(FilePointer, ElementIndex); switch(type) { case OPENTAG: //Encode ElementIndex in CurrState context EncodeOpenTag(CurrState, ElementIndex); Stack.push(CurrState); CurrState = StateStruct(ElementIndex, 0); break; case CLOSETAG: //Encode CLOSETAG in CurrState context EncodeCloseTag(CurrState); if(stack.empty() == true) { ExitLoop = true; } else { CurrState = Stack.pop(); //Make state transition in CurrState.ElementIndex 7

8 Table 1: Sizes of XML documents that were compressed Name XMark1 113 XMark2 230 DBLP 302 UniRef 79 Size (in MB) } } //automaton and get the next state CurrState.StateIndex = MakeStateTransition(CurrState, ElementIndex); } break; case PCDATA: //Encode Pcdata in path sensitive or path agnostic context EncodePcdata(CurrState); CurrState.StateIndex = MakeStateTransition(CurrState, PCDATA); break; } Experimental Results We have experimented with allocation of a memory block of fixed size for runtime memory during compression. We have examined the performance of the tool in terms of the compression ratio as a function of two parameters. The first one is the strategy for flushing context when the maximum memory allocated for PCDATA is full. The three strategies implemented are, flushing out all context tables, flushing out context tables such that the size reduces to half of half the memory allocated, and flushing the largest table. The memory allocated for PCDATA varies from to 0 Mb and as an optimization strategy memory allocated for encoding states and attributes is unbounded. The sizes of these documents are displayed in Table 1. We define the Compression Ratio as the ratio of the size of the compressed document to the size of the original document expressed as a percentage. The s achieved for various sizes of the memory block alocated are displayed for each strategy. We have also measured the effect of using root to leaf context in a path sensitive scheme tables for PCDATA. In this case the range for the memory block allocated range is higher as the tables need more space. It is observed the path agnostic scheme seems to perform better under a limited block size constraint. When the results are compared with those of XMLPPM we see that ours is better for XMARK by 2.% and DBLP by 0.1% and XMLPPM is better for UniRef by 2.7%. 8

9 6 Conclusion and Future Work We present and evaluate new schemes for syntax directed compression of XML documents where the underlying context model for the compression of tags is a recursive finite state automaton generated directly from the DTD of the document. The model is automatically switched on transiting from one automaton to another storing enough information on the stack so that return to the right state is possible; this ensures that the correct model is always used for compression. (In fact it precisely achieves the multiplexing of models mentioned in XMLPPM in a completely natural manner). We have measured the effects of allocating a fixed size block of runtime memory for the compressor, as well as varying strategies for flushing out the context tables. We have also compared the path sensitive and path agnostic schemes for storing context for PCDATA. Our experiments indicate that path sensitive schemes are less effective in the fixed memory model. Future work will concentrate on modifying this scheme to facilitate simple tree queries on the XML text. The fact that the tree structure is implicit in the textual representation, and that function calls to elements may be augmented with parameters, make it feasible to handle tree queries which require only a forward pass over the implicit tree, while the document is being decompressed. References [1] DBLP Computer Science Bibliography, ley/db [2] Nelson, M.: (1991) Arithmetic Coding. Dr. Dobbs Journal [3] Teahan, W.J.: PPMD+, PPM* source code. wjt/. [4] Megginson, D.: SAX: A Simple API for XML. [] Ian H. Witten, Radford M.Neal, John G. Cleary.: Arithmetic Coding for Data Compression. Communications of the ACM, 30(6): -40, June [6] XMark - An XML Benchmark project. Efficient query evaluation over compressed XML data. In Proc. of EDBT 04. [7] Extensible Markup Language (XML) 1.0. W3C Recommendation Feb, Reference: REC-xml [8] XML Solutions. XMLZIP. [9] Hartmut Liefke, Dan Suciu.: XMill: an efficient compressor for XML data, Proceedings of ACM SIGMOD, 00. [] James Cheney.: Compressing XML with Multiplexed Hierarchical Models. Proceedings of the 01 IEEE Data Compression Conference, pp [11] UniProt(Universal Protein Resource). 9

10 XMark1 (no path context) XMark2 (no pathcontext) MB MB 1 MB 0 MB MB 1 MB DBLP (no path context) UniRef (no path context) MB MB 0 MB MB MB Figure 2: Statistics for Compression Ratios Versus Memory Usage for XAUST and Compression Ratios for XMLPPM (continued in next page)

XMark1 (path context) XMark2 (path context) 40 3 30 1 0 MB 1 MB MB 3 30 1 0 1 MB MB DBLP (path context)

11 XMark1 (path context) XMark2 (path context) MB 1 MB MB MB MB DBLP (path context) XMLPPM MB MB 1 0 DBLP XMark 0 XMark 0 UniRef Figure 3: Continued from previous page 11

Compressing XML Documents Using Recursive Finite State Automata

Compressing XML Documents Using Recursive Finite State Automata Hariharan Subramanian and Priti Shankar Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India