Evaluating the Role of Context in Syntax Directed Compression of XML Documents
|
|
- Luke Oliver
- 5 years ago
- Views:
Transcription
1 Evaluating the Role of Context in Syntax Directed Compression of XML Documents S. Hariharan Priti Shankar Department of Computer Science and Automation Indian Institute of Science Bangalore 60012, India November 16, 0 Abstract We propose a new technique based on recursive finite state machines for tracking context to be used in a statistical code compression scheme for XML documents. We also study the tradeoffs between space and, by observing the effects of either using or ignoring root to leaf contexts for textual content in the associated tree structures. The advantage of our scheme is that it is syntax aware and the compressor and decompressor can be generated automatically from the Document Type Definition(DTD). 1 Introduction Extensible Markup Language(XML) [7] is a standard meta language used to describe a class of data objects, called XML documents and to specify how they are to be processed by computer programs. XML is rapidly becoming a standard for the creation and parsing of documents. However, a significant disadvantage is document size, which is a consequence of the overhead of markup information. Present day XML databases are massive and the need for compression is pressing. XML documents have their structure dictated by a Document Type Definition (DTD) which specifies the syntax of the documents. It is therefore natural to investigate the use of syntactic models for the compression of such data. We propose a syntax directed scheme for compression, which is totally automatic as the user is required to specify just the DTD, and the compressor and decompressor are generated from the syntactic specification. The model that is constructed mirrors the DTD, in that it tracks the structure of the document, and is able to make accurate predictions of the expected elements. Therefore whenever the predicted element is unique, there is no need to encode it at all as the decoder generates the same model from the DTD and is thus able to generate the unique expected symbol. Most markup symbols fall into this category of symbols. Character data associated with a single element can either be automatically directed to the same model for arithmetic compression irrespective of the instance of the element in the DTD, in which case the model is said to be path agnostic, or one may choose to have a separate 1
2 model for each root to leaf path in the underlying tree for the document, in which case the scheme is path sensitive. We evaluate both schemes in this paper. Since elements may be nested recursively, the set of models used is in general a set of mutually recursive automata. A stack is used to store root to leaf context in the underlying structure tree and operations on the stack are governed by syntax. We have run experiments on five large databases and compared the performance of our tool with that of two well known XMLaware compression schemes, XMill [9] and XMLPPM []. We do not address the problem of querying compressed documents in this paper. Section 2 describes related work. Section 3 provides the relevant background. Section 4 describes the new syntactic model proposed and used in our experiments. Section gives the results of experiments run using our tool and compares the results with those obtained using XMill and XMLPPM. Finally Section 6 concludes the paper. 2 Related Work The XML-specific compression schemes that we are aware of are XMLZIP [8], XMill [9] and XMLPPM []. The last two have tried to take advantage of the structure in XML data by either transforming the file after parsing, breaking up the tree into components (as in the case of XMill) or injecting hierarchical element structure symbols into a model that multiplexes several models based on the syntactic structure of XML (in the case of XMLPPM). They do not require the DTD to compress the document. XMLZIP parses XML data and creates the underlying tree. It then breaks up the tree into many components, the root component at depth d and a component for each of the subtrees at depth d. Each of the subtrees is compressed using Java s ZIP-DEFLATE archive library. The advantage of such a scheme is that it allows limited random access to parts of the document without the need to have the whole tree in main memory. XMill separates the structure from the content and compresses them separately. Data items are grouped into containers and each container is compressed separately. Different compressors are applied to compress different containers depending on the content. The criterion for grouping data into a container is not just the tag name but also the path from the root to the tag name. XMLPPM uses a modeling technique called Multiplexed Hierarchical Modeling (MHM), based on the SAX [4] encoding and on PPM [3] modeling. The technique employs two basic ideas: multiplexing several text compression models based on the syntactic structure of XML (one model for element structure, one for attributes, and so on), and injecting hierarchical element structure symbols into the multiplexed models (these are essentially root to leaf paths to the element). Multiplexing enables more effective hierarchical structure modeling. A common case for these dependencies is for the enclosing element tag to be strongly correlated with enclosed data. MHM exploits this by injecting the enclosing tag symbol into the element, attribute or string model immediately before an element, attribute or string is encoded. Injecting a symbol means telling the model that it has been seen but not explicitly encoding or decoding it. Earlier our tool XAUST(XML Compression with AUtomata and STack) was compared with XMLPPM and XMill using unbounded memory for PCDATA and now we evaluate the role of contexts in bounded memory case. 2
3 3 Background 3.1 Arithmetic Coding Arithmetic coding does not replace every input symbol with a specific code. Instead it processes a stream of input symbols and replaces it with a single number greater than or equal to 0 and less than 1. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. In order to construct the output number, the symbols being encoded need to have a set of probabilities assigned to them. Initially the range of the message is the interval [0, 1). As each symbol is processed, the range is narrowed to that portion of it allocated to the symbol. The range thus gets narrower and narrower requiring an increasing number of bits to represent it as successive symbols are encoded. At the end, a single number in the final interval encodes the stream. The decoder works in exactly the same manner and mimics the action of the encoder. For this scheme to be effective, the model should produce probabilities that deviate from a uniform distribution. The better the model is at making such predictions, the better the s will be Finite Context Modeling In a finite context scheme, the probabilities of each symbol are calculated based on the context the symbol appears in. In its traditional setting, the context is just the symbols that have been previously encountered. The order of the model refers to the number of previous symbols that make up the context. In an adaptive order k model, both the compressor and the decompresser start with the same model. The compressor encodes a symbol using the existing model and then updates the model to account for the new symbol. Typically a model is a set of frequency tables one for each context. After seeing a symbol the frequency counts in the tables are updated. The frequency counts are used to approximate the probabilities and the scheme is adaptive because this is being done as the symbols are being scanned. The decompresser similarly decodes a symbol using the existing model and then updates the model. Since there are potentially q k possibilities for level k contexts where q is the size of the symbol space, update can be a costly process, and the tables consume a large amount of space. This causes arithmetic coding to be somewhat slow. 3.2 XML Syntax XML documents contain element tags which include start tags like <name> and end tags like </name>. Elements can nest other elements and therefore a tree structure can be associated with an XML document. Elements can also contain plain text, comments and special processing instructions for XML processors. In addition, opening element tags can have attributes with values such as gender in <person gender= female >. Detailed specifications are given in [7]. XML documents have to conform to a specified syntax usually in the form of a DTD. Usually XML documents are parsed to ensure that only valid data reaches an application. Most XML parsing libraries use either the SAX interface or the DOM(Document Object Model) interface. SAX is an event based interface suitable for search tools and algorithms 3
4 that need one pass. The DOM model on the other hand is suitable for algorithms that have to make multiple passes. Since XML documents are stored as plain text files one possibility is to use standard compression tools like bzip2 or ppm*. Cheney[] has performed a study of the compression using such general purpose tools and observes that each general purpose compressor performs poorly on at least one document. Since XML documents are governed by a rather restrictive set of rules the obvious way to go is to try to use the rules to predict what symbols to expect. Further if the rules are already known a-priori then the compressor which is tuned to take advantage of the rules can be generated directly from the rules themselves. This is what we achieve with our tool XAUST(XML Compression with AUtomata and STack). The scheme proposed in this paper assumes that the DTD describing the data is known to both the sender and the receiver. Typically, an element of a DTD consists of distinct beginning and ending tags enclosing regular expressions over other elements. Elements can also contain plain text, comments and special instructions for XML processors ( processing instructions ). Opening element tags can have attributes with values. Example 1 Consider a DTD defined as follows: <!DOCTYPE addressbook[ <!ELEMENT addressbook(card*)> <!ELEMENT card((name (givenname,familyname)), , note?)> <!ELEMENT name(#pcdata)> <!ELEMENT givenname(#pcdata)> <!ELEMENT familyname(#pcdata)> <!ELEMENT (#pcdata)> <!ELEMENT note(#pcdata)> ]> Below is an instance of an XML document conforming to this DTD. <addressbook> <card> <givenname>hariharan</givenname> <familyname>iyer</familyname> < >hari@gmail.com</ > </card> <card> <name>priti Shankar</name> < >priti@gmail.com</ > <note>hariharan s advisor</note> </card> </addressbook> It can be seen that each rule has an element name followed by a regular expression involving elements. It is thus natural to associate a deterministic finite automaton (DFA) with an element definition in a rule. For example, the DFA in Figure 1 represents the rule 4
5 name note givenname familyname Figure 1: DFA for the element card in example 1 for the element card. There are two kinds of states in this automaton, those having a single output transition and those with multiple output transitions. Symbols that begin elements which label single output transitions need not be encoded as their occurence probability is 1. Thus encoding of symbols by the arithmetic compressor needs to be performed only at states with more than one outgoing transition. An arithmetic encoding procedure is called at each such state for each element. As we observed in Section 3, the arithmetic encoder maintains a set of tables of frequencies which it updates each time it encodes a symbol. Each element which has a #PCDATA attribute will result in a call to an arithmetic encoder which uses a common set of tables for all instances of that element attribute, whenever a path agnostic scheme is used. If a path sensitive scheme is used, different sets of tables are used for each state which has a transition labeled by that element. An example will illustrate the difference. Example 2 Consider the element below <!ELEMENT Project (date, date,...) > <!ELEMENT Employee (date,...) > <!ELEMENT date (#PCDATA)> XAUST provides the choice of either using a single set of tables for date or using the contexts Project, and Employee to route textual data associated with the element date, to two separate sets of tables. A typical sequence of actions is then as follows: Enter the start state of a DFA representing the right side of a rule; if there is only one edge out of the state do nothing; if that element has a #PCDATA attribute then encode the string of symbols using the frequency tables associated with that element; if there is more than one edge, encode the tag beginning the element labeling the edge taken, using an arithmetic encoder for that state, and transit to the the start state of the DFA for that element. The decoder mimics the action of the encoder generating symbols that are certain and using the arithmetic decoder for symbols that are not. We now define the model more formally. 4 The Recursive Finite State Machine We recall that the strings following each element declaration are just regular expressions over element names and therefore each of them can be associated with a deterministic finite automaton.
6 The collection of elements is described by a recursive finite state machine which we now define. Definition 4.1 A recursive finite state machine M over an alphabet Σ is specified by a tuple < M 1, M 2,... M k > where each element of the tuple is a finite state machine M i = (S i, Σ, δ i, s 0i, F i ) where S i is a finite set of states, Σ is the input alphabet, s 0i is the start state, F i is a subset of S i and is the set of final states; δ i is a mapping from S i Σ to S i. In the present setting, the members of Σ are the elements of the DTD and k is the number of elements. There is one finite state machine for each element. The recursive finite state machine maintains a stack during its operation. A configuration of M is a quadruple (index, state, stack, string) where index is the index of the current DFA which M is traversing, state is the state of the DFA where M is currently stationed, stack represents the content of the context stack, which is initially empty, and string represents the unconsumed suffix of the input string, namely, the XML document to be compressed. Assume that the current configuration of M is (i, s li, α, o m s), where o m is an open tag for element m and s is the suffix of the input string after o m. When an open tag is encountered for element m in the document, the pair (i, s li ) is stored on the calling stack and the start state s 0m of the DFA for the element m is entered. The current configuration of M now becomes (m, s 0m, α(i, s li ), s) where the pair (i, s li ) is concatenated with the stack contents. When the closing tag c m is encountered for element m, the stack is popped and the new configuration of M becomes (i, s l i, α, s ) where δ(s li, k) = s l i, and s is the suffix of the input following c m. We now indicate how to use the states of M to refine probability estimates. Each state of M is associated with a frequency table if there is more than one output transition from the state. The elements in the table are the labels of edges leaving that state. The frequencies are the frequencies with which the edges are taken. An order 0 arithmetic encoder is used at each state with the appropriate table to represent probabilities. The machine M begins in the start state of the first element, i.e. the element specified in the DOCTYPE statement. Each time it sees an opening tag o e, it takes the transition labeled with element e, pushes the current state and the index of the current machine M i on the context stack as described, and moves to the start state of the machine associated with element e. Each state initiates an encoding (or decoding, if decompression is being carried out) action. If there is a single transition out of that state then the element is not encoded as its probability is one and there is no need to maintain any table. If there is more than one transition, then an order 0 frequency table is maintained which gives the probabilities at that state. In the example below, we need not encode the tag D but we have to encode B and C. <!ELEMENT A ((B C), D)> We note here that there is an implicit transition out of every final state of every DFA M i to the state on top of the context stack. However such transitions depend on the calling context and are detected only at runtime (i.e. during compression or decompression). These transitions will be taken on encountering the closing tag of the element. If the element is associated with PCDATA then a path sensitive scheme uses the contents of the stack to route the compressor to the correct set of tables. This corresponds to possibly having a different model for PCDATA associated with each instance of the element in the DTD. In contrast, a 6
7 path insensitive scheme will route all PCDATA associated with any instance of that element in the DTD to the same set of tables. We implement both schemes in this paper to study space/compression-ratio tradeoffs. We note here that the stack contents denote the root to element path in the implied tree representation of the structure. Consider the element below. <!ELEMENT A ( (#PCDATA B)*)> There are two transitions from the start state of the DFA for element A. The first invokes the arithmetic model for PCDATA. The second invokes the DFA for element B after pushing the current pair on the stack. The pseudo-code for Encoder (Compressor) is given below. The pseudo-code for Decoder (Decompressor) is similar. Encoding attributes is similar to encoding PCDATA and hence not shown. void Encoder() { ExitLoop = true; //StateStruct is the pair of int(elementindex, StateIndex) //ElementIndex represents the automaton //StateIndex is the state in the above automaton StateStruct CurrState(0, 0); while(exitloop == false) { Type = GetNextType(FilePointer, ElementIndex); switch(type) { case OPENTAG: //Encode ElementIndex in CurrState context EncodeOpenTag(CurrState, ElementIndex); Stack.push(CurrState); CurrState = StateStruct(ElementIndex, 0); break; case CLOSETAG: //Encode CLOSETAG in CurrState context EncodeCloseTag(CurrState); if(stack.empty() == true) { ExitLoop = true; } else { CurrState = Stack.pop(); //Make state transition in CurrState.ElementIndex 7
8 Table 1: Sizes of XML documents that were compressed Name XMark1 113 XMark2 230 DBLP 302 UniRef 79 Size (in MB) } } //automaton and get the next state CurrState.StateIndex = MakeStateTransition(CurrState, ElementIndex); } break; case PCDATA: //Encode Pcdata in path sensitive or path agnostic context EncodePcdata(CurrState); CurrState.StateIndex = MakeStateTransition(CurrState, PCDATA); break; } Experimental Results We have experimented with allocation of a memory block of fixed size for runtime memory during compression. We have examined the performance of the tool in terms of the compression ratio as a function of two parameters. The first one is the strategy for flushing context when the maximum memory allocated for PCDATA is full. The three strategies implemented are, flushing out all context tables, flushing out context tables such that the size reduces to half of half the memory allocated, and flushing the largest table. The memory allocated for PCDATA varies from to 0 Mb and as an optimization strategy memory allocated for encoding states and attributes is unbounded. The sizes of these documents are displayed in Table 1. We define the Compression Ratio as the ratio of the size of the compressed document to the size of the original document expressed as a percentage. The s achieved for various sizes of the memory block alocated are displayed for each strategy. We have also measured the effect of using root to leaf context in a path sensitive scheme tables for PCDATA. In this case the range for the memory block allocated range is higher as the tables need more space. It is observed the path agnostic scheme seems to perform better under a limited block size constraint. When the results are compared with those of XMLPPM we see that ours is better for XMARK by 2.% and DBLP by 0.1% and XMLPPM is better for UniRef by 2.7%. 8
9 6 Conclusion and Future Work We present and evaluate new schemes for syntax directed compression of XML documents where the underlying context model for the compression of tags is a recursive finite state automaton generated directly from the DTD of the document. The model is automatically switched on transiting from one automaton to another storing enough information on the stack so that return to the right state is possible; this ensures that the correct model is always used for compression. (In fact it precisely achieves the multiplexing of models mentioned in XMLPPM in a completely natural manner). We have measured the effects of allocating a fixed size block of runtime memory for the compressor, as well as varying strategies for flushing out the context tables. We have also compared the path sensitive and path agnostic schemes for storing context for PCDATA. Our experiments indicate that path sensitive schemes are less effective in the fixed memory model. Future work will concentrate on modifying this scheme to facilitate simple tree queries on the XML text. The fact that the tree structure is implicit in the textual representation, and that function calls to elements may be augmented with parameters, make it feasible to handle tree queries which require only a forward pass over the implicit tree, while the document is being decompressed. References [1] DBLP Computer Science Bibliography, ley/db [2] Nelson, M.: (1991) Arithmetic Coding. Dr. Dobbs Journal [3] Teahan, W.J.: PPMD+, PPM* source code. wjt/. [4] Megginson, D.: SAX: A Simple API for XML. [] Ian H. Witten, Radford M.Neal, John G. Cleary.: Arithmetic Coding for Data Compression. Communications of the ACM, 30(6): -40, June [6] XMark - An XML Benchmark project. Efficient query evaluation over compressed XML data. In Proc. of EDBT 04. [7] Extensible Markup Language (XML) 1.0. W3C Recommendation Feb, Reference: REC-xml [8] XML Solutions. XMLZIP. [9] Hartmut Liefke, Dan Suciu.: XMill: an efficient compressor for XML data, Proceedings of ACM SIGMOD, 00. [] James Cheney.: Compressing XML with Multiplexed Hierarchical Models. Proceedings of the 01 IEEE Data Compression Conference, pp [11] UniProt(Universal Protein Resource). 9
10 XMark1 (no path context) XMark2 (no pathcontext) MB MB 1 MB 0 MB MB 1 MB DBLP (no path context) UniRef (no path context) MB MB 0 MB MB MB Figure 2: Statistics for Compression Ratios Versus Memory Usage for XAUST and Compression Ratios for XMLPPM (continued in next page)
11 XMark1 (path context) XMark2 (path context) MB 1 MB MB MB MB DBLP (path context) XMLPPM MB MB 1 0 DBLP XMark 0 XMark 0 UniRef Figure 3: Continued from previous page 11
Compressing XML Documents Using Recursive Finite State Automata
Compressing XML Documents Using Recursive Finite State Automata Hariharan Subramanian and Priti Shankar Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India
More informationTradeoffs in XML Database Compression
Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression Conference March 30, 2006 Tradeoffs in XML Database Compression p.1/22 XML Compression XML: a format for tree-structured
More informationInformation Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2
Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Adaptive Huffman
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
Rashmi Gadbail,, 2013; Volume 1(8): 783-791 INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK EFFECTIVE XML DATABASE COMPRESSION
More informationTo Optimize XML Query Processing using Compression Technique
To Optimize XML Query Processing using Compression Technique Lalita Dhekwar Computer engineering department Nagpur institute of technology,nagpur Lalita_dhekwar@rediffmail.com Prof. Jagdish Pimple Computer
More informationChapter 13 XML: Extensible Markup Language
Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1
Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.
More informationThe Effects of Data Compression on Performance of Service-Oriented Architecture (SOA)
The Effects of Data Compression on Performance of Service-Oriented Architecture (SOA) Hosein Shirazee 1, Hassan Rashidi 2,and Hajar Homayouni 3 1 Department of Computer, Qazvin Branch, Islamic Azad University,
More informationLecture 5: Suffix Trees
Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common
More informationSFilter: A Simple and Scalable Filter for XML Streams
SFilter: A Simple and Scalable Filter for XML Streams Abdul Nizar M., G. Suresh Babu, P. Sreenivasa Kumar Indian Institute of Technology Madras Chennai - 600 036 INDIA nizar@cse.iitm.ac.in, sureshbabuau@gmail.com,
More informationCompiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6
Compiler Design 1 Bottom-UP Parsing Compiler Design 2 The Process The parse tree is built starting from the leaf nodes labeled by the terminals (tokens). The parser tries to discover appropriate reductions,
More information.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..
.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar.. XML in a Nutshell XML, extended Markup Language is a collection of rules for universal markup of data. Brief History
More informationThe Xlint Project * 1 Motivation. 2 XML Parsing Techniques
The Xlint Project * Juan Fernando Arguello, Yuhui Jin {jarguell, yhjin}@db.stanford.edu Stanford University December 24, 2003 1 Motivation Extensible Markup Language (XML) [1] is a simple, very flexible
More informationDecidable Problems. We examine the problems for which there is an algorithm.
Decidable Problems We examine the problems for which there is an algorithm. Decidable Problems A problem asks a yes/no question about some input. The problem is decidable if there is a program that always
More informationAdvanced Aspects and New Trends in XML (and Related) Technologies
NPRG039 Advanced Aspects and New Trends in XML (and Related) Technologies RNDr. Irena Holubová, Ph.D. holubova@ksi.mff.cuni.cz Lecture 10. XML Compression http://www.ksi.mff.cuni.cz/~svoboda/courses/171-nprg039/
More informationTheoretical Part. Chapter one:- - What are the Phases of compiler? Answer:
Theoretical Part Chapter one:- - What are the Phases of compiler? Six phases Scanner Parser Semantic Analyzer Source code optimizer Code generator Target Code Optimizer Three auxiliary components Literal
More informationAssignment 4 CSE 517: Natural Language Processing
Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationPerformance Evaluation of XHTML encoding and compression
Performance Evaluation of XHTML encoding and compression Sathiamoorthy Manoharan Department of Computer Science, University of Auckland, Auckland, New Zealand Abstract. The wireless markup language (WML),
More informationThe Extensible Markup Language (XML) and Java technology are natural partners in helping developers exchange data and programs across the Internet.
1 2 3 The Extensible Markup Language (XML) and Java technology are natural partners in helping developers exchange data and programs across the Internet. That's because XML has emerged as the standard
More informationALGORITHMS EXAMINATION Department of Computer Science New York University December 17, 2007
ALGORITHMS EXAMINATION Department of Computer Science New York University December 17, 2007 This examination is a three hour exam. All questions carry the same weight. Answer all of the following six questions.
More informationOptimizing Finite Automata
Optimizing Finite Automata We can improve the DFA created by MakeDeterministic. Sometimes a DFA will have more states than necessary. For every DFA there is a unique smallest equivalent DFA (fewest states
More informationCT32 COMPUTER NETWORKS DEC 2015
Q.2 a. Using the principle of mathematical induction, prove that (10 (2n-1) +1) is divisible by 11 for all n N (8) Let P(n): (10 (2n-1) +1) is divisible by 11 For n = 1, the given expression becomes (10
More informationCOP4020 Programming Languages. Syntax Prof. Robert van Engelen
COP4020 Programming Languages Syntax Prof. Robert van Engelen Overview n Tokens and regular expressions n Syntax and context-free grammars n Grammar derivations n More about parse trees n Top-down and
More informationNondeterministic Finite Automata (NFA): Nondeterministic Finite Automata (NFA) states of an automaton of this kind may or may not have a transition for each symbol in the alphabet, or can even have multiple
More information1 Lexical Considerations
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Spring 2013 Handout Decaf Language Thursday, Feb 7 The project for the course is to write a compiler
More information16 Greedy Algorithms
16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices
More informationarxiv: v2 [cs.it] 15 Jan 2011
Improving PPM Algorithm Using Dictionaries Yichuan Hu Department of Electrical and Systems Engineering University of Pennsylvania Email: yichuan@seas.upenn.edu Jianzhong (Charlie) Zhang, Farooq Khan and
More informationModified SPIHT Image Coder For Wireless Communication
Modified SPIHT Image Coder For Wireless Communication M. B. I. REAZ, M. AKTER, F. MOHD-YASIN Faculty of Engineering Multimedia University 63100 Cyberjaya, Selangor Malaysia Abstract: - The Set Partitioning
More informationLanguages and Compilers
Principles of Software Engineering and Operational Systems Languages and Compilers SDAGE: Level I 2012-13 3. Formal Languages, Grammars and Automata Dr Valery Adzhiev vadzhiev@bournemouth.ac.uk Office:
More informationSemistructured Data Store Mapping with XML and Its Reconstruction
Semistructured Data Store Mapping with XML and Its Reconstruction Enhong CHEN 1 Gongqing WU 1 Gabriela Lindemann 2 Mirjam Minor 2 1 Department of Computer Science University of Science and Technology of
More informationIndexing Keys in Hierarchical Data
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science January 2001 Indexing Keys in Hierarchical Data Yi Chen University of Pennsylvania Susan
More informationAccelerating XML Structural Matching Using Suffix Bitmaps
Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,
More information8 Integer encoding. scritto da: Tiziano De Matteis
8 Integer encoding scritto da: Tiziano De Matteis 8.1 Unary code... 8-2 8.2 Elias codes: γ andδ... 8-2 8.3 Rice code... 8-3 8.4 Interpolative coding... 8-4 8.5 Variable-byte codes and (s,c)-dense codes...
More informationAbout the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design
i About the Tutorial A compiler translates the codes written in one language to some other language without changing the meaning of the program. It is also expected that a compiler should make the target
More informationMidterm Exam. CSCI 3136: Principles of Programming Languages. February 20, Group 2
Banner number: Name: Midterm Exam CSCI 336: Principles of Programming Languages February 2, 23 Group Group 2 Group 3 Question. Question 2. Question 3. Question.2 Question 2.2 Question 3.2 Question.3 Question
More informationMQEB: Metadata-based Query Evaluation of Bi-labeled XML data
MQEB: Metadata-based Query Evaluation of Bi-labeled XML data Rajesh Kumar A and P Sreenivasa Kumar Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036, India.
More informationAutomata-Theoretic LTL Model Checking. Emptiness of Büchi Automata
Automata-Theoretic LTL Model Checking Graph Algorithms for Software Model Checking (based on Arie Gurfinkel s csc2108 project) Automata-Theoretic LTL Model Checking p.1 Emptiness of Büchi Automata An automation
More informationCBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents
CIT. Journal of Computing and Information Technology, Vol. 26, No. 2, June 2018, 99 114 doi: 10.20532/cit.2018.1003955 99 CBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents
More informationEvaluating XPath Queries
Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But
More informationXML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11
!important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... 7:4 @import Directive... 9:11 A Absolute Units of Length... 9:14 Addressing the First Line... 9:6 Assigning Meaning to XML Tags...
More informationData Structure. IBPS SO (IT- Officer) Exam 2017
Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data
More informationAADL Graphical Editor Design
AADL Graphical Editor Design Peter Feiler Software Engineering Institute phf@sei.cmu.edu Introduction An AADL specification is a set of component type and implementation declarations. They are organized
More informationNavigation- vs. Index-Based XML Multi-Query Processing
Navigation- vs. Index-Based XML Multi-Query Processing Nicolas Bruno, Luis Gravano Columbia University {nicolas,gravano}@cs.columbia.edu Nick Koudas, Divesh Srivastava AT&T Labs Research {koudas,divesh}@research.att.com
More informationDefinition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,
CMPSCI 601: Recall From Last Time Lecture 5 Definition: A context-free grammar (CFG) is a 4- tuple, variables = nonterminals, terminals, rules = productions,,, are all finite. 1 ( ) $ Pumping Lemma for
More informationXCQ: A Queriable XML Compression System
Under consideration for publication in Knowledge and Information Systems XCQ: A Queriable XML Compression System Wilfred Ng 1, Wai-Yeung Lam 1, Peter T. Wood 2 and Mark Levene 2 1 Department of Computer
More informationXML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9
XML databases Jan Chomicki University at Buffalo Jan Chomicki (University at Buffalo) XML databases 1 / 9 Outline 1 XML data model 2 XPath 3 XQuery Jan Chomicki (University at Buffalo) XML databases 2
More informationEfficient subset and superset queries
Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper
More informationIntroduction to XML Zdeněk Žabokrtský, Rudolf Rosa
NPFL092 Technology for Natural Language Processing Introduction to XML Zdeněk Žabokrtský, Rudolf Rosa November 28, 2018 Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal
More informationFinite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.
Finite automata We have looked at using Lex to build a scanner on the basis of regular expressions. Now we begin to consider the results from automata theory that make Lex possible. Recall: An alphabet
More informationCompression of Probabilistic XML documents
Compression of Probabilistic XML documents Irma Veldman i.e.veldman@student.utwente.nl July 9, 2009 Abstract Probabilistic XML (PXML) files resulting from data integration can become extremely large, which
More informationLecture 6: The Declarative Kernel Language Machine. September 13th, 2011
Lecture 6: The Declarative Kernel Language Machine September 13th, 2011 Lecture Outline Computations contd Execution of Non-Freezable Statements on the Abstract Machine The skip Statement The Sequential
More informationCS Lecture 2. The Front End. Lecture 2 Lexical Analysis
CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 1 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture
More informationsimplefun Semantics 1 The SimpleFUN Abstract Syntax 2 Semantics
simplefun Semantics 1 The SimpleFUN Abstract Syntax We include the abstract syntax here for easy reference when studying the domains and transition rules in the following sections. There is one minor change
More informationTheory of Computation Dr. Weiss Extra Practice Exam Solutions
Name: of 7 Theory of Computation Dr. Weiss Extra Practice Exam Solutions Directions: Answer the questions as well as you can. Partial credit will be given, so show your work where appropriate. Try to be
More informationOutline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata
Outline 1 2 Regular Expresssions Lexical Analysis 3 Finite State Automata 4 Non-deterministic (NFA) Versus Deterministic Finite State Automata (DFA) 5 Regular Expresssions to NFA 6 NFA to DFA 7 8 JavaCC:
More informationA new generation of tools for SGML
Article A new generation of tools for SGML R. W. Matzen Oklahoma State University Department of Computer Science EMAIL rmatzen@acm.org Exceptions are used in many standard DTDs, including HTML, because
More informationLecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking
CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking Instructor: Tevfik Bultan Buchi Automata Language
More informationCOMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou Administrative! Any questions about the syllabus?! Course Material available at www.cs.unic.ac.cy/ioanna! Next time reading assignment [ALSU07]
More informationSEMANTIC ANALYSIS TYPES AND DECLARATIONS
SEMANTIC ANALYSIS CS 403: Type Checking Stefan D. Bruda Winter 2015 Parsing only verifies that the program consists of tokens arranged in a syntactically valid combination now we move to check whether
More informationAn Analysis of Approaches to XML Schema Inference
An Analysis of Approaches to XML Schema Inference Irena Mlynkova irena.mlynkova@mff.cuni.cz Charles University Faculty of Mathematics and Physics Department of Software Engineering Prague, Czech Republic
More informationIntermediate Code Generation
Intermediate Code Generation In the analysis-synthesis model of a compiler, the front end analyzes a source program and creates an intermediate representation, from which the back end generates target
More informationOrganizing Spatial Data
Organizing Spatial Data Spatial data records include a sense of location as an attribute. Typically location is represented by coordinate data (in 2D or 3D). 1 If we are to search spatial data using the
More informationLexical Analysis 1 / 52
Lexical Analysis 1 / 52 Outline 1 Scanning Tokens 2 Regular Expresssions 3 Finite State Automata 4 Non-deterministic (NFA) Versus Deterministic Finite State Automata (DFA) 5 Regular Expresssions to NFA
More informationVariants of Turing Machines
November 4, 2013 Robustness Robustness Robustness of a mathematical object (such as proof, definition, algorithm, method, etc.) is measured by its invariance to certain changes Robustness Robustness of
More informationWelcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson
Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson zhuyongxin@sjtu.edu.cn 2 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information
More informationIntroduction to Computers & Programming
16.070 Introduction to Computers & Programming Theory of computation 5: Reducibility, Turing machines Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT States and transition function State control A finite
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationMIDTERM EXAM (Solutions)
MIDTERM EXAM (Solutions) Total Score: 100, Max. Score: 83, Min. Score: 26, Avg. Score: 57.3 1. (10 pts.) List all major categories of programming languages, outline their definitive characteristics and
More informationPart V. Relational XQuery-Processing. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 297
Part V Relational XQuery-Processing Marc H Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 297 Outline of this part (I) 12 Mapping Relational Databases to XML Introduction Wrapping Tables into XML
More informationScribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017
CS6 Lecture 4 Greedy Algorithms Scribe: Virginia Williams, Sam Kim (26), Mary Wootters (27) Date: May 22, 27 Greedy Algorithms Suppose we want to solve a problem, and we re able to come up with some recursive
More information9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation
Language Implementation Methods The Design and Implementation of Programming Languages Compilation Interpretation Hybrid In Text: Chapter 1 2 Compilation Interpretation Translate high-level programs to
More informationM301: Software Systems & their Development. Unit 4: Inheritance, Composition and Polymorphism
Block 1: Introduction to Java Unit 4: Inheritance, Composition and Polymorphism Aims of the unit: Study and use the Java mechanisms that support reuse, in particular, inheritance and composition; Analyze
More informationCOP4020 Programming Languages. Syntax Prof. Robert van Engelen
COP4020 Programming Languages Syntax Prof. Robert van Engelen Overview Tokens and regular expressions Syntax and context-free grammars Grammar derivations More about parse trees Top-down and bottom-up
More informationPRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS
Objective PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Explain what is meant by compiler. Explain how the compiler works. Describe various analysis of the source program. Describe the
More informationXML: some structural principles
XML: some structural principles Hayo Thielecke University of Birmingham www.cs.bham.ac.uk/~hxt October 18, 2011 1 / 25 XML in SSC1 versus First year info+web Information and the Web is optional in Year
More information2. Syntax and Type Analysis
Content of Lecture Syntax and Type Analysis Lecture Compilers Summer Term 2011 Prof. Dr. Arnd Poetzsch-Heffter Software Technology Group TU Kaiserslautern Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type
More informationEmbedded Rate Scalable Wavelet-Based Image Coding Algorithm with RPSWS
Embedded Rate Scalable Wavelet-Based Image Coding Algorithm with RPSWS Farag I. Y. Elnagahy Telecommunications Faculty of Electrical Engineering Czech Technical University in Prague 16627, Praha 6, Czech
More informationRecognizing regular tree languages with static information
Recognizing regular tree languages with static information Alain Frisch (ENS Paris) PLAN-X 2004 p.1/22 Motivation Efficient compilation of patterns in XDuce/CDuce/... E.g.: type A = [ A* ] type B =
More informationXML Tree Structure Compression
XML Tree Structure Compression Sebastian Maneth NICTA & University of NSW Joint work with N. Mihaylov and S. Sakr Melbourne, Nov. 13 th, 2008 Outline -- XML Tree Structure Compression 1. Motivation 2.
More informationOptimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching
Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Tiancheng Li Ninghui Li CERIAS and Department of Computer Science, Purdue University 250 N. University Street, West
More informationXML Filtering Technologies
XML Filtering Technologies Introduction Data exchange between applications: use XML Messages processed by an XML Message Broker Examples Publish/subscribe systems [Altinel 00] XML message routing [Snoeren
More informationA Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase
More informationPoint Enclosure and the Interval Tree
C.S. 252 Prof. Roberto Tamassia Computational Geometry Sem. II, 1992 1993 Lecture 8 Date: March 3, 1993 Scribe: Dzung T. Hoang Point Enclosure and the Interval Tree Point Enclosure We consider the 1-D
More informationHeap Compression for Memory-Constrained Java
Heap Compression for Memory-Constrained Java CSE Department, PSU G. Chen M. Kandemir N. Vijaykrishnan M. J. Irwin Sun Microsystems B. Mathiske M. Wolczko OOPSLA 03 October 26-30 2003 Overview PROBLEM:
More informationPioneering Compiler Design
Pioneering Compiler Design NikhitaUpreti;Divya Bali&Aabha Sharma CSE,Dronacharya College of Engineering, Gurgaon, Haryana, India nikhita.upreti@gmail.comdivyabali16@gmail.com aabha6@gmail.com Abstract
More informationSyntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412
Midterm Exam: Thursday October 18, 7PM Herzstein Amphitheater Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412 COMP 412 FALL 2018 source code IR Front End Optimizer Back End IR target
More informationSecurity Based Heuristic SAX for XML Parsing
Security Based Heuristic SAX for XML Parsing Wei Wang Department of Automation Tsinghua University, China Beijing, China Abstract - XML based services integrate information resources running on different
More informationPARALLEL XPATH QUERY EVALUATION ON MULTI-CORE PROCESSORS
PARALLEL XPATH QUERY EVALUATION ON MULTI-CORE PROCESSORS A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF
More information7.1 Introduction. extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML
7.1 Introduction extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML Lax syntactical rules Many complex features that are rarely used HTML is a markup language,
More informationXDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013
Assured and security Deep-Secure XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013 This technical note describes the extensible Data
More informationLexical Analysis. COMP 524, Spring 2014 Bryan Ward
Lexical Analysis COMP 524, Spring 2014 Bryan Ward Based in part on slides and notes by J. Erickson, S. Krishnan, B. Brandenburg, S. Olivier, A. Block and others The Big Picture Character Stream Scanner
More informationCopyright 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 7 XML
Chapter 7 XML 7.1 Introduction extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML Lax syntactical rules Many complex features that are rarely used HTML
More informationfor (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }
Ex: The difference between Compiler and Interpreter The interpreter actually carries out the computations specified in the source program. In other words, the output of a compiler is a program, whereas
More informationCOMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table
COMPILER CONSTRUCTION Lab 2 Symbol table LABS Lab 3 LR parsing and abstract syntax tree construction using ''bison' Lab 4 Semantic analysis (type checking) PHASES OF A COMPILER Source Program Lab 2 Symtab
More informationSyntax and Type Analysis
Syntax and Type Analysis Lecture Compilers Summer Term 2011 Prof. Dr. Arnd Poetzsch-Heffter Software Technology Group TU Kaiserslautern Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 1 Content
More informationAn Empirical Evaluation of XML Compression Tools
An Empirical Evaluation of XML Compression Tools Sherif Sakr School of Computer Science and Engineering University of New South Wales 1 st International Workshop on Benchmarking of XML and Semantic Web
More informationA Simple Syntax-Directed Translator
Chapter 2 A Simple Syntax-Directed Translator 1-1 Introduction The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation for it, called
More informationCSc 453 Lexical Analysis (Scanning)
CSc 453 Lexical Analysis (Scanning) Saumya Debray The University of Arizona Tucson Overview source program lexical analyzer (scanner) tokens syntax analyzer (parser) symbol table manager Main task: to
More informationEE-575 INFORMATION THEORY - SEM 092
EE-575 INFORMATION THEORY - SEM 092 Project Report on Lempel Ziv compression technique. Department of Electrical Engineering Prepared By: Mohammed Akber Ali Student ID # g200806120. ------------------------------------------------------------------------------------------------------------------------------------------
More information