An Analysis of Approaches to XML Schema Inference

Similar documents
Evolution of XML Applications

Similarity of DTDs Based on Edit Distance and Semantics

XML Data in (Object-) Relational Databases

FlexBench: A Flexible XML Query Benchmark

Discovering XML Keys and Foreign Keys in Queries

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Towards Schema-Guided XML Query Induction

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

FROM XML SCHEMA TO OBJECT-RELATIONAL DATABASE AN XML SCHEMA-DRIVEN MAPPING ALGORITHM

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Michal Klempa. Optimization and Refinement of XML Schema Inference Approaches

Grammar vs. Rules. Diagnostics in XML Document Validation. Petr Nálevka

Fast Learning of Restricted Regular Expressions and DTDs

FAdo: Interactive Tools for Learning Formal Computational Models

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Introduction to XML Zdeněk Žabokrtský, Rudolf Rosa

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Lecture 4: Syntax Specification

MASTER THESIS. Julie Vyhnanovská. Automatic Construction of an XML Schema for a Given Set of XML Documents

Extending E-R for Modelling XML Keys

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Chapter 13 XML: Extensible Markup Language

CSCI312 Principles of Programming Languages!

Schemachine. (C) 2002 Rick Jelliffe. A framework for modular validation of XML documents

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Zhizheng Zhang. Southeast University

M359 Block5 - Lecture12 Eng/ Waleed Omar

Schema-Guided Query Induction

Delivery Options: Attend face-to-face in the classroom or remote-live attendance.

Languages and Compilers

Introduction to Dependable Systems: Meta-modeling and modeldriven

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

Introduction to Syntax Analysis. The Second Phase of Front-End

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

Delivery Options: Attend face-to-face in the classroom or via remote-live attendance.

ISO/IEC INTERNATIONAL STANDARD. Information technology Document Schema Definition Languages (DSDL) Part 3: Rule-based validation Schematron

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Chapter 3. Semantics. Topics. Introduction. Introduction. Introduction. Introduction

Author: Irena Holubová Lecturer: Martin Svoboda

COP 3402 Systems Software Syntax Analysis (Parser)

CS 403: Scanning and Parsing

Appendix Set Notation and Concepts

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

This book is licensed under a Creative Commons Attribution 3.0 License

XML and Web Application Programming

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Introduction to Syntax Analysis

Building Compilers with Phoenix

Statistical Analysis of Real XML Data Collections

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

A new generation of tools for SGML

Restricting Grammars with Tree Automata

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

LL Parsing, LR Parsing, Complexity, and Automata

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

358 cardinality 18, 55-57, 62, 65, 71, 75-76, 85, 87, 99, 122, 129, , 140, 257, , , 295, categorization 18, 26, 56, 71, 7

Non-deterministic Finite Automata (NFA)

DSD: A Schema Language for XML

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

A Typed Lambda Calculus for Input Sanitation

challenges in domain-specific modeling raphaël mannadiar august 27, 2009

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

A Methodology for Integrating XML Data into Data Warehouses

LR Parsing LALR Parser Generators

UserMap an Enhancing of User-Driven XML-to-Relational Mapping Strategies

Structural and Syntactic Pattern Recognition

Chapter Seven: Regular Expressions

UserMap an Exploitation of User-Specified XML-to-Relational Mapping Requirements and Related Problems

XML Technologies. Doc. RNDr. Irena Holubova, Ph.D. Web page:

Decision Properties of RLs & Automaton Minimization

Talen en Compilers. Jurriaan Hage , period 2. November 13, Department of Information and Computing Sciences Utrecht University

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Lexical Analysis. Introduction

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

CS561 Spring Mixed Content

SQL: Data De ni on. B0B36DBS, BD6B36DBS: Database Systems. h p:// Lecture 3

Course Project 2 Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

LR Parsing LALR Parser Generators

The Generalized Star-Height Problem

CMSC 330 Practice Problem 4 Solutions

B4M36DS2, BE4M36DS2: Database Systems 2

Foundations of SPARQL Query Optimization

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

XML Research for Formal Language Theorists

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

SFilter: A Simple and Scalable Filter for XML Streams

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

CMSC 330: Organization of Programming Languages

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Grammar Rules in Prolog

OWL 2 Profiles. An Introduction to Lightweight Ontology Languages. Markus Krötzsch University of Oxford. Reasoning Web 2012

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Using Input Buffers for Streaming XSLT Processing

Transcription:

An Analysis of Approaches to XML Schema Inference Irena Mlynkova irena.mlynkova@mff.cuni.cz Charles University Faculty of Mathematics and Physics Department of Software Engineering Prague, Czech Republic Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 1

Overview 1. Introduction 2. Existing approaches 3. Open issues 4. Conclusion Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 2

Introduction XML = a standard for data representation and manipulation XML documents + XML schema Allowed data structure W3C recommendations: DTD, XML Schema (XSD) ISO standards: RELAX NG, Schematron, Why schema? Known structure, valid data, limited complexity of processing, Optimization of XML processing Storing, querying, updating, compressing, Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 3

Real-World XML Schemas Statistical analyses of real-word XML data: 52% of randomly crawled / 7.4% of semi-automatically collected documents: no schema 0.09% of randomly crawled / 38% of semi-automatically collected documents with schema: use XSD 85% of randomly crawled XSDs: equivalent to DTDs Problem: Users do not use schemas at all Extreme opinion: I do not want to follow the rules of an XML schema in my XML data. Schema = a kind of documentation Documents are not valid, schemas are not correct Mlynkova, Toman, Pokorny: Statistical Analysis of Real XML Data Collections. In COMAD Nov 30 '06, - Dec pages 3, 2008 20 31, New Delhi, SITIS India, 2008 2006. - Bali, Tata Indonesia McGraw-Hill Publishing Co. Ltd. 4

Inference of XML Schemas Solution: Automatic inference of XML schema S D for a given set of documents D Multiple solutions Too general = accepts too many documents Too restrictive = accepts only D Advantages: S D = a good initial draft for user-specified schema S D = a reasonable representative when no schema is available User-defined XML schemas are too general (*, +, recursion, ) S D can be more precise Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 5

XML Schemas and Grammars An extended context-free grammar is quadruple G = (N,T,P,S), where N and T are finite sets of nonterminals and terminals, P is a finite set of productions and S is a non terminal called a start symbol. Each production is of the form A α, where A N and α is a regular expression over alphabet N T. Given the alphabet Σ, a regular expression (RE) over Σ is inductively defined as follows: (empty set) and ε (empty string) are REs a Σ: a is a RE If r and s are REs over Σ, then (rs) (concatenation), (r s) (alternation) and (r*) (Kleene closure) are REs DTD adds: (s ε) = (s?), (s s*) = (s+), concatenation = ',' XML Schema adds: unordered sequence Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 6

Overview 1. Introduction 2. Existing approaches 3. Open issues 4. Conclusion Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 7

Classification of Approaches Type of the result (DTD vs. XSD) DTDs are most common Some works infer XSDs, but with expressive power of DTD Key aim: Inference of REs (content models) The way we construct the result Heuristic = no theoretic basis Generalization of a trivial schema Rules: If there are > 3 occurrences of E, it can occur arbitrary times" E* or E+ Inferring a grammar = inference of a set of regular expressions Gold's theorem: Regular languages are not identifiable in the limit only from positive examples (valid XML documents) Inference of subclasses of regular languages Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 8

Classical Steps 1. Derivation of initial grammar (IG) For each element E and its subelements E 1, E 2,, E n we create production E E 1 E 2 E n 2. Clustering of rules of IG According to element names vs. broader context 3. Construction of prefix tree automaton (PTA) for each cluster 4. Generalization of PTAs Merging state algorithms 5. Inference of simple data types and integrity constraints Often ignored 6. Refactorization Correction and simplification of the derived REs 7. Expressing the inferred REs in target XML schema language Most common: Direct rewriting of REs to content models Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 9

Step 1: Initial Grammar Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 10

Step 2: Clustering Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 11

Step 3: Construction of PTA Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 12

Step 4. PTA Generalization Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 13

Heuristic Approaches Various generalization rules Observations of real-world data, common prefixes, suffixes, Generalization process Generalize IG until a satisfactory solution is reached Problem: wrong step Generate a set of candidates and choose the optimal one Problem: space overhead How to generalize Until any rule can be applied Until a better schema can be found Problems: Evaluation of quality of schemas (MDL principle) Conciseness = bits required to describe schema Preciseness = bits required for description of input data using schema Efficient search strategy (greedy search vs. ACO heuristics) Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 14

Approaches Inferring a Grammar Common idea: regular languages are not identifiable in the limit from positive examples inferring a subclass that can be Difference: The selected class of languages k-contextual, (k,h)-contextual = having a limited context f-distinguishable = having a distinguishing function single-occurrence REs, chain REs, k-local single-occurrence = simple types of REs occurring in real-world XML schemas Approaches: Merging state algorithms Merging criteria are given by the language class directly Note: Necessary requirement of W3C = 1-unambiguity Deterministic content models Example: (A,B) (A,C) vs. A, (B C) Often ignored Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 15

Overview 1. Introduction 2. Existing approaches 3. Open issues 4. Conclusion Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 16

1. User Interaction Existing approaches: Automatic inference of an XML schema Problem: How to find the optimal generalization? MDL principle: Good schema = tightly represents data, concise, compact User's preferences can be different resulting schema may be unnatural Bex et al. (VLDB'06, VLDB'07): Let us infer only schema constructs that occur in real-world XML data Natural improvement: user interaction Refining the clustering, preferred merging, preferred schema constructs, refining the REs, Problem: A user may not be skilled in specifying complex REs A user is not able to make too many decisions Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 17

2. Other Input Information Input in existing works: a set of positive examples Problem: Gold's theorem Question: Are there any other ways? Input 1: An obsolete XML schema Typical situation: a user creates an XML schema updates only the data schema is obsolete Idea: The schema contains partially correct information Note: XML schema evolution = opposite problem Input 2: XML queries Idea: partial information on the structure Input 3 - : Negative examples, user requirements, statistical analysis of XML documents, Mlynkova: On Inference of XML Schema with the Knowledge of an Obsolete One. In ADC 09 (to appear), volume 92, Wellington, New Zealand, 2009. ACS. Necasky, Nov Mlynkova: 30 - Dec 3, 2008 Enhancing XML Schema SITIS 2008 Inference - Bali, Indonesia with Keys and Foreign Keys. 18 In SAC 09 (to appear), Honolulu, Hawaii, USA, 2009. ACM.

3. XML Schema Simple Data Types Advantage of XML Schema: wide support of simple data types 44 built-in data types User-defined data types derived from existing simple types Natural improvement: precise inference of simple data types Current approaches: Omit simple data types at all Two exceptions: selected built-in data types Do we need simple data types? Inferring within an XML editor: yes Inferring for optimization purposes: not always necessary Schema-driven XML-to-relational mapping methods Ideas: exploitation of additional information Queries, semantics of element names, obsolete schema, Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 19

4. XML Schema Advanced Constructs Advantage of XML Schema: object-oriented features User-defined data types, inheritance, substitutability of both data types and elements, Disadvantage: Do not extend the expressive power "syntactic sugar" Advantages: More user-friendly and realistic schemas Can carry more precise information for optimization Inheritance, shared globally defined items, Problem: constructs are equivalent how to find the optimal expression? User-interaction Additional information Vosta, Mlynkova, Pokorny. Even an Ant Can Create an XSD. In DASFAA 08, LNCS 4947, pages 35 50. New Delhi, India, 2008. Springer-Verlag. Mlynkova, Nov 30 Necasky: - Dec 3, 2008 Towards Inference SITIS of More 2008 Realistic - Bali, Indonesia XSDs. 20 In SAC 09 (to appear), Honolulu, Hawaii, USA, 2009. ACM.

5. Integrity Constraints (ICs) DTD: ID, IDREF, IDREFS = keys and foreign keys XML Schema: ID, IDREF, IDREFS unique, key, keyref More precise expression of keys and foreign keys + uniqueness assert, report Special constraints expressed using XPath More powerful ICs: Cannot be expressed in XML Schema but can be inferred Aim of ICs Optimization of XML processing approaches Existing works: Restricted cases of ICs in special situations (applications) No general/universal approach Necasky, Mlynkova: Enhancing XML Schema Inference with Keys and Foreign Keys. Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 21 In SAC 09 (to appear), Honolulu, Hawaii, USA, 2009. ACM.

6. Other Schema Definition Languages W3C: DTD, XML Schema Most popular ones There are other languages RELAX NG Similar strategy as XML Schema and DTD Describes the structure of XML documents using content models Simpler syntax than XSDs, richer set of simple data types than DTD Schematron Different strategy Specifies a set of conditions (ICs) the documents must follow Expressed using XPath A brand new method A first step towards inference of general ICs Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 22

7. XML Data Streams Data streams Special type of XML data Recently became popular Special processing Parsing, validation, querying, transforming, Inference of XML schema? Features: Cannot be kept in a memory Cannot be read more than once Processing cannot "wait" for the last portion The situation is complicated No inference method for XML data streams Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 23

Overview 1. Introduction 2. Existing approaches 3. Open issues 4. Conclusion Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 24

Conclusion Almost any approach can benefit from XML schemas = knowledge of data structure Currently Data-exchange: inferred schema = candidate for further improving Optimization: inferred schema = the only option May be more precise Main observations: Basic aspects (inference of REs) are solved Advanced aspects are still waiting for solutions Aim of this study: A good starting point for researchers searching a solution or a research topic Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 25

Thank you Nov 30 - Dec 3, 2008 SITIS 2008 - Bali, Indonesia 26