Identifying Hierarchical Structure in Sequences. Sean Whalen

Similar documents
Lecture 4: Syntax Specification

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

CSCE 314 Programming Languages

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

COP 3402 Systems Software Syntax Analysis (Parser)

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

CS 314 Principles of Programming Languages. Lecture 3

Part 5 Program Analysis Principles and Techniques

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

CMSC 330: Organization of Programming Languages

Theory and Compiling COMP360

Introduction to Lexing and Parsing

Lexical Scanning COMP360

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

CS 314 Principles of Programming Languages

Principles of Programming Languages COMP251: Syntax and Grammars

Homework & Announcements

Formal Languages. Formal Languages

Languages and Compilers

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Non-deterministic Finite Automata (NFA)

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

CMSC 330: Organization of Programming Languages

announcements CSE 311: Foundations of Computing review: regular expressions review: languages---sets of strings

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Building Compilers with Phoenix

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages

Syntax. In Text: Chapter 3

CMSC 330: Organization of Programming Languages. Context Free Grammars

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

CPS 506 Comparative Programming Languages. Syntax Specification

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CSE 3302 Programming Languages Lecture 2: Syntax

CMSC 330: Organization of Programming Languages. Context Free Grammars

Describing Syntax and Semantics

CSCI312 Principles of Programming Languages!

EECS 6083 Intro to Parsing Context Free Grammars

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Dr. D.M. Akbar Hussain

3. Context-free grammars & parsing

Week 2: Syntax Specification, Grammars

Defining syntax using CFGs

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

1. The output of lexical analyser is a) A set of RE b) Syntax Tree c) Set of Tokens d) String Character

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Plan for Today. Regular Expressions: repetition and choice. Syntax and Semantics. Context Free Grammars

CSE 12 Abstract Syntax Trees

KEY. A 1. The action of a grammar when a derivation can be found for a sentence. Y 2. program written in a High Level Language

CS415 Compilers. Lexical Analysis

Context-Free Grammar (CFG)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CS 314 Principles of Programming Languages

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

11. a b c d e. 12. a b c d e. 13. a b c d e. 14. a b c d e. 15. a b c d e

Syntax. 2.1 Terminology

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Chapter 3. Describing Syntax and Semantics ISBN

Writing a Simple DSL Compiler with Delphi. Primož Gabrijelčič / primoz.gabrijelcic.org

(a) R=01[((10)*+111)*+0]*1 (b) ((01+10)*00)*. [8+8] 4. (a) Find the left most and right most derivations for the word abba in the grammar

This book is licensed under a Creative Commons Attribution 3.0 License

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Introduction to Parsing. Lecture 8

Chapter 2 :: Programming Language Syntax

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

CS 415 Midterm Exam Spring SOLUTION

Principles of Programming Languages COMP251: Syntax and Grammars

Lecture 8: Context Free Grammars

CSE 401 Midterm Exam 11/5/10

Specifying Syntax COMP360

Programming Language Definition. Regular Expressions

Habanero Extreme Scale Software Research Project

Last time. What are compilers? Phases of a compiler. Scanner. Parser. Semantic Routines. Optimizer. Code Generation. Sunday, August 29, 2010

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

Chapter 3. Describing Syntax and Semantics

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Lecture 10 Parsing 10.1

Project 2: Scheme Lexer and Parser

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

CMSC 330, Fall 2009, Practice Problem 3 Solutions

Context-Free Languages and Parse Trees

Compiler Design Overview. Compiler Design 1

Chapter 3. Describing Syntax and Semantics

afewadminnotes CSC324 Formal Language Theory Dealing with Ambiguity: Precedence Example Office Hours: (in BA 4237) Monday 3 4pm Wednesdays 1 2pm

Context-Free Grammars

LANGUAGE PROCESSORS. Presented By: Prof. S.J. Soni, SPCE Visnagar.

CSCI312 Principles of Programming Languages!

A programming language requires two major definitions A simple one pass compiler

CSE302: Compiler Design

Lexical Analysis. Introduction

Optimizing Finite Automata

Transcription:

Identifying Hierarchical Structure in Sequences Sean Whalen 07.19.06 1

Review of Grammars 2

Syntax vs Semantics Syntax - the pattern of sentences in a language CS - the rules governing the formation of statements in a programming language Semantics - the study of language meaning CS - the meaning of a string in some language, as opposed to syntax which describes how symbols may be combined independent of their meaning 3

Digits How might we express the syntactic structure of positive integers? digit ::= 0 digit ::= 1 digit ::= 2... digit ::= 9 ::= means can be a (digit ::= 1 means a digit can be a 1 ). We can collect these rules together using the or operator digit ::= 0 1 2 3 4 5 6 7 8 9 Essentially, digit generates elements in the set {0,1,2,3,4,5,6,7,8,9} 4

Terminology We gave 10 rules (called productions) describing possible digits Left hand side of a production is called a nonterminal, because it can generate additional symbols Right hand side may contain terminals, more nonterminals, or both Replacing digit with something which can be a digit is called a derivation Given a production, what set of strings can be derived (generated) 5

Formalization A grammar can be defined as G = (N, T, P, S) N - a set of nonterminals T - a set of terminals P - a set of productions S - a start symbol 6

Numbers How would we generate a number with multiple digits? number ::= digit digit number ::= digit digit digit Can we use a finite number of productions? Recursion! number ::= number digit digit 7

8

Names Let s say a name in the phonebook is defined as last name, first name middle initial name ::= last_name, first_name middle_initial Middle initial is easiest char ::= A B C... Z middle_initial ::= char 9

Names Similar to number, we need a recursive definition for a string of characters char_list ::= char_list char char Now we can define names first_name ::= char_list last_name ::= char_list 10

Database Record One can describe the structure of a database entry this way student_record ::= name address ssn address ::= house_address apt_address house_address ::= number street city state zip apt_address ::= number street apt_num city state zip This is a Backus-Naur Form (BNF) context free grammar (CFG) 11

Precedence assignment ::= identifier = expression ; expression ::= term expression + term expression - term term ::= factor term * factor term / factor factor ::= identifier literal (expression) 12

Grammatical Inference 13

Grammatical Inference The task of learning (inferring) a grammar from a set of sample strings Automata / grammar equivalence For any regular grammar G there is a finite automaton accepting L(G) Same for context free grammars / pushdown automata Thus, learning automata from sample strings is also GI Probabilistic versions - learn a FSM that generates strings for some PD 14

Algorithms Regular Positive Negative Inference (RPNI, Oncina & Garcia 92) - Given labeled samples both in (positive) and not in (negative) the target language, infer the minimal DFA acceptor Alergia (Carrasco & Oncina 94/99) - Given a set of example strings, infer a stochastic context free grammar MDI (DuPont et al 2000) - Refinement of Alergia, addresses global control of generalization from training sample using KL divergence 15

Sequitur Nevill-Manning and Witten, Journal of AI Research 97 Sequitur is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continues this process recursively. Creates a set of productions and a start symbol which can derive original string The grammar represents hierarchical structure, and simultaneously performs lossless compression (similar to Lempel-Ziv) 16

High-Level Algorithm Scan input string, adding one symbol at a time to starting production For each digram (2 consecutive letters) formed by last 2 symbols Store digram in a table If it s already in the table, it s been seen before Create production - replace current and matched digram with nonterminal for new production 17

Constraints No pair of adjacent symbols can appear more than once in the grammar (digram uniqueness) This includes nonterminals for productions Digrams containing nonterminals create the hierarchy Every rule is used more than once (rule utility) Delete rules when used only once 18

Example 1 string > abcdbc Parse consecutive digrams and store Store ab, store bc, store cd, store bc bc matches! create new rule A ::= bc S ::= aada A ::= bc 19

Example 2 - Nested string > abcdbcabcdbc S ::= AA A ::= abdb B ::= bc 20

Example 3 - Violation string > abcdbcabcdbc (same) S ::= AA A ::= abcdbc bc occurs twice in rule A 21

Step by Step 22

Implementation Issues Appending a symbol to start symbol S Using an existing rule - substitute nonterminal for two symbols Creating a new rule - new nonterminal, insert 2 symbols, substitute Deleting a rule - replace nonterminal with contents Data structures: doubly linked list (symbols), hash table (digrams) 23

Digram Table 24

Complexity Linear time in length of string (good), linear space in size of grammar (bad) Creation/deletion/substitution/lookup each have constant number of operations due to doubly linked list and hash table Number of rules can increase without bound for unstructured data Linearity proof is valid for 10^9 symbols on 32-bit machines 25

Performance 26

Compression Performance 27

28

Application Run algorithm on protocol data Convert resulting grammar to DFA How big is the resulting model? Is the structure of the protocol more or less apparent? 29