Grammars and Parsing. Paul Klint. Grammars and Parsing

Similar documents
Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

3. Parsing. Oscar Nierstrasz

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSE 3302 Programming Languages Lecture 2: Syntax

CPS 506 Comparative Programming Languages. Syntax Specification

Syntax Analysis Part I

CSCI312 Principles of Programming Languages!

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Introduction to Lexing and Parsing

CMSC 330: Organization of Programming Languages

CS2210: Compiler Construction Syntax Analysis Syntax Analysis

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

A programming language requires two major definitions A simple one pass compiler

Properties of Regular Expressions and Finite Automata

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

Error Recovery during Top-Down Parsing: Acceptable-sets derived from continuation

Writing a Simple DSL Compiler with Delphi. Primož Gabrijelčič / primoz.gabrijelcic.org

Defining syntax using CFGs

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

Top down vs. bottom up parsing

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

Optimizing Finite Automata

CMSC 330: Organization of Programming Languages

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Part 5 Program Analysis Principles and Techniques

Introduction to Syntax Analysis

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Introduction to Syntax Analysis. The Second Phase of Front-End

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

Homework & Announcements

CSE 401 Midterm Exam Sample Solution 2/11/15

Lexical and Syntax Analysis

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Chapter 3: Lexing and Parsing

4. Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Syntax Analysis/Parsing. Context-free grammars (CFG s) Context-free grammars vs. Regular Expressions. BNF description of PL/0 syntax

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

4. Lexical and Syntax Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Building Compilers with Phoenix

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

COP 3402 Systems Software Syntax Analysis (Parser)

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Course Overview. Introduction (Chapter 1) Compiler Frontend: Today. Compiler Backend:

Wednesday, August 31, Parsers

Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages

Lexical and Syntax Analysis. Top-Down Parsing

CS 403: Scanning and Parsing

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages. Context Free Grammars

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Introduction to Lexical Analysis

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Monday, September 13, Parsers

Evaluation of Semantic Actions in Predictive Non- Recursive Parsing

Parsing II Top-down parsing. Comp 412

Building Compilers with Phoenix

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Abstract Syntax Trees & Top-Down Parsing

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

1. Explain the input buffer scheme for scanning the source program. How the use of sentinels can improve its performance? Describe in detail.

CS 314 Principles of Programming Languages

Bottom-Up Parsing. Lecture 11-12

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Lecture 7: Deterministic Bottom-Up Parsing

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Introduction to Syntax Analysis. Compiler Design Syntax Analysis s.l. dr. ing. Ciprian-Bogdan Chirila

Chapter 3. Describing Syntax and Semantics ISBN

A simple syntax-directed

Chapter 4. Lexical and Syntax Analysis

Bottom-Up Parsing. Lecture 11-12

Lecture 8: Deterministic Bottom-Up Parsing

Academic Formalities. CS Modern Compilers: Theory and Practise. Images of the day. What, When and Why of Compilers

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Transcription:

Paul Klint

Grammars and Languages are one of the most established areas of Natural Language Processing and Computer Science 2

N. Chomsky, Aspects of the theory of syntax, 1965 3

A Language...... is a (possibly infinite) set of sentences Exercise: Give examples of finite languages Give examples of infinite languages 4

A Grammar...... is a set of formation rules to describe the sentences in a language The Chomsky hierarchy: Context-sensitive languages Natural language processing Context-free languages Syntax of programming languages Regular languages Regular expressions, grep, lexical syntax 5

Syntax Analysis (aka Parsing)...... is the process of analyzing the syntactic structure of a sentence. A recognizer only says Yes or No (+ messages) to the question: Does this sentence belong to language L? A parser also builds a structured representation when the text is syntactically correct. Such a syntax tree or parse trees is a proof how the grammar rules can be sued to derive the sentence. 6

A Recognizer Grammar l1 -> r1 l2 -> r2... ln -> rn Source text Recognizer Yes or No ( + Errors) 7

A Parser Grammar l1 -> r1 l2 -> r2... ln -> rn Parse tree Source text or Errors Parser 8

Why are Techniques relevant? A grammar is a formal method to describe a (textual) language Programming languages: C, Java, C#, JavaScript Domain-specific languages: BibTex, Mathematica Data formats: log files, protocol data Parsing: Tests whether a text conforms to a grammar Turns a correct text into a parse tree 9

A.V. Aho & J.D. Ullman, The Theory of Parsing, Translation and Compiling, Parts I + II, 1972 10

A.V. Aho, R. Sethi, J.D. Ullman, Compiler, Principles, Techniques and Tools, 1986 11

D. Grune, C. Jacobs, Parsing Techniques, A Practical Guide, 2008 12

What is Syntax Analysis about? Syntax analysis (or parsing) is about recognizing structure in text (or the lack thereof) The question Is this a textually correct Java program? can be answered by syntax analysis. Note: other correctness aspects are outside the scope of parsing: Has this variable been declared? Is this expression type correct? Is this method called with the right parameters? 13

When is Syntax Analysis Used? Compilers IDEs Software analysis Software transformation DSL implementations Natural Language processing Genomics (parsing of DNA fragments) 14

How to define a grammar? Simplistic solution: finite set of acceptable sentences Problem: what to do with infinite languages? Realistic solution: finite recipe that describes all acceptable sentences A grammar is a finite description of a possibly infinite set of acceptable sentences 15

Example: Tom, Dick and Harry Suppose we want describe a language that contains the following legal sentences: Tom Tom and Dick Tom, Dick and Harry Tom, Harry, Tom and Dick... How do we find a finite recipe for this? 16

The Tom, Dick and Harry Grammar Name -> tom Name -> dick Name -> harry Sentence -> Name Sentence -> List End Non-terminals: Name, Sentence, List, End List -> Name List -> List, Name, Name End -> and Name Terminals: tom, dick, harry, and,, Start Symbol: Sentence 17

Example Name -> tom Name -> dick Name -> harry Sentence -> Name Sentence -> List End List -> Name List -> List, Name Sentence -> Name -> tom Sentence -> List End -> List, Name End -> Name, Name End -> tom, Name End -> tom, dick End ->, Name End -> and Name tom and dick 18

Variations in Notation Name -> tom dick harry <Name> ::= tom dick harry tom dick harry -> Name In Rascal: syntax Name = tom dick harry ; 19

Chomsky s Grammar Hierarchy Type-0: Recursively Enumerable Type-1: Context-sensitive Rules: αaβ -> αγβ Type-2: Context-free Rules: α -> β (unrestricted) Rules: A -> γ Type-3: Regular Rules: A -> a and A -> ab 20

Context-free Grammar for TDH Name -> tom dick harry Sentence -> Name List and Name List -> Name, List Name 21

Exercise: What changed and Why? Name -> tom Name -> tom Name -> dick Name -> dick Name -> harry Name -> harry Sentence -> Name Sentence -> Name Sentence -> List End List -> Name List -> List, Name Sentence -> List and Name List -> Name List -> Name, List, Name End -> and Name 22

In practice... Regular grammars used for lexical syntax: Keywords: if, then, while Constants: 123, 3.14, a string Comments: /* a comment */ Context-free grammars used for structured and nested concepts: Class declaration If statement 23

We start with text Consider the assignment statement: Position := initial + rate * 60 First approximation, this is a string of characters: p o s i t i o n : = i n i t i a l + r a t e * 6 0 24

From text to tokens p o s i t i o n : = i n i t i a l The identifier position The assignment symbol := The identifier initial The addition operator + The identifier rate The multiplication operator * The number 60 + r a t e * 6 0 25

Lexical syntax Regular expressions define lexical syntax: Literal characters: a,b,c,1,2,3 Character classes: [a-z], [0-9] Operators: sequence (space), repetition (* or +), option (?) Examples: Identifier: [a-z][a-z0-9]* Number: [0-9]+ Floating constant: [0-9]*.[0-9]*(e-[0-9]+) 26

Lexical syntax Regular expressions can be implemented with a finite automaton Consider [a-z][a-z0-9]* [a-z0-9]* Start [a-z] Start 27

From text to tokens Classify characters by lexical category: Tokenization Lexical analysis Lexical scanning p o s i t i o n : = identifier position i n i t i identifier := initial a l + r a t e identifier + rate * 6 0 number * 60 28

From Tokens to Parse Tree assignment statement expression expression identifier position := expression expression expression identifier identifier number initial + rate * 60 29

Expression Grammar The hierarchical structure of expressions can be described by recursive rules: 1. Any Identifier is an expression 2. Any Number is an expression 3. If Expression1 and Expression2 are expressions then so are: Expression1 + Expression2 Expression1 * Expression2 ( Expression1 ) 30

Statement Grammar 1. If Identifier1 is an identifier and Expression1 is an expression then the following is a statement: Identifier1 := Expression1 2. If Expression1 is an expression and Statement1 is an statement then the following are statements: while ( Expression1 ) Statement1 if ( Expression1) then Statement1 31

How do we get a parser? Use a parser generator Pro: regenerate when grammar changes Pro; recognized language is exactly known Pro: less effort Con: Grammar has to fit in the grammar class accepted by the parser generator (this may be very hard!) Con: mixing of parsing and other actions somewhat restricted Con: limited error recovery 32

Language A start syntax A = "a"; parsetreeviewer(#start[a]) 33

Language AB start syntax AB = "a" "b"; parsetreeviewer(#start[ab]) 34

Language AB start syntax AB = "a" "b"; parsetreeviewer(#start[ab]) 35

Language AB (with layout) layout Whitespace = [\ \t\n]*; start syntax AB = "a" "b"; parsetreeviewer(#start[ab]) 36

Language AB2 layout Whitespace = [\ \t\n]*; syntax A = "a"; syntax B = "b"; start syntax AB2 = A B; parsetreeviewer(#start[ab2]) 37

Language C layout Whitespace = [\ \t\n]*; syntax A = "a"; syntax B = "b"; start syntax C = "c" A C B; parsetreeviewer(#start[c]) 38

Language E layout Whitespace = [\ \t\n]*; lexical Integer = [0-9]+; start syntax E = Integer E "*" E E "+" E "(" E ")" ; parsetreeviewer(#start[e]) 39

Language E: ambiguity parsetreeviewer(#start[e]) 40

Language E: Using Parentheses parsetreeviewer(#start[e]) 41

Language E1: Define Priority layout Whitespace = [\ \t\n]*; lexical Integer = [0-9]+; start syntax E1 = Integer E1 "*" E1 > E1 "+" E1 "(" E1 ")" ; > defines that E1 * E1 has higher priority than E1 + E1 parsetreeviewer(#start[e1]) 42

Language E2: Extra non-terminals layout Whitespace = [\ \t\n]*; lexical Integer = [0-9]+; start syntax E2 = E2 "+" T T; syntax T = T "*" P P; syntax P = "(" E2 ")" Integer; parsetreeviewer(#start[e2]) 43

Language La0: List of zero or more a's start syntax La0 = "a"*; parsetreeviewer(#start[la0]) 44

Language La0: List of one or more a's start syntax La1 = "a"+; parsetreeviewer(#start[la1]) 45

Language LaS0: List of zero or more a's separated by comma's start syntax LaS0 = {"a" ","}*; parsetreeviewer(#start[las0]) 46

Language LaS1: List of one or more a's separated by comma's start syntax LaS1 = {"a" ","}+; parsetreeviewer(#start[las1]) 47

Language S: Statement-like layout Whitespace = [\ \t\n]*; lexical Integer = [0-9]+; syntax E1 = Integer E1 "*" E1 > E1 "+" E1 "(" E1 ")" ; lexical Id = [a-z][a-z0-9]*; start syntax S = Id "=" E1 "while" E1 "do" {S ";"}+ "od" ; parsetreeviewer(#start[s]) 48

What is a grammar? A context-free grammar is 4-tuple G=(N,Σ,P,S) N is a set of nonterminals Σ is a set of terminals (literal symbols, disjoint from N) P is a set of production rules of the form (A, α) with A a nonterminal, and α a list of zero or more terminals or nonterminals. Notation: A ::= α (in BNF) syntax A = α; (in Rascal) S ε N, is the start symbol. 49

Derivations A grammar is a formal system with one proof rule: α A β > α γ β if A ::= γ is a production A is a nonterminal, α, β, γ possibly empty lists of (non)terminals 50

Example N = {E} Σ = { +, *, (, ), -, a} S=E P = {E::=E+E, E::=E*E, E::=( E ), E::=-E, E::= a} A derivation: E > - E > - ( E ) > - ( E + E) > - ( a + E) > -(a+a) A derivation generates a sentence from the start symbol 51

Exercise Give a derivation for a + a * a 52

Language defined by a Grammar Extend the one step derivation > to >* derive in zero or more steps >+ derive in one or more steps The language defined by a grammar G = (N,Σ,P,S) is: L(G) = { w ε Σ* S >+ w } A sentence w ε L(G) only contains terminals 53

Derivations At each derivation step there are choices: Which nonterminal will we replace? Which alternative of the selected nonterminal will we apply? Two choices: Leftmost: always select leftmost nonterminal Rightmost: always select leftmost nonterminal 54

Examples Recall our -(a+a) example Leftmost derivation of -(a+a): E > - E > - ( E ) > - ( E + E) > - ( a + E) > - ( a + a ) Rightmost derivation of -(a+a): E > - E > - ( E ) > - ( E + E) > - ( E + a) > - ( a + a ) 55

Derivation versus parsing A derivation generates a sentence from the start symbol A recognizer does the inverse: it deduces the start symbol from the sentence Leftmost derivation leads to a topdown recognizer (LL parser) Rightmost derivation leads to a bottom-up recognizer (LR parser) 56

Recognizing versus Parsing Recognizer: Is this string in the language? Parser: Is this string in the language? If so, return a syntax tree Generalized Parser: Idem, but may return more than one tree Accepts larger class of grammars 57

A Recognizer Grammar l1 -> r1 l2 -> r2... ln -> rn Source text Recognizer Yes or No ( + Errors) 58

A Parser Grammar l1 -> r1 l2 -> r2... ln -> rn Parse tree Source text or Errors Parser 59

Generalized Parser (as used in Rascal) Grammar l1 -> r1 l2 -> r2... ln -> rn Parse trees or Errors Source text GLL+ 60

Recall Language E parsetreeviewer(#start[e]) 61

General Parsing Approaches Top-Down (predictive) Predict what you want to parse, and verify the input Leftmost derivation Bottom-Up Recognize token by token and infer what you are recognizing by combining these tokens. Rightmost derivation The type of grammar determines the parsing techniques that can be used 62

Parsing techniques Top-Down LL(1): Left-to-right, Leftmost derivation, 1 symbol lookahead LL(k), k symbols lookahead... Bottom-Up LR(1): Left-to-right, Rightmost, 1 symbol lookahead LR(k) LALR(k) SLR(k)... 63

Top-Down Parser assignment statement expression identifier position Integer := 70 64

Bottom-Up Parser assignment statement expression identifier position Integer := 70 65

How do we get a parser? Write it by hand Pro: large flexibility Pro: each mixing of parsing with other actions Type checking Tree building Pro: specialized error messages/error recovery Con: more effort Con: reprogramming needed when grammar changes Con: unclear which language is recognized 66

Example: Writing a Parser syntax A = "a"; syntax B = "b"; start syntax C = "c" A C B; Idea: implement three functions bool parsea() bool parseb() bool parsec() that parse the corresponding nonterminal 67

Infrastructure syntax A = "a"; syntax B = "b"; start syntax C = "c" A C B; str input = ""; int cursor = -1; private void initparse(str s) { input = s; cursor = 0; } private str lookahead() = cursor < size(input)? input[cursor] : "$"; private void nextchar(){ cursor += 1; } public bool endofstring() = lookahead() == "$"; public bool parsec(str s){ initparse(s); return parsec() && endofstring(); } cursor nextchar() input $ lookahead() 68

Parser syntax A = "a"; syntax B = "b"; start syntax C = "c" A C B; public bool parseterm(str term){ if(lookahead() == term){ nextchar(); return true; } return false; } public bool parsea() = parseterm("a"); public bool parseb() = parseterm("b"); public bool parsec(){ if(lookahead() == "c") return parseterm("c"); if(lookahead() == "a"){ parsea(); if(parsec()){ if(lookahead() == "b"){ return parseb(); } } } return false; } rascal>parsec("aaacbbb") bool: true rascal>parsec("aaacbb") bool: false 69

Trickier Cases Fixed lookahead > 1 characters No fixed lookahead Alternatives overlap partially: Naive approach tries first alternative and then fails but another alternative may match. A backtracking approach tries each alternative and if it fails it restores the input position and tries other alternatives. Generalized parsing approach: try alternatives in parallel 70

Automatic Parser Generation Syntax of L Parser Generator L text Parser for L L tree 71

Some Parser Generators Bottom-up Yacc/Bison, LALR(1) CUP, LALR(1) SDF, SGLR Top-down: ANTLR, LL(k) JavaCC, LL(k) Rascal, GLL+ Except SDF and Rascal, all depend on a scanner generator 72

Assessment parser implementation Manual parser construction + Good error recovery + Flexible combination of parsing and actions - A lot of work Parser generators + May save a lot of work - Complex and rigid frameworks - Rigid actions - Error recovery more difficult 73

Parser Generation Architecture (table-generator) Lexical Grammar Context-free Grammar Scanner Generator Parse Table Generator Scanner Table Source Source Code Code Scanner table interpreter Scanner Parse Table Parser Parse Tree Parse table interpreter Grammars and Parsing 74

Parser Generation Architecture (program-generator) Lexical Grammar Context-free Grammar Scanner Generator Parser Generator Scanner Parser Source Source Code Code Executable Scanner program Parse Tree Executable Parsing Grammars and program Parsing 75

Parser Generation Architecture Lexical Grammar Context-free Grammar Parser Generator Source Source Code Code Scannerless Parser Parse Tree Executable Parsing program Grammars and Parsing 76

Pragmatic Issues How do I get my grammar in a form accepted by the parser generator: Rewriting, refactoring, renaming,... May be very hard (or impossible!) How does the scanner get its input? How are scanner and parser interfaced? How are actions attached to grammar rules? Semantic actions in C/Java code + Interface variables and Parsing How to define errorgrammars recovery? 77

Parsing in Rascal Scannerless, GLL+ parser Grammars can easily be composed (this is not possible with other technologies) Parsing and executing Rascal code can be mixed. Work in progress: error recovery. 78

Conclusions Parsing is a vital ingredient for many systems Formal languages, grammars, etc. are a wellestablished but rather theoretical part of computer science. Learn the basic notions! Not always easy to get to grips with a specific parsing technology Grammar rewriting/refactoring is difficult Rascal's scannerless GLL+ parser makes this unnecessary. 79

Parser in a Bigger Picture Call Graph Visualization Source Source Code Code Grammar Rules Parser Extract Calls Parse Tree Rules {<, >, <, >,... } Call Relation Visualize Graph Visualization 80

Parser in a Bigger Picture Compiler Source Source Code Code Grammar Rules Parser Typechecker Parse Tree Rules Code generator Annotated Parse Tree Code 81

Parser in a Bigger Picture Refactoring Source Source Code Code Grammar Rules Parser Typechecker Parse Tree Rules Refactor Annotated Parse Tree Refactored Program 82

Further Reading http://en.wikipedia.org/wiki/chomsky_hierarchy D. Grune & C.J.H. Jacobs, Parsing Techniques: A Practical Guide, Second Edition, Springer, 2008 Tutor/Rascalopedia (Grammar, Language, LanguageDefinition) Tutor/Rascal (Concepts/SyntaxDefinitionAndParsing, Declarations/Syntaxdefinition) Tutor/Recipes/Languages 83