Introduction to Lexing and Parsing

Similar documents
Theory and Compiling COMP360

COP 3402 Systems Software Syntax Analysis (Parser)

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CSCI312 Principles of Programming Languages!

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Syntax Analysis Part I

Building Compilers with Phoenix

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Defining syntax using CFGs

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

ECE251 Midterm practice questions, Fall 2010

CMSC 330: Organization of Programming Languages. Context Free Grammars

Syntax. In Text: Chapter 3

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages

Chapter 3. Describing Syntax and Semantics ISBN

Lexical Scanning COMP360

CSE 3302 Programming Languages Lecture 2: Syntax

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

CMSC 330: Organization of Programming Languages

Programming Language Syntax and Analysis

Structure of Programming Languages Lecture 3

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSCE 314 Programming Languages

CS164: Midterm I. Fall 2003

Dr. D.M. Akbar Hussain

Lecture 4: Syntax Specification

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CPS 506 Comparative Programming Languages. Syntax Specification

Syntax. 2.1 Terminology

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Specifying Syntax COMP360

Lexical and Syntax Analysis. Top-Down Parsing

Structure of a Compiler: Scanner reads a source, character by character, extracting lexemes that are then represented by tokens.

Describing Syntax and Semantics

Languages and Compilers

4. Lexical and Syntax Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

3. Context-free grammars & parsing

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

4. Lexical and Syntax Analysis

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax Analysis Check syntax and construct abstract syntax tree

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Chapter 4. Lexical and Syntax Analysis

Introduction to Parsing. Lecture 8

Formal Languages. Formal Languages

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

CSE302: Compiler Design

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Lexical and Syntax Analysis

Week 2: Syntax Specification, Grammars

Context-Free Grammars

Homework & Announcements

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Introduction to Parsing. Lecture 5

CS 314 Principles of Programming Languages

Introduction to Parsing. Lecture 5

Grammars and Parsing. Paul Klint. Grammars and Parsing

CS 314 Principles of Programming Languages. Lecture 3

Context-Free Languages and Parse Trees

Lecture 8: Context Free Grammars

Optimizing Finite Automata

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Habanero Extreme Scale Software Research Project

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Introduction to Syntax Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

ICOM 4036 Spring 2004

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

EECS 6083 Intro to Parsing Context Free Grammars

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

Part 5 Program Analysis Principles and Techniques

Introduction to Parsing Ambiguity and Syntax Errors

3. Parsing. Oscar Nierstrasz

Introduction to Parsing Ambiguity and Syntax Errors

Writing a Simple DSL Compiler with Delphi. Primož Gabrijelčič / primoz.gabrijelcic.org

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Introduction to Syntax Analysis. The Second Phase of Front-End

Lexical Analysis. Lecture 2-4

CS 314 Principles of Programming Languages

Transcription:

Introduction to Lexing and Parsing ECE 351: Compilers Jon Eyolfson University of Waterloo June 18, 2012

1 Riddle Me This, Riddle Me That What is a compiler?

1 Riddle Me This, Riddle Me That What is a compiler? It s a specific kind of language processor.

2 Terminology We can think of a language processor as a black box translator. Input Language Magic Output Language

2 Terminology We can think of a language processor as a black box translator. Input Language Magic Output Language A compiler is a translator whose input language is a programming language and outputs machine or assembly language.

3 More Specific Translators Assembler Transliterator (or Preprocessor) Intermediate Code How the input is represented (usually internally) before generating output (e.g. AST, LLVM IR, Bytecode) Interpreter (or Simulator)

3 More Specific Translators Assembler Transforms assembly language to machine language Transliterator (or Preprocessor) Transforms one high level language to another Intermediate Code How the input is represented (usually internally) before generating output (e.g. AST, LLVM IR, Bytecode) Interpreter (or Simulator) Directly executes intermediate code

4 Inside the Black Box Input Language Lexical Analysis (Lexer) Tokens Syntax Analysis (Parser) Parse Tree Semantic Analysis (Type Checker) Code Generation Intermediate Language Output Language

5 The Lexer Also known as a scanner/screener. Goal Break up input characters into groups (tokens) Why? Ignores whitespace Provides a nice abstraction

5 The Lexer Also known as a scanner/screener. Goal Break up input characters into groups (tokens) Why? Ignores whitespace Provides a nice abstraction Example In F, we don t care that the input is a or blahlblahblah, they re both identifiers

6 Language Definition Let s revisit some more terminology. Alphabet - a finite set of symbols {a, b, c, d,...} String - any finite sequence of symbols in the alphabet Empty String - a sequence with no symbols ε

6 Language Definition Let s revisit some more terminology. Alphabet - a finite set of symbols {a, b, c, d,...} String - any finite sequence of symbols in the alphabet Empty String - a sequence with no symbols ε Language - a subset of strings in a particular alphabet

7 Example Language for Identifiers (1) Alphabet - letters and numbers {a, b, c, d,...} {0, 1, 2, 3,..} Strings x bbjl15 1monaway got1

7 Example Language for Identifiers (1) Alphabet - letters and numbers {a, b, c, d,...} {0, 1, 2, 3,..} Strings x bbjl15 1monaway got1 Which of these strings should belong to our language?

7 Example Language for Identifiers (1) Alphabet - letters and numbers {a, b, c, d,...} {0, 1, 2, 3,..} Strings x bbjl15 1monaway got1 Which of these strings should belong to our language? Why?

8 Expressing a Language Using Regular Expressions Recall a regular expression over an alphabet is made up of symbols (in the alphabet) and the following operators: * repetition (zero or more) alternation (or) sequence (implied) () grouping [] character sets

8 Expressing a Language Using Regular Expressions Recall a regular expression over an alphabet is made up of symbols (in the alphabet) and the following operators: * repetition (zero or more) alternation (or) sequence (implied) () grouping [] character sets Notes: a+ a a* (one or more) a?b b (a b) (zero or one)

9 Example Language for Identifiers (2) If our language for identifiers should begin with a letter followed by any number of letters and numbers, what should our regular expression be? Hint: [A-Z], [a-z] and [0-9] may be useful.

9 Example Language for Identifiers (2) If our language for identifiers should begin with a letter followed by any number of letters and numbers, what should our regular expression be? Hint: [A-Z], [a-z] and [0-9] may be useful. Answer: ([A-Z] [a-z])([a-z] [a-z] [0-9])*

10 Using Regular Expressions So, how do we use regular expressions (or what does grep do)? One way is to convert the regular expression to a finite state automaton (FSA) and follow it for each input character.

10 Using Regular Expressions So, how do we use regular expressions (or what does grep do)? One way is to convert the regular expression to a finite state automaton (FSA) and follow it for each input character. Finite State Automaton A set of states and state transitions Contains a start state and one or more final states States can be arbitrarily numbered, state transitions are for individual symbols in the alphabet (characters)

11 Finite State Automaton Notation s n s 0 s 1 State n Starting state Final state State transitions are represented by labeled arrows

12 Finite State Automaton Usage To see if a sentence is in our language we do the following: 1 Start a the starting state(!) 2 Follow the state transition for each character No transition, reject 3 Accept if we re in a final state, reject otherwise

13 Finite State Automaton Example Consider the simplest language, represented by the regular expression a. Implicitly our alphabet is the set of keyboard characters. This corresponds to the following FSA: s a 0 s 1

13 Finite State Automaton Example Consider the simplest language, represented by the regular expression a. Implicitly our alphabet is the set of keyboard characters. This corresponds to the following FSA: s a 0 s 1 Do we accept or reject these sentences? a ε bob

13 Finite State Automaton Example Consider the simplest language, represented by the regular expression a. Implicitly our alphabet is the set of keyboard characters. This corresponds to the following FSA: s a 0 s 1 Do we accept or reject these sentences? a ε bob Answer: we only accept a

Finite State Automaton Basic Conversions Regular Expression: a b s a 0 s b 1 s 2 14

14 Finite State Automaton Basic Conversions Regular Expression: a b s a 0 s b 1 s 2 Regular Expression: a b a s 0 s 1 b

14 Finite State Automaton Basic Conversions Regular Expression: a b s a 0 s b 1 s 2 Regular Expression: a b a s 0 s 1 b Regular Expression: a* s 0 a

15 Finite State Automaton for Identifiers What does the FSA look like for identifiers? Recall: ([A-Z] [a-z])([a-z] [a-z] [0-9])*

15 Finite State Automaton for Identifiers What does the FSA look like for identifiers? Recall: ([A-Z] [a-z])([a-z] [a-z] [0-9])* [a-z] [a-z] s 0 s 1 [0-9] [A-Z] [A-Z]

15 Finite State Automaton for Identifiers What does the FSA look like for identifiers? Recall: ([A-Z] [a-z])([a-z] [a-z] [0-9])* [a-z] [a-z] s 0 s 1 [0-9] [A-Z] [A-Z] This accepts bbjl15 and rejects 1monaway

16 Regular Languages It is known all regular expressions can be converted to a FSA Any language which can be expressed using a regular expression or a FSA is a regular language

16 Regular Languages It is known all regular expressions can be converted to a FSA Any language which can be expressed using a regular expression or a FSA is a regular language For example, W is a regular language, F is not. Why?

16 Regular Languages It is known all regular expressions can be converted to a FSA Any language which can be expressed using a regular expression or a FSA is a regular language For example, W is a regular language, F is not. Why? Regular languages cannot handle: Nesting Indefinite counting Balancing of symbols

17 Illustration of Regular Language Limitations Can we write a FSA for ( n a) n (simple parenthesis matching)?

17 Illustration of Regular Language Limitations Can we write a FSA for ( n a) n (simple parenthesis matching)? s 0 s 1 ( ) s 2 a s 3 ( ) s 4 a s 5 This is as close as we can get (in this amount of space)

17 Illustration of Regular Language Limitations Can we write a FSA for ( n a) n (simple parenthesis matching)? s 0 s 1 ( ) s 2 a s 3 ( ) s 4 a s 5 This is as close as we can get (in this amount of space) We need an FSA of infinite size (contradiction!)

18 Real Regular Expressions While this is technically correct (the best kind of correct) most regular expression implementations are somewhere in the grey area Can you write a regular expression to match: a n b n?

18 Real Regular Expressions While this is technically correct (the best kind of correct) most regular expression implementations are somewhere in the grey area Can you write a regular expression to match: a n b n? With Perl Regular Expressions, we can use: ^(a(?1)?b)$

18 Real Regular Expressions While this is technically correct (the best kind of correct) most regular expression implementations are somewhere in the grey area Can you write a regular expression to match: a n b n? With Perl Regular Expressions, we can use: ^(a(?1)?b)$ Basically, (?1) matches (a(?1)?b) and recurses to match the same number of a s and b s ^(a(?1)?b)$ ^(a(a(?1)?b)?b)$... This is just for general interest, no need to worry Source: http://tinyurl.com/6rayj5a

19 Push Down Automata We can modify our FSA to be able to match ( n a) n as follows: Add a push down stack Add another condition for a transition 1 The input symbol (as before) 2 The top symbol on the stack Allow transitions to push and pop from the stack

19 Push Down Automata We can modify our FSA to be able to match ( n a) n as follows: Add a push down stack Add another condition for a transition 1 The input symbol (as before) 2 The top symbol on the stack Allow transitions to push and pop from the stack The modified FSA is called a finite state control The stack and the FSC together form a push down automata

20 Push Down Automata Example Notation: ε means the top of the stack is empty α means the top of the stack may be anything Transitions are: symbol, top of stack, optional push/pop (, α, push ), (, pop (, ε, push a, α ε, ε s 0 s 1 s 2 s 3

20 Push Down Automata Example Notation: ε means the top of the stack is empty α means the top of the stack may be anything Transitions are: symbol, top of stack, optional push/pop (, α, push ), (, pop (, ε, push a, α ε, ε s 0 s 1 s 2 s 3 Theoretically there is no stack limit, so this works Examples: (a), (((a))) are accepted and (a)) is rejected

21 Context-Free Language Any language which can be expressed using a push down automata or context-free grammar is a context-free language We haven t used a push down automata, and neither have any of our tools, how did express a grammar for F?

21 Context-Free Language Any language which can be expressed using a push down automata or context-free grammar is a context-free language We haven t used a push down automata, and neither have any of our tools, how did express a grammar for F? We used a context-free grammar, which is specified in Extended Backus-Naur Form (BNF)

22 Backus-Naur Form BNF is a 4-tuple (T, N, S, P ), where T is a set of terminal symbols (tokens) N is a set of nonterminal symbols (rule names) S is the starting rule, which is a member of N P is a set of rules (or productions)

22 Backus-Naur Form BNF is a 4-tuple (T, N, S, P ), where T is a set of terminal symbols (tokens) N is a set of nonterminal symbols (rule names) S is the starting rule, which is a member of N P is a set of rules (or productions) All rules have the form: A γ A N (A is a nonterminal) γ (N T ) (γ is a string of terminals/nonterminals or ε) Note: B C D is shorthand for B C, B D

23 Backus-Naur Form Example Consider the grammar G = (T, N, S, P ), where T = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, + } N = { E } S = E P = E E + E E 1 2 3 4 5 6 7 8 9 0

24 Backus-Naur Form Derivations Consider x and y such that x, y (N T ) x and y are strings of terminals/nonterminals or ε We say x derives y in one step (x y) if we can apply a single rule (in P ) to x and get y

24 Backus-Naur Form Derivations Consider x and y such that x, y (N T ) x and y are strings of terminals/nonterminals or ε We say x derives y in one step (x y) if we can apply a single rule (in P ) to x and get y E + E E + E + E since E E + E P

24 Backus-Naur Form Derivations Consider x and y such that x, y (N T ) x and y are strings of terminals/nonterminals or ε We say x derives y in one step (x y) if we can apply a single rule (in P ) to x and get y E + E E + E + E since E E + E P We say x derives y (x y) if we can apply one or more steps to x to get y

25 Backus-Naur Form Usage Now that we have a grammar G, we want to know what s in our language L Our strings in this case are a sequence of terminals (or tokens)

25 Backus-Naur Form Usage Now that we have a grammar G, we want to know what s in our language L Our strings in this case are a sequence of terminals (or tokens) L(G) is the set of all strings of terminals that can be derived from the starting rule S In other words (CS): L(G) = {s S s and s T } Note: L(G) is likely an infinite set (all possible valid programs)

26 Backus-Naur Form Derivation Example Consider the string 1 + 2 + 3, is it in L(G)? Yes, since: E E + E E + E + E E + E + 3 E + 2 + 3 1 + 2 + 3

26 Backus-Naur Form Derivation Example Consider the string 1 + 2 + 3, is it in L(G)? Yes, since: E E + E E + E + E E + E + 3 E + 2 + 3 1 + 2 + 3 Note: if our derivation contains terminals and nonterminals, we call it a sentential form of G

27 BNF Leftmost Derivation (1) Consider the string 1 + 2 + 3 again, we can do a leftmost derivation by replacing the leftmost nonterminal in every step E E + E E + E + E 1 + E + E 1 + 2 + E 1 + 2 + 3

27 BNF Leftmost Derivation (1) Consider the string 1 + 2 + 3 again, we can do a leftmost derivation by replacing the leftmost nonterminal in every step E E + E E + E + E 1 + E + E 1 + 2 + E 1 + 2 + 3 This corresponds to the following parse tree...

28 Parse Tree (1) E E + E E + E 3 1 2

29 BNF Leftmost Derivation (2) Again, considering 1 + 2 + 3 again, there s another leftmost derivation, what is it?

29 BNF Leftmost Derivation (2) Again, considering 1 + 2 + 3 again, there s another leftmost derivation, what is it? E E + E 1 + E 1 + E + E 1 + 2 + E 1 + 2 + 3 If there s more than one leftmost derivation the grammar is ambiguous (that s bad)

29 BNF Leftmost Derivation (2) Again, considering 1 + 2 + 3 again, there s another leftmost derivation, what is it? E E + E 1 + E 1 + E + E 1 + 2 + E 1 + 2 + 3 If there s more than one leftmost derivation the grammar is ambiguous (that s bad) This corresponds to the following parse tree...

30 Parse Tree (2) E E + E 1 E + E 2 3

31 Summary Definitions for string, language, regular language and context-free language Creating and using finite state automaton Using BNF grammars and detecting ambiguous grammars

32 Lab 7 Use EOI in the expansion of your starting rule This makes sure parboiled tries to parse the entire input Common Problems: Your output AST is missing a bunch of input You re recognizing strings you shouldn t be Solution: Sequence(ZeroOrMore(DesignUnit()), EOI)

33 Next Lecture Removing ambiguity using precedence and associativity Extended Backus-Naur Form (EBNF) Other sources of ambiguity Methods of parsing