n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

Similar documents
Programming Languages and Compilers (CS 421)

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

Programming Languages and Compilers (CS 421)

Programming Languages and Compilers (CS 421)

n Parsing function takes a lexing function n Returns semantic attribute of 10/27/16 2 n Contains arbitrary Ocaml code n May be omitted 10/27/16 4

Two Problems. Programming Languages and Compilers (CS 421) Type Inference - Example. Type Inference - Outline. Type Inference - Example

Introduction to Parsing. Lecture 5

A Simple Syntax-Directed Translator

ECE251 Midterm practice questions, Fall 2010

ML 4 A Lexer for OCaml s Type System

Programming Languages and Compilers (CS 421)

Programming Languages and Compilers (CS 421)

n <exp>::= 0 1 b<exp> <exp>a n <exp>m<exp> 11/7/ /7/17 4 n Read tokens left to right (L) n Create a rightmost derivation (R)

CPS 506 Comparative Programming Languages. Syntax Specification

MP 3 A Lexer for MiniJava

Introduction to Lexical Analysis

Lexical Analysis. Chapter 2

Lecture 9 CIS 341: COMPILERS

Introduction to Parsing. Lecture 5

Lexical Analysis. Introduction

A simple syntax-directed

CS 11 Ocaml track: lecture 6

Topic 3: Syntax Analysis I

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical and Syntax Analysis

Lexical Analysis. Lecture 2-4

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CMSC 350: COMPILER DESIGN

Implementation of Lexical Analysis

MP 3 A Lexer for MiniJava

CSCE 314 Programming Languages

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

COP 3402 Systems Software Syntax Analysis (Parser)

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CMSC 330: Organization of Programming Languages

n Takes abstract syntax trees as input n In simple cases could be just strings n One procedure for each syntactic category

Programming Languages and Compilers (CS 421)

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Lexical Analysis. Lecture 3-4

Programming Languages and Compilers (CS 421)

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Week 2: Syntax Specification, Grammars

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Implementation of Lexical Analysis

Building Compilers with Phoenix

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Chapter 3. Describing Syntax and Semantics ISBN

CSE 401 Midterm Exam Sample Solution 2/11/15

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

CMSC 330: Organization of Programming Languages

Lexical Analysis. Finite Automata. (Part 2 of 2)

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Programming Languages and Compilers (CS 421)

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSCI312 Principles of Programming Languages!

CS415 Compilers. Lexical Analysis

Example. Programming Languages and Compilers (CS 421) Disambiguating a Grammar. Disambiguating a Grammar. Steps to Grammar Disambiguation

CMSC 330: Organization of Programming Languages. Context Free Grammars

Lexical Analysis. Lecture 3. January 10, 2018

Introduction to Lexical Analysis

Compiler Construction D7011E

CSE302: Compiler Design

CMSC 330: Organization of Programming Languages

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

LECTURE 3. Compiler Phases

Introduction to Lexing and Parsing

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Programming Languages and Compilers (CS 421)

n What is its running time? 9/18/17 2 n poor_rev [1,2,3] = n (poor_rev [1] = n ((poor_rev [1] =

ASTs, Objective CAML, and Ocamlyacc

CSCI 2041: Advanced Language Processing

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

A programming language requires two major definitions A simple one pass compiler

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Compiler phases. Non-tokens

# true;; - : bool = true. # false;; - : bool = false 9/10/ // = {s (5, "hi", 3.2), c 4, a 1, b 5} 9/10/2017 4

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Introduction to Parsing. Lecture 8

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Context Free Languages

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Kinder, Gentler Nation

Compilers and computer architecture From strings to ASTs (2): context free grammars

CS308 Compiler Principles Lexical Analyzer Li Jiang

Semantic Analysis. Lecture 9. February 7, 2018

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Intro To Parsing. Step By Step

Syntax Intro and Overview. Syntax

Booleans (aka Truth Values) Programming Languages and Compilers (CS 421) Booleans and Short-Circuit Evaluation. Tuples as Values.

Introduction to Parsing Ambiguity and Syntax Errors

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Transcription:

Regular Expressions Programming Languages and Compilers (CS 421) Elsa L Gunter 2112 SC, UIUC http://courses.engr.illinois.edu/cs421 Based in part on slides by Mattox Beckman, as updated by Vikram Adve and Gul Agha n (0 1)*1 n The set of all strings of 0 s and 1 s ending in 1, {1, 01, 11, } n a*b(a*) n The set of all strings of a s and b s with exactly one b n ((01) (10))* n You tell me n Regular expressions (equivalently, regular grammars) important for lexing, breaking strings into recognized words 10/20/16 1 10/20/16 2 Regular Grammars n Subclass of BNF (covered in detail sool) n Only rules of form <nonterminal>::=<terminal><nonterminal> or <nonterminal>::=<terminal> or <nonterminal>::=ε n Defines same class of languages as regular expressions n Important for writing lexers (programs that convert strings of characters into strings of tokens) n Close connection to nondeterministic finite state automata nonterminals = ~ states; rule = ~ edge n Regular grammar: <Balanced> ::= ε <Balanced> ::= 0<OneAndMore> <Balanced> ::= 1<ZeroAndMore> <OneAndMore> ::= 1<Balanced> <ZeroAndMore> ::= 0<Balanced> n Generates even length strings where every initial substring of even length has same number of 0 s as 1 s 10/20/16 3 10/20/16 4 : Lexing n Regular expressions good for describing lexemes (words) in a programming language n Identifier = (a b z A B Z) (a b z A B Z 0 1 9)* n Digit = (0 1 9) n Number = 0 (1 9)(0 9)* ~ (1 9)(0 9)* n Keywords: if = if, while = while, 10/20/16 5 Implementing Regular Expressions n Regular expressions reasonable way to generate strings in language n Not so good for recognizing when a string is in language n Problems with Regular Expressions n which option to choose, n how many repetitions to make n Answer: finite state automata n Should have seen in CS374 10/20/16 6

Lexing n Different syntactic categories of words : tokens : n Convert sequence of characters into sequence of strings, integers, and floating point numbers. n "asd 123 jkl 3.14" will become: [String "asd"; Int 123; String "jkl"; Float 3.14] Lex, ocamllex n Could write the reg exp, then translate to DFA by hand n A lot of work n Better: Write program to take reg exp as input and automatically generates automata n Lex is such a program n ocamllex version for ocaml 10/20/16 7 10/20/16 8 How to do it n To use regular expressions to parse our input we need: n Some way to identify the input string call it a lexing buffer n Set of regular expressions, n Corresponding set of actions to take when they are matched. How to do it n The lexer will take the regular expressions and generate a state machine. n The state machine will take our lexing buffer and apply the transitions... n If we reach an accepting state from which we can go no further, the machine will perform the appropriate action. 10/20/16 9 10/20/16 10 Mechanics n Put table of reg exp and corresponding actions (written in ocaml) into a file <filename>.mll n Call ocamllex <filename>.mll n Produces Ocaml code for a lexical analyzer in file <filename>.ml Sample Input rule main = parse ['0'-'9']+ { print_string "Int\n"} ['0'-'9']+'.'['0'-'9']+ { print_string "Float\n"} ['a'-'z']+ { print_string "String\n"} _ { main lexbuf } { let newlexbuf = (Lexing.from_channel stdin) in print_string "Ready to lex.\n"; main newlexbuf } 10/20/16 11 10/20/16 12

General Input { header } let ident = regexp... rule entrypoint [arg1... argn] = parse regexp { action }... regexp { action } and entrypoint [arg1... argn] = parse...and... { trailer } Ocamllex Input n header and trailer contain arbitrary ocaml code put at top an bottom of <filename>.ml n let ident = regexp... Introduces ident for use in later regular expressions 10/20/16 13 10/20/16 14 Ocamllex Input n <filename>.ml contains one lexing function per entrypoint n Name of function is name given for entrypoint n Each entry point becomes an Ocaml function that takes n+1 arguments, the extra implicit last argument being of type Lexing.lexbuf n arg1... argn are for use in action Ocamllex Regular Expression n Single quoted characters for letters: a n _: (underscore) matches any letter n Eof: special end_of_file marker n Concatenation same as usual n string : concatenation of sequence of characters n e 1 e 2 : choice - what was e 1 e 2 10/20/16 15 10/20/16 16 Ocamllex Regular Expression n [c 1 - c 2 ]: choice of any character between first and second inclusive, as determined by character codes n [^c 1 - c 2 ]: choice of any character NOT in set n e*: same as before n e+: same as e e* n e?: option - was e 1 ε Ocamllex Regular Expression n e 1 # e 2 : the characters in e 1 but not in e 2 ; e 1 and e 2 must describe just sets of characters n ident: abbreviation for earlier reg exp in let ident = regexp n e 1 as id: binds the result of e 1 to id to be used in the associated action 10/20/16 17 10/20/16 18

Ocamllex Manual n More details can be found at http://caml.inria.fr/pub/docs/manual-ocaml/ lexyacc.html : test.mll { type result = Int of int Float of float String of string } let digit = ['0'-'9'] let digits = digit + let lower_case = ['a'-'z'] let upper_case = ['A'-'Z'] let letter = upper_case lower_case let letters = letter + 10/20/16 19 10/20/16 20 : test.mll rule main = parse (digits)'.'digits as f { Float (float_of_string f) } digits as n { Int (int_of_string n) } letters as s { String s} _ { main lexbuf } { let newlexbuf = (Lexing.from_channel stdin) in print_string "Ready to lex."; print_newline (); main newlexbuf } # #use "test.ml";; val main : Lexing.lexbuf -> result = <fun> val ocaml_lex_main_rec : Lexing.lexbuf -> int -> result = <fun> Ready to lex. hi there 234 5.2 - : result = String "hi" What happened to the rest?!? 10/20/16 21 10/20/16 22 # let b = Lexing.from_channel stdin;; # main b;; hi 673 there - : result = String "hi" # main b;; - : result = Int 673 # main b;; - : result = String "there" Your Turn n Work on ML4 n Add a few keywords n Implement booleans and unit n Implement Ints and Floats n Implement identifiers 10/20/16 23 10/20/16 24

Problem n How to get lexer to look at more than the first token at one time? n One Answer: action tells it to -- recursive calls n Side Benefit: can add state into lexing n Note: already used this with the _ case n Mainly useful when you can make your lexer be your parser n OCamlyacc parser needs tokens one at a time rule main = parse (digits) '.' digits as f { Float (float_of_string f) :: main lexbuf} digits as n { Int (int_of_string n) :: main lexbuf } letters as s { String s :: main lexbuf} eof { [] } _ { main lexbuf } 10/20/16 25 10/20/16 26 Results Ready to lex. hi there 234 5.2 - : result list = [String "hi"; String "there"; Int 234; Float 5.2] # Used Ctrl-d to send the end-of-file signal Dealing with comments First Attempt let open_comment = "(*" let close_comment = "*)" rule main = parse (digits) '.' digits as f { Float (float_of_string f) :: main lexbuf} digits as n { Int (int_of_string n) :: main lexbuf } letters as s { String s :: main lexbuf} 10/20/16 27 10/20/16 28 Dealing with comments open_comment { comment lexbuf} eof { [] } _ { main lexbuf } and comment = parse close_comment { main lexbuf } _ { comment lexbuf } Dealing with nested comments rule main = parse open_comment { comment 1 lexbuf} eof { [] } _ { main lexbuf } and comment depth = parse open_comment { comment (depth+1) lexbuf } close_comment { if depth = 1 then main lexbuf else comment (depth - 1) lexbuf } _ { comment depth lexbuf } 10/20/16 29 10/20/16 30

Dealing with nested comments Dealing with nested comments rule main = parse (digits) '.' digits as f { Float (float_of_string f) :: main lexbuf} digits as n { Int (int_of_string n) :: main lexbuf } letters as s { String s :: main lexbuf} open_comment { (comment 1 lexbuf} eof { [] } _ { main lexbuf } and comment depth = parse open_comment { comment (depth+1) lexbuf } close_comment { if depth = 1 then main lexbuf else comment (depth - 1) lexbuf } _ { comment depth lexbuf } 10/20/16 31 10/20/16 32 Types of Formal Language Descriptions n Regular expressions, regular grammars n Context-free grammars, BNF grammars, syntax diagrams n Finite state automata n Whole family more of grammars and automata covered in automata theory Sample Grammar n Language: Parenthesized sums of 0 s and 1 s n <Sum> ::= 0 n <Sum >::= 1 n <Sum> ::= <Sum> + <Sum> n <Sum> ::= (<Sum>) 10/20/16 33 10/20/16 34 BNF Grammars n Start with a set of characters, a,b,c, n We call these terminals n Add a set of different characters, X,Y,Z, n We call these nonterminals n One special nonterminal S called start symbol BNF Grammars n BNF rules (aka productions) have form X ::= y where X is any nonterminal and y is a string of terminals and nonterminals n BNF grammar is a set of BNF rules such that every nonterminal appears on the left of some rule 10/20/16 35 10/20/16 36

Sample Grammar n Terminals: 0 1 + ( ) n Nonterminals: <Sum> n Start symbol = <Sum> n <Sum> ::= 0 n <Sum >::= 1 n <Sum> ::= <Sum> + <Sum> n <Sum> ::= (<Sum>) n Can be abbreviated as <Sum> ::= 0 1 <Sum> + <Sum> (<Sum>) 10/20/16 37 BNF Deriviations n Given rules X::= yzw and Z::=v we may replace Z by v to say X => yzw => yvw n Sequence of such replacements called derivation n Derivation called right-most if always replace the right-most non-terminal 10/20/16 38 n Start with the start symbol: n Pick a non-terminal <Sum> => <Sum> => 10/20/16 39 10/20/16 40 n Pick a rule and substitute: n <Sum> ::= <Sum> + <Sum> n Pick a non-terminal: 10/20/16 41 10/20/16 42

n Pick a rule and substitute: n <Sum> ::= ( <Sum> ) n Pick a non-terminal: 10/20/16 43 10/20/16 44 n Pick a rule and substitute: n <Sum> ::= <Sum> + <Sum> n Pick a non-terminal: 10/20/16 45 10/20/16 46 n Pick a rule and substitute: n <Sum >::= 1 => ( <Sum> + 1 ) + <Sum> n Pick a non-terminal: => ( <Sum> + 1 ) + <Sum> 10/20/16 47 10/20/16 48

n Pick a rule and substitute: n <Sum >::= 0 => ( <Sum> + 1 ) + <Sum> => ( <Sum> + 1 ) + 0 n Pick a non-terminal: => ( <Sum> + 1 ) + <Sum> => ( <Sum> + 1 ) + 0 10/20/16 49 10/20/16 50 n Pick a rule and substitute n <Sum> ::= 0 => ( <Sum> + 1 ) + <Sum> => ( <Sum> + 1 ) 0 => ( 0 + 1 ) + 0 n ( 0 + 1 ) + 0 is generated by grammar => ( <Sum> + 1 ) + <Sum> => ( <Sum> + 1 ) + 0 => ( 0 + 1 ) + 0 10/20/16 51 10/20/16 52 <Sum> ::= 0 1 <Sum> + <Sum> (<Sum>) <Sum> => BNF Semantics n The meaning of a BNF grammar is the set of all strings consisting only of terminals that can be derived from the Start symbol 10/20/16 53 10/20/16 54

Regular Grammars n Subclass of BNF n Only rules of form <nonterminal>::=<terminal><nonterminal> or <nonterminal>::=<terminal> or <nonterminal>::=ε n Defines same class of languages as regular expressions n Important for writing lexers (programs that convert strings of characters into strings of tokens) n Close connection to nondeterministic finite state automata nonterminals = ~ states; rule = ~ edge n Regular grammar: <Balanced> ::= ε <Balanced> ::= 0<OneAndMore> <Balanced> ::= 1<ZeroAndMore> <OneAndMore> ::= 1<Balanced> <ZeroAndMore> ::= 0<Balanced> n Generates even length strings where every initial substring of even length has same number of 0 s as 1 s 10/20/16 55 10/20/16 56 Extended BNF Grammars n Alternatives: allow rules of from X::=y z n Abbreviates X::= y, X::= z n Options: X::=y[v]z n Abbreviates X::=yvz, X::=yz n Repetition: X::=y{v}*z n Can be eliminated by adding new nonterminal V and rules X::=yz, X::=yVz, V::=v, V::=vV Parse Trees n Graphical representation of derivation n Each node labeled with either non-terminal or terminal n If node is labeled with a terminal, then it is a leaf (no sub-trees) n If node is labeled with a non-terminal, then it has one branch for each character in the right-hand side of rule used to substitute for it 10/20/16 57 10/20/16 58 n Consider grammar: <exp> ::= + ::= <bin> <bin> * <exp> <bin> ::= 0 1 cont. n 1 * 1 + 0: <exp> <exp> is the start symbol for this parse tree n Problem: Build parse tree for 1 * 1 + 0 as an <exp> 10/20/16 59 10/20/16 60

cont. cont. n 1 * 1 + 0: <exp> Use rule: <exp> ::= n 1 * 1 + 0: <exp> <bin> * <exp> Use rule: ::= <bin> * <exp> 10/20/16 61 10/20/16 62 cont. cont. n 1 * 1 + 0: <exp> <bin> * <exp> 1 + Use rules: <bin> ::= 1 and <exp> ::= + 10/20/16 63 n 1 * 1 + 0: <exp> <bin> * <exp> 1 + <bin> <bin> Use rule: ::= <bin> 10/20/16 64 cont. cont. n 1 * 1 + 0: <exp> <bin> * <exp> 1 + <bin> <bin> 1 0 Use rules: <bin> ::= 1 0 10/20/16 65 n 1 * 1 + 0: <exp> <bin> * <exp> 1 + <bin> <bin> 1 0 Fringe of tree is string generated by grammar 10/20/16 66

Your Turn: 1 * 0 + 0 * 1 Parse Tree Data Structures n Parse trees may be represented by OCaml datatypes n One datatype for each nonterminal n One constructor for each rule n Defined as mutually recursive collection of datatype declarations 10/20/16 67 10/20/16 68 cont. n Recall grammar: <exp> ::= + ::= <bin> <bin> * <exp> <bin> ::= 0 1 n type exp = Factor2Exp of factor Plus of factor * factor and factor = Bin2Factor of bin Mult of bin * exp and bin = Zero One n 1 * 1 + 0: <exp> <bin> * <exp> 1 + <bin> <bin> 1 0 10/20/16 69 10/20/16 70 cont. n Can be represented as Factor2Exp (Mult(One, Plus(Bin2Factor One, Bin2Factor Zero))) Ambiguous Grammars and Languages n A BNF grammar is ambiguous if its language contains strings for which there is more than one parse tree n If all BNF s for a language are ambiguous then the language is inherently ambiguous 10/20/16 71 10/20/16 72

: Ambiguous Grammar n 0 + 1 + 0 <Sum> <Sum> n What is the result for: 3 + 4 * 5 + 6 <Sum> + <Sum> <Sum> + <Sum> <Sum> + <Sum> 0 0 <Sum> + <Sum> 0 1 1 0 10/20/16 73 10/20/16 74 n What is the result for: 3 + 4 * 5 + 6 n Possible answers: n What is the value of: 7 5 2 n 41 = ((3 + 4) * 5) + 6 n 47 = 3 + (4 * (5 + 6)) n 29 = (3 + (4 * 5)) + 6 = 3 + ((4 * 5) + 6) n 77 = (3 + 4) * (5 + 6) 10/20/16 75 10/20/16 76 Two Major Sources of Ambiguity n n What is the value of: 7 5 2 Possible answers: In Pascal, C++, SML assoc. left 7 5 2 = (7 5) 2 = 0 n n In APL, associate to right 7 5 2 = 7 (5 2) = 4 n Lack of determination of operator precedence n Lack of determination of operator assoicativity n Not the only sources of ambiguity 10/20/16 77 10/20/16 78