Course Project 2 Regular Expressions

Similar documents
CSE 105 THEORY OF COMPUTATION

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CSE 105 THEORY OF COMPUTATION

ECS 120 Lesson 7 Regular Expressions, Pt. 1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. Chapter 2

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

CS164: Midterm I. Fall 2003

Lexical Analysis. Lecture 2-4

Lexical Analysis. Implementation: Finite Automata

Chapter 3: Lexing and Parsing

Lecture 4: Syntax Specification

Introduction to Lexing and Parsing

Implementation of Lexical Analysis

COL728 Minor1 Exam Compiler Design Sem II, Answer all 5 questions Max. Marks: 20

1 Parsing (25 pts, 5 each)

Definition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,

Languages and Compilers

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CS308 Compiler Principles Lexical Analyzer Li Jiang

LL(1) predictive parsing

Ambiguous Grammars and Compactification

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Lecture 3-4

CSCI312 Principles of Programming Languages!

Chapter 2 - Programming Language Syntax. September 20, 2017

Midterm Exam II CIS 341: Foundations of Computer Science II Spring 2006, day section Prof. Marvin K. Nakayama

COP4020 Spring 2011 Midterm Exam

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Syntax Analysis Part I

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

ECE251 Midterm practice questions, Fall 2010

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

Compiler Construction Using

Compilation Lecture 3: Syntax Analysis: Top-Down parsing. Noam Rinetzky

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analysis. Introduction

CSE 105 THEORY OF COMPUTATION

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Bottom-Up Parsing. Lecture 11-12

Automating Construction of Lexers

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Non-deterministic Finite Automata (NFA)

Formal Languages. Formal Languages

CS415 Compilers. Lexical Analysis

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

HKN CS 374 Midterm 1 Review. Tim Klem Noah Mathes Mahir Morshed

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Introduction to Lexical Analysis

Compiler principles, PS1

Context-Free Languages and Parse Trees

Implementation of Lexical Analysis. Lecture 4

PS3 - Comments. Describe precisely the language accepted by this nondeterministic PDA.

CSE 401 Midterm Exam 11/5/10

COP 3402 Systems Software Syntax Analysis (Parser)

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.


Midterm I (Solutions) CS164, Spring 2002

Midterm Exam. CSCI 3136: Principles of Programming Languages. February 20, Group 2

CSE Theory of Computing Fall 2017 Project 2-Finite Automata

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Lecture 7: Deterministic Bottom-Up Parsing

CSCI 3155: Lab Assignment 6

CSCI 3155: Lab Assignment 6

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

1. [5 points each] True or False. If the question is currently open, write O or Open.

CMSC 330: Organization of Programming Languages

CSCI 3155: Lab Assignment 6

Midterm I - Solution CS164, Spring 2014

Nested Words. A model of linear and hierarchical structure in data. Examples of such data

Regular Languages and Regular Expressions

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

Context Free Languages and Pushdown Automata

CSE302: Compiler Design

Specifying Syntax COMP360

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lecture 8: Deterministic Bottom-Up Parsing

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

Theory of Computations Spring 2016 Practice Final Exam Solutions

Test I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

1 Introduction. 2 Recursive descent parsing. Predicative parsing. Computer Language Implementation Lecture Note 3 February 4, 2004

shift-reduce parsing

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

CS 403: Scanning and Parsing

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Transcription:

Course Project 2 Regular Expressions CSE 30151 Spring 2017 Version of February 16, 2017 In this project, you ll write a regular expression matcher similar to grep, called mere (for match and echo using regular expression). This has three major steps: first, parse a regular expression into regular operations; second, execute the regular operations to create a NFA; third, run the NFA on input strings. Because we use a linear-time NFA recognition algorithm, our regular expression matcher will actually be much faster than one written using Perl or Python s regular expression engine. (Most implementations of grep, as well as Google RE2, are linear like ours.) You will need a correct solution for CP1 to complete this project. If your CP1 doesn t work correctly (or you just weren t happy with it), you may use the official solution or another team s solution, as long as you properly cite your source. Getting started We ve made some minor updates, so please have one team member run the commands git pull https://github.com/nd-cse-30151-sp17/theory-project-skeleton git push and then other team members should run git pull. The project repository should then include the following files: bin/ compare_nfa parse_re union_nfa concat_nfa star_nfa mere examples/ sipser-n{1,2,3,4}.nfa tests/ test-cp2.sh time-cp2.sh cp2/ Please place the programs that you write into the cp2/ subdirectory. 1

1 Parser Note: Parts 1 and the subparts of 2 and 3 can all be written and tested independently. In this first part, we ll write a parser for regular expressions. Our regular expressions allow union ( ), concatenation, Kleene star (*), and empty set (@). Parentheses are used for grouping. The grammar is as follows. The start nonterminal S is Union, and we write terminal symbols inside boxes. Union Union Concat Union Concat Concat Concat Unary Concat ε Unary Primary * Unary Primary if not followed by or ) or end of string otherwise Primary a for a not in @ ( ) * Primary @ Primary ( Union ) A parser for CFG for a programming language essentially converts the CFG into a deterministic pushdown automaton (DPDA) and runs the DPDA on programs. There are several ways of doing this conversion. This is a complex topic that you can learn more about by taking Compilers. Here, we ll take the simplest route, a recursive-descent parser (which many of you implemented in Data Structures for a fragment of Scheme). The pseudocode is shown in Algorithm 1. The top-level function is parseregexp. For each nonterminal symbol X, there is a function, parsex, which tries to read in a string that matches X. For example, if the regular expression is ab, the trace of the parser is shown in Figure 1. If this is a pushdown automaton, you must be asking, where is the stack? The call stack itself is being used as the stack: every time we call a function, we push a return address, and every time a function returns, the return address is popped. So the stack alphabet is the set of return addresses, of which there is clearly a finite number. Each of these functions returns the semantics of the (sub)expression that it reads in. The functions emptyset, epsilon, and symbol create semantic objects, and the functions union, concat, star build semantic objects from smaller semantic objects. Eventually, the semantics of a regular expression will be an NFA equivalent to the regular expression. But initially, to facilitate unit testing, the semantics will just be a string containing the regular expression written in prefix notation (kind of like Scheme). For example, the regular expression (ab a)* should become the string: star(union(concat(symbol(a),symbol(b)),symbol(a))) So, for testing purposes, these functions should do the following: emptyset() Arguments: none 2

line input parseregexp() ab 2: M parseunion() ab 8: M parseconcat() ab 17: M parseunary() ab 22: M parseprimary() ab 38: read a b 39: return symbol(a) b 22: M symbol(a) b 29: return symbol(a) b 17: M symbol(a) b 19: M concat(symbol(a), ParseUnary()) b 22: M parseprimary() b 38: read b ε 39: return symbol(b) ε 22: M symbol(b) ε 29: return symbol(b) ε 19: M concat(symbol(a), symbol(b)) ε 20: return concat(symbol(a), symbol(b)) ε 8: M concat(symbol(a), symbol(b)) ε 12: return concat(symbol(a), symbol(b)) ε 2: M concat(symbol(a), symbol(b)) ε 4: return concat(symbol(a), symbol(b)) ε concat(symbol(a), symbol(b)) ε Figure 1: Trace of parser (Algorithm 1) for example regular expression ab. Computed values that have been substituted into the code are shown in blue. 3

Algorithm 1 Pseudocode for recursive-descent parser. 1: function parseregexp() 2: M parseunion() 3: if no next token then 4: return M 5: else 6: error 7: function parseunion() 8: M parseconcat() 9: while next token is do 10: read 11: M union(m, parseconcat()) 12: return M 13: function parseconcat() 14: if no next token, or next token is or ) then 15: return epsilon() 16: else 17: M ParseUnary() 18: while next token exists and is not or ) do 19: M concat(m, ParseUnary()) 20: return M 21: function parseunary() 22: M parseprimary() 23: if next token is * then 24: read * 25: return star(m) 26: else 27: return M 28: function parseprimary() 29: if next token is ( then 30: read ( 31: M parseunion() 32: read ) 33: return M 34: else if next token is @ then 35: read @ 36: return emptyset() 37: else if next token is not in ( ) * then 38: read a 39: return symbol(a) 40: else 41: error 4

Returns: string "emptyset()" epsilon() Arguments: none Returns: string "epsilon()" symbol(a) Argument: alphabet symbol a Σ Returns: string "symbol(a)" union(m 1, M 2 ) Arguments: strings M 1, M 2 Returns: string "union(m 1,M 2 )" concat(m 1, M 2 ) Arguments: strings M 1, M 2 Returns: string "concat(m 1,M 2 )" star(m) Arguments: string M Returns: string "star(m)" Write a program called parse re to test your parser: parse re regexp should output the prefix-notation version of regexp. Test your program by running test-cp2.sh. 2 Easy operations Write functions that construct the following NFAs. emptyset() Arguments: none Returns: NFA recognizing the empty language epsilon() Arguments: none Returns: NFA recognizing the language {ε} symbol(a) Argument: Alphabet symbol a Σ Returns: NFA recognizing the language {a} These operations are trivial to implement, and we won t bother writing tests for them. 5

3 Regular operations Write functions that perform the regular operations, using the constructions given in the book: union(m 1, M 2 ) Arguments: NFAs M 1, M 2 Returns: NFA recognizing language L(M 1 ) L(M 2 ) concat(m 1, M 2 ) Arguments: NFAs M 1, M 2 Returns: NFA recognizing language L(M 1 )L(M 2 ) star(m) Argument: NFA M Returns: NFA recognizing language L(M) Optional: The book s construction creates a lot of ε-transitions, and in later projects, these ε-transitions will proliferate. For greater efficiency, you could try using the construction in Appendix A, which creates no ε-transitions. (However, note that if you do this, then the tests for union_nfa, concat_nfa, and star_nfa will fail; please contact the instructor if you choose this option.) Write three programs, called union nfa, concat nfa, and star nfa, to test your operations. Each takes one or two command-line arguments, each of which is the name of a file containing a NFA, in the same format you used in CP1, and each writes a NFA to stdout in the same format. union nfa nfafile1 nfafile2 concat nfa nfafile1 nfafile2 star nfa nfafile Writes union of NFAs to stdout Writes concatenation of NFAs to stdout Writes Kleene star of NFA to stdout Test your programs by running test-cp2.sh. 4 Putting it together Write a function that puts all the above functions and your NFA simulator from CP1 together: Arguments: regular expression α, string w Return: true if α matches w, false otherwise. Put your function into a command-line tool called mere: mere regexp where regexp is a regular expression. The program should read zero or more lines from stdin and write to stdout just the lines that match the regular expression. Unlike grep, the regular expression should match the entire line, not just part of the line. Test your program by running tests/test-cp2.sh. 6

5 Testing performance Now run tests/time-cp2.sh. This script, for n = 1,..., 20, creates the regular expression (a ) (a ) a }{{}} {{ a } n copies n copies and tries to match it against the string a n, using a short Perl script, our solution (bin/mere), and your solution (cp2/mere). Observe that Perl has an exponential running time, whereas our solution has roughly quadratic running time. Hopefully, yours does too. In order to get full credit, your solution must run faster than Perl for some value of n. If needed, you may increase the maximum value of n by modifying the variable NMAX in time-cp2.sh. Please be sure to commit your change. However, a better strategy would be to optimize your code, looking in particular for inadvertent linear searches and copies. (Python code is especially prone to the former, and C++ code is especially prone to the latter.) The hot spots are the innermost loops of the intersection and emptiness algorithms, so look there first. Submission instructions Your code should build and run on studentnn.cse.nd.edu. The automatic tester will clone your repository, cd into its root directory, run make -C cp2, and run tests/test-cp2.sh. You re advised to try all of the above steps and ensure that all tests pass. To submit your work, please push your repository to Github and then create a new release with tag version cp2 (note that the tag version is not the same thing as the release title). If you are making a partial submission, then use a tag version of the form cp2-12345, indicating which parts you re submitting. Rubric Parser 9 Easy operations 3 Union 4 Concatenation 4 Kleene star 4 Main program 3 Speed 3 Total 30 7

A The Berry-Sethi construction The Berry-Sethi construction [Berry and Sethi, 1986] is a more efficient way of converting a regular expression to a NFA. Unlike the construction in the book, it does not create any ε-transitions. Base cases book. If α =, α = ε, or α is a single symbol a Σ, the construction is the same as in the Union If α = α 1 α 2 : Convert α 1 and α 2 to M 1 = (Q 1, Σ, s 1, F 1 ) M 2 = (Q 2, Σ, s 2, F 2 ) where Q 1 Q 2 =. Create a new start state s. For each transition s i a r, create a transition s a r. State s is an accept state if either s 1 or s 2 is. Delete s 1, s 2, and their outgoing transitions. Concatenation If α = α 1 α 2 : Convert α 1 and α 2 to where Q 1 Q 2 =. M 1 = (Q 1, Σ, s 1, F 1 ) M 2 = (Q 2, Σ, s 2, F 2 ) For each accept state q F 1 and transition s 2 a r, create a transition q a r. Each accept state q F 1 continues to be an accept state iff s 2 is an accept state. Delete s 2 and its outgoing transitions. Kleene star If α = α 1 : Convert α 1 to M 1 = (Q 1, Σ, s 1, F 1 ). For each accept state q F 1 and transition s 1 a r, create a transition q a r. Make s 1 an accept state. References Gerard Berry and Ravi Sethi. From regular expressions to deterministic automata. Theoretical Computer Science, 48:117 126, 1986. URL http://dx.doi.org/10.1016/0304-3975(86)90088-5. 8