Chapter Eight: Regular Expression Applications. Formal Language, chapter 8, slide 1

Similar documents
Lexical Analysis. Chapter 2

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

Implementation of Lexical Analysis

CSE 105 THEORY OF COMPUTATION

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Compiler phases. Non-tokens

Implementation of Lexical Analysis

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

CSE 105 THEORY OF COMPUTATION

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Figure 2.1: Role of Lexical Analyzer

Regular Expressions & Automata

Automata & languages. A primer on the Theory of Computation. The imitation game (2014) Benedict Cumberbatch Alan Turing ( ) Laurent Vanbever

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Lexical Analysis. Implementation: Finite Automata

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CSE 401 Midterm Exam Sample Solution 2/11/15

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Introduction to Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Automatic Scanning and Parsing using LEX and YACC

CMSC 132: Object-Oriented Programming II

J. Xue. Tutorials. Tutorials to start in week 3 (i.e., next week) Tutorial questions are already available on-line

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter Seven: Regular Expressions

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Compiler course. Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Compiler Construction

Context-Free Grammars

Chapter 3 Lexical Analysis

Structure of Programming Languages Lecture 3

Chapter 2 :: Programming Language Syntax

CMSC 350: COMPILER DESIGN

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Alternation. Kleene Closure. Definition of Regular Expressions

DOMjudge team manual. Summary. Reading and writing. Submitting solutions. Viewing scores, submissions, etc.

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Course Project 2 Regular Expressions

Lexical Analysis. Lecture 3. January 10, 2018

Regular Languages and Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE302: Compiler Design

Lexical Analyzer Scanner

Decision, Computation and Language

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Compiler Construction

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Lecture 18 Regular Expressions

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Java in 21 minutes. Hello world. hello world. exceptions. basic data types. constructors. classes & objects I/O. program structure.

The structure of a compiler

Lexical Analyzer Scanner

Context-Free Grammars

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Finite Automata Part Three

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Formal Languages and Compilers Lecture VI: Lexical Analysis

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Lesson 3: Accepting User Input and Using Different Methods for Output

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

CSE 401/M501 Compilers

CS103 Handout 42 Spring 2017 May 31, 2017 Practice Final Exam 1

An introduction to Flex

DOMjudge team manual. Summary. Reading and writing. Submitting solutions. Viewing scores, submissions, etc.

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

G52LAC Languages and Computation Lecture 6

Implementation of Lexical Analysis. Lecture 4

Formal Definition of Computation. Formal Definition of Computation p.1/28

COMP Logic for Computer Scientists. Lecture 25

ECE251 Midterm practice questions, Fall 2010

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

LECTURE 6 Scanning Part 2

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Finite Automata

Lecture 5: Regular Expression and Finite Automata

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Implementation of Lexical Analysis

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

e) Implicit and Explicit Type Conversion Pg 328 j) Types of errors Pg 371

CMPSCI 250: Introduction to Computation. Lecture 20: Deterministic and Nondeterministic Finite Automata David Mix Barrington 16 April 2013

Transcription:

Chapter Eight: Regular Expression Applications Formal Language, chapter 8, slide 1 1

We have seen some of the implementation techniques related to DFAs and NFAs. These important techniques are like tricks of the programmer's trade, normally hidden from the end user. Not so with regular expressions: they are often visible to the end user, and part of the user interface of a variety of useful software tools. Formal Language, chapter 8, slide 2 2

Outline 8.1 The egrep Tool 8.2 Non-Regular Regexps 8.3 Implementing Regexps 8.4 Regular Expressions in Java 8.5 The lex Tool Formal Language, chapter 8, slide 3 3

Text File Search Unix tool: egrep Searches a text file for lines that contain a substring matching a specified pattern Echoes all such lines to standard output Formal Language, chapter 8, slide 4 4

Example: A Constant Substring File names: fred barney wilma betty egrep command and results: % egrep 'a' names barney wilma % Formal Language, chapter 8, slide 5 5

More Than Simple Substrings egrep understands a language of patterns Various dialects of its pattern-language are also used by many other tools Confusingly, these patterns are often called regular expressions, but they differ from ours To keep the two ideas separate, we'll call the text patterns used by egrep and other tools by their common nickname: regexps Formal Language, chapter 8, slide 6 6

A Regexp Dialect * like our Kleene star: for any regexp x, x* matches strings that are concatenations of zero or more strings from the language specified by x like our +: for any regexps x and y, x y matches strings that match either x or y (or both) () used for grouping ^ this special symbol at the start of the regexp allows it to match only at the start of the line $ this special symbol at the end of the regexp allows it to match only at the end of the line. matches any symbol (except the end-of-line marker) Formal Language, chapter 8, slide 7 7

Example File names: fred barney wilma betty egrep for a, followed by any string, followed by y: % egrep 'a.*y' names barney % Formal Language, chapter 8, slide 8 8

Example File names: fred barney wilma betty egrep for odd-length string; what went wrong? % egrep '.(..)*' names fred barney wilma betty % Formal Language, chapter 8, slide 9 9

Example File names: fred barney wilma betty egrep for odd-length line: % egrep '^.(..)*$' names fred barney wilma betty % Formal Language, chapter 8, slide 10 10

Example File numbers: 0 1 10 11 100 101 110 111 1000 1001 1010 egrep for numbers divisible by 3: % egrep '^(0 1(01*0)*1)*$' numbers 0 11 110 1001 % Formal Language, chapter 8, slide 11 11

Outline 8.1 The egrep Tool 8.2 Non-Regular Regexps 8.3 Implementing Regexps 8.4 Regular Expressions in Java 8.5 The lex Tool Formal Language, chapter 8, slide 12 12

Capturing Parentheses Many regexp dialects can define more than just the regular languages Capturing parentheses: \( r \) captures the text that was matched by the regexp r \n matches the same text captured by the nth previous capturing left parenthesis Found in grep (but not most versions of egrep) Formal Language, chapter 8, slide 13 13

Example File test: abaaba ababa abbbabbb abbaabb grep for lines that consist of doubled strings: % grep '^\(.*\)\1$' test abaaba abbbabbb % Formal Language, chapter 8, slide 14 14

More Than Regular The formal language corresponding to that example is {xx x Σ*} It turns out that this language is not regular Like DFAs, regular expressions can do only what you could implement in a computer using a fixed, finite amount of memory Capturing parentheses must remember a string whose size is unbounded We'll see this more formally later Formal Language, chapter 8, slide 15 15

Outline 8.1 The egrep Tool 8.2 Non-Regular Regexps 8.3 Implementing Regexps 8.4 Regular Expressions in Java 8.5 The lex Tool Formal Language, chapter 8, slide 16 16

Many Regexp Tools Many programs make use of regexp dialects: Text tools like emacs, vi, and sed Compiler construction tools like lex Programming languages like Perl, Ruby, and Python Program language libraries like those for Java and the.net languages How do all these systems implement regexp matching? Formal Language, chapter 8, slide 17 17

Implementing Regexps We've already seen how, roughly: Convert regexp to an NFA Simulate that Or, convert to DFA and simulate that Many implementation tricks are possible; we haven't worried much about efficiency And some important details are different because regexps are used to match substrings Formal Language, chapter 8, slide 18 18

Using a DFA Our basic DFA decides after it reads the whole string For regexps, we need to find whether any substring is accepted That means running the DFA repeatedly, on each successive starting position Run the DFA until: it enters an accepting state: that's a match enters a non-accepting trap state: restart the DFA from the next possible starting position hits the end of the string: restart the DFA from the next possible starting position Formal Language, chapter 8, slide 19 19

Which Match? Some tools needs to know which substring matched Capturing parentheses, for example If there is more than one match in a given string, which should the tool find? The string abcab contains two substrings that match the regexp ab It isn't enough to specify the leftmost match: what if several matches start at the same place? The string abb contains three substrings that match the regexp ab*, and they all start at the first symbol Formal Language, chapter 8, slide 20 20

Longest Leftmost Some tools are required to find the longest leftmost match in a string The string abbcabb contains six matches for ab* The first abb is the longest leftmost match That means running the DFA past accepting states Run the DFA starting from each successive position, until it enters a non-accepting trap or hits the end As you go, keep track of the last accepting state entered, and the string position at the time At the end of this iteration, if any accepting state was recorded, that is the longest leftmost match Formal Language, chapter 8, slide 21 21

Using an NFA Similar accommodations are required Run from each successive starting position When an implementation using backtracking finds a match, it cannot necessarily stop there If the longest match is required, it must remember the match and continue Explore all paths through the NFA to make sure the longest match is found Formal Language, chapter 8, slide 22 22

Outline 8.1 The egrep Tool 8.2 Non-Regular Regexps 8.3 Implementing Regexps 8.4 Regular Expressions in Java 8.5 The lex Tool Formal Language, chapter 8, slide 23 23

java.util.regex The Java package java.util.regex contains classes for working with regexps in Java Two particularly important ones: The Pattern class A compiled version of a regexp, ready to be given an input string to test A bit like a Java representation of an NFA The Matcher class Has a Pattern, an input string to run it on, and the current state of the search for a match Can find matches within a string and report their locations Formal Language, chapter 8, slide 24 24

Example A mini-grep written in Java We'll take a regexp from the command line, and make it into a Pattern Then, for each line of the standard input: we'll make a Matcher for that line and use it to test for a match with our Pattern If it matches, we'll echo the line to the standard output Formal Language, chapter 8, slide 25 25

import java.io.*; import java.util.regex.*; /** * A Java application to demonstrate the Java package * java.util.regex. We take one command-line argument, * which is treated as a regexp and compiled into a * Pattern. We then use that pattern to filter the * standard input, echoing to standard output only * those lines that match the Pattern. */ Formal Language, chapter 8, slide 26 26

class RegexFilter { public static void main(string[] args) throws IOException { Pattern p = Pattern.compile(args[0]); // the regexp BufferedReader in = // standard input new BufferedReader(new InputStreamReader(System.in)); // Read and echo lines until EOF. } } String s = in.readline(); while (s!=null) { Matcher m = p.matcher(s); if (m.matches()) System.out.println(s); s = in.readline(); } Formal Language, chapter 8, slide 27 27

Example, Continued Now this Java application can be used to do our divisible-by-three filtering: % java RegexFilter '^(0 1(01*0)*1)*$' < numbers 0 11 110 1001 % Formal Language, chapter 8, slide 28 28

Outline 8.1 The egrep Tool 8.2 Non-Regular Regexps 8.3 Implementing Regexps 8.4 Regular Expressions in Java 8.5 The lex Tool Formal Language, chapter 8, slide 29 29

Preconstructing Automata Applications like grep take a regexp as user input: Convert regexp to automaton Simulate the automaton on a text file Discard automaton when finished Other applications, like compilers, use the same regexp and automaton each time It would be a waste of time to do the regexp-to-automaton conversion each time the compiler is run Formal Language, chapter 8, slide 30 30

The lex Tool lex preconstructs automata: Regexps as input C source code for a DFA as output Similar tools exist for many other output languages Useful for applications like compilers that use the same set of regexps over and over Formal Language, chapter 8, slide 31 31

The lex Input File definition section %% rules section %% user subroutines Definition section: a variety of preliminary definitions User subroutines: C code for extra C functions used from inside the rules section In our examples, both these sections will be empty: %% rules section %% Formal Language, chapter 8, slide 32 32

Rules Section %% abc {fprintf(yyout, "Found one.\n");}. \n {} %% The rules section is a list of regexps Each regexp is followed by some C code, to be executed whenever a match is found This example has two regexps with code Formal Language, chapter 8, slide 33 33

First regexp: abc Rules Section %% abc {fprintf(yyout, "Found one.\n");}. \n {} %% Action when abc is matched: print a message to the current output file (yyout), which is the standard output in this case Formal Language, chapter 8, slide 34 34

Rules Section %% abc {fprintf(yyout, "Found one.\n");}. \n {} %% Second regexp:. \n. matches any symbol except end-of-line \n matches end-of-line Action when. \n is matched: do nothing The string abc matches both regexps, but the lexgenerated code goes with the longest match Formal Language, chapter 8, slide 35 35

Running lex % flex abc.l % gcc lex.yy.c -o abc ll % Assuming our example is stored as abc.l flex is the gnu implementation of lex C code output by flex is stored in lex.yy.c The gcc command compiles the C code It puts the executable in abc (because of -o abc) It links with the special lex library (because of -ll) Formal Language, chapter 8, slide 36 36

Running The lex Program Suppose abctest contains these lines: abc aabbcc abcabc Then our lex program does this: % abc < abctest Found one. Found one. Found one. % Formal Language, chapter 8, slide 37 37

Example %% ^(0 1(01*0)*1)*$ {fprintf(yyout, "%s\n", yytext);}. \n {} %% This lex program echoes numbers divisible by 3 The same thing we've already done with Java and with egrep The lex variable yytext gives us a way to access the substring that matched the regexp Formal Language, chapter 8, slide 38 38

Larger Applications For simple applications, the code produced by lex can be compiled as a stand-alone program, as in the previous examples For most large applications, the code produced by lex is one of many source files from which the full program is compiled Compilers are sometimes written this way: lex generates some source Other tools (like yacc) generate some source Some is written by hand Formal Language, chapter 8, slide 39 39