CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Similar documents
Lecture 18 Regular Expressions

ECE251 Midterm practice questions, Fall 2010

CSE 105 THEORY OF COMPUTATION

Lexical Analysis. Lecture 2-4

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 3-4

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Compiler phases. Non-tokens

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

CSE 105 THEORY OF COMPUTATION

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lecture 2. Regular Expression Parsing Awk

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Structure of Programming Languages Lecture 3

JFlex Regular Expressions

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical and Syntax Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Chapter 3 Lexical Analysis

Introduction to Lexical Analysis

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

CSC 467 Lecture 3: Regular Expressions

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

G52LAC Languages and Computation Lecture 6

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Regular Expressions Primer

MidTerm Papers Solved MCQS with Reference (1 to 22 lectures)

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Compilers CS S-01 Compiler Basics & Lexical Analysis

CSCI 2132 Software Development. Lecture 8: Introduction to C

Figure 2.1: Role of Lexical Analyzer

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CS S-01 Compiler Basics & Lexical Analysis 1

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Part 5 Program Analysis Principles and Techniques

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Regular Expressions. Regular Expression Syntax in Python. Achtung!

2. Lexical Analysis! Prof. O. Nierstrasz!

Implementation of Lexical Analysis

Lexical Analysis (ASU Ch 3, Fig 3.1)

LECTURE 6 Scanning Part 2

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Languages and Compilers

UNIT -2 LEXICAL ANALYSIS

Compilers CS S-01 Compiler Basics & Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Homework 1 Due Tuesday, January 30, 2018 at 8pm

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CSE 401/M501 Compilers

Chapter Eight: Regular Expression Applications. Formal Language, chapter 8, slide 1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

CPS 506 Comparative Programming Languages. Syntax Specification

Chapter 2 :: Programming Language Syntax

CMSC 350: COMPILER DESIGN

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analysis. Introduction

CSc 453 Compilers and Systems Software

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

CSCI 4152/6509 Natural Language Processing Lecture 6: Regular Expressions; Text Processing in Perl

Lecture 5: Regular Expression and Finite Automata

Implementation of Lexical Analysis

Spoke. Language Reference Manual* CS4118 PROGRAMMING LANGUAGES AND TRANSLATORS. William Yang Wang, Chia-che Tsai, Zhou Yu, Xin Chen 2010/11/03

Regexs with DFA and Parse Trees. CS230 Tutorial 11

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

DVA337 HT17 - LECTURE 4. Languages and regular expressions

The Structure of a Syntax-Directed Compiler

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Automation Framework for Large-Scale Regular Expression Matching on FPGA. Thilan Ganegedara, Yi-Hua E. Yang, Viktor K. Prasanna

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

27-Sep CSCI 2132 Software Development Lecture 10: Formatted Input and Output. Faculty of Computer Science, Dalhousie University. Lecture 10 p.

BNF, EBNF Regular Expressions. Programming Languages,

MATVEC: MATRIX-VECTOR COMPUTATION LANGUAGE REFERENCE MANUAL. John C. Murphy jcm2105 Programming Languages and Translators Professor Stephen Edwards

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Typescript on LLVM Language Reference Manual

Compiler Construction LECTURE # 3

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

Front End: Lexical Analysis. The Structure of a Compiler

Dr. D.M. Akbar Hussain

CS415 Compilers. Lexical Analysis

Transcription:

CS 301 Lecture 05 Applications of Regular Languages Stephen Checkoway January 31, 2018 1 / 17

Characterizing regular languages The following four statements about the language A are equivalent The language A is regular Some DFA M recognizes A (i.e., L(M) = A) Some NFA N recognizes A (i.e., L(N) = A) Some regular expression R generates (or describes) A (i.e., L(R) = A) 2 / 17

Converting between DFA, NFA, regex DFA M = (Q 1, Σ, δ 1, q 1, F 1 ) δ 2 (q, t) = {δ 1 (q, t)} Q 1 = P (Q 2 ) Construct GNFA and remove states Construct NFAs for base cases and combine NFA N = (Q 2, Σ, δ 2, q 2, F 2 ) Regular Expression Construct GNFA and remove states 3 / 17

Types of regular expressions Formal language-theoretic regular expressions (this class) Portable Operating System Interface (POSIX) basic and extended regular expressions Perl-compatible regular expressions (PCRE) (not always regular!) Many languages use similar regex, Java, JavaScript, Python, Ruby,... Vim regular expressions Boost regular expressions... 4 / 17

Regex in text processing Alphabet is usually ASCII characters Common tasks include Finding lines that match (or have a substring that matches) the regex Text substitution: match a regex, replace parts of it E.g., restructuring formatted data Validating input E.g., untainting user input in Perl Web (or other data) scraping Syntax highlighting in editors 5 / 17

POSIX regex Most characters match literally E.g., the formal regex red would be written red or /red/ Metacharacters. Equivalent to Σ: matches any character (not completely true as newlines are typically not matched) 6 / 17

POSIX regex Most characters match literally E.g., the formal regex red would be written red or /red/ Metacharacters. Equivalent to Σ: matches any character (not completely true as newlines are typically not matched) [ ] Matches characters contained in the brackets E.g., [abc] is a b c; [a-za-z0-9] is a b z A B Z 0 1 9 6 / 17

POSIX regex Most characters match literally E.g., the formal regex red would be written red or /red/ Metacharacters. Equivalent to Σ: matches any character (not completely true as newlines are typically not matched) [ ] Matches characters contained in the brackets E.g., [abc] is a b c; [a-za-z0-9] is a b z A B Z 0 1 9 [ˆ ] Matches characters not contained in the brackets E.g., [ˆabc] matches any character except a, b, or c 6 / 17

POSIX regex Most characters match literally E.g., the formal regex red would be written red or /red/ Metacharacters. Equivalent to Σ: matches any character (not completely true as newlines are typically not matched) [ ] Matches characters contained in the brackets E.g., [abc] is a b c; [a-za-z0-9] is a b z A B Z 0 1 9 [ˆ ] Matches characters not contained in the brackets E.g., [ˆabc] matches any character except a, b, or c ˆ Matches the start of the string or the start of the line 6 / 17

POSIX regex Most characters match literally E.g., the formal regex red would be written red or /red/ Metacharacters. Equivalent to Σ: matches any character (not completely true as newlines are typically not matched) [ ] Matches characters contained in the brackets E.g., [abc] is a b c; [a-za-z0-9] is a b z A B Z 0 1 9 [ˆ ] Matches characters not contained in the brackets E.g., [ˆabc] matches any character except a, b, or c ˆ Matches the start of the string or the start of the line $ Matches the end of the string or the end of the line 6 / 17

POSIX regex Most characters match literally E.g., the formal regex red would be written red or /red/ Metacharacters. Equivalent to Σ: matches any character (not completely true as newlines are typically not matched) [ ] Matches characters contained in the brackets E.g., [abc] is a b c; [a-za-z0-9] is a b z A B Z 0 1 9 [ˆ ] Matches characters not contained in the brackets E.g., [ˆabc] matches any character except a, b, or c ˆ Matches the start of the string or the start of the line $ Matches the end of the string or the end of the line ( ) Defines a subexpression 6 / 17

POSIX regex More metacharacters * Matches the preceding element zero or more times E.g., ab*c is ab c + Matches the preceding element one or more times E.g., ab+c is abb c? Matches the preceding element zero or one times E.g., ab?c is a(b ε)c {m,n} Matches the preceding element at least m and at most n times E.g.,.{2,4} is ΣΣ ΣΣΣ ΣΣΣΣ Normal or E.g., abc def is abc def 7 / 17

Character classes Character classes are shorthands for [ ] or [ˆ ] expressions [:alpha:] Equivalent to [A-Za-z] [:digit:] Equivalent to [0-9] (written \d in PCRE or Vim)... The POSIX ones (with the brackets and colons) must appear inside brackets E.g., [[:digit:]abc] matches a digit or a, b, or c 8 / 17

Some common tools grep (or egrep): Selects lines that match a regex egrep '((1-)?[0-9]{3}-)?[0-9]{3}-[0-9]{4}' file awk (or gawk or mawk or nawk): Runs a program on lines that match awk '/cat hat/ { print $1, $3 }' sed: Reads lines from files and applies commands sed -E 's/([^,]*),(.*)/\2,\1/' file 9 / 17

Programming language support Built-in support Perl: $foo =~ /foo bar?/ or $foo =~ s/red/blue/ Bash: if [[ "$x" =~ foo bar baz ]]; then echo match; fi Ruby: 'haystack' =~ /hay/... Standard library support Python. re module has re.compile('ab*') and related functions C++11. std::regex... Languages without built-in support usually use strings for regex and this leads to lots of escaping: /\d/ becomes "\\\d" 10 / 17

Match objects or variables Usually, just matching a string isn t enough We want to extract matching substrings and do something with them Parentheses denote capturing groups and the text that matches the corresponding subexpression is available using special variables (like $1, $2,... ) 'foo bar baz ' =~ /([^ ]+) ([^ ]+)/; print "$1\n"; # prints foo print "$2\n"; # prints bar via returned match object >>> import re >>> m = re. match (r'([^ ]+) ([^ ]+) ', " foo bar baz ") >>> m. group (1) 'foo ' >>> m. group (2) 'bar ' 11 / 17

Much much more There s a lot more than I ve touched on Read some of the documentation to see how best to use regex in your language of choice Many popular regex implementations have extentions that allow the language to match strings from some nonregular languages 12 / 17

You cannot parse HTML with regular expressions! 13 / 17

Compiler construction Compilers typically operate in phases 1 Lexical analysis (lexing or tokenizing) splits sequences of characters into tokens 2 Syntax analysis (parsing) generates a parse tree and checks that the program is syntatically correct (more on this later!) 3 Semantic analysis checks if the parse tree follows the rules of the language 4 Code generation and optimization (the bulk of the work of a compiler) 14 / 17

Lexing Lexing splits a sequence of characters into tokens with types and values Consider int foo = 32; This might be split into a sequence of tokens IDENTIFIER, int, IDENTIFIER, foo, EQUAL SIGN, INTEGER, 32, SEMICOLON The parsing stage might have a rule that says that a variable declaration consists of two identifiers, an equal sign, an expression, and a semicolon The semantic analysis phase would check that the first identifier was a valid type and that the second identifier was a valid variable name, and that the expression was valid 15 / 17

Flex Flex is a tool that is used to construct (usually C) source code to run as tokens are created /* Definitions */ IDENTIFIER [A-Za -z_ ][A-Za -z0-9_]* DIGIT [0-9] %% /* Rules for what code to run when matching the * corresponding regular expression */ { DIGIT }+ { /* construct INTEGER token */ } { DIGIT }+"."{ DIGIT }* { /* construct FLOAT token */ } { IDENTIFIER } { /* construct IDENTIFIER token */ } 16 / 17

Implementing regular expression matching Some options Table driven: convert to DFA and encode δ as a table Encode as loops and conditionals: convert to DFA but encode the transitions using control structures from the target language Backtracking: convert to NFA and employ a backtracking strategy if a choice was incorrect Brzozowski derivative (named for Janusz Brzozowski): for the first character t in the string, construct a new regular expression t 1 R to match against the remaining characters, repeat 17 / 17