Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

Similar documents
CSE528 Natural Language Processing Venue:ADB-405 Topic: Regular Expressions & Automata. www. l ea rn ersd esk.weeb l y. com

applied regex implementing REs using finite state automata using REs to find patterns Informatics 1 School of Informatics, University of Edinburgh 1

Lecture 3 Regular Expressions and Automata

Natural Language Processing

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Structure of Programming Languages Lecture 3

Describing Languages with Regular Expressions

Tuesday, September 30, 14.

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Basic Text Processing

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

FSASIM: A Simulator for Finite-State Automata

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Understanding Regular Expressions, Special Characters, and Patterns

Lecture 18 Regular Expressions

Expressions, Text Normalization, Edit Distance

Ruby Regular Expressions AND FINITE AUTOMATA

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han


Computer Systems and Architecture

Computer Systems and Architecture

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved.

UNIX / LINUX - REGULAR EXPRESSIONS WITH SED

Formal Languages. Formal Languages

STREAM EDITOR - REGULAR EXPRESSIONS

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

SPEECH RECOGNITION COMMON COMMANDS

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Languages and Compilers

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

CMSC 132: Object-Oriented Programming II

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Regular Expressions & Automata

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Introduction to Lexing and Parsing

Regular Expressions 1

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Set and Set Operations

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

CS Unix Tools & Scripting

More Details about Regular Expressions

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

Behaviour Diagrams UML

Wildcards and Regular Expressions

Regular Expressions.

Lexical and Syntax Analysis

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

TAFL 1 (ECS-403) Unit- V. 5.1 Turing Machine. 5.2 TM as computer of Integer Function

CSc 453 Lexical Analysis (Scanning)

CSCI 4152/6509 Natural Language Processing Lecture 6: Regular Expressions; Text Processing in Perl

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Regular Expressions Explained

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

CSCI312 Principles of Programming Languages!

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Regular Expressions. Chapter 6

Bash Script. CIRC Summer School 2015 Baowei Liu

Theory of Computations Spring 2016 Practice Final Exam Solutions

Part 5 Program Analysis Principles and Techniques

UNIX II:grep, awk, sed. October 30, 2017

CS Bootcamp Boolean Logic Autumn 2015 A B A B T T T T F F F T F F F F T T T T F T F T T F F F

CS102: Variables and Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

1. Draw the state graphs for the finite automata which accept sets of strings composed of zeros and ones which:

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Regular Expressions. Perl PCRE POSIX.NET Python Java

Principles of Programming Languages COMP251: Syntax and Grammars

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Regex Guide. Complete Revolution In programming For Text Detection

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Pattern Matching. An Introduction to File Globs and Regular Expressions

Features of C. Portable Procedural / Modular Structured Language Statically typed Middle level language

ASCII Code - The extended ASCII table

Indian Institute of Technology Kharagpur. PERL Part III. Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. I.I.T.

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

KEY. A 1. The action of a grammar when a derivation can be found for a sentence. Y 2. program written in a High Level Language

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Lexical Analysis 1 / 52

Motivation (Scenarios) Topic 4: Grep, Find & Sed. Displaying File Names. grep

Structure of a Compiler: Scanner reads a source, character by character, extracting lexemes that are then represented by tokens.

Pieter van den Hombergh. April 13, 2018

Regular expressions. LING78100: Methods in Computational Linguistics I

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

1 CS580W-01 Quiz 1 Solution

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

SECTION A (40 marks)

1.0 Languages, Expressions, Automata

Lexical Analysis. Lecture 3-4

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Transcription:

Regular Expressions using REs to find patterns implementing REs using finite state automata

REs and FSAs Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata Finite-state automata are a way of implementing regular expressions

Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

Regular Expressions for Textual Searches Who does it? Everybody: Web search engines, CGI scripts Information retrieval Word processing (Emacs, vi, MSWord) Linux tools (sed, awk, grep) Computation of frequencies from corpora Perl

http://xkcd.com/

Regular Expression Regular expression: formula in algebraic notation for specifying a set of strings String: any sequence of alphanumeric characters letters, numbers, spaces, tabs, punctuation marks Regular expression search pattern: specifying the set of strings we want to search for corpus: the texts we want to search through

Basic Regular Expression Patterns Case sensitive: d is not the same as D Disjunctions: [dd] [0123456789] Ranges: [0-9] [A-Z] Negations: [^Ss] (only when ^ occurs immediately after [ ) Optional characters:? and * Wild :. Anchors: ^ and $, also \b and \B Disjunction, grouping, and precedence: (pipe)

Caret for negation, ^, or anchor RE Match (single characters) Example Patterns Matched [^A-Z]! not an uppercase letter Oyfn pripetchik [^Ss]! neither S nor s I have no exquisite reason for t [^\.]! not a period our resident Djinn [e/]! either e or ^ look up ˆ now a^b! the pattern a^b look up aˆb now ^T T at the beginning of a line The Dow Jones closed up one

Optionality and Counters RE Match Example Patterns Matched woodchucks?! woodchuck or woodchucks The woodchuck hid colou?r!! color or colour comes in three colours (he){3} exactly 3 he s and he said hehehe.? zero or one occurrences of previous char or expression * zero or more occurrences of previous char or expression + one or more occurrences of previous char or expression {n} exactly n occurrences of previous char or expression {n, m} between n to m occurrences {n, } at least n occurrences

Wild card. RE Match Example Patterns Matched beg.n!!! any char between beg and n begin, beg n, begun big.*dog find lines where big and the big dog bit the little dog occur the big black dog bit the

Operator Precedence Hierarchy 1. Parenthesis ( ) 2. Counters * +? { } 3. Sequences and Anchors the ^my end$ 4. Disjunction Examples: /moo+/ /try ies/ /and or/

10/17/11

Example Find all instances of the word the in a text. 10/17/11

Example Find all instances of the word the in a text. /the/ 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples /[tt]he/ 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples /[tt]he/ Finds other or theology 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples /[tt]he/ Finds other or theology /\b[tt]he\b/ 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples /[tt]he/ Finds other or theology /\b[tt]he\b/ /[^a-za-z][tt]he[^a-za-z]/ 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples /[tt]he/ Finds other or theology /\b[tt]he\b/ /[^a-za-z][tt]he[^a-za-z]/ Misses sentence-initial the 10/17/11

Example Find all instances of the word the in a text. /the/ Misses capitalized examples /[tt]he/ Finds other or theology /\b[tt]he\b/ /[^a-za-z][tt]he[^a-za-z]/ Misses sentence-initial the /(^ [^a-za-z])[tt]he[^a-za-z]/ 10/17/11

Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, then, other) False positives (Type I) Not matching things that we should have matched (The) False negatives (Type II)

A more complex example Write a RE that will match any PC with more than 500MHz and 32 Gb of disk space for less than $1000. First a RE for prices /$[0-9]+/ # whole dollars /$[0-9]+\.[0-9][0-9]/ # dollars and cents /$[0-9]+(\.[0-9][0-9])?/ #cents optional /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries

A more complex example Write a RE that will match any PC with more than 500MHz and 32 Gb of disk space for less than $1000. First a RE for prices /$[0-9]+/ # whole dollars /$[0-9]+\.[0-9][0-9]/ # dollars and cents /$[0-9]+(\.[0-9][0-9])?/ #cents optional /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries

Continued Specifications for processor speed /\b[0-9]+ *(MHz [Mm]egahertz Ghz [Gg]igahertz)\b/ Memory size /\b[0-9]+ *(Mb [Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb [Gg]igabytes?)\b/ Vendors /\b(win(95 98 NT dows *(NT 95 98 2000)?))\b/ /\b(mac Macintosh Apple)\b/

Substitutions and Memory Substitutions: s/regexp/pattern/) s/color/colour/ Memory (\1, \2, etc. refer back to found matches) e.g., Put angle brackets around all integers in text the 39 students ==> the <39> students s/([0-9]+)/<\1>/ 15

Using Backslash RE Match Example Patterns Matched \*! an asterisk * K*A*P*L*A*N \.! a period. Dr. Livingston, I presume \?! a question mark Would you light my candle? \n! a newline \t! a tab

Some Useful Aliases RE Expansion Match Example Patterns \d [0-9] any digit Party of 5 \D [ˆ0-9] any non-digit 99p \w [a-za-z0-9_] any alphanumeric or underscore 99p \W [ˆ\w] a non-alphanumeric!!!! \s [ \r\t\n\f] whitespace (sp, tab) \S [ˆ\s] Non-whitespace in Concord

Substitutions and Memory Substitutions: s/regexp/pattern/) s/color/colour/ Memory (\1, \2, etc. refer back to found matches) e.g., Put angle brackets around all integers in text the 39 students ==> the <39> students s/([0-9]+)/<\1>/

Example Swap first two words of line s/(\w+) +(\w+)/\2 \1/ % perl -de 42 DB<1> $s = DOES HE LIKE BEER ; DB<2> print $s; DOES HE LIKE BEER DB<3> $s =~ s/(\w+) +(\w+)/\2 \1/; DB<4> print $s; HE DOES LIKE BEER

Finite State Automata & Regular Expressions Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. FSAs and their probabilistic relatives are at the core of much of what we ll do this quarter

FSAs as Graphs Let s start with the sheep language /baa+!/

FSAs as Graphs Let s start with the sheep language /baa+!/ baa! baaa! baaaa! baaaaa!...

Sheep FSA We can say the following things about this machine It has 5 states b, a, and! are in its alphabet q 0 is the start state q 4 is an accept state It has 5 transitions

More Formally You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q 10/17/11 24

Yet Another View The guts of FSAs can be represented as tables 0 1 b a! e 1 2 2 2,3 3 4 4 10/17/11 25

Yet Another View The guts of FSAs can be represented as tables If you re in state 1 and you re looking at an a, go to state 2 b a! e 0 1 1 2 2 2,3 3 4 4 10/17/11 25

Recognition Recognition is the process of determining if a string should be accepted by a machine Or it s the process of determining if a string is in the language we re defining with the machine Or it s the process of determining if a regular expression matches a string Those all amount the same thing in the end

Recognition Traditionally, (Turing s notion) this process is depicted with a tape.

Recognition Start in the start state Examine the current input Consult the table Go to a new state and update the tape pointer. Until you run out of tape. 10/17/11 28

Tracing a Rejection q 0 a b a! b 10/17/11 Slide from Dorr/Monz

Tracing a Rejection q 0 a b a! b b a a a! 0 1 2 3 4 10/17/11 Slide from Dorr/Monz

Tracing a Rejection q 0 a b a! b REJECT b a a a! 0 1 2 3 4 10/17/11 Slide from Dorr/Monz

Tracing an Accept q 0 q 1 q 2 q 3 q 3 q 4 b a a a! 10/17/11 Slide from Dorr/Monz

Tracing an Accept q 0 q 1 q 2 q 3 q 3 q 4 b a a a! b a a a! 0 1 2 3 4 10/17/11 Slide from Dorr/Monz

Tracing an Accept q 0 q 1 q 2 q 3 q 3 q 4 b a a a! ACCEPT b a a a! 0 1 2 3 4 10/17/11 Slide from Dorr/Monz

Regular expression search http://www.inf.ed.ac.uk/teaching/courses/il1/2010/labs/2010-10-07/regex.xml http://www.learn-javascript-tutorial.com/regularexpressions.cfm#h1.2 Search for the following expressions Alice brillig m.m c..c [A-Z][A-Z]+ J j (J j) \(.*\) l.*l l.*?l l.+l What does. stand for? (any character) * is for repetition - zero or more times [aeiou] is for any vowel 31

More Examples Finite State Automata and Regular Expressions

rara S r a r a r r Rara is similar

ra(ra)* S r a r r ra(ra)*

suss this? [ ] s s u su s sus s s u state FSAs can be represented as: - graphs - transition tables If you re in state s and you re looking at a u, go to state su input [] s su sus s s s sus s u [] su []. [] [] [] [] Now we try to write finite state machines that will search for regular expressions.

(flip) (flop) f l i [] f fl fl(i o) p S f l o fl(i o) [] f fl p

fl(i o)p S [ f f l fl i fl(i p o

regular expressions any character is a regexp matches itself if R and S are regexps, so is RS matches a match for R followed by a match for S if R and S are regexps, so is R S matches any match for R or S (or both) if R is a regexp, so is R* (R+) matches any sequence of 0 (1) or more matches for R Kleene *, + * + Stephen Cole Kleene 1909-1994