Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Similar documents
Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Dr. D.M. Akbar Hussain

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Regular Expressions & Automata

Structure of Programming Languages Lecture 3

1.3 Functions and Equivalence Relations 1.4 Languages

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Regular Expressions.

CMSC 132: Object-Oriented Programming II

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis (ASU Ch 3, Fig 3.1)

CSE : Python Programming

CSE 105 THEORY OF COMPUTATION

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

More Details about Regular Expressions

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Chapter Seven: Regular Expressions

DVA337 HT17 - LECTURE 4. Languages and regular expressions

8 ε. Figure 1: An NFA-ǫ

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Formal Languages and Automata

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Lexical Analysis. Chapter 2

Regular Languages and Regular Expressions

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

CS 314 Principles of Programming Languages. Lecture 3

Lecture 18 Regular Expressions

Formal Languages and Compilers Lecture VI: Lexical Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analysis. Lecture 3-4

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CSE528 Natural Language Processing Venue:ADB-405 Topic: Regular Expressions & Automata. www. l ea rn ersd esk.weeb l y. com

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008


This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Lexical Analysis 1 / 52

Principles of Programming Languages COMP251: Syntax and Grammars

Lexical Analysis. Lecture 2-4

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Regular Expressions. Chapter 6

Part 5 Program Analysis Principles and Techniques

1 Finite Representations of Languages

CS/ECE 374 Fall Homework 1. Due Tuesday, September 6, 2016 at 8pm

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Automata & languages. A primer on the Theory of Computation. The imitation game (2014) Benedict Cumberbatch Alan Turing ( ) Laurent Vanbever

CS 314 Principles of Programming Languages

CSE 105 THEORY OF COMPUTATION

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

Compiler Construction

Non-deterministic Finite Automata (NFA)

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Lexical Analysis. Lecture 3. January 10, 2018

2. Lexical Analysis! Prof. O. Nierstrasz!

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Figure 2.1: Role of Lexical Analyzer

Pattern Matching. An Introduction to File Globs and Regular Expressions

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

LING115 Lecture Note Session #7: Regular Expressions

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Automating Construction of Lexers

Tuesday, September 30, 14.

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Algorithmic Approaches for Biological Data, Lecture #8

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

UNION-FREE DECOMPOSITION OF REGULAR LANGUAGES

Formal Languages. Formal Languages

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS Unix Tools & Scripting

applied regex implementing REs using finite state automata using REs to find patterns Informatics 1 School of Informatics, University of Edinburgh 1

Introduction to regular expressions

The Structure of a Syntax-Directed Compiler

CS402 Theory of Automata Solved Subjective From Midterm Papers. MIDTERM SPRING 2012 CS402 Theory of Automata

Reading Assignment. Scanner. Read Chapter 3 of Crafting a Compiler.

Regular expressions. LING78100: Methods in Computational Linguistics I

Lexical Analyzer Scanner

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analyzer Scanner

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Introduction to Lexing and Parsing

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Ling/CSE 472: Introduction to Computational Linguistics. 4/6/15: Morphology & FST 2

Zhizheng Zhang. Southeast University

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CSE 431S Scanning. Washington University Spring 2013

Regexs with DFA and Parse Trees. CS230 Tutorial 11

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Languages and Compilers

Introduction to Lexical Analysis

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

CMPSCI 250: Introduction to Computation. Lecture #7: Quantifiers and Languages 6 February 2012

Transcription:

Regular Expressions Steve Renals s.renals@ed.ac.uk (based on original notes by Ewan Klein) ICL 12 October 2005

Introduction Formal Background to REs Extensions of Basic REs Overview Goals: a basic idea of the formal background for REs an ability to write small Python programs that do useful things with REs

Introduction Formal Background to REs Extensions of Basic REs Motivation Task: To search for strings using (partially specified) patterns Why: validate data fields (dates, email addresses, URLs) filter text (spam, disallowed web sites) identify particular strings in a text (token boundaries for tokenization) convert the output of one processing component into the format required for a second component (rabbit_nn <word pos= NN >rabbit</word>)

Introduction Formal Background to REs Extensions of Basic REs The Basic Idea Regular expressions form a language for expressing patterns. The language can be stated as a formal algebra. Recognizers for RE can be efficiently implemented. Regular expression also a term for a pattern that is constructed using the language. Every pattern specifies a set of strings. Text string: a sequence of letters, numerals, spaces, tabs, punctuation,...

Introduction Formal Background to REs Extensions of Basic REs Initital Examples Pattern Matches concatenation abc abc disjunction a b a, b (a bb) d ad, bbd closure a* ɛ, a, aa, aaa, aaaa,... c(a bb)* c, ca, cbb, cabb, caa, cbbbb,...

Introduction Formal Background to REs Extensions of Basic REs Two Types of RE Literals Every normal text character is an RE, and denotes itself. Metacharacters Special characters which allow you to specify various sets of strings. Example Kleene star (*) a denotes a a* denotes ɛ (empty string), a, aa, aaa,...

Introduction Formal Background to REs Extensions of Basic REs Preliminaries: Operations on Sets of Strings Let Σ be a finite set of symbols and let Σ be the set of all strings (including the empty string) over Σ. Suppose L, L 1, L 2 are subsets of Σ.

Introduction Formal Background to REs Extensions of Basic REs Preliminaries: Operations on Sets of Strings Let Σ be a finite set of symbols and let Σ be the set of all strings (including the empty string) over Σ. Suppose L, L 1, L 2 are subsets of Σ. The union of L 1, L 2, denoted L 1 L 2, is the set of strings x such that x L 1 or x L 2.

Introduction Formal Background to REs Extensions of Basic REs Preliminaries: Operations on Sets of Strings Let Σ be a finite set of symbols and let Σ be the set of all strings (including the empty string) over Σ. Suppose L, L 1, L 2 are subsets of Σ. The union of L 1, L 2, denoted L 1 L 2, is the set of strings x such that x L 1 or x L 2. The concatenation of L 1, L 2, denoted L 1 L 2, is the set of strings xy such that x L 1 and y L 2.

Introduction Formal Background to REs Extensions of Basic REs Preliminaries: Operations on Sets of Strings Let Σ be a finite set of symbols and let Σ be the set of all strings (including the empty string) over Σ. Suppose L, L 1, L 2 are subsets of Σ. The union of L 1, L 2, denoted L 1 L 2, is the set of strings x such that x L 1 or x L 2. The concatenation of L 1, L 2, denoted L 1 L 2, is the set of strings xy such that x L 1 and y L 2. The Kleene closure of L, denoted L, is the set of strings constructed by concatenating any number of strings from L. L contains ɛ, the empty string.

Introduction Formal Background to REs Extensions of Basic REs Preliminaries: Operations on Sets of Strings Let Σ be a finite set of symbols and let Σ be the set of all strings (including the empty string) over Σ. Suppose L, L 1, L 2 are subsets of Σ. The union of L 1, L 2, denoted L 1 L 2, is the set of strings x such that x L 1 or x L 2. The concatenation of L 1, L 2, denoted L 1 L 2, is the set of strings xy such that x L 1 and y L 2. The Kleene closure of L, denoted L, is the set of strings constructed by concatenating any number of strings from L. L contains ɛ, the empty string. The positive closure of L, denoted L +, is the same as L but without ɛ.

Introduction Formal Background to REs Extensions of Basic REs Examples Let L 1 = {a, b} and L 2 = {c}. Then L 1 L 2 = {a, b, c} L 1 L 2 = {ac, bc} {a, b} = {ɛ, a, b, aa, bb, ab, ba,...} {a, b} + = {a, b, aa, bb, ab, ba,...}

Introduction Formal Background to REs Extensions of Basic REs Formal Definition of Regular Expressions Regular expressions over a finite alphabet Σ:

Introduction Formal Background to REs Extensions of Basic REs Formal Definition of Regular Expressions Regular expressions over a finite alphabet Σ: 1. ɛ is a regular expression and denotes the set {ɛ}.

Introduction Formal Background to REs Extensions of Basic REs Formal Definition of Regular Expressions Regular expressions over a finite alphabet Σ: 1. ɛ is a regular expression and denotes the set {ɛ}. 2. For each a in Σ, a is a regular expression and denotes the set {a}.

Introduction Formal Background to REs Extensions of Basic REs Formal Definition of Regular Expressions Regular expressions over a finite alphabet Σ: 1. ɛ is a regular expression and denotes the set {ɛ}. 2. For each a in Σ, a is a regular expression and denotes the set {a}. 3. If r and s are regular expressions denoting the sets R and S respectively, then (r s) is a regular expression denoting R S. (rs) is a regular expression denoting RS. (r ) is a regular expression denoting R.

Introduction Formal Background to REs Extensions of Basic REs Recognizers A recognizer for a language is a program that takes as input a string x and answers yes if x is a sentence of the language and no otherwise. We can think of this program as a machine which only emits two possible responses to its input.

Introduction Formal Background to REs Extensions of Basic REs Finite State Automata A Finite State Automaton (FSA) is an abstract finite machine. Regular expressions can be viewed as a way to describe a Finite State Automaton (FSA) Kleene s theorem (1956): FSA and RE describe the same languages: Any regular expression can be implemented as an FSA. Any FSA can be described by a regular expression. Regular languages are those that can be recognized by FSAs (or characterized by a regular expression).

Introduction Formal Background to REs Extensions of Basic REs Metacharacters NB. Different sets of metacharacters and notations used by different host languages (e.g., Unix grep, GNU emacs, Perl, Java, Python, etc.). Cf. Jurafsky & Martin, Appendix A) Disjunction: Wild card:. Optionality:? Quantification: * and + Choice: [Mm] [0123456789] Ranges: [a-z] [0-9] Negation: [ Mm] (only when occurs immediately after [ )

Introduction Formal Background to REs Extensions of Basic REs Special Backslash Sequences Standard escape sequences \t: tab \n: newline Abbreviatory forms \d: digit (i.e., numeral) \s: whitespace ([ \t\n]) \w: alphanumeric ([a-za-z0-9]) \D: non-digit \S: non-whitespace \W: non-alphanumeric \ is a general escape character; e.g., \. is not a wildcard, but matches a period,. If you want to use \ in a string, it has to be escaped: \\

Introduction Formal Background to REs Extensions of Basic REs Anchors (Also: zero-width characters) Anchors don t match strings in the text, instead they match positions in the text. ^: matches beginning of line (or text) $: matches end of line (or text) \b: matches word boundary (i.e., a location with \w on one side but not the other)

Wildcard >>> from nltk_lite.utilities import re_show >>> s = BP has agreed to sell... it s petrochemicals unit for $5.1bn. >>> re_show(..., s) {BP }{has}{ ag}{ree}{d t}{o s}{ell} {it }{s p}{etr}{och}{emi}{cal}{s u}{nit}{ fo}{r $}{5.1}{bn.

Wildcard >>> from nltk_lite.utilities import re_show >>> s = BP has agreed to sell... it s petrochemicals unit for $5.1bn. >>> re_show(..., s) {BP }{has}{ ag}{ree}{d t}{o s}{ell} {it }{s p}{etr}{och}{emi}{cal}{s u}{nit}{ fo}{r $}{5.1}{bn. >>> re_show(.a.., s) BP {has }agreed to sell it s petrochemi{cals} unit for $5.1bn.

Wildcards with Quantifiers >>> re_show( s.*l, s) BP ha{s agreed to sell} it {s petrochemical}s unit for $5.1bn.

Wildcards with Quantifiers >>> re_show( s.*l, s) BP ha{s agreed to sell} it {s petrochemical}s unit for $5.1bn. >>> re_show( B.*P, s) {BP} has agreed to sell it s petrochemicals unit for $5.1bn.

Wildcards with Quantifiers >>> re_show( s.*l, s) BP ha{s agreed to sell} it {s petrochemical}s unit for $5.1bn. >>> re_show( B.*P, s) {BP} has agreed to sell it s petrochemicals unit for $5.1bn. >>> re_show( B.+P, s) BP has agreed to sell it s petrochemicals unit for $5.1bn.

Disjunction >>> re_show( has it, s) BP {has} agreed to sell {it} s petrochemicals un{it} for $5.1bn.

Disjunction >>> re_show( has it, s) BP {has} agreed to sell {it} s petrochemicals un{it} for $5.1bn. >>> re_show( has it, s) BP {has }agreed to sell it s petrochemicals unit for $5.1bn.

Disjunction >>> re_show( has it, s) BP {has} agreed to sell {it} s petrochemicals un{it} for $5.1bn. >>> re_show( has it, s) BP {has }agreed to sell it s petrochemicals unit for $5.1bn. >>> re_show( (e l)+, s) BP has agr{ee}d to s{ell} it s p{e}troch{e}mica{l}s unit for $5.1bn.

Zero Width Characters >>> re_show( l, s) BP has agreed to se{l}{l} it s petrochemica{l}s unit for $5.1bn.

Zero Width Characters >>> re_show( l, s) BP has agreed to se{l}{l} it s petrochemica{l}s unit for $5.1bn. >>> re_show( l$, s) BP has agreed to sel{l} it s petrochemicals unit for $5.1bn.

Zero Width Characters >>> re_show( l, s) BP has agreed to se{l}{l} it s petrochemica{l}s unit for $5.1bn. >>> re_show( l$, s) BP has agreed to sel{l} it s petrochemicals unit for $5.1bn. >>> re_show( i, s) BP has agreed to sell {i}t s petrochem{i}cals un{i}t for $5.1bn.

Zero Width Characters >>> re_show( l, s) BP has agreed to se{l}{l} it s petrochemica{l}s unit for $5.1bn. >>> re_show( l$, s) BP has agreed to sel{l} it s petrochemicals unit for $5.1bn. >>> re_show( i, s) BP has agreed to sell {i}t s petrochem{i}cals un{i}t for $5.1bn. >>> re_show( ^i, s) BP has agreed to sell {i}t s petrochemicals unit for $5.1bn.

Escaping Special Characters >>> re_show(., s) {B}{P}{ }{h}{a}{s}{ }{a}{g}{r}{e}{e}{d}...

Escaping Special Characters >>> re_show(., s) {B}{P}{ }{h}{a}{s}{ }{a}{g}{r}{e}{e}{d}... >>> re_show( \., s) BP has agreed to sell it s petrochemicals unit for $5{.}1bn{.}

Escaping Special Characters >>> re_show(., s) {B}{P}{ }{h}{a}{s}{ }{a}{g}{r}{e}{e}{d}... >>> re_show( \., s) BP has agreed to sell it s petrochemicals unit for $5{.}1bn{.} >>> re_show( $, s) BP has agreed to sell{} it s petrochemicals unit for $5.1bn.{}

Escaping Special Characters >>> re_show(., s) {B}{P}{ }{h}{a}{s}{ }{a}{g}{r}{e}{e}{d}... >>> re_show( \., s) BP has agreed to sell it s petrochemicals unit for $5{.}1bn{.} >>> re_show( $, s) BP has agreed to sell{} it s petrochemicals unit for $5.1bn.{} >>> re_show( \$, s) BP has agreed to sell it s petrochemicals unit for {$}5.1bn.

Metacharacters and Negated Ranges >>> re_show( \w,s) {B}{P} {h}{a}{s} {a}{g}{r}{e}{e}{d}...

Metacharacters and Negated Ranges >>> re_show( \w,s) {B}{P} {h}{a}{s} {a}{g}{r}{e}{e}{d}... >>> re_show( \d,s) BP has agreed to sell it s petrochemicals unit for ${5}.{1}bn.

Metacharacters and Negated Ranges >>> re_show( \w,s) {B}{P} {h}{a}{s} {a}{g}{r}{e}{e}{d}... >>> re_show( \d,s) BP has agreed to sell it s petrochemicals unit for ${5}.{1}bn. >>> re_show( [^a-z\s],s) {B}{P} has agreed to sell it{ }s petrochemicals unit for {$}{5}{.}{1}bn{.}

Metacharacters and Negated Ranges >>> re_show( \w,s) {B}{P} {h}{a}{s} {a}{g}{r}{e}{e}{d}... >>> re_show( \d,s) BP has agreed to sell it s petrochemicals unit for ${5}.{1}bn. >>> re_show( [^a-z\s],s) {B}{P} has agreed to sell it{ }s petrochemicals unit for {$}{5}{.}{1}bn{.} >>> re_show( [^\w],s) BP{ }has{ }agreed{ }to{ }sell{ }it{ }s{ }petrochemicals{ }unit{ }for{ }{$}5{.}1bn{.}

Using, 1 Usually best to compile the RE into a PatternObject; more efficient, and it can be re-used. >>> import re >>> str = do you say hello or hullo? >>> hellore = re.compile( h[eu]llo ) The resulting PatternObject has a number of methods: findall(s): returns a list of all matches of pattern in string s search(s): searches for leftmost occurrence of pattern in string s match(s): tries to match pattern at the beginning of string s

Using, 2 The PatternObject method findall returns a list: >>> hellore.findall(str) [ hello, hullo ]

Using, 2 The PatternObject method findall returns a list: >>> hellore.findall(str) [ hello, hullo ] The PatternObject method search (and match) returns a MatchObject or None. A MatchObject has a variety of methods, but is not a string. >>> m = hellore.search(str) >>> m <_sre.sre_match object at 0x47b138> >>> m.group() # return matched substring (sort of!) hello >>> m.end() # index of end of target 16

Groups Groups in regular expressions are captured using parentheses. >>> import re >>> str = do you say hello or hullo? >>> regrp = re.compile( (d.)(.*)(e..) ) >>> m = regrp.search(str) >>> m <_sre.sre_match object at 0x64390> >>> m.groups() ( do, you say h, ell )

Named Groups Name groups captured using (?P<name>): FROM = re.compile(""" ^From: # Anchor to start of line \s* # maybe some spaces (?P<user>\w+) # user : group of word characters @ (?P<domain> # the domain : \S+) # some non-space characters \s # finally, a space character """,re.verbose)

Named Groups (cont.) from nltk_lite.corpus import twenty_newsgroups for item in twenty_newsgroups.items( misc.forsale ): text = twenty_newsgroups.read(item) m = FROM.search(text) if m: print %s is at %s % \ (m.group( user ), m.group( domain )) kedz is at bigwpi.wpi.edu myoakam is at cis.ohio-state.edu gt1706a is at prism.gatech.edu jvinson is at xsoft.xerox.com hungjenc is at usc.edu thouchin is at cs.umr.edu kssimon is at silver.ucs.indiana.edu

Tokenization with Regular Expressions (1) The method tokenize.regexp() takes a string and a regular expression, and returns the list of substrings that match the RE >>> from nltk_lite import tokenize >>> s = "Hello. Isn t this fun?" >>> pat= r \w+ [^\w\s]+ >>> list(tokenize.regexp(s, pat)) [ Hello,., Isn, " ", t, this, fun,? ] This is a simple tokenizer that may break up things we want to keep as a single token: >>> t = "That poster from the U.S.A. costs $22.50." >>> list(tokenize.regexp(t, pat)) [ That, poster, from, the, U,., S,., A,., costs, $, 22,., 50,. ]

Tokenization with Regular Expressions (2) Add further components to the RE used in the tokenizer: >>> import re >>> pat2 = re.compile(r... \$?\d+(\.\d+)? # currency amounts (eg $22.50)... ([A-Z]\.)+ # abbreviations (eg U.S.A.)... \w+ # sequences of word characters... [^\w\s]+ # punctuation sequences..., re.verbose) >>> list(tokenize.regexp(t, pat2)) [ That, poster, from, the, U.S.A., costs, $22.50,. ]

Reading Jurafsky & Martin, Chap 2 NLTK Lite Tutorial: Regular Expressions available from http: //nltk.sourceforge.net/lite/doc/en/regexps.html