Describing Languages with Regular Expressions

Similar documents
Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Regular Expressions Explained

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

CS Unix Tools & Scripting

Lecture 18 Regular Expressions

Regular Expressions 1

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Filtering Service

Regular Expressions in Practice

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Understanding Regular Expressions, Special Characters, and Patterns

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

Regular Expressions. Regular Expression Syntax in Python. Achtung!

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Lecture 2. Regular Expression Parsing Awk

Computer Systems and Architecture

Computing Unit 3: Data Types

Regexp. Lecture 26: Regular Expressions

applied regex implementing REs using finite state automata using REs to find patterns Informatics 1 School of Informatics, University of Edinburgh 1

STREAM EDITOR - REGULAR EXPRESSIONS

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Regular Expressions in Perl

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Regular Expressions.

Effective Programming Practices for Economists. 17. Regular Expressions

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

Structure of Programming Languages Lecture 3

Ling/CSE 472: Introduction to Computational Linguistics. 4/6/15: Morphology & FST 2

More Details about Regular Expressions

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Pattern Matching. An Introduction to File Globs and Regular Expressions

Pieter van den Hombergh. April 13, 2018

Computer Systems and Architecture

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Table ofcontents. Preface. 1: Introduction to Regular Expressions xv

Lesson 10: Representing, Naming, and Evaluating Functions

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

successes without magic London,

CSE528 Natural Language Processing Venue:ADB-405 Topic: Regular Expressions & Automata. www. l ea rn ersd esk.weeb l y. com

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

CSE : Python Programming

Advanced Handle Definition

Regular Expressions. James Balamuta STAT UIUC. Lecture 25: Nov 9, 2018

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Wildcards and Regular Expressions

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Regexs with DFA and Parse Trees. CS230 Tutorial 11

Motivation (Scenarios) Topic 4: Grep, Find & Sed. Displaying File Names. grep

Who This Book Is For What This Book Covers How This Book Is Structured What You Need to Use This Book. Source Code

Regular Expressions. Perl PCRE POSIX.NET Python Java

--- stands for the horizontal line.

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

Introduction to: Computers & Programming: Using Patterns with Strings For Search and Modification

Lecture 5, Regular Expressions September 2014

CST Lab #5. Student Name: Student Number: Lab section:

set in Options). Returns the cursor to its position prior to the Correct command.

Expressions, Text Normalization, Edit Distance

More regular expressions, synchronizing data, comparing files

Download the examples: LabWeek5examples..py or download LabWeek5examples.txt and rename it as.py from the LabExamples folder or from blackboard.

1 Finite Representations of Languages

Midterm I - Solution CS164, Spring 2014

CSE 390a Lecture 7. Regular expressions, egrep, and sed

Compiler Construction LECTURE # 3

Introduction to Regular Expressions Version 1.3. Tom Sgouros

Informatics 1 - Computation & Logic: Tutorial 3

Searching Guide. September 16, Version 9.3

R E G U L A R E X P R E S S I O N S

1 de 6 07/03/ :28 p.m.

Fundamentals: Expressions and Assignment

Set and Set Operations

CS Unix Tools & Scripting Lecture 7 Working with Stream

Tips and Tricks for Making the Most of Create Lists

Tuesday, September 30, 14.

BNF, EBNF Regular Expressions. Programming Languages,

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Introduction to Lexing and Parsing

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

Unix Introduction. Part 2

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Chapter 17. Fundamental Concepts Expressed in JavaScript

Regular Expressions for Linguists: A Life Skill

Systems Programming/ C and UNIX

User Commands sed ( 1 )

Basics Wildcard and multipliers Special characters Negation Other functions Programming. Regular Expressions. Web Programming

Perl Regular Expressions Perl is renowned for its excellence in text processing. Regular expressions area big factor behind this fame.

1 CS580W-01 Quiz 1 Solution

Certification. String Processing with Regular Expressions

Behaviour Diagrams UML

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

Project 2: Eliza Due: 7:00 PM, Nov 3, 2017

Object-Oriented Software Engineering CS288

CS 2112 Lab: Regular Expressions

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Transcription:

University of Oslo : Department of Informatics Describing Languages with Regular Expressions Jonathon Read 25 September 2012 INF4820: Algorithms for AI and NLP

Outlook How can we write programs that handle sentences?

Outlook How can we write programs that handle sentences? Describing languages with regular expressions Representing and implementing regular expressions using finite state automata Estimating the probability of unobserved strings of words with language models Sequence-labelling part-of-speech using Hidden Markov models

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +...

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +... With natural languages there are so many more choices: The fox.

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +... With natural languages there are so many more choices: The fox. The hungry fox.

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +... With natural languages there are so many more choices: The fox. The hungry fox. The hungry fox ate.

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +... With natural languages there are so many more choices: The fox. The hungry fox. The hungry fox ate. The hungry fox ate the chicken.

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +... With natural languages there are so many more choices: The fox. The hungry fox. The hungry fox ate. The hungry fox ate the chicken. The hungry fox quickly ate the chicken.

Productivity of languages Even simple formal languages are infinite: x =1 + 2 x =1 + 2 + 3 x =1 + 2 + 3 +... With natural languages there are so many more choices: The fox. The hungry fox. The hungry fox ate. The hungry fox ate the chicken. The hungry fox quickly ate the chicken. The hungry brown fox quickly ate the delicious roast chicken and washed it down with a pint of beer.

Characterising language Simplifying assumption A language is a set of utterances utterances inside this set are well-formed utterances not in this set are ill-formed

Characterising language Simplifying assumption A language is a set of utterances utterances inside this set are well-formed utterances not in this set are ill-formed How do we represent sets of utterances, if the set is infinite?

Regular expressions Regular expressions (RE, RegEx, RegExp): Algebraic notation for characterising sets of strings They consist of constants and operators Example /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ Note: an implementation is supplied in many programming languages and text editors for instance, try C-M-s in emacs, or grep on the command line.

Matching Sequences of character constants specify how to match strings. Further expressiveness is added by metacharacters, including: Example. any single character (except new lines) ˆ the start of a line $ the end of a line /ˆChapter.$/ { Chapter 1, Chapter 2,..., Chapter & } Note: When the literal of an operator or metacharacter i.e. one of {}[]()ˆ$. *+?\ should be matched, it must be escaped using a back slash, e.g. match a full-stop with /\./

Disjunction The operator expresses a logical or Example /ˆa (fox wolf)$/ { a fox, a wolf } Note: The operator has low precedence brackets ensure that it does not specify the set { a fox, wolf }

Character classes Character classes can also be used to specify disjunction they are expressed using square brackets, [ and ]: Examples /ˆ[Ff]ox$/ { Fox, fox } /ˆf[aio]x]$/ { fax, fix, fox } /ˆ[a-z]$/ { a, b, c,..., z } /ˆChapter [1-9]$/ { Chapter 1, Chapter 2,..., Chapter 9 }

Character classes Used inside a character class, ˆ negates the class: Example /[ˆA-Za-z]/ matches any non-alphabetic character /[ˆ ]/ matches anything that is not a space Many implementations provide named character classes: Examples /\d/ /[[:digit:]]/ /[0-9]/ /\w/ /[[:alnum:]]/ /[a-za-z0-9 ]/ /\D/ /[ˆ0-9]/ /[[:punct:]]/ matches punctuation characters

Quantification Quantification can be specified in a number of ways: Example? zero or one of the preceeding element * zero or more of the preceeding element + one or more of the preceeding element {n} exactly n of the preceeding element {n,m} from n to m of the preceeding element {n,} n or more of the preceeding element {,m} less than m of the preceeding element /ˆChapter [1-9]\d*$/ { Chapter 1, Chapter 2,..., Chapter 99999,... }

Lazy quantification How to match quoted items? Yes, he said, but why? Normal quantification operators are greedy they will the match the largest possible sequence in the input: /.+ / { Yes, he said, but why? } This can be overridden with?, which becomes the lazy operator when used next to a quantification operator: /.+? / { Yes, but why? }

Capturing groups Brackets are used to specify matching groups, which (a) enforce precedence and (b) indicate groups for later reference, using an escaped number (1-9). Example <[bi]>.+?</[bi]> { <b> +... + </b>, <i> +... + </i>, <b> +... + </i>, <b> +... + </i> } <([bi])>.+?</\1> { <b> +... + </b>, <i> +... + </i> }

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ /[A-Z][a-z]* / a word with an initial capital, followed by a space

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ /[A-Z][a-z]* / /\d+[a-z]?/ a word with an initial capital, followed by a space one or more digits, optionally followed by a capital letter

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ /[A-Z][a-z]* / a word with an initial capital, followed by a space /\d+[a-z]?/ one or more digits, optionally followed by a capital letter /, / a comma and a space

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ /[A-Z][a-z]* / a word with an initial capital, followed by a space /\d+[a-z]?/ one or more digits, optionally followed by a capital letter /, / a comma and a space /\d{4} / four digits, followed by a space

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ /[A-Z][a-z]* / a word with an initial capital, followed by a space /\d+[a-z]?/ one or more digits, optionally followed by a capital letter /, / a comma and a space /\d{4} / four digits, followed by a space /[A-Z][a-z]*/ a word with an initial capital

Putting it all together What does this match? /[A-Z][a-z]* \d+[a-z]?, \d{4} [A-Z][a-z]*/ /[A-Z][a-z]* / a word with an initial capital, followed by a space /\d+[a-z]?/ one or more digits, optionally followed by a capital letter /, / a comma and a space /\d{4} / four digits, followed by a space /[A-Z][a-z]*/ a word with an initial capital Gaustadalleén 23B, 0373 Oslo

Some exercises Write regular expressions for the following: 1. all alphabetic strings; 2. all lower case alphabetic strings ending in a b; 3. all strings of two repeated words; 4. all strings from the alphabet a,b such that a is immediately preceeded by and immediately followed by a b. 5. capturing the first word of an English sentence (making sure to deal with punctuation)

Some exercises 1. all alphabetic strings; /[a-za-z]+/

Some exercises 1. all alphabetic strings; /[a-za-z]+/ 2. all lower case alphabetic strings ending in a b; /[a-z]*b/

Some exercises 1. all alphabetic strings; /[a-za-z]+/ 2. all lower case alphabetic strings ending in a b; /[a-z]*b/ 3. all strings of two repeated words, separated by a space; /([a-za-z]+) \1/

Some exercises 1. all alphabetic strings; /[a-za-z]+/ 2. all lower case alphabetic strings ending in a b; /[a-z]*b/ 3. all strings of two repeated words, separated by a space; /([a-za-z]+) \1/ 4. all strings from the alphabet a,b such that a is immediately preceeded by and immediately followed by a b. /b+(ab+)+/

Some exercises 1. all alphabetic strings; /[a-za-z]+/ 2. all lower case alphabetic strings ending in a b; /[a-z]*b/ 3. all strings of two repeated words, separated by a space; /([a-za-z]+) \1/ 4. all strings from the alphabet a,b such that a is immediately preceeded by and immediately followed by a b. /b+(ab+)+/ 5. capturing the first word of an English sentence (making sure to deal with punctuation) /ˆ[ˆa-zA-Z]*([a-zA-Z]+)/

Applications in AI and NLP Weizenbaum 1966 User: Men are all alike. Eliza: In what way? User: They re always bugging us about something or other. Eliza: Can you think of a specific example? User: Well, my boyfriend made me come here. Eliza: Your boyfriend made you come here? User: He says I am depressed much of the time. Eliza: I am sorry to hear you are depressed.

Applications in AI and NLP Weizenbaum 1966 User: Eliza: User: Eliza: User: Eliza: User: Eliza: Men are all alike. In what way? They re always bugging us about something or other. Can you think of a specific example? Well, my boyfriend made me come here. Your boyfriend made you come here? He says I am depressed much of the time. I am sorry to hear you are depressed. Can be reproduced with a cascade of regular expression substitutions, e.g. using sed: s/.* all.*/in what way/ s/.* always.*/can you think of a specific example/ s/.* I am (depressed sad).*/i am sorry to hear you are \1/

Applications in AI and NLP Lexical morphology s/mouse/mice/ s/(bush fox house)/\1es/ s/(.)/\1s/

Applications in AI and NLP Lexical morphology s/mouse/mice/ s/(bush fox house)/\1es/ s/(.)/\1s/ Concise expressions of genetic sequences: finding codons, e.g. s/cg. AG[AG]/arginine specifying patterns e.g. /CG. AG[AG].{,100}GG./

Summary Regular expressions A finite way of specifying infinite sets Character constants, metacharacters and operators The fundamental operations are: Matching characters, wildcards (.) and anchors (ˆ and $) Disjunction ( and [ ]) Quantification (?, *, + and {n, m}) Precedence can be enforced with brackets (( and )) More complex operations include capturing groups Next week: Finite state automata Searching state spaces