LING115 Lecture Note Session #7: Regular Expressions

Similar documents
Regular expressions. LING78100: Methods in Computational Linguistics I

Lecture 18 Regular Expressions

CSE : Python Programming

Understanding Regular Expressions, Special Characters, and Patterns

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Regexp. Lecture 26: Regular Expressions

Part 5 Program Analysis Principles and Techniques

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

RegExpr:Review & Wrapup; Lecture 13b Larry Ruzzo

LECTURE 8. The Standard Library Part 2: re, copy, and itertools

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Lecture 15 (05/08, 05/10): Text Mining. Decision, Operations & Information Technologies Robert H. Smith School of Business Spring, 2017

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Introduction to regular expressions

Regular Expressions. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Regular Expressions. Pattern and Match objects. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

S206E Lecture 19, 5/24/2016, Python an overview

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines.

Regular Expressions 1 / 12

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Lexical Analysis. Lecture 3-4

Regular Expressions. Regular Expression Syntax in Python. Achtung!

CSE 401 Midterm Exam 11/5/10


Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Non-deterministic Finite Automata (NFA)

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

\n is used in a string to indicate the newline character. An expression produces data. The simplest expression

Object-Oriented Software Engineering CS288

Indian Institute of Technology Kharagpur. PERL Part III. Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. I.I.T.

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

More Examples. Lex/Flex/JLex

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

Languages and Strings. Chapter 2

Languages and Compilers

CSE 105 THEORY OF COMPUTATION

Lecture 4: Syntax Specification

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation

Optimizing Finite Automata

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Regular Expressions. Perl PCRE POSIX.NET Python Java

Getting Started Values, Expressions, and Statements CS GMU

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Regex Guide. Complete Revolution In programming For Text Detection

Lab 20: Regular Expressions in Python. Ling 1330/2330: Computational Linguistics Na-Rae Han

STATS Data analysis using Python. Lecture 0: Introduction and Administrivia

CMPSCI 250: Introduction to Computation. Lecture #7: Quantifiers and Languages 6 February 2012

a b c d a b c d e 5 e 7

Configuring the RADIUS Listener LEG

CSCE 120: Learning To Code

LING/C SC/PSYC 438/538. Lecture 20 Sandiway Fong

Table ofcontents. Preface. 1: Introduction to Regular Expressions xv

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Regular Expressions. Pattern and Match objects. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Regular Languages and Regular Expressions

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

CMSC 132: Object-Oriented Programming II

An Introduction to Regular Expressions in Python

ECE251 Midterm practice questions, Fall 2010

R in Linguistic Analysis. Week 2 Wassink Autumn 2012

CMSC 330: Organization of Programming Languages

CSC401 Tutorial 3 - Regular Expressions

Regular Expressions. Upsorn Praphamontripong. CS 1111 Introduction to Programming Spring [Ref:

This example uses the or-operator ' '. Strings which are enclosed in angle brackets (like <N>, <sg>, and <pl> in the example) are multi-character symb

Effective Programming Practices for Economists. 17. Regular Expressions

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

1 Finite Representations of Languages

Introduction to Scientific Computing

Regular Expressions & Automata

CSE 3302 Programming Languages Lecture 2: Syntax

DATA STRUCTURE AND ALGORITHM USING PYTHON

Formal Languages and Compilers Lecture VI: Lexical Analysis

CMSC 330: Organization of Programming Languages

Project 2: How Parentheses and the Order of Operations Impose Structure on Expressions

Regular Expressions. Chapter 6

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

A Functional Evaluation Model

B The SLLGEN Parsing System

Server-side Web Development (I3302) Semester: 1 Academic Year: 2017/2018 Credits: 4 (50 hours) Dr Antoun Yaacoub

SOFTWARE ENGINEERING DESIGN I

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Scripting Languages Course 1. Diana Trandabăț

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Chapter Seven: Regular Expressions

Python source materials

1 Lexical Considerations

Lexical Considerations

Essentials for Scientific Computing: Stream editing with sed and awk

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CSC 467 Lecture 3: Regular Expressions

JME Language Reference Manual

$ /path/to/python /path/to/soardoc/src/soardoc.py

Transcription:

LING115 Lecture Note Session #7: Regular Expressions 1. Introduction We need to refer to a set of strings for various reasons: to ignore case-distinction, to refer to a set of files that share a common substring, etc. A regular expression is a formal notation that refers to a set of strings. For example, the regular expression [Ll]inguistics is equivalent to {Linguistics, linguistics}. The idea is that it is much more convenient to use regular expressions than to enumerate all the strings, if possible. Formal systems like regular expressions depend on what the users of the system agree on or its conventions: what the individual symbols mean, what the operators that attach to the symbols mean, etc. There are a small number of different kinds of regular expressions that follow different conventions: POSIX-basic, POSIX-extended, Perl-derivative, etc. We will learn the Perl-derivative regular expressions in this class. 2. Basics In its simplest form, a regular expression is a single character like a, which basically refers to the string a. There are three basic ways a regular expression can become more complex from this: concatenation, disjunction, and Kleene-star. 2.1. Concatenation We can concatenate regular expressions to derive a regular expression that refers to a longer string. For example, concatenating two regular expressions a and b will result in ab, which now refers to the string ab. We can, of course, concatenate ab and c to get abc, which refers to abc. 2.2. Disjunction We can combine two or more regular expressions by disjunction to refer to a larger set of strings. Each regular expression refers to a set of strings. Disjunction of two or more regular expressions refers to the union of sets of strings, where each set is a set of strings specified by each regular expression. There are two ways to expression disjunction: and []. Two regular expressions conjoined by refer to the union of two sets each represented by a regular expression. For example, regular expression means either regular or expression. On the other hand, [] is used to refer to the any one of the characters inside the brackets. For example, [regular] is equal to {r, e, g, u, l, a}. 2.3. Kleene-star 1

It helps to specify a repetition of characters or strings. In regular expression, this is done by the Kleene star: zero or more instances of what immediately comes before the star. For example, regular* is equal to {regula, regular, regularr, regularrr,...}. 2.4. Scope and * are operators. Regular expression becomes more powerful if we control the scope of these operators. We use the parentheses to identify the scope. For example, the parentheses in (regular)* means that the Kleene star applies to regular, not just r. So the regular expression means zero or more repetitions of regular. Similarly, reg(ular ister) means that the disjunction operator applies to ular and ister, rather than regular and ister. So the regular expression means either regular or register. 3. Special characters There are special characters which allow us to use regular expression more easily. Below is a list of useful characters. Character Meaning Example. Any single character.* = zero or more repetitions of any character + One or more instances of regular+ = {regular, regu larr, regularrr,...}? Zero or one instance of the linguistics? = {linguistic,linguistics} - Range of characters when used with characters in brackets [a-z] = {a,b,c,d,...,z} {m,n} from m through n instances of reguar{2,4} = {regularr, regularrr, regularrr} {m,} m or more instances of regualar{2,} = { regularr, regularrr,...} {,n} up to n instances of regulaar{,3} = {regula, regular, regularr, regularrr} {m} exactly m instances of regualar{2} = regularr ^ Beginning of a line ^ab = ab where a begins a line ^ Any character except the [^abc] = any character but {a,b,c} characters listed in brackets. This meaning takes effect only when ^ is the first character in brackets. $ End of a line ab$ = ab where b ends a line \d Any digit \d{2} = {00,01,...,99} \D Any non-digit character \D = [^0-9] \w Any alphanumeric character \w = [a-za-z0-9_] 2

or underscore \W Any character but \w \W = [^a-za-z0-9_] In order to refer to these special characters literally in a regular expression, we add a backslash before the characters. For example, \. would literally refer to the dot character, rather than any single character. 4. Python re module Python provides a module named re with various functions for manipulating regular expressions: searching text for that match a regular expression, substituting a string for a string that matches the regular expression, etc. Keep in mind that as these functions are defined in the re module, we must first import re before we use them. 4.1. Matching There are several functions for matching strings against the regular expression we specify. Keep in mind that in all of the functions here, we specify the regular expression as strings. We will frequently use the following three functions: re.match(regex,string) Match regular expression at the beginning of the string. If the two match, the function returns a regular-expression object. Otherwise, it returns None. >>> re.match( ab+, abc ) >>> re.match( bc, abc ) re.search(regex,string) Scan through the string and look for a substring that matches the regular expression. If the two match, the function returns a regular-expression object. Otherwise, it returns None. >>> re.search( bc, abc ) re.findall(regex,string) Find all substrings of the string that match the regular expression. The function returns a list of all matching substrings. >>> re.findall( bc, abcdcbc ) 4.2. Substitution We can rewrite parts of a string that match a regular expression as some other string. re.sub(regex,replacement,string) 3

Replace parts of string that match regex by replacement. >>> re.sub( ab, xy, sabcdabdw ) 4.3. Regular expression objects and grouping Recall that re.match() and re.search() return a regular expression object as the result of matching. Similar to other data-types we have seen, the re module provides several methods specific to the regular expression objects. Among those methods, there are a few that are related to grouping the part of the string that matches the regular expression. group() Return the string that matches the regular expression. Again, the output of re.match() or re.search() is a regular expression object, which doesn t tell us much by itself. For easier interpretation, it can be quite useful to read the result of matching as a string. >>> vowel.group() start() Return the staring index of the part of the string that matches the regular expression. >>> vowel.start() end() Return the end index of the part of the string that matches the regular expression. >>> vowel.end() span() Return a tuple consisting of the starting index and the end index of the part of the string that matches the regular expression. >>> vowel.span() 4.4. Greedy vs. Non-greedy Regular expression functions in Python are greedy in the sense that they look for longest possible substring that matches the regular expression. Consider the following, for example: >>> line= <a href= http://www.sjsu.edu >Welcome to SJSU</a> >>> re.sub( <.+>,,line) 4

Running the re.sub function will completely delete the line. If we wanted to only remove the first html tag, for example, we would need to make re.sub() less greedy and stop scanning the line as soon as it finds the first >. In Python, we use? to make the regular expression functions non-greedy. For example, >>> re.sub( <.+?>,,line> 5