Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Similar documents
Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

PYTHON- AN INNOVATION

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Ruby: Introduction, Basics

Ruby: Introduction, Basics

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens


CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CS Unix Tools & Scripting

X Language Definition

DVA337 HT17 - LECTURE 4. Languages and regular expressions

CSE 3302 Programming Languages Lecture 2: Syntax

Structure of Programming Languages Lecture 3

Behaviour Diagrams UML

正则表达式 Frank from

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

Describing Languages with Regular Expressions

PowerGREP. Manual. Version October 2005

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Ruby: Introduction, Basics

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

More Details about Regular Expressions

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Regular Expressions.

Ruby Regular Expressions AND FINITE AUTOMATA

Lexical Analysis. Lecture 3-4

Regular Expressions. Perl PCRE POSIX.NET Python Java

Variables, Constants, and Data Types

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

7/8/10 KEY CONCEPTS. Problem COMP 10 EXPLORING COMPUTER SCIENCE. Algorithm. Lecture 2 Variables, Types, and Programs. Program PROBLEM SOLVING

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS52 - Assignment 10

Dr. D.M. Akbar Hussain

Regular expressions. LING78100: Methods in Computational Linguistics I

CS/ECE 374 Fall Homework 1. Due Tuesday, September 6, 2016 at 8pm

Homework 1 Due Tuesday, January 30, 2018 at 8pm

Strings, characters and character literals

Regular Expressions. Regular Expression Syntax in Python. Achtung!

CSE 154 LECTURE 11: REGULAR EXPRESSIONS

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

Overview. - General Data Types - Categories of Words. - Define Before Use. - The Three S s. - End of Statement - My First Program

Computer Organization

CSCI312 Principles of Programming Languages!

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

1 Lexical Considerations

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Parsing and Pattern Recognition

Compiler phases. Non-tokens

Non-deterministic Finite Automata (NFA)

Lecture 4: Syntax Specification

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

STREAM EDITOR - REGULAR EXPRESSIONS

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical and Syntax Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

IC Language Specification

Formal Languages and Compilers Lecture VI: Lexical Analysis

JFlex Regular Expressions

Languages and Compilers

Lexical Scanning COMP360

Lexical Considerations

CS103L FALL 2017 UNIT 1: TYPES, VARIABLES,EXPRESSIONS,C++ BASICS

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Introduction to Regular Expressions Version 1.3. Tom Sgouros

LING115 Lecture Note Session #7: Regular Expressions

CSCI 2010 Principles of Computer Science. Data and Expressions 08/09/2013 CSCI

Version November 2017

Regex Guide. Complete Revolution In programming For Text Detection

2.1. Chapter 2: Parts of a C++ Program. Parts of a C++ Program. Introduction to C++ Parts of a C++ Program

CS 2112 Lab: Regular Expressions

CSE : Python Programming

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines.

CSE 401 Midterm Exam 11/5/10

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Full file at

Lecture 18 Regular Expressions

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

CS112 Lecture: Primitive Types, Operators, Strings

Lexical Considerations

=~ determines to which variable the regex is applied. In its absence, $_ is used.

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Compiling Regular Expressions COMP360

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

FSASIM: A Simulator for Finite-State Automata

DECLARATIONS. Character Set, Keywords, Identifiers, Constants, Variables. Designed by Parul Khurana, LIECA.

Transcription:

Regular Expressions Computer Science and Engineering College of Engineering The Ohio State University Lecture 9

Language Definition: a set of strings Examples Activity: For each above, find (the cardinality of the set)

Programming Languages Q: Are C, Java, Ruby, Python, languages in this formal sense?

Programming Languages Q: Are C, Java, Ruby, Python, languages in this formal sense? A: Yes! is the set of well-formed Ruby programs What the interpreter (compiler) accepts The syntax of the language But what does one such string mean? The semantics of the language Not part of formal definition of "language" But necessary to know to claim "I know Ruby"

Regular Expression (RE) A formal mechanism for defining a language Precise, unambiguous, well-defined In math, a clear distinction between: Characters in strings (the "alphabet") Meta characters used in writing a RE In computer applications, there isn't Is '*' a Kleene star or an asterisk? (a b)*a(a b)(a b)(a b)

Literals A literal represents a character from the alphabet Some are easy: f, i, s, h, Whitespace is hard (invisible!) \t is a tab (ascii 0x09) \n is a newline (ascii 0x0A) \r is a carriage return (ascii 0x0D) So the character '\' needs to be escaped! \\ is a \ (ascii 0x5c)

Basic Operators ( ) for grouping, for choice Examples cat dog fish (h H)ello R(uby ails) (G g)r(a e)y These operators are meta-characters too To represent the literal: \( \) \ \(61(3 4)\) Activity: For each RE above, write out the corresponding language explicitly (ie, as a set of strings)

Character Class Set of possible characters (0 1 2 3 4 5 6 7 8 9) is annoying! Syntax: [ ] Explicit list as [0123456789] Range as [0-9] Negate with ^ at the beginning [^A-Z] a character that is not a capital letter Activity: Write the language defined by Gr[ae]y 0[xX][0-9a-fA-F] [Qq][^u]

Character Class Shorthands Common \d for digit, ie [0-9] \s for whitespace, ie [ \t\r\n] \w for word character, ie [0-9a-zA-Z_] And negations too \D, \S, \W (ie [^\d], [^\s], [^\w]) Warning: [^\d\s] [\D\S] POSIX standard (& Ruby) includes [[:alpha:]] alphabetic character [[:lower:]] lowercase alphabetic character [[:digit:]] decimal digit (unicode! Eg ١) [[:xdigit:]] hexadecimal digit [[:space:]] whitespace including newlines

Wildcards A. matches any character (almost) Includes space, tab, punctuation, etc! But does not include newline So add. to list of meta-characters Use \. for a literal period Examples Gr.y buckeye\.\d Problem: What is RE for OSU email address for everyone named Smith? Answer is not: smith\.\d@osu\.edu

Repetition Applies to preceding character or ( ) group? means 0 or 1 time * means 0 or more times (unbounded) + means 1 or more times (unbounded) {k} means exactly k times {a,b} means k times, for a k b More meta-characters to escape! \? \* \+ \{ \}

Examples colou?r smith\.[1-9]\d*@osu\.edu 0[xX](0 [1-9a-fA-F][0-9a-fA-F]*).*\.jpe?g

Your Turn (Language consisting of) strings that: Contain only letters, numbers, and _ Start with a letter Do not contain 2 consecutive _'s Do not end with _ Exemplars and counter-exemplars: EOF, 4Temp, Test_Case3, _class, a4_sap_x, S T_2 Write the corresponding RE

Your Turn (Language consisting of) strings that: Contain only letters, numbers, and _ Start with a letter Do not contain 2 consecutive _'s Do not end with _ Exemplars and counter-exemplars: EOF, 4Temp, Test_Case3, _class, a4_sap_x, S T_2 Write the corresponding RE

Your Turn (Language consisting of) strings that: Contain only letters, numbers, and _ Start with a letter Do not contain 2 consecutive _'s Do not end with _ Exemplars and counter-exemplars: EOF, 4Temp, Test_Case3, _class, a4_sap_x, S T_2 Write the corresponding RE [a-za-z](_[a-za-z0-9] [a-za-z0-9])*

Finite State Automota (FSA) An FSA is an "accepting rule" Finite set of states Transition function (relation) between states based on next character in string DFA vs NFA Start state (s 0 ) Set of accepting states An FSA "accepts" a string if you can start in s 0 and end up in an accepting state, consuming 1 character per step

Example What language is defined by this FSA? 1 1 0 S 0 S 1 0

Example What language is defined by this FSA? A. Binary strings (0's and 1's) with an even number of 0's 1 1 0 S 0 S 1 0

Your Turn (Language consisting of) strings that: Contain only letters, numbers, and _ Start with a letter Do not contain 2 consecutive _'s Do not end with _ Exemplars and counter-exemplars: EOF, 4Temp, Test_Case3, _class, a4_sap_x, S T_2 Write the corresponding FSA

Solution a-za-z0-9 a-za-z S 0 S 1 a-za-z0-9 _ S 2

Fundamental Results Expressive power of RE is the same as FSA Expressive power of RE is limited Write a RE for "strings of balanced parens" ()(()()), ()(), (((()))), (((, ())((), Can not be done! (impossibility result) Take CSE 3321

REs in Practice REs often used to find a "match" A substring s within a longer string such that s is in the language defined by the RE (CSE cse)?3901 Possible uses: Report matching substrings and locations Replace match with something else Practical aspects of using REs this way Anchors Greedy vs lazy matching

Anchors Used to specify where matching string should be with respect to a line of text Newlines are natural breaking points ^ anchors to the beginning of a line $ anchors to the end of a line Ruby: \A \z for beginning/end of string Examples ^Hello World$ \A[Tt]he ^[^\d].\.jpe?g end\.\z

Greedy vs Lazy Repetition (+ and *) allows multiple matches to begin at same place Example: <.*> <h1>title</h1> <h1>title</h1> The match selected depends on whether the repetition matching is greedy, ie matches as much as possible lazy, ie matches as little as possible Default is typically greedy For lazy matching, use *? or +?

Regular Expressions in Ruby Instance of a class (Regexp) pattern = Regexp.new('^Rub.') But literal notation is common: /pattern/ /[aeiou]*/ %r{hello+} #no need to escape / Match operator =~ (negation is!~) Operands: String and Regexp (in either order) Returns index of first match (or nil if not present) 'hello world' =~ /o/ #=> 4 /or/ =~ 'hello' #=> nil Options post-pended: /pattern/options i ignore case x ignore whitespace & comments ("free spacing")

Strings and Regular Expressions Find all matches as an array a.scan /[[:alpha:]]/ Delimeter for spliting string into array a.split /[aeiou]/ Substitution: sub and gsub (+/-!) Replace first match vs all ("globally") a = "the quick brown fox" a.sub /[aeiou]/, '@' #=> "th@ quick brown fox" a.gsub /[aeiou]/, '@' #=> "th@ q@@ck br@wn f@x"

Your Turn (Regular Expressions) Check if phone number in valid format phone = "614-292-2900" #not ok phone = "(614) 292-2900" #ok format =? if phone? format #well-formatted

Summary Language: A set of strings RE: Defines a language Recipe for making elements of language Literals Distinguish characters and metacharacters Character classes Represent 1 character in RE Repetition FSA Expressive power same as RE