The Little Regular Expressionist

Similar documents
Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

Introduction to Regular Expressions Version 1.3. Tom Sgouros

PowerGREP. Manual. Version October 2005

CMSC 132: Object-Oriented Programming II

Regular Expressions & Automata

Regular Expressions. Perl PCRE POSIX.NET Python Java

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Starting Boolean Algebra

Using Basic Formulas 4

Lecture 15 (05/08, 05/10): Text Mining. Decision, Operations & Information Technologies Robert H. Smith School of Business Spring, 2017

Regular Expressions!!

Understanding Regular Expressions, Special Characters, and Patterns

Transcriber(s): Aboelnaga, Eman Verifier(s): Yedman, Madeline Date Transcribed: Fall 2010 Page: 1 of 9


CS 2112 Lab: Regular Expressions

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Regular Expressions. Upsorn Praphamontripong. CS 1111 Introduction to Programming Spring [Ref:

Regexs with DFA and Parse Trees. CS230 Tutorial 11

Assignment 0. Nothing here to hand in

Pattern Matching. An Introduction to File Globs and Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Version November 2017

ASCII Code - The extended ASCII table

CS 230 Programming Languages

Regular expressions. LING78100: Methods in Computational Linguistics I

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

More Details about Regular Expressions

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

OOstaExcel.ir. J. Abbasi Syooki. HTML Number. Device Control 1 (oft. XON) Device Control 3 (oft. Negative Acknowledgement

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Shorthand for values: variables

Subversion was not there a minute ago. Then I went through a couple of menus and eventually it showed up. Why is it there sometimes and sometimes not?

If you have to pick between reluctant and greedy, often (but not always) you ll want the reluctant versions. So let s say we have the string:

Password Management Guidelines for Cisco UCS Passwords

Class 1 Supplement: Pattern matching, and dealing with files

SPEECH RECOGNITION COMMON COMMANDS

Regular Expressions Explained

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Introduction to Unix

c) Comments do not cause any machine language object code to be generated. d) Lengthy comments can cause poor execution-time performance.

Project Collaboration

Accounts and Passwords

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Number Systems II MA1S1. Tristan McLoughlin. November 30, 2013

Regular Expressions in Practice

While Statement Examples. While Statement (35.15) Until Statement (35.15) Until Statement Example

CSC401 Tutorial 3 - Regular Expressions

CMSC201 Computer Science I for Majors

Section 0.3 The Order of Operations

Worksheet - Reading Guide for Keys and Passwords

1 de 6 07/03/ :28 p.m.

STATISTICAL TECHNIQUES. Interpreting Basic Statistical Values

Filtering Service

- In light of so many reports of USB sticks failing on the C64 mini, I thought I'd try test out some of Camcam's collection.

9 R1 Get another piece of paper. We re going to have fun keeping track of (inaudible). Um How much time do you have? Are you getting tired?

Structure of Programming Languages Lecture 3

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Functions that return lists

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

L A TEX Primer. Randall R. Holmes. August 17, 2018

Word: Print Address Labels Using Mail Merge

Full file at C How to Program, 6/e Multiple Choice Test Bank

successes without magic London,

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Chapter 17. Fundamental Concepts Expressed in JavaScript

Writing a Lexer. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, February 6, Glenn G.

CMSC201 Computer Science I for Majors

Computer Systems and Architecture

Installing a Custom AutoCAD Toolbar (CUI interface)

Formulas in Microsoft Excel

Regular Expressions. James Balamuta STAT UIUC. Lecture 25: Nov 9, 2018

CHAPTER-2 STRUCTURE OF BOOLEAN FUNCTION USING GATES, K-Map and Quine-McCluskey

MITOCW watch?v=w_-sx4vr53m

Intermediate Excel 2013

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Language Basics. /* The NUMBER GAME - User tries to guess a number between 1 and 10 */ /* Generate a random number between 1 and 10 */

Haskell Syntax in Functions

Introduction; Parsing LL Grammars

A language is a subset of the set of all strings over some alphabet. string: a sequence of symbols alphabet: a set of symbols

MITOCW watch?v=se4p7ivcune

正则表达式 Frank from

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Slide Set 5. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

Using Lex or Flex. Prof. James L. Frankel Harvard University

Python allows variables to hold string values, just like any other type (Boolean, int, float). So, the following assignment statements are valid:

ITST Searching, Extracting & Archiving Data

Programming for Kids

Innovative User Group Conference 2009 Anaheim 1

SYMMETRY v.1 (square)

CS 115 Lecture 13. Strings. Neil Moore. Department of Computer Science University of Kentucky Lexington, Kentucky

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller

Transcription:

The Little Regular Expressionist Vilja Hulden August 2016 v0.1b CC-BY-SA 4.0 This little pamphlet, which is inspired by The Little Schemer by Daniel Friedman and Matthias Felleisen, aims to serve as a gentle introduction to regular expressions. You may want to cover the right half of the page, and only move the cover down answer by answer; that way you give yourself time to digest the question and maybe even come up with the answer (sometimes you will have enough information to at least take a guess, though not always). This pamphlet is by no means exhaustive; there is much more to regular expressions. It s a good idea to test and play; http://regexr.com/ has both a tester and reference resources, and a good comprehensive cheat sheet can be found at https://www.cheatography.com/davechild/cheat-sheets/ regular-expressions/. 1 Basics Is a a regular expression? Yes, it matches string a. Is ba*c a regular expression? Does ba*c match bac? Does ba*c match baaaac? Does ba*c match bc? 1

Hmm. So do you mean that ba*c can be read as b followed by zero or more a s and then a c? I do indeed. So Mo*re would match both More and Moore? Yes, but it would of course also match Mre. Oh. What if I don t want it to match Mre? Use the plus (+) instead of the star (*). Aha! bc? So ba+c matches bac but not OK, so does ba+c match baaaac? Yes, because + means one or more. And it also matches baac and baaaaac and baaaaaaaac and... Yes, you ve got the idea. So ba+c can be read as b followed by one or more a s and then a c? Quite so. OK, let me check a few more. b*a match ba? Does Does b*a match a? Yes, because the star means you don t have to have a b. Does b+a+ match aaaa? No, because the plus means that you have to have at least one b. Does b+a match bbbba? Yes, because you can have as many bs as you like, as long as you have at least one. Does b*a match bbbb? No. The a is not followed by a star, so it has to be there. 2

Does ba*c match BAC? No, because the matches are casesensitive. Does b*a match bbbba? No, because the matches are casesensitive! A lowercase b only matches a lowercase b, not an uppercase B. Oh, OK. What if I want to match both? We ll get to that, don t worry. Fine. What if I want to match any character? Use. (the period). Like this: b.c to match bac? Or the same thing, b.c, to match bxc? Does b.c match baaaac too? No. Does it match bxxxxc? No. The period only matches a single character. Oh. So does b.*c match baaaac? Yes! Does b.*c match bbbbbaaaac? Yes, because b is also a character. Does b.*k match bark? Yes, because both a and r are characters. Does B.*K match bark? No, because the matches are casesensitive. B only matches an uppercase B, not a lowercase b. OK. But does c.*p match co-op? Yes, because - (the hyphen) is a character too. 3

So c.*p would match c&?$#e$p? Yes, those are all characters. Does a.*p match b635%#p? No, because there s no a. Does a.*p match a&4>p? What if I have a word that looks like this: ab cd Will a.*d match it? No, the one character that. does not match is the line break. Oh, so it will only match a string if the string is all on one line? So let s say I have a line like this: Cats beat dogs I want to match the Cats in that line. I write C.*s to do that, right? Actually, no. What?? C.*s will match the whole line. Why?? Because regular expressions are greedy. They take everything they can. Oh, so C.*s will actually match the whole string Cats beat dogs because that string ends with s too? So, let me make sure I ve got all this right. Please do. What s the minimum number of a s a string has to have for the expression ba*c to match? Zero. 4

And what s the minimum number of a s a string has to have for the expression ba+ to match? One. Is there an upper limit to how many consecutive a s there can be in a string for the expressions ba* or ba+ to match? No. What does. match? Any character except line breaks. And, a regular expression is greedy. Yes!! Can I make it ungreedy? Can I match the Cats in Cats beat dogs somehow? Yes, you can add? to the quantifier (the star or plus). Like this: C.*?s? Cheat sheet. any character * zero or more + one or more *? zero or more, ungreedy +? one or more, ungreedy 5

2 Alternatives Does a b match a? Does a b match b? Does a b match c? No, there s nothing in the expression that could match c. Does a b match ab? No. So a b means a or b? Yes! Does a b+ match abbbb? No, it only matches one or the other side of the, not both at the same time. OK. So does a b+ match bbbb? And does a b+ match a? Does a b+ match aaa? No, because there s only one a in the expression. Does a+ b+ match aaabbb? No, it still only matches one or the other side of the pipe! Oh, right. So does a+ b+ match aaa? Does a+ b+ match bbb? So dog cat matches dog? But does dog cat match dogs? No, there s no s in the expression. 6

Oh yeah, that s true. But does dog cat match the dog in dogs? Yes! Does dog cat match docat? No, because the whole expression on one side of the (the pipe) has to match. Ah, right. So does dog cat match the cat in docat? Yes! What if I want to match both dogat and docat? Group the g c with parentheses. Like this: do(g c)at? That matches dogat as well as docat. Does dog cats match dogs? No, you can t combine the two sides of the pipe. Does dog cats match cats? Yes, because cats is all on one side of the pipe. Does (dog cat)s match dogs? Yes, because now you ve grouped the alternatives. Does (dog cat)s match cats? Yes, because you ve grouped the alternatives. Does (dog cat)s match cat? No, because the s has to be there. Does (dog cat)s* match cat? Yes, because now you ve made the s optional. 7

3 Character classes Does [abc] match a? Yes, because a is included in the class. Does [abc] match b? Yes, because b is included in the class. Does [abc] match d? No, because d is not included in the class. Does [abc] match ab? No, because the class only represents one of its members at a time. Does [abc][abc] match ab? Yes, because now there are two classes in a row. Does [abc][abc] match cb? Yes, because the class can represent any one of its members. Does [abc]+ match ab? Yes! Very good. Does [abc]+ match ba? Does b[abc]* match ba? Does [abc]+ match bacab? So if a class is followed by * or +, any number of the members of that class can be strung together in any order? Yes (well, to be precise, zero or more for * and one or more for +). If a class contains all letters between a and h, do I have to list them all, like this: [abcdefgh]? No, you can define the range like this: [a-h]. So, [a-z] matches a? 8

And [a-z] matches g? And [a-z] match k? We could go on. Let s not. But does [a-z] match A? No, the class is case sensitive. Does [a-z] match aa? No, because the class only represents one of its members at a time. Does [a-z]+ match abc? Yes, of course. Does [a-z]+ match bye? Yes, of course. Does [a-z] match a-z? No, the expression defines a class, not a string. Does [a-k] match k? Does [a-k] match x? No, x is not included in the range a-k. Does [a-z] match 2? No, no digits are included in the range a-z. Does [a-z] match &? No, no punctuation marks are included in the range a-z. Does [0-5] match 2? Does [0-5] match 7? No, 7 is not included in the range 0-5. Does [0-5] match b? No, no letters are included in the range 0-5. 9

Does [a-z0-9] match b? Does [a-z0-9] match 5? Does [a-z0-9] match b5? No, because the class only represents one of its members at a time. Does [a-z0-9]+ match b5? Yes, because both b and 5 are members of the class and + means one or more. Does [a-z0-9]+ match good4you? Does [a-z0-9]+ match good4you!? No, because the exclamation mark is not a member of the class. Does [a-z0-9!]+ match good4you!? Yes, because now you ve added the exclamation mark to the class. So if I put any characters inside square brackets, those characters become members of a class? Hey, couldn t I use this to match both uppercase and lowercase letters, like I wanted to do earlier (on page 1)? Yes! Like this: [Bb]ob to match both bob and Bob? Yes! Or different spellings, like gr[ae]y to match both grey and gray? Absolutely! OK, so by saying [0-9] I can match anything that s a number. That s right. 10

But what if I want to match anything except a number? Do I have to make a class that lists everything that s not a number? No, silly, of course not. So how do I match anything except a number? You negate the number class with ^ (a caret). Oh, like this: [^0-9]? Cheat sheet a b [abc] [a-z] [A-Z] [Bb] [0-9] any number in the range 0-9 [0-4] any number in the range 0-4 [^246] not 2, 4, or 6 [^0-9] not a number a or b a or b or c any lowercase letter in the range a-z any uppercase letter in the range a-z uppercase or lowercase b 11

4 Special characters and shorthands So if a period is short for any character, then how do I match a period and nothing else? Good question! You have to escape it with a backslash, like this: \. So \.org would only match.org, not, say, borg? That s right. Is it the same for * and +? The period, star, and plus are all special characters with a special meaning. You have to escape them to make them represent the literal character. So writing \*borg\.org\+ would only match *borg.org+ and nothing else? That s right. Hey, could I also match.org by saying [.]org? Yes! Inside a character class, only - (the hyphen) and ^ the caret are special characters. What if I want to include the hyphen in a character class? You put it first, since then it can t define a range. So [-a-z]+ would match bye-bye? And the caret? You put it anywhere but first, since it only negates if it s the first character. So [a-z^]+ would match o^o? Yep. What about if I want to match all whitespace? Do I make a character class? You could, or you can just say \s to cover spaces, tabs, and the various flavors of newlines. So kitty\s*cat would match both kittycat and kitty cat? 12

Are there more shorthands like that? You bet. Too many to list here. Is there a shorthand for all digits? Yes! It s \d. So \d matches the same thing as [0-9] Is there another way to say [^0-9], too? Yes, \D. Oh! So is \S then any character except whitespace? It is! Speaking of special characters, does the caret mean anything outside a character class? Yes, it means beginning of line. So ^dog would find all lines beginning with dog? Is there a character for end of line, too? Yes, $. 5 Grouping and substitution Say I want to match kittykittykitty as well as kittykitty. Can I do that with a plus, like I can match aa and aaa with a+? Yes! You can make kitty a single group by putting it in parentheses. Like this: (kitty)+? So if I replace (kitty)+ with doggie, do I get doggiedoggiedoggie? No, you get doggie, because (kitty)+ matches the whole string, no matter how many times kitty it has. 13

Oh. So actually to replace kittykittykitty with doggiedoggiedoggie, I should just replace kitty with doggie? Can I also say (kitty doggie)+ and match either kittykittykitty and doggiedoggiedoggie? Yes, of course you can. And that will of course also match kitty and doggiedoggie and so on. Right. So what if I wanted to replace kittykitty with here-kittykitty and doggiedoggiedoggie with here-doggiedoggiedoggie? Can I do that with one expression? Yes; you can use a backreference. Any grouped expression is saved and numbered and can be accessed by $1, $2, and so on. So replacing That s a cute (kitty) with I like that $1 would produce I like that kitty? Absolutely. So then can I say replace (kitty doggie)+ with here-$1 to get here-doggiedoggie and so on? No, because now the grouped part only has kitty or doggie once. Oh, so I ll always get here-kitty or here-doggie and never here-doggiedoggie. Yes; you have to include the plus in a group. Would this work: ((kitty doggie )+)? It would. Cheat sheet \ escape character \. period (literal) \s any whitespace character (space, tab, newline) \d any digit \S any non-whitespace character \D any non-digit character (ab)* ab zero or more times $1 cat in kitty(cat) $2 s in kitty(cat)(s) $1 kittycat in (kitty(cat)) $2 cat in (kitty(cat)) 14