Regular Expressions Primer

Similar documents
CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Certification. String Processing with Regular Expressions

Understanding Regular Expressions, Special Characters, and Patterns

Lecture 18 Regular Expressions

CST Lab #5. Student Name: Student Number: Lab section:

CSE 390a Lecture 7. Regular expressions, egrep, and sed

=~ determines to which variable the regex is applied. In its absence, $_ is used.

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

CSE 303 Lecture 7. Regular expressions, egrep, and sed. read Linux Pocket Guide pp , 73-74, 81

Appendix. As a quick reference, here you will find all the metacharacters and their descriptions. Table A-1. Characters

Computer Systems and Architecture

Who This Book Is For What This Book Covers How This Book Is Structured What You Need to Use This Book. Source Code

Regular Expressions in Practice

CS Unix Tools & Scripting

Computer Systems and Architecture

Pattern Matching. An Introduction to File Globs and Regular Expressions

STREAM EDITOR - REGULAR EXPRESSIONS

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

successes without magic London,

Regular Expressions Explained

Regular Expression Reference

Regex Guide. Complete Revolution In programming For Text Detection

Getting to grips with Unix and the Linux family

Lecture 2. Regular Expression Parsing Awk

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Regular Expressions. Perl PCRE POSIX.NET Python Java

Module 8 Pipes, Redirection and REGEX

Regular Expressions for Technical Writers

R E G U L A R E X P R E S S I O N S

FILTERS USING REGULAR EXPRESSIONS grep and sed

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

IT441. Regular Expressions. Handling Text: DRAFT. Network Services Administration

PowerGREP. Manual. Version October 2005

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Describing Languages with Regular Expressions

LING115 Lecture Note Session #7: Regular Expressions

Introduction to Unix

Version November 2017

CSE 374 Programming Concepts & Tools. Laura Campbell (thanks to Hal Perkins) Winter 2014 Lecture 6 sed, command-line tools wrapup

1 CS580W-01 Quiz 1 Solution

Regular expressions and regexp-based string operations

CS395T: Introduction to Scientific and Technical Computing

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

How to Use Adhoc Parameters in Actuate Reports

Regular Expressions for Technical Writers (tutorial)

Regular Expressions!!

CS 2112 Lab: Regular Expressions

Visualization of Biomolecular Structures

CSE528 Natural Language Processing Venue:ADB-405 Topic: Regular Expressions & Automata. www. l ea rn ersd esk.weeb l y. com

JavaScript Functions, Objects and Array

Wildcards and Regular Expressions

Tools for Text. Unix Pipe Fitting for Data Analysis. David A. Smith

Configuring the RADIUS Listener LEG

Unix Introduction. Part 2

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Tuesday, September 30, 14.

Additional Resources

UNIX files searching, and other interrogation techniques

How Actuate Reports Process Adhoc Parameter Values and Expressions

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Automating Data Analysis with PERL

Effective Programming Practices for Economists. 17. Regular Expressions

Introduction to Unix

22-Sep CSCI 2132 Software Development Lecture 8: Shells, Processes, and Job Control. Faculty of Computer Science, Dalhousie University

Systems Programming/ C and UNIX

Arrays. Comp Sci 1570 Introduction to C++ Array basics. arrays. Arrays as parameters to functions. Sorting arrays. Random stuff

HEP Computing Part II Scripting Marcella Bona

PYTHON. Values and Variables

Regular Expressions. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Regular Expressions and Metric Groupings in Introscope: A Basic Guide

CSCI 2132 Software Development. Lecture 8: Introduction to C

Lecture 3. Miscellaneous Ruby and Testing 1 / 40

$ true && echo "Yes." Yes. $ false echo "Yes." Yes.

CSCI-GA Scripting Languages

Outbound Routing Regular Expressions

BNF, EBNF Regular Expressions. Programming Languages,

Homework # (Latex Handout) by Laura Parkinson

JFlex Regular Expressions

Regular Expressions 1

LPIC Level 1 Seminar in English

If you have to pick between reluctant and greedy, often (but not always) you ll want the reluctant versions. So let s say we have the string:

More regular expressions, synchronizing data, comparing files

ITST Searching, Extracting & Archiving Data

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

Grep and Shell Programming

Transcription:

Regular Expressions Primer Jeremy Stephens Computer Systems Analyst Department of Biostatistics December 18, 2015

What are they? Regular expressions are a way to describe patterns in text.

Why use them? To find stuff haystack <- c("abcdef", "anbeceddelfe", "abcdef") grep("n.e.e.d.l.e", haystack) #=> [1] 2

Why use them? To find stuff haystack <- c("abcdef", "anbeceddelfe", "abcdef") grep("n.e.e.d.l.e", haystack) #=> [1] 2 To remove cruft x <- c("123", "123 oz", "123 ounces") sub("\\d+$", "", x) #=> [1] "123" "123" "123"

Why use them? To find stuff haystack <- c("abcdef", "anbeceddelfe", "abcdef") grep("n.e.e.d.l.e", haystack) #=> [1] 2 To remove cruft x <- c("123", "123 oz", "123 ounces") sub("\\d+$", "", x) #=> [1] "123" "123" "123" To extract data x <- "The Predators lost in overtime with 4:43 left on the clock." gsub("^.+(\\d{1,2}:\\d{2}).+$", "\\1", x) #=> "4:43"

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?: (?:\r\n)?[ \t])*))* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*) (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) " (?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\ ".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[ \]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([ ^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\ ]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\ r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) "(? :[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\". \[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\] ])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\ ".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[ \["()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\ ]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\[" ()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<> @,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@, ;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\ ".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\". \[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\[ "()<>@,;:\\".\[\]])) "(?:[^\"\r\\] \\. (?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) + \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \Z (?=[\["()<>@,;:\\".\[\]])) \[([^\[\]\r\\] \\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

Don t Panic!

Things you (might) need to know There are 3 main regular expression standards: POSIX Basic Regular Expressions POSIX Extended Regular Expressions (R uses this by default) Perl-based Regular Expressions (R has support for this as well)

Using regular expressions in R grep( pattern, text,...) sub( pattern, replacement, text,...) regexpr( pattern, text,...)

Using regular expressions in other languages Ruby: text =~ /pattern/ Javascript: text.match(/pattern/) Python: re.match('pattern', text)

The simplest regular expression Plain text! grep("test", c("foo", "bar", "test")) #=> [1] 3 The pattern "test" is a perfectly valid regular expression.

Metacharacters A "metacharacter" in a regular expression is a character with special meaning.

Your first metacharacter: dot A "dot" (or period) means match any one character. grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar")) #=> [1] 1 2 In the above example, "foo.bar" is the regular expression.

Your first metacharacter: dot (cont.) Using multiple wildcards: grep("foo..bar", c("foo12bar", "foo123bar")) #=> [1] 1 grep("foo...bar", c("foo12bar", "foo123bar")) #=> [1] 2

Variable-length matching: plus Use + to indicate one or more. grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar")) #=> [1] 1 2 4

Variable-length matching: plus (cont.) Using + works on normal characters too: grep("a+rgh", c("argh", "aaargh", "aaaaaaaargh", "ugh")) #=> [1] 1 2 3

Quantifiers Quantifiers are metacharacters that describe "how many", like the + character.

Matching zero or more: asterisk Use * to indicate zero or more. grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar")) #=> [1] 1 2 3 4

Matching zero or one: question mark Use? to indicate zero or one. grep("abc?def", c("abdef", "abcdef", "abccdef")) #=> [1] 1 2

Matching n times: curly braces Use {} to indicate exactly n times. grep("10{6}", c("1000", "1000000")) => [1] 2

Matching n to m times: curly braces You can also use {} to indicate a range. grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001")) #=> [1] 2 3 4

Intermission

Anchor down Consider the following: grep("10{3}", c("100", "1000", "10000")) #=> [1] 2 3

Anchoring to the end: dollar sign Use $ to indicate anchoring at the end. grep("10{3}$", c("100", "1000", "10000")) #=> [1] 2

Anchoring to the beginning: caret Use ˆ to indicate anchoring at the beginning. grep("^10{3}", c("1000", "abc1000")) #=> [1] 1

Using both anchors You can use both ˆ and $ to anchor at both ends. grep("10{3}", c("abc1000", "1000", "10000")) #=> [1] 1 2 3 grep("10{3}$", c("abc1000", "1000", "10000")) #=> [1] 1 2 grep("^10{3}$", c("abc1000", "1000", "10000")) #=> [1] 2

Matching groups of characters: brackets Use [ ] to indicate a group of characters, also known as a character class. grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg")) #=> [1] 1 2 3 4

Matching groups of characters: brackets (cont.) You can also use [ ] with a range of characters. grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg")) #=> [1] 1 2 3 4

Matching groups of characters: brackets (cont.) Using [ ] with multiple ranges: grep("ab[c-fc-f]", c("abc", "abc", "abf", "abf", "abg")) #=> [1] 1 2 3 4

Matching groups of characters: brackets (cont.) Mix and match ranges and characters: grep("ab[c-fc-f123]", c("abc", "abf", "ab1", "ab2")) #=> [1] 1 2 3 4

Context in character classes Most metacharacters lose their meaning inside character classes: grep("[.+*]", c(".", "+", "*", "x")) #=> [1] 1 2 3 grep("[a-c-]", c("a", "b", "c", "-")) #=> [1] 1 2 3 4

Using quantifiers with character classes It s possible to put quantifiers on character classes: grep("[a-z]+", c("123", "abcdef")) #=> [1] 2

Negating a character class Use ˆ inside a character class to negate it. grep("[^a-z]+", c("123", "abcdef")) #=> [1] 1

Built-in character classes Shortcut Expanded \d [0-9] \w [a-za-z0-9_] \s [ \t\n\r\f] \D [^0-9] \W [^a-za-z0-9_] \S [^ \t\n\r\f]

A note about backslashes in R To use built-in character classes in R, you need to escape the backslashes. grep("\d", c("123", "abcdef")) #=> Error! grep("\\d", c("123", "abcdef")) #=> [1] 1

A note about backslashes in R (cont.) To match a literal backslash in a regular expression in R... grep("\\\\", c("123", "123\\")) #=> [1] 2 grep("\\\\\\\\", c("123", "123\\\\")) #=> [1] 2

Grouping Use ( ) to specify a group to be referenced later using \[number]. sub("foo(.+)", "\\1", c("foobar", "foobaz")) #=> [1] bar baz

Advanced usage: negative lookahead Use?! inside a group to indicate a negative lookahead match. This requires perl=true in R functions. grep("foo(?!bar).+", c("foo123", "foobar", "foo123bar"), perl=true) #=> [1] 1 3

Advanced usage: negative lookbehind Use?<! inside a group to indicate a negative lookbehind match. This requires perl=true in R functions. grep("(?<!foo)bar", c("bar", "foobar", "123bar"), perl=true) #=> [1] 1 3

Regular expressions on the command line You can use regular expressions with the grep and sed commands in your terminal. grep -E 'foo.+bar' *.txt egrep 'foo.+bar' *.txt sed 's/foo/bar/' *.txt

References http://www.regular-expressions.info/ man perlre from a terminal?sub in R RegularExpressionPrimer on the Biostatistics Wiki

The End