Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Regular Expressions Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland November 11 th, 2015

Regular expressions provide a flexible way to identify and subsequently manipulate strings of text of interest, such as words or any patterns of characters. For example: the sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate" the word "car" when it appears as an isolated word the word "car" when preceded by the word "blue" or "red" a dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits a URL or an email address (http://...) (<name>@<text>.<text>) Find eg all AGI codes in a given text which would look like At<digit>g<five digits> Find duplicated words in a text

Regular expressions, or regexes, provide a very very powerful tool to search and manipulate huge amounts of data (text, databases, output of commands) very efficiently. Many programming languages have implementations of regular expressions. The Perl implementation of regular expressions is built into the core of the language, other languages use add-on packages for regex support. Unix has several tools that use regular expressions. Most notably the scripting language Awk (not featured in this lecture) and the tool grep (more powerful in the egrep Implementation).

REGular EXpressions are a way of thinking!

Egrep Metacharacters Metacharacters are special markers that ensure the desired results when combined with other characters. Without metacharacters it is very difficult or impossible to build efficient regular expressions and a search essentially becomes a simple plain text search. In a search for the word cat a plain text search also finds the result vacation. In egrep (and Perl) the metacharacters for start of line and end of line are the ^ (caret) and $ (dollar sign). The search ^cat returns only the lines where cat is right in the beginning of a line wheras cat$ returns only those, where cat is in the end (like scat).

Egrep Metacharacters What would the following expressions find: ^cat$ ^$ ^

Egrep Metacharacters What would the following expressions find: ^cat$ Matches if the line has a beginning (which all lines have)followed immediately by cat, then followed immediately by the end of the line (which all lines should have) ^$ Matches if the line has a beginning, followed immediately by the end of the line. Finds empty lines. ^ Means to match if the line has a beginning (which every line has). It matches empty and non-empty lines and essentially achieves nothing.

Egrep Character Classes spelled gray. grey but you want to find it also when Instead of doing to independent searches you can use the character class. construct to create a gr[ea]y Will find a g, followed by r, followed by either e or a finally followed by a y.

Egrep Character Classes The character class can contain as many characters as you like. To search for a particular locus on all Arabidopsis chromosomes you can use a character class: At[12345]g09970

Egrep Character Classes Multiple ranges are fine. You could define something like this: [abcdefabcdef0123456] This is awkward to write, so it is better to use a shorthand for this: [a-fa-f0-6] The following class [0-9A-Z_!.?] Will match digits, uppercase letters, underscore, exclamation mark, period and question mark.

Egrep Character Classes Note: The dash is something special. In a character class it usually indicates a range of characters (A-Z). Outside a character class it matches the normal dash. However, if interpreted as a plain character.

Egrep Character Classes You can also use negated character classes if you use instead of. For example [^1-6] matches a character that is not 1, 2, 3, 4, 5 or 6. The caret is the same, that has been introduced before as an anchor for the beginning of a line.

Egrep Character Classes Iraqi Iraqian miqra qasida qintar qoph zaqqum Words not found but included were: Qantas or Iraq. WHY???

Egrep Character Classes (Overview) Character classes in egrep:. stands for every character except newline. [a-z] uses all characters from a to z (in lowercase use [A-Z] for uppercase) [0-9] uses all digits \w Alphanumeric characters [A-Za-z0-9_] [:alnum:] Alphanumeric characters. [:alpha:] Alphabetic characters. [:blank:] Space and TAB characters. [:cntrl:] Control characters. [:digit:] Numeric characters. [:graph:] Characters that are both printable and visible. (A space is printable but not visible, whereas an `a' is both.) [:lower:] Lowercase alphabetic characters. [:print:] Printable characters (characters that are not control characters). [:punct:] Punctuation characters (characters that are not letters, digits, control or space characters). [:space:] Space characters (such as space, TAB, and formfeed, to name a few). [:upper:] Uppercase alphabetic characters. [:xdigit:] Characters that are hexadecimal digits. While egrep can use negated classes, the v option is an often more convenient way to find everything except the defined class.

Alternation Looking back we used the following construct to search for grey and gray: gr[ea]y This can also be written using alternation instead of a character class: gr(e a)y The parenthesis is required because the search term gre ay would results in either gre or ay, which is clearly not what is wanted here.

Alternation The following alternations result in the same outcome: Jeffrey Jeffery Jeff(rey ery) Jeff(re er)y To have them match the spelling Geoffrey or Geoffery we can modify it further: (Geoff Jeff)(rey ery) (Geo Je)ff(rey ery) (Geo je)ff(re er)y All of those match the longer (but simpler) Jeffrey Jeffery Geoffrey Geoffery

Ignoring Differences in Capitalization To make your regex case insensitive you can specify the i option in egrep (in Perl and most other programming languages use the i modifier for your regex).

Word Boundaries To avoid finding occurences of your word embedded in a bigger word you can use the word boundaries to avoid those results. In grep you can use the a little odd looking \< and \> metasequences to specify that. The expression \<cat\> literally means match if we can find a start of word position, followed immediately by c, a and t, followed immediately by an end of word position. word boundary metasequences from the combination with the backslash \

Metacharacter Name Matches. dot any one character character class any character listed negated character class any character not listed ^ caret position at the start of line $ dollar position at the end of line \< backslash less than position at start of word \> backslash greater than position at end of word or, bar, pipe matches either expression it separates parentheses used to limit scope of, plus additional uses (discussed later)

Quantifiers With quantifiers we are able to specify how many instances of A certain character or character class we want to match. Quantifiers can be separated into greedy and non-greedy. Greedy quantifiers will match everything they can while nongreedy ones will only match until a given criterium is matched for the first time. Greedy quantifiers:? * + {n} {m,n} Matches n instances Matches at least m but at most n instances, matches the maximum possible

Quantifiers Search for color and colour: colou?r July or abbreviation Jul: July? You can use the parentheses to group characters in order to apply a quantifier to the group: 4(th)? will find 4 but also 4th

Parentheses and Backreferences So far we have used the parentheses to limit the scope of alternation or to group multiple characters into larger units to which you can apply quantifiers. matched by the subexpression they enclose. This can used to solve the problem of finding doubled words for example. \<the +the\> finds word boundary, the followed immediately by at least one whitespace and then the and a word boundary. To make this work also for other words we can modify it like this: \<([a-za-z]+) +\1\> The \1 (backslash 1) is a backreference pointing to the text in the parentheses.

The Great Escape So how can you use a character that is usually a meta character as an actual character??? You use the backslash to escape them. The. (period) usually matches any character except newline. To match an actual. you escape it: \. To use an actual \ (backslash) you also escape it: \\

Some egrep examples egrep can use the output of any Unix command: ls /usr egrep ls /usr egrep ls /usr egrep l b ls /usr egrep ls /usr egrep egrep however, can also search files directly: egrep filename Modifiers for egrep: -i case-insensitive -v everything but the matches AGI code examples: egrep i agi.txt egrep iv agi.txt

How Does Pattern Matching Work? (NFA and DFA) Both regex engines follow 2 rules: 1.The match that begins earliest (leftmost) wins. 2. The standard quantifiers (*, +,? and {m,n} are greedy.

1. Earliest Match Wins Rule This rule says, that any match that begins earlier in the string is always preferred over any plausible match that begins later. The match is first attempted at the very beginning of the string to be searched the entire (perhaps complex) regex is tested starting right at that spot. If all possibilities are exhausted and a match is not found, the complete expression is re-tried starting from just before the second character. This full retry occurs at each position in the string until a match is found. No match is reported only after the full retry has been attempted at each position all the way to the end of the string (after the last character).

1. Earliest Match Wins Rule The second attempt also fails (ORA does not match LOR either). The attempt starting at the third position however matches, so the engine stops and reports the match. FLORAL.

1. Why Is This Rule Important? The dragging belly indicates your cat is too fat. Is you search for indicates appears earlier in the string. This is not important in cases like grep, where you just test for the presence of a string, but if you search AND replace the distinction becomes paramount. Where will this match in the example above: fat cat belly your

2. The Standard Quantifiers Are Greedy Greedy means, that the quantifiers will match as many characters as possible. They will settle for something else than the maximum if they have to, but the always attempt to match as many times as then can up to the absolute maximum allowed. The only time they settle for anything less than their allowed maximum is when matching too much ends up causing some later part of the regex to fail. Example: \b\w+s\b The \w+ happily matches the whole word, but if it did, there would be nothing for the s to match. For the match to succeed, \w+ s\b to be able to match.

2. Greedy Quantifiers: First Come, First Served What is being captured by the parentheses in this example: 2003 Regex: ^.*([0-9]+) WHY???

Where to go from here? Regular expressions are a quite complicated topic, we barely scratched the surface here. We did not address different types of regex engines and we also did not touch the topic of the performance and efficiency of regular expressions. Suggested further reading: Mastering Regular Expressions THE regex bible! Covers almost every aspect of regular expressions. Regular Expressions Pocket Reference A quick and good reference to regexes in most Unix tools and scripting languages. Requires however understanding of regular expressions. Michael Wrzaczek, michael.wrzaczek@helsinki.fi