Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Similar documents
Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Introduction to Regular Expressions Version 1.3. Tom Sgouros

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

ITST Searching, Extracting & Archiving Data

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

CS Unix Tools & Scripting

Regular Expressions. Perl PCRE POSIX.NET Python Java

STREAM EDITOR - REGULAR EXPRESSIONS

Lecture 18 Regular Expressions

Configuring the RADIUS Listener LEG

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Lecture 2. Regular Expression Parsing Awk

Essentials for Scientific Computing: Stream editing with sed and awk

Pattern Matching. An Introduction to File Globs and Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Table ofcontents. Preface. 1: Introduction to Regular Expressions xv

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

UNIX / LINUX - REGULAR EXPRESSIONS WITH SED

=~ determines to which variable the regex is applied. In its absence, $_ is used.

Regular Expressions Explained

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Regular Expressions. Regular Expression Syntax in Python. Achtung!

正则表达式 Frank from

CST Lab #5. Student Name: Student Number: Lab section:

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Server-side Web Development (I3302) Semester: 1 Academic Year: 2017/2018 Credits: 4 (50 hours) Dr Antoun Yaacoub

Filtering Service

psed [-an] script [file...] psed [-an] [-e script] [-f script-file] [file...]

Computer Systems and Architecture

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Regular Expressions!!

PowerGREP. Manual. Version October 2005

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

The Little Regular Expressionist

CS160A EXERCISES-FILTERS2 Boyd

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

Computer Systems and Architecture

Understanding Regular Expressions, Special Characters, and Patterns

Configuring the RADIUS Listener Login Event Generator

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Regular Expression Reference

Introduction to regular expressions

Regular Expressions for Technical Writers (tutorial)

Wildcards and Regular Expressions

IT441. Regular Expressions. Handling Text: DRAFT. Network Services Administration

Module 8 Pipes, Redirection and REGEX

Bashed One Too Many Times. Features of the Bash Shell St. Louis Unix Users Group Jeff Muse, Jan 14, 2009

successes without magic London,

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Effective Programming Practices for Economists. 17. Regular Expressions

Version November 2017

Regular expressions. LING78100: Methods in Computational Linguistics I

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Regular Expressions 1

Beginning Perl for Bioinformatics. Steven Nevers Bioinformatics Research Group Brigham Young University


Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Regular Expressions for Technical Writers

User Commands sed ( 1 )

UNIX files searching, and other interrogation techniques

- c list The list specifies character positions.

Structure of Programming Languages Lecture 3

LESSON 4. The DATA TYPE char

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Regex Guide. Complete Revolution In programming For Text Detection

ML 4 A Lexer for OCaml s Type System

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

CS Advanced Unix Tools & Scripting

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Getting to grips with Unix and the Linux family

Version June 2017

Cisco Common Classification Policy Language

Awk & Regular Expressions

Linux Text Utilities 101 for S/390 Wizards SHARE Session 9220/5522

More regular expressions, synchronizing data, comparing files

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

FILTERS USING REGULAR EXPRESSIONS grep and sed

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Additional Resources

CS214-AdvancedUNIX. Lecture 2 Basic commands and regular expressions. Ymir Vigfusson. CS214 p.1

Object-Oriented Software Engineering CS288

5/8/2012. Exploring Utilities Chapter 5

Shell Programming Overview

CS 230 Programming Languages

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Expr Language Reference

PHP and MySQL for Dynamic Web Sites. Intro Ed Crowley

Basics. I think that the later is better.

CSCI 2132 Software Development. Lecture 8: Introduction to C

CS 2112 Lab: Regular Expressions

Advanced Handle Definition

The Java Language Rules And Tools 3

Motivation (Scenarios) Topic 4: Grep, Find & Sed. Displaying File Names. grep

Transcription:

Regular Expressions Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland November 11 th, 2015

Regular expressions provide a flexible way to identify and subsequently manipulate strings of text of interest, such as words or any patterns of characters. For example: the sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate" the word "car" when it appears as an isolated word the word "car" when preceded by the word "blue" or "red" a dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits a URL or an email address (http://...) (<name>@<text>.<text>) Find eg all AGI codes in a given text which would look like At<digit>g<five digits> Find duplicated words in a text

Regular expressions, or regexes, provide a very very powerful tool to search and manipulate huge amounts of data (text, databases, output of commands) very efficiently. Many programming languages have implementations of regular expressions. The Perl implementation of regular expressions is built into the core of the language, other languages use add-on packages for regex support. Unix has several tools that use regular expressions. Most notably the scripting language Awk (not featured in this lecture) and the tool grep (more powerful in the egrep Implementation).

REGular EXpressions are a way of thinking!

Egrep Metacharacters Metacharacters are special markers that ensure the desired results when combined with other characters. Without metacharacters it is very difficult or impossible to build efficient regular expressions and a search essentially becomes a simple plain text search. In a search for the word cat a plain text search also finds the result vacation. In egrep (and Perl) the metacharacters for start of line and end of line are the ^ (caret) and $ (dollar sign). The search ^cat returns only the lines where cat is right in the beginning of a line wheras cat$ returns only those, where cat is in the end (like scat).

Egrep Metacharacters What would the following expressions find: ^cat$ ^$ ^

Egrep Metacharacters What would the following expressions find: ^cat$ Matches if the line has a beginning (which all lines have)followed immediately by cat, then followed immediately by the end of the line (which all lines should have) ^$ Matches if the line has a beginning, followed immediately by the end of the line. Finds empty lines. ^ Means to match if the line has a beginning (which every line has). It matches empty and non-empty lines and essentially achieves nothing.

Egrep Character Classes spelled gray. grey but you want to find it also when Instead of doing to independent searches you can use the character class. construct to create a gr[ea]y Will find a g, followed by r, followed by either e or a finally followed by a y.

Egrep Character Classes The character class can contain as many characters as you like. To search for a particular locus on all Arabidopsis chromosomes you can use a character class: At[12345]g09970

Egrep Character Classes Multiple ranges are fine. You could define something like this: [abcdefabcdef0123456] This is awkward to write, so it is better to use a shorthand for this: [a-fa-f0-6] The following class [0-9A-Z_!.?] Will match digits, uppercase letters, underscore, exclamation mark, period and question mark.

Egrep Character Classes Note: The dash is something special. In a character class it usually indicates a range of characters (A-Z). Outside a character class it matches the normal dash. However, if interpreted as a plain character.

Egrep Character Classes You can also use negated character classes if you use instead of. For example [^1-6] matches a character that is not 1, 2, 3, 4, 5 or 6. The caret is the same, that has been introduced before as an anchor for the beginning of a line.

Egrep Character Classes Iraqi Iraqian miqra qasida qintar qoph zaqqum Words not found but included were: Qantas or Iraq. WHY???

Egrep Character Classes (Overview) Character classes in egrep:. stands for every character except newline. [a-z] uses all characters from a to z (in lowercase use [A-Z] for uppercase) [0-9] uses all digits \w Alphanumeric characters [A-Za-z0-9_] [:alnum:] Alphanumeric characters. [:alpha:] Alphabetic characters. [:blank:] Space and TAB characters. [:cntrl:] Control characters. [:digit:] Numeric characters. [:graph:] Characters that are both printable and visible. (A space is printable but not visible, whereas an `a' is both.) [:lower:] Lowercase alphabetic characters. [:print:] Printable characters (characters that are not control characters). [:punct:] Punctuation characters (characters that are not letters, digits, control or space characters). [:space:] Space characters (such as space, TAB, and formfeed, to name a few). [:upper:] Uppercase alphabetic characters. [:xdigit:] Characters that are hexadecimal digits. While egrep can use negated classes, the v option is an often more convenient way to find everything except the defined class.

Alternation Looking back we used the following construct to search for grey and gray: gr[ea]y This can also be written using alternation instead of a character class: gr(e a)y The parenthesis is required because the search term gre ay would results in either gre or ay, which is clearly not what is wanted here.

Alternation The following alternations result in the same outcome: Jeffrey Jeffery Jeff(rey ery) Jeff(re er)y To have them match the spelling Geoffrey or Geoffery we can modify it further: (Geoff Jeff)(rey ery) (Geo Je)ff(rey ery) (Geo je)ff(re er)y All of those match the longer (but simpler) Jeffrey Jeffery Geoffrey Geoffery

Ignoring Differences in Capitalization To make your regex case insensitive you can specify the i option in egrep (in Perl and most other programming languages use the i modifier for your regex).

Word Boundaries To avoid finding occurences of your word embedded in a bigger word you can use the word boundaries to avoid those results. In grep you can use the a little odd looking \< and \> metasequences to specify that. The expression \<cat\> literally means match if we can find a start of word position, followed immediately by c, a and t, followed immediately by an end of word position. word boundary metasequences from the combination with the backslash \

Metacharacter Name Matches. dot any one character character class any character listed negated character class any character not listed ^ caret position at the start of line $ dollar position at the end of line \< backslash less than position at start of word \> backslash greater than position at end of word or, bar, pipe matches either expression it separates parentheses used to limit scope of, plus additional uses (discussed later)

Quantifiers With quantifiers we are able to specify how many instances of A certain character or character class we want to match. Quantifiers can be separated into greedy and non-greedy. Greedy quantifiers will match everything they can while nongreedy ones will only match until a given criterium is matched for the first time. Greedy quantifiers:? * + {n} {m,n} Matches n instances Matches at least m but at most n instances, matches the maximum possible

Quantifiers Search for color and colour: colou?r July or abbreviation Jul: July? You can use the parentheses to group characters in order to apply a quantifier to the group: 4(th)? will find 4 but also 4th

Parentheses and Backreferences So far we have used the parentheses to limit the scope of alternation or to group multiple characters into larger units to which you can apply quantifiers. matched by the subexpression they enclose. This can used to solve the problem of finding doubled words for example. \<the +the\> finds word boundary, the followed immediately by at least one whitespace and then the and a word boundary. To make this work also for other words we can modify it like this: \<([a-za-z]+) +\1\> The \1 (backslash 1) is a backreference pointing to the text in the parentheses.

The Great Escape So how can you use a character that is usually a meta character as an actual character??? You use the backslash to escape them. The. (period) usually matches any character except newline. To match an actual. you escape it: \. To use an actual \ (backslash) you also escape it: \\

Some egrep examples egrep can use the output of any Unix command: ls /usr egrep ls /usr egrep ls /usr egrep l b ls /usr egrep ls /usr egrep egrep however, can also search files directly: egrep filename Modifiers for egrep: -i case-insensitive -v everything but the matches AGI code examples: egrep i agi.txt egrep iv agi.txt

How Does Pattern Matching Work? (NFA and DFA) Both regex engines follow 2 rules: 1.The match that begins earliest (leftmost) wins. 2. The standard quantifiers (*, +,? and {m,n} are greedy.

1. Earliest Match Wins Rule This rule says, that any match that begins earlier in the string is always preferred over any plausible match that begins later. The match is first attempted at the very beginning of the string to be searched the entire (perhaps complex) regex is tested starting right at that spot. If all possibilities are exhausted and a match is not found, the complete expression is re-tried starting from just before the second character. This full retry occurs at each position in the string until a match is found. No match is reported only after the full retry has been attempted at each position all the way to the end of the string (after the last character).

1. Earliest Match Wins Rule The second attempt also fails (ORA does not match LOR either). The attempt starting at the third position however matches, so the engine stops and reports the match. FLORAL.

1. Why Is This Rule Important? The dragging belly indicates your cat is too fat. Is you search for indicates appears earlier in the string. This is not important in cases like grep, where you just test for the presence of a string, but if you search AND replace the distinction becomes paramount. Where will this match in the example above: fat cat belly your

2. The Standard Quantifiers Are Greedy Greedy means, that the quantifiers will match as many characters as possible. They will settle for something else than the maximum if they have to, but the always attempt to match as many times as then can up to the absolute maximum allowed. The only time they settle for anything less than their allowed maximum is when matching too much ends up causing some later part of the regex to fail. Example: \b\w+s\b The \w+ happily matches the whole word, but if it did, there would be nothing for the s to match. For the match to succeed, \w+ s\b to be able to match.

2. Greedy Quantifiers: First Come, First Served What is being captured by the parentheses in this example: 2003 Regex: ^.*([0-9]+) WHY???

Where to go from here? Regular expressions are a quite complicated topic, we barely scratched the surface here. We did not address different types of regex engines and we also did not touch the topic of the performance and efficiency of regular expressions. Suggested further reading: Mastering Regular Expressions THE regex bible! Covers almost every aspect of regular expressions. Regular Expressions Pocket Reference A quick and good reference to regexes in most Unix tools and scripting languages. Requires however understanding of regular expressions. Michael Wrzaczek, michael.wrzaczek@helsinki.fi