Introduction to regular expressions

Introduction to regular expressions Table of Contents Introduction to regular expressions Here's how we do it Iteration 1: skill level > Wollowitz Iteration 2: skill level > Rakesh Introduction to regular expressions What are regular expressions Regular expressions describe a chunk of text with certain properties Why is that useful, and where is it useful? Searching substrings Searching and replacing, scientist style Concise text manipulation programs Find double adresses in adress books and delete one entry Switch first and last names in text manipulate tabulated data manipulate email adresses extract biological data from table Here's how we do it 4 iterations, 4 exercises 4 models, from Rutherford to pretty nice Solutions at 4.15 and 5.30 DISCLAIMER: don't take slides from the first iterations as the full truth. They just provide models to help you understand. Iteration 1: skill level > Wollowitz Building blocks Literal text

In the most basic case, the regex searches for literal text: pattern = 'ACAC' string = 'TACAGACACGAC' match = re.search(pattern, string) # finds: ACAC Character classes Often, we want to look for a set of characters instead of a single literal character The dot The dot character matches everything except the newline character pattern = r'ac.c' string = 'TACTCACACGAC' # Finds: ACTC Standard sets Standard sets describe categories of characters \w alphanumeric chars and underscore a z and A Z and _ \d decimal numbers 0 9 \s whitespace \t\n etc. Complements to the standard sets \W, \D and \ S mean everything except \w, \d or \s respectively pattern = r'\w\w' string = 'Hello World' # finds: 'o ' Character ranges Instead of the standard sets, you may use custom sets by including these elements in brackets []: 1. Literal characters: [abc] 2. Standard set: [\w\d] 3. Ranges [a e]

4. Complement [^\w] beginning_of_headline = r'[a-e]) ' headline1 = 'A) Regexes are useful' headline2 = 'F) Regexes are fun' Searching with re.search re.search(pattern, string) starts looking for pattern at the beginning of string goes through all positions in the string, until a match is found re.search returns a match object if a match was found None otherwise We will talk more about the match object later. Key point for now: it is truthy The regex engine, 1/10 Text based and regex based engines There are two different algorithmic approaches to deal with regular expression searches: 1. Text based engine (DFA) 2. Regex based engine (NFA) Here, we are only concerned with regex based engines. These engines are used in Java, Perl, Python, R etc., so this is likely what you will encounter most of the time. The regex engine is eager To find its match, the regex engine follows this basic algorithm: 1. Start at position 0 (beginning of the string) 2. Try every possible way to match the pattern from this position 3. As soon as a complete match is found: end the search and return the match 4. If no match was found: go to the next position and repeat from step 2 Incredibly important implications 1/10 1. One of the leftmost matches wins Quantifiers

Quantifiers specify how often a regex token may appear m times To specify that a token has to appear mtimes: pattern = r'.{3}b' Between m and n times To specify that a token may appear between mand ntimes: pattern = r'.{3,5}b' Shortcuts {,} * {1,} + {0,1}? The regex engine, 2/10 By default regex engine is greedy The default modifiers are greedy. They try to match as much of the text as possible. pattern = r'.*cat' string = 'my cat is a really fat cat' # matches: 'my cat is a really fat cat' The regex engine uses backtracking to try out all possible ways to match a pattern This was explained on the blackboard. Here the main points for your reference: the regex engine keeps track of two positions the current token in the regex the current position in the string the engine works through all tokens of the regex step by step the position in the string is updated as required by matching of the tokens whenever the regex engine can do more than one thing, it will keep track of its decisions

if a later token in the regex can't be matched on the current matching 'path', the engine goes back to the last branching point in the path and takes an alternative decision this algorithm is followed until the first match is found: the engine stops as soon as a successful match is found, independent of whether more and perhaps longer matches could be found by continuing the search all possible ways to match a regex have been tried without success: no match is found Iteration 2: skill level > Rakesh Alternatives To allow the engine to select between alternatives, combine them with pattern = r'(howard Rakesh Sheldon Leonard) was here' string = 'Rakesh was here' The regex engine, 3/10 Alternatives are tried from left to right Implications: 1. The first viable alternative is taken 2. The alternatives operator is not greedy Incredibly important consequences of the algorithm 1. One of the leftmost matches wins 2. The first viable alternative is taken, even if a longer alternative would also match Substitution re.sub(pattern, replacement, string) More building blocks Capturing groups Standard capturing groups

To capture and reuse parts of a match, put the regex tokens in parentheses get_day_from_date = '\w+ (\d+)' date = 'May 15' # 15 is captured Anchors ^ beginning of the string $ end of the string \b \w to \W boundary or \w to 'void' boundary Reusing captured content In the same pattern \N get_double_day_error = '\w+ (\d\d)\1' date = 'May 15' # nothing matched, this one is ok date = 'May 1515' # date is matched, this one is not ok In substitutions \N string = 'The protein BNIP3... BNIP-3.. bnip three...' pattern = r'bnip?-?(3 three)' replacement = r'bnup \1' Through the match objects Return the content of all captured groups m.groups() m.group(0) m.group(1,2)

If you want to learn more https://docs.python.org/3.5/howto/regex.html Author: Stephen Kraemer Created: 2015 11 25 Mi 07:50 Validate