Introduction to regular expressions

Similar documents
Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Regular Expressions. Upsorn Praphamontripong. CS 1111 Introduction to Programming Spring [Ref:

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Lecture 2. Regular Expression Parsing Awk

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python


LING115 Lecture Note Session #7: Regular Expressions

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

CSE : Python Programming

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

Regex Guide. Complete Revolution In programming For Text Detection

CSCI 4152/6509 Natural Language Processing Lecture 6: Regular Expressions; Text Processing in Perl

CSE 105 THEORY OF COMPUTATION

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Table ofcontents. Preface. 1: Introduction to Regular Expressions xv

CSC207 Week 9. Larry Zhang

Slide 1 Side Effects Duration: 00:00:53 Advance mode: Auto

DATA STRUCTURE AND ALGORITHM USING PYTHON

Regular Expressions. Perl PCRE POSIX.NET Python Java

Fundamentals of Programming. November 19, 2017

Effective Programming Practices for Economists. 17. Regular Expressions

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Regular Expressions.

Lexical Analysis. Lecture 3-4

Computer Systems and Architecture

RegExpr:Review & Wrapup; Lecture 13b Larry Ruzzo

applied regex implementing REs using finite state automata using REs to find patterns Informatics 1 School of Informatics, University of Edinburgh 1

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

Coding Workshop. Learning to Program with an Arduino. Lecture Notes. Programming Introduction Values Assignment Arithmetic.

Beginning Perl for Bioinformatics. Steven Nevers Bioinformatics Research Group Brigham Young University

More Details about Regular Expressions

Language Reference Manual

Chapter Eight: Regular Expression Applications. Formal Language, chapter 8, slide 1

Full file at

正则表达式 Frank from

Haskell: Lists. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Friday, February 24, Glenn G.

CIS192 Python Programming

CSE 105 THEORY OF COMPUTATION

Pythonic Coding Style. C-START Python PD Workshop

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Introduction to Regular Expressions Version 1.3. Tom Sgouros

LECTURE 8. The Standard Library Part 2: re, copy, and itertools

Introduction; Parsing LL Grammars

CS 11 Haskell track: lecture 1

Object-Oriented Software Engineering CS288

Regular Expressions for Technical Writers (tutorial)

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

CS102: Standard I/O. %<flag(s)><width><precision><size>conversion-code

PROGRAMMING FUNDAMENTALS

STATS Data analysis using Python. Lecture 0: Introduction and Administrivia

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

CS 2112 Lab: Regular Expressions

CS 177 Recitation. Week 1 Intro to Java

Algorithmic Approaches for Biological Data, Lecture #8

Getting Started Values, Expressions, and Statements CS GMU

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

MITOCW watch?v=se4p7ivcune

DaMPL. Language Reference Manual. Henrique Grando

CMSC 330: Organization of Programming Languages. Ruby Regular Expressions

BASIC COMPUTATION. public static void main(string [] args) Fundamentals of Computer Science I

Regular Expressions for Technical Writers

What we will do today Explain and look at examples of. Programs that examine data. Data types. Topic 4. variables. expressions. assignment statements

Variables, Functions and String Formatting

Structure of Programming Languages Lecture 3

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 10: OCT. 6TH INSTRUCTOR: JIAYIN WANG

Lexical Analysis. Finite Automata

PowerGREP. Manual. Version October 2005

Lexical Analysis. Chapter 2

ML 4 A Lexer for OCaml s Type System

Introduction to Unix

Lecture 15 (05/08, 05/10): Text Mining. Decision, Operations & Information Technologies Robert H. Smith School of Business Spring, 2017

=~ determines to which variable the regex is applied. In its absence, $_ is used.

Expressions and Data Types CSC 121 Spring 2015 Howard Rosenthal

York University Department of Electrical Engineering and Computer Science. Regular Expressions

Java+- Language Reference Manual

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Python for Analytics. Python Fundamentals RSI Chapters 1 and 2

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Regular Expressions!!

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

First Java Program - Output to the Screen

Regular expressions. LING78100: Methods in Computational Linguistics I

Expressions and Data Types CSC 121 Fall 2015 Howard Rosenthal

A PROGRAM IS A SEQUENCE of instructions that a computer can execute to

Programming In Java Prof. Debasis Samanta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Unix

Lecture Notes for CS 150 Fall 2009; Version 0.5

COMPUTER PROGRAMMING LOOPS

shortcut Tap into learning NOW! Visit for a complete list of Short Cuts. Your Short Cut to Knowledge

MITOCW watch?v=rvrkt-jxvko

Lec 3. Compilers, Debugging, Hello World, and Variables

Introduction to Internet of Things Prof. Sudip Misra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

CMSC 330: Organization of Programming Languages. Ruby Regular Expressions

Download the examples: LabWeek5examples..py or download LabWeek5examples.txt and rename it as.py from the LabExamples folder or from blackboard.

Transcription:

Introduction to regular expressions Table of Contents Introduction to regular expressions Here's how we do it Iteration 1: skill level > Wollowitz Iteration 2: skill level > Rakesh Introduction to regular expressions What are regular expressions Regular expressions describe a chunk of text with certain properties Why is that useful, and where is it useful? Searching substrings Searching and replacing, scientist style Concise text manipulation programs Find double adresses in adress books and delete one entry Switch first and last names in text manipulate tabulated data manipulate email adresses extract biological data from table Here's how we do it 4 iterations, 4 exercises 4 models, from Rutherford to pretty nice Solutions at 4.15 and 5.30 DISCLAIMER: don't take slides from the first iterations as the full truth. They just provide models to help you understand. Iteration 1: skill level > Wollowitz Building blocks Literal text

In the most basic case, the regex searches for literal text: pattern = 'ACAC' string = 'TACAGACACGAC' match = re.search(pattern, string) # finds: ACAC Character classes Often, we want to look for a set of characters instead of a single literal character The dot The dot character matches everything except the newline character pattern = r'ac.c' string = 'TACTCACACGAC' # Finds: ACTC Standard sets Standard sets describe categories of characters \w alphanumeric chars and underscore a z and A Z and _ \d decimal numbers 0 9 \s whitespace \t\n etc. Complements to the standard sets \W, \D and \ S mean everything except \w, \d or \s respectively pattern = r'\w\w' string = 'Hello World' # finds: 'o ' Character ranges Instead of the standard sets, you may use custom sets by including these elements in brackets []: 1. Literal characters: [abc] 2. Standard set: [\w\d] 3. Ranges [a e]

4. Complement [^\w] beginning_of_headline = r'[a-e]) ' headline1 = 'A) Regexes are useful' headline2 = 'F) Regexes are fun' Searching with re.search re.search(pattern, string) starts looking for pattern at the beginning of string goes through all positions in the string, until a match is found re.search returns a match object if a match was found None otherwise We will talk more about the match object later. Key point for now: it is truthy The regex engine, 1/10 Text based and regex based engines There are two different algorithmic approaches to deal with regular expression searches: 1. Text based engine (DFA) 2. Regex based engine (NFA) Here, we are only concerned with regex based engines. These engines are used in Java, Perl, Python, R etc., so this is likely what you will encounter most of the time. The regex engine is eager To find its match, the regex engine follows this basic algorithm: 1. Start at position 0 (beginning of the string) 2. Try every possible way to match the pattern from this position 3. As soon as a complete match is found: end the search and return the match 4. If no match was found: go to the next position and repeat from step 2 Incredibly important implications 1/10 1. One of the leftmost matches wins Quantifiers

Quantifiers specify how often a regex token may appear m times To specify that a token has to appear mtimes: pattern = r'.{3}b' Between m and n times To specify that a token may appear between mand ntimes: pattern = r'.{3,5}b' Shortcuts {,} * {1,} + {0,1}? The regex engine, 2/10 By default regex engine is greedy The default modifiers are greedy. They try to match as much of the text as possible. pattern = r'.*cat' string = 'my cat is a really fat cat' # matches: 'my cat is a really fat cat' The regex engine uses backtracking to try out all possible ways to match a pattern This was explained on the blackboard. Here the main points for your reference: the regex engine keeps track of two positions the current token in the regex the current position in the string the engine works through all tokens of the regex step by step the position in the string is updated as required by matching of the tokens whenever the regex engine can do more than one thing, it will keep track of its decisions

if a later token in the regex can't be matched on the current matching 'path', the engine goes back to the last branching point in the path and takes an alternative decision this algorithm is followed until the first match is found: the engine stops as soon as a successful match is found, independent of whether more and perhaps longer matches could be found by continuing the search all possible ways to match a regex have been tried without success: no match is found Iteration 2: skill level > Rakesh Alternatives To allow the engine to select between alternatives, combine them with pattern = r'(howard Rakesh Sheldon Leonard) was here' string = 'Rakesh was here' The regex engine, 3/10 Alternatives are tried from left to right Implications: 1. The first viable alternative is taken 2. The alternatives operator is not greedy Incredibly important consequences of the algorithm 1. One of the leftmost matches wins 2. The first viable alternative is taken, even if a longer alternative would also match Substitution re.sub(pattern, replacement, string) More building blocks Capturing groups Standard capturing groups

To capture and reuse parts of a match, put the regex tokens in parentheses get_day_from_date = '\w+ (\d+)' date = 'May 15' # 15 is captured Anchors ^ beginning of the string $ end of the string \b \w to \W boundary or \w to 'void' boundary Reusing captured content In the same pattern \N get_double_day_error = '\w+ (\d\d)\1' date = 'May 15' # nothing matched, this one is ok date = 'May 1515' # date is matched, this one is not ok In substitutions \N string = 'The protein BNIP3... BNIP-3.. bnip three...' pattern = r'bnip?-?(3 three)' replacement = r'bnup \1' Through the match objects Return the content of all captured groups m.groups() m.group(0) m.group(1,2)

If you want to learn more https://docs.python.org/3.5/howto/regex.html Author: Stephen Kraemer Created: 2015 11 25 Mi 07:50 Validate