Regular Expressions in Practice

Similar documents
Z-Books: Hunting Down Zombie Ebooks Hiding in your Catalog

Batch Editing MARC Records with MarcEdit and Regular Expressions

Writing Macros and Programs for Voyager Cataloging

Fast, but Accurate? Pitfalls of Batch Metadata Editing

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Personalizing Voyager Using Browser Extensions

Describing Languages with Regular Expressions

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Upgrade Your View of the Web with Browser Extensions

Regular Expressions Explained

Regular Expressions. Perl PCRE POSIX.NET Python Java

Tips and Tricks for Making the Most of Create Lists

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Mining the Blob: There s Gold in the Directory!

Uploading Records to the COE Database: Easy as 1,2,3!

"Forgive and Forget": Cleaning House with Circ Jobs 39 and 42

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Introduction to Regular Expressions Version 1.3. Tom Sgouros

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Advanced Handle Definition

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Certification. String Processing with Regular Expressions

Global Data Change: Overview, Tips and Tricks [2014]

Understanding Regular Expressions, Special Characters, and Patterns

Regular Expressions Primer

chapter 2 G ETTING I NFORMATION FROM A TABLE

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

Lecture 18 Regular Expressions

Introduction to MarcEdit iskills Workshop Series. University of Toronto. Faculty of Information. Winter 2018.

Getting to grips with Unix and the Linux family

CSE 390a Lecture 7. Regular expressions, egrep, and sed

Top 10 GDC Projects at UK Libraries

Capturing Collective Updates and Upgrades: Using OCLC WorldShare MARC Record Delivery at University of Kentucky Libraries

Filtering Service

Regular Expressions for Technical Writers (tutorial)

ITST Searching, Extracting & Archiving Data

CS 230 Programming Languages

The Little Regular Expressionist


CSE 303 Lecture 7. Regular expressions, egrep, and sed. read Linux Pocket Guide pp , 73-74, 81

Getting Information from a Table

successes without magic London,

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Regexp. Lecture 26: Regular Expressions

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

STREAM EDITOR - REGULAR EXPRESSIONS

IT441. Regular Expressions. Handling Text: DRAFT. Network Services Administration

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

BNF, EBNF Regular Expressions. Programming Languages,

Regular Expressions for Technical Writers

More Details about Regular Expressions

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

If you have to pick between reluctant and greedy, often (but not always) you ll want the reluctant versions. So let s say we have the string:

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Who This Book Is For What This Book Covers How This Book Is Structured What You Need to Use This Book. Source Code

Unix Introduction. Part 2

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

STATS Data analysis using Python. Lecture 0: Introduction and Administrivia

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

CS Unix Tools & Scripting

Non-deterministic Finite Automata (NFA)

Effective Programming Practices for Economists. 17. Regular Expressions

Scripting with Praat. Day 8: More Goodies and the Way Forward

PowerGREP. Manual. Version October 2005

Regular Expressions 1

Motivation (Scenarios) Topic 4: Grep, Find & Sed. Displaying File Names. grep

Shorthand for values: variables

Searching Guide. September 16, Version 9.3

CS Unix Tools & Scripting Lecture 7 Working with Stream

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Computer Systems and Architecture

Python allows variables to hold string values, just like any other type (Boolean, int, float). So, the following assignment statements are valid:

1 CS580W-01 Quiz 1 Solution

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regular expressions: Text editing and Advanced manipulation. HORT Lecture 4 Instructor: Kranthi Varala

Week - 01 Lecture - 04 Downloading and installing Python

Lecture 15 (05/08, 05/10): Text Mining. Decision, Operations & Information Technologies Robert H. Smith School of Business Spring, 2017

Version November 2017

Regular Expressions. James Balamuta STAT UIUC. Lecture 25: Nov 9, 2018

Object-Oriented Software Engineering CS288

Computer Systems and Architecture

Essentials for Scientific Computing: Stream editing with sed and awk

Regex Guide. Complete Revolution In programming For Text Detection

UNIX / LINUX - REGULAR EXPRESSIONS WITH SED

UNIX II:grep, awk, sed. October 30, 2017

Regular Expressions for Linguists: A Life Skill

Regular Expressions.

CST Lab #5. Student Name: Student Number: Lab section:

Innovative User Group Conference 2009 Anaheim 1

Instructor: Craig Duckett. Lecture 02: Thursday, March 29 th, 2018 SQL Basics and SELECT, FROM, WHERE

1 de 6 07/03/ :28 p.m.

Science One CS : Getting Started

R E G U L A R E X P R E S S I O N S

Perl and Python ESA 2007/2008. Eelco Schatborn 27 September 2007

User Commands sed ( 1 )

Transcription:

University of Kentucky UKnowledge Library Presentations University of Kentucky Libraries 12-20-2016 Regular Expressions in Practice Kathryn Lybarger University of Kentucky, kathryn.lybarger@uky.edu Click here to let us know how access to this document benefits you. Follow this and additional works at: https://uknowledge.uky.edu/libraries_present Part of the Cataloging and Metadata Commons Repository Citation Lybarger, Kathryn, "Regular Expressions in Practice" (2016). Library Presentations. 157. https://uknowledge.uky.edu/libraries_present/157 This Presentation is brought to you for free and open access by the University of Kentucky Libraries at UKnowledge. It has been accepted for inclusion in Library Presentations by an authorized administrator of UKnowledge. For more information, please contact UKnowledge@lsv.uky.edu.

Regular expressions in practice Kathryn Lybarger Dec. 20, 2016 University of Kentucky #Mashcat

Audience participation! Chat Make sure you can get to your chat window Regex Bingo Get your individual card! http://www.zemkat.org/mashcat/regex.php

What are regular expressions? ( regex or regexp ) Powerful search and replace Rather than just searching for specific things, you can search for kinds of things Symbols define a search pattern (similar to wildcards ) This pattern may match your data Data may be modified or rearranged based on that matching

Examples: Regex for finding data Find all the lines that start with 2016 Find all 10- or 13-digit numbers Find all the fields like Includes biblio something something... references. (but not bibliographical )

Use case: Links per record? Large MARC file of ebook records How many links in each record? Do some have more than one? Or none? Is having the same number of 001 and 856 enough? 001 eve0483837 856 $u http://www.evender.com/ 001 eve0483839 856 $u http://www.evender.com/ 856 $u http://www.evender.com/ 001 eve0483834 001 eve0483845 856 $u http://www.evender.com/

Examples: Regex for modifying data Reformat all phone numbers from ###-###-#### to (###) ###-#### Re-order names from LastName, FirstName -> FirstName LastName Handling Jr. or Sr. correctly

Use Case: What to catalog?

Use Case: What to catalog? https://xkcd.com/208/

Use Case: What to catalog?

Use case: Digital Library software < /> Regex XML

Common software for cataloging / programming has regex support MARCEdit Vendor-specific: Voyager s Global Data Change, etc. Google Sheets Text/XML Editors: vim, Sublime, EditPad, Oxygen Programming languages Databases like Oracle, MySQL Command line tools: grep, sed, awk

sed e s/now/meow/ file1 > file2

A nice learning tool: regex101.com

There are different flavors of regex Depends on software or language you re using e.g. Language with PCRE support, vim text editor Different features available Slightly different syntax to invoke some features BUT there is a common core Once you know this, can look up details for your specific application

Structured text? Biko, S. B., BIKO SPEAKS: A SELECTION OF HIS WRITINGS, Johannesburg, Bayakha Books, 1997, 30 pp. G (2 copies)

Forward slashes (/) are often used to indicate a regex Regular Expression

Forward slashes (/) are often used to indicate a regex What does this pattern match?

Literal characters: the simplest regex! Matches: cat cataloging scatter wildcat

Literal characters: the simplest regex! Does NOT match: cart CAT act c a t

Unless specified otherwise Matching is case-sensitive To match, the capitalization in your data must match the capitalization in your pattern Matching only part of the data is fine It doesn t have to be the whole word or line

What matches this?

Regex Bingo! I will put a regex search pattern on the slide Search the indicated column (like B, I, N, G, O) does the regex match any of those strings? If so, mark that space off! If you get five in a line, shout BINGO! (in chat)

N

Meta-characters \ ^ $ ( ). [ ]? * + { } These are what give regular expressions their power. If the word or phrase you re searching for has punctuation, it may not match as you intend

But what if I want to search for those?! You can escape such characters by preceding them with a backslash For example, if you wanted to search for a literal carat, include this in your regex: \^

Carat ^ Start of line anchor /^dog/ Matches: dog dogfood dog house dogma

Carat ^ Start of line anchor /^dog/ Does NOT match: bulldog What a cute dog! dgo dooooog

B

I

Dollar sign $ End of line anchor /ugh$/ Matches: ugh Tough dough laugh

Dollar sign $ End of line anchor /ugh$/ Does NOT match: taught ugh that s gross UGH upright

N

G

Vertical bar This or that? /ab c/ Matches: abdicate cabaret laboratory Jack

Vertical bar This or that? /ab c/ Does NOT match: rad blue band Canada

O

B

Dot (period): It matches any character /m.t.e/ Matches: matte mother permitted tomatoes

Dot (period): It matches any character /m.t.e/ Does NOT match: smote tempted mothmen Mister

I

N

Character classes Square brackets [ ] This part of the expression only matches one character Matches any character that appears in the brackets (but no others)

Character classes Square brackets [ ] /m[ai]n/ Matches: man mine woman swimming

Character classes Square brackets [ ] /m[ai]n/ Does NOT match: main chimney women am not

G

O

Character classes: some shorthand [a-z] instead of [abcdefghijklmnopqrstuvwxyz] [a-za-z] any letter \d instead of [0123456789] \s any whitespace like <space>, <tab>

Character classes (excluding) - carat in square brackets [^ ] You can use similar notation to match any character EXCEPT what you specify This is the same carat character as used in the left anchor, but there should be no confusion

Character classes (excluding) - carat in square brackets [^ ] /o[^ao]r/ Matches: your autograph cobra ourselves

Character classes (excluding) - carat in square brackets [^ ] /o[^ao]r/ Does NOT match: oar poor or coder

B

I

Quantifiers: How many of a thing? These most recent patterns we ve looked at have each matched exactly ONE character But what if it is optional? Or you can / must have more than one? These metacharacters directly follow patterns to specify such things

Question mark? (zero or one time) Thing before it is optional /do?r/ Matches: ardor dorsal drone Andrew

Question mark? (zero or one time) Thing before it is optional /do?r/ Does NOT match: door dor and Lando

N

G

Star (asterisk): (zero, one, or more!) Thing before is optional, repeatable /il*e/ Matches: pumpkin pie bibliophile Miller gillless

Star (asterisk): (zero, one, or more!) Thing before is optional, repeatable /il*e/ Does NOT match: fills fire dinner tilted

O

B

Plus sign: (one or more times) Thing before is required, repeatable /an+/ Matches: an annotate manner annnnnnnnn

Plus sign: (one or more times) Thing before is required, repeatable /an+/ Does NOT match: apple minnow Andover nasty

I

N

All of these can be used together (What do these match?) /^(dog raccoon)?fo+d$/ /[A-Z][a-z]* [A-Z][a-z]+/

Going back to this example: Matches: cat cataloging scatter wildcat

Search and replace Search for: /cat/ Replace with: fuzzy beast Data: My cats are the best. Results: My fuzzy beasts are the best.

Search and replace Flexibility for searching Search for: /(cat dog horse)/ Replace with: fuzzy beast Data: My horses are the best. Results: My fuzzy beasts are the best.

Parentheses ( ) Capturing patterns As you match patterns, you can capture parts of your data and rearrange them In the replacement pattern, refer to these parts by their back references like \1, \2, \3, in order of parentheses in the pattern

Simple example: Reformatting names My data: Jones, Martha Search for: /([A-Za-z]+), ([A-Za-z]+)/ This sets: \1 = Jones \2 = Martha Replace with: Result: \2 \1 Martha Jones

What does this match?

What does this match? /(.*) (.*)/

What does this do? My data: James Earl Jones Search for: /(.*) (.*)/ This sets: \1 =? \2 =? Replace with: Result: \2, \1

Regular expressions are greedy That first capture group (.*) has two choices: Capture James and leave Earl Jones (or just Earl ) for the second Capture James Earl and leave Jones for the second Greedy means it will capture as much as it can

Results: My data: James Earl Jones Search for: /(.*) (.*)/ This sets: \1 = James Earl \2 = Jones Replace with: Result: \2, \1 Jones, James Earl

Think about this: How to remove all GMD? Data: (lots of records like this) =100 1_ $a Sachar, Louis, $e author. =245 00 $a Holes $h [electronic resource] / $c Louis Sachar. =264 _1 $a New York : $b Random House, $c 2015. Goal: Remove 245 $h

References Regex101 http://www.regex101.com/ Regular-Expressions.info http://www.regular-expressions.info/ Documentation for your particular software / application Where to use regex Which features are available How to invoke them (Cat pictures from the @EmrgencyKittens twitter)

Contact Me Email: Kathryn.Lybarger@uky.edu Twitter: @zemkat Tumblr: problem-cataloger.tumblr.com