Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Similar documents
Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

STREAM EDITOR - REGULAR EXPRESSIONS

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

CST Lab #5. Student Name: Student Number: Lab section:

Server-side Web Development (I3302) Semester: 1 Academic Year: 2017/2018 Credits: 4 (50 hours) Dr Antoun Yaacoub

Understanding Regular Expressions, Special Characters, and Patterns

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Module 8 Pipes, Redirection and REGEX

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Pattern Matching. An Introduction to File Globs and Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

正则表达式 Frank from

CS Unix Tools & Scripting

Regular Expressions Explained

Configuring the RADIUS Listener LEG

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

UNIX / LINUX - REGULAR EXPRESSIONS WITH SED

psed [-an] script [file...] psed [-an] [-e script] [-f script-file] [file...]

ITST Searching, Extracting & Archiving Data

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

- c list The list specifies character positions.

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

=~ determines to which variable the regex is applied. In its absence, $_ is used.

Regular Expressions 1

Password Management Guidelines for Cisco UCS Passwords

CS160A EXERCISES-FILTERS2 Boyd

FSASIM: A Simulator for Finite-State Automata

Regular Expressions. Perl PCRE POSIX.NET Python Java

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved.

Filtering Service

Regular Expressions. Regular Expression Syntax in Python. Achtung!

FILTERS USING REGULAR EXPRESSIONS grep and sed

Appendix. As a quick reference, here you will find all the metacharacters and their descriptions. Table A-1. Characters

Configuring the RADIUS Listener Login Event Generator

UNIX files searching, and other interrogation techniques

Cisco Common Classification Policy Language

Lecture 18 Regular Expressions

Introduction to Regular Expressions Version 1.3. Tom Sgouros

Regular Expressions!!

Regular Expression Reference

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Computer Systems and Architecture

Structure of Programming Languages Lecture 3

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

Computer Systems and Architecture

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Describing Languages with Regular Expressions

Getting to grips with Unix and the Linux family

TCL - STRINGS. Boolean value can be represented as 1, yes or true for true and 0, no, or false for false.

CSE 390a Lecture 7. Regular expressions, egrep, and sed

User Commands sed ( 1 )

CS Advanced Unix Tools & Scripting

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

PowerGREP. Manual. Version October 2005

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

Bashed One Too Many Times. Features of the Bash Shell St. Louis Unix Users Group Jeff Muse, Jan 14, 2009

Fundamentals of Programming. November 19, 2017

Regex Guide. Complete Revolution In programming For Text Detection

Common File System Commands

successes without magic London,

6 Redirection. Standard Input, Output, And Error. 6 Redirection

CSE 303 Lecture 7. Regular expressions, egrep, and sed. read Linux Pocket Guide pp , 73-74, 81

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Standard 11. Lesson 9. Introduction to C++( Up to Operators) 2. List any two benefits of learning C++?(Any two points)

1 CS580W-01 Quiz 1 Solution

Computing Unit 3: Data Types

SPEECH RECOGNITION COMMON COMMANDS

More regular expressions, synchronizing data, comparing files

JFlex Regular Expressions

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

ISO/IEC JTC1/SC22/WG20 N

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Using Lex or Flex. Prof. James L. Frankel Harvard University

Version November 2017

Strings, characters and character literals

Regexs with DFA and Parse Trees. CS230 Tutorial 11

Regular Expressions. Upsorn Praphamontripong. CS 1111 Introduction to Programming Spring [Ref:

ITC213: STRUCTURED PROGRAMMING. Bhaskar Shrestha National College of Computer Studies Tribhuvan University

LESSON 1. A C program is constructed as a sequence of characters. Among the characters that can be used in a program are:

Regular Expressions Primer

Text & Patterns. stat 579 Heike Hofmann

Utilities. September 8, 2015

1 Lexical Considerations

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Chapter 2. Lexical Elements & Operators

Regular expressions. LING78100: Methods in Computational Linguistics I

DECLARATIONS. Character Set, Keywords, Identifiers, Constants, Variables. Designed by Parul Khurana, LIECA.

UNIX II:grep, awk, sed. October 30, 2017

The top level documentation about Perl regular expressions is found in perlre.

Bash Reference Manual Reference Documentation for Bash Edition 2.5b, for Bash Version 2.05b. July 2002

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Object oriented programming. Instructor: Masoud Asghari Web page: Ch: 3

CS214-AdvancedUNIX. Lecture 2 Basic commands and regular expressions. Ymir Vigfusson. CS214 p.1

Lexical Considerations

Linux Text Utilities 101 for S/390 Wizards SHARE Session 9220/5522

Transcription:

aim to facilitate the solution of text manipulation problems are symbolic notations used to identify patterns in text; are supported by many command line tools; are supported by most programming languages; grep (means global regular expression print) grep searches text files for the occurrence of a specified regular expression and outputs any line containing a match to standard output. grep usage: grep [options] regex [file...] regex: regular expression ITIS E. Fermi, Modena 1 /26

aim to facilitate the solution of text manipulation problems Option Description --------------------------------------------------------------------------- -i... Ignore case. Do not distinguish between upper and lower case characters. May also be specified --ignore-case. -v... Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match. May also be specified --invert-match. -c... Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves. May also be specified --count. -l... Print the name of each file that contains a match instead of the lines themselves. May also be specified --files-with-matches. -L...... Like the -l option, but print only the names of files that do not contain matches. May also be specified --files-withoutmatch -n... Display a prompt for input using the string prompt. Prefix each matching line with the number of the line within the file. May also be specified --line-number. ITIS E. Fermi, Modena 2 /26

aim to facilitate the solution of text manipulation problems Option Description --------------------------------------------------------------------------- -h... For multi-file searches, suppress the output of filenames. May also be specified --no-filename --ignore-case. some text files to search paolo@ubuntu-server:~$ ls /bin > dirlist-bin.txt paolo@ubuntu-server:~$ ls /usr/bin > dirlist-usr-bin.txt paolo@ubuntu-server:~$ ls /sbin > dirlist-sbin.txt paolo@ubuntu-server:~$ ls /usr/sbin > dirlist-usr-sbin.txt paolo@ubuntu-server:~$ ls dirlist*.txt dirlist-bin.txt dirlist-sbin.txt dirlist-usr-sbin.txt dirlist-usr-bin.txt ITIS E. Fermi, Modena 3 /26

grep simple search grep searches all of the listed files for the string bzip and finds two matches, both in the file dirlist-bin.txt: paolo@ubuntu-server:~$ grep bzip dirlist*.txt dirlist-bin.txt:bzip2 dirlist-bin.txt:bzip2recover -l option: only the list of files that contained matches: paolo@ubuntu-server:~$ grep -l bzip dirlist*.txt dirlist-bin.txt -L option: list of the files that did not contain a match paolo@ubuntu-server:~$ grep -L bzip dirlist*.txt dirlist-sbin.txt dirlist-usr-bin.txt dirlist-usr-sbin.txt ITIS E. Fermi, Modena 4 /26

Some Definitions literal: A literal is any character we use in a search or matching expression, for example, to find ind in windows the ind is a literal string - each character plays a part in the search, they form the string we want to find. metacharacter: A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression, for example, the character ^ (circumflex or caret) is a metacharacter. target string: This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern. escape sequence: An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal, for example, if we want to find (s) in the target string window(s) then we use the search expression \(s\) and if we want to find \\file in the target string c:\\file then we would need to use the search expression \\\\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence \). ITIS E. Fermi, Modena 5 /26

Some Definitions Metacharacter: ^ $. [ ] { } -? * + ( ) \ In addition to literals, regular expressions may also include metacharacters that are used to specify more complex matches ITIS E. Fermi, Modena 6 /26

the dot or period character: it is used to match any character. The Any Character If it is included it in a regular expression, it will match any character in that character position. Here s an example: paolo@ubuntu-server:~$ grep -h '.zip' dirlist*.txt bunzip2 bzip2 bzip2recover gunzip gzip funzip gpg-zip preunzip prezip prezip-bin unzip unzipsfx search for any line in files that matches the regular expression.zip, (the length of the required match is four characters). ITIS E. Fermi, Modena 7 /26

Anchors The caret (^) and dollar sign ($) characters are treated as anchors: they cause the match to occur only if the regular expression is found at the beginning of the line (^) or at the end of the line ($): Here there are some examples: paolo@ubuntu-server:~$ grep -h '^zip' dirlist*.txt zip zipcloak zipgrep zipinfo zipnote Zipsplit paolo@ubuntu-server:~$ grep -h 'zip$' dirlist*.txt gunzip gzip funzip gpg-zip preunzip prezip unzip Zip paolo@ubuntu-server:~$ grep -h '^zip$' dirlist*.txt zip ITIS E. Fermi, Modena 8 /26

Bracket Expressions and Character Classes bracket expressions: match a single character from a specified set of characters; [ ] Match anything inside the square brackets for ONE character position once and only once; set of characters may contain any number of characters, and metacharacters lose their special meaning when placed within brackets; paolo@ubuntu-server:~$ grep -h '[bg]zip' dirlist*.txt bzip2 bzip2recover gzip ITIS E. Fermi, Modena 9 /26

Negation ^ : negates the expression; Bracket Expressions and Character Classes If the first character in a bracket expression is a caret (^), the remaining characters are taken to be a set of characters that must not be present at the given character position paolo@ubuntu-server:~$ grep -h '[^bg]zip' dirlist*.txt bunzip2 gunzip funzip gpg-zip preunzip prezip prezip-bin unzip unzipsfx ITIS E. Fermi, Modena 10 /26

Bracket Expressions and Character Classes Character Ranges: - (dash) inside square brackets is the 'range separator', it allows to define a range; [0123456789] could be rewritten as [0-9] paolo@ubuntu-server:~$ grep -h '^[ABCDEFGHIJKLMNOPQRSTUVWXZY]' dirlist*.txt paolo@ubuntu-server:~$ grep -h '^[A-Z]' dirlist*.txt MAKEDEV ControlPanel GET HEAD POST X X11 Xorg MAKEFLOPPIES NetworkManager NetworkManagerDispatcher ITIS E. Fermi, Modena 11 /26

Bracket Expressions and Character Classes Character Ranges: Any range of characters can be expressed this way including multiple ranges; [0-9A-C] means check for 0 to 9 and A to C matches all filenames starting with letters and numbers: paolo@ubuntu-server:~$ grep -h '^[A-Za-z0-9]' dirlist*.txt by to include a dash character in to a bracket expression, make it the first in the expression: paolo@ubuntu-server:~$ grep -h '[-AZ]' dirlist*.txt will match every filename containing a dash, or a uppercase A or an uppercase Z. ITIS E. Fermi, Modena 12 /26

Bracket Expressions and Character Classes POSIX Character Classes: the POSIX standard includes a number of character classes which provide useful ranges of characters Character Class Description --------------------------------------------------------------------------- [:alnum:]... The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9] [:word:]... The same as [:alnum:], with the addition of the Underscore (_) character. [:alpha:]... The alphabetic characters. In ASCII, equivalent to: [A-Za-z] [:blank:]... Includes the space and tab characters [:cntrl:]... The ASCII control codes. Includes the ASCII characters 0 through 31 and 127 [:digit:]... The numerals zero through nine [:graph:]... The visible characters. In ASCII, it includes Characters 33 through 126. [:lower:]... The lowercase letters [:punct:]... The punctuation characters. In ASCII, equivalent to: [-!"#$%&'()*+,./:;<=>?@[\\\]_`{ }~] [:print:]... The printable characters. All the characters in [:graph:] plus the space character ITIS E. Fermi, Modena 13 /26

Bracket Expressions and Character Classes POSIX Character Classes: the POSIX standard includes a number of character classes which provide useful ranges of characters Character Class Description --------------------------------------------------------------------------- [:space:]... The whitespace characters including space, tab, Carriage return, newline, vertical tab, and form feed. In ASCII, equivalent to: [ \t\r\n\v\f] [:upper:]... The uppercase characters [:xdigit:]... Characters used to express hexadecimal numbers. In ASCII, equivalent to: [0-9A-Fa-f] ITIS E. Fermi, Modena 14 /26

Bracket Expressions and Character Classes POSIX Character Classes: the POSIX standard includes a number of character classes which provide useful ranges of characters paolo@ubuntu-server:~$ ls /usr/sbin/[[:upper:]]* /usr/sbin/makefloppies /usr/sbin/networkmanagerdispatcher /usr/sbin/networkmanager ITIS E. Fermi, Modena 15 /26

POSIX Basic Vs. Extended Regular Expressions POSIX splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE) POSIX regular expression basic regular expressions (BRE) extended regular expressions (ERE) BRE recognize the following metacharacters recognized: ^ $. [ ] * ( ) { } are treated as metacharacters in BRE if they are escaped with a backslash ERE recognize the following metacharacters recognized: ( ) { }? + ERE are supported by the egrep program, and grep when the -E option is used. ITIS E. Fermi, Modena 16 /26

Extended Regular Expressions Alternation: is the facility that allows a match to occur from among a set of expressions paolo@ubuntu-server:~$ echo "AAA" grep -E 'AAA BBB' AAA paolo@ubuntu-server:~$ echo "BBB" grep -E 'AAA BBB' BBB paolo@ubuntu-server:~$ echo "CCC" grep -E 'AAA BBB' paolo@ubuntu-server:~$ the regular expression 'AAA BBB' means match either the string AAA or the string BBB. Alternation is not limited to two choices: paolo@ubuntu-server:~$ echo "AAA" grep -E 'AAA BBB CCC' AAA ITIS E. Fermi, Modena 17 /26

Extended Regular Expressions Parenthesis (): is the facility that allows a match to occur from among a set of expressions To combine alternation with other regular expression elements, parenthesis () can be used to separate the alternation This expression matches the filenames that start with either bz, gz, or zip. paolo@ubuntu-server:~$ grep -Eh '^(bz gz zip)' dirlist*.txt This expression matches any filename that begins with bz or contains gz or zip paolo@ubuntu-server:~$ grep -Eh '^bz gz zip' dirlist*.txt ITIS E. Fermi, Modena 18 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched? - Match An Element Zero Or One Time this means, Make the preceding element optional Let s say we wanted to check a phone number for validity and we considered a phone number to be valid if it matched either of these two forms: (nnn) nnn-nnnn nnn nnn-nnnn where n is a numeral. We could construct a regular expression like this: ^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$ ITIS E. Fermi, Modena 19 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched? - Match An Element Zero Or One Time this means, Make the preceding element optional paolo@ubuntu-server:~$ echo "(555) 123-4567" grep -E '^\(?[0-9][0-9][0-9] \)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' (555) 123-4567 paolo@ubuntu-server:~$ echo "555 123-4567" grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' 555 123-4567 paolo@ubuntu-server:~$ echo "AAA 123-4567" grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' paolo@ubuntu-server:~$ ITIS E. Fermi, Modena 20 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched * - Match An Element Zero Or More Times it is used to denote an optional item; unlike the?, the item may occur any number of times, not just once. Let s say we wanted to see if a string was a sentence; that is, it starts with an uppercase letter, then contains any number of upper and lowercase letters and spaces, and ends with a period. we could use a regular expression like this: character classes [[:upper:]][[:upper:][:lower:] ]*\. two bracket expressions, * metacharacter, period escaped with a backslash ITIS E. Fermi, Modena 21 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched * - Match An Element Zero Or More Times paolo@ubuntu-server:~$ echo "This works." grep -E '[[:upper:]][[:upper:][ :lower:] ]*\.' This works. paolo@ubuntu-server:~$ echo "This Works." grep -E '[[:upper:]][[:upper:][ :lower:] ]*\.' This Works. paolo@ubuntu-server:~$ echo "this does not" grep -E '[[:upper:]][[:upper: ][:lower:] ]*\.' paolo@ubuntu-server:~$ ITIS E. Fermi, Modena 22 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched + - Match An Element One Or More Times: the + metacharacter works much like the *, except it requires at least one instance of the preceding element to cause a match. Example: regular expression that match lines consisting of groups of one or more alphabetic characters separated by single spaces: ^([[:alpha:]]+?)+$ paolo@ubuntu-server:~$ echo "This that" grep -E '^([[:alpha:]]+?)+$' This that paolo@ubuntu-server:~$ echo "a b c" grep -E '^([[:alpha:]]+?)+$' a b c paolo@ubuntu-server:~$ echo "a b 9" grep -E '^([[:alpha:]]+?)+$' paolo@ubuntu-server:~$ echo "abc d" grep -E '^([[:alpha:]]+?)+$' paolo@ubuntu-server:~$ ITIS E. Fermi, Modena 23 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched { } - Match An Element A Specific Number Of Times the { and } metacharacters are used to express minimum and maximum numbers of required matches. They may be specified in four possible ways: Specifier Meaning --------------------------------------------------------------------------- {n}... Match the preceding element if it occurs exactly n times {n,m}... Match the preceding element if it occurs at least n times, but no more than m times. {n,}... Match the preceding element if it occurs n or more Times {,m}... Match the preceding element if it occurs no more than m times ITIS E. Fermi, Modena 24 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched { } - Match An Element A Specific Number Of Times Check a phone number for validity, consider a phone number as valid if it matched either of these two forms: (nnn) nnn-nnnn or nnn nnn-nnnn, where n is a numeral. two equivalent regular expressions: ^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$ ^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$ ITIS E. Fermi, Modena 25 /26

Extended Regular Expressions Quantifiers: they are used to specify the number of times an element is matched { } - Match An Element A Specific Number Of Times Check a phone number for validity, consider a phone number as valid if it matched either of these two forms: (nnn) nnn-nnnn or nnn nnn-nnnn, where n is a numeral. paolo@ubuntu-server:~$ echo "(555) 123-4567" grep -E '^\(?[0-9]{3}\)? [0-9] {3}-[0-9]{4}$' (555) 123-4567 paolo@ubuntu-server:~$ echo "555 123-4567" grep -E '^\(?[0-9]{3}\)? [0-9] {3}-[0-9]{4}$' 555 123-4567 paolo@ubuntu-server:~$ echo "AAA 123-4567" grep -E '^\(?[0-9]{3}\)? [0-9] {3}-[0-9]{4}$' paolo@ubuntu-server:~$ ITIS E. Fermi, Modena 26 /26