CMSC 330: Organization of Programming Languages. Ruby Regular Expressions

Similar documents
CMSC 330: Organization of Programming Languages. Ruby Regular Expressions

CMSC 330: Organization of Programming Languages

Reminders. CMSC 330: Organization of Programming Languages. Review. Standard Library: String. Regular Expressions. Example Regular Expressions in Ruby

No variable declara<ons. sum = 0! i = 1! while i <= 10 do! sum += i*i! i = i + 1! end! print "Sum of squares is #{sum}\n"!

CMSC330 Fall 2014 Midterm 1 Solution

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

CMSC 330: Organization of Programming Languages

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

CMSC330 Fall 2014 Midterm 1 Grading

Essentials for Scientific Computing: Stream editing with sed and awk

CSCI 4152/6509 Natural Language Processing. Perl Tutorial CSCI 4152/6509. CSCI 4152/6509, Perl Tutorial 1

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

Lecture 18 Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

CSE 374 Programming Concepts & Tools. Laura Campbell (thanks to Hal Perkins) Winter 2014 Lecture 6 sed, command-line tools wrapup

Pattern Matching. An Introduction to File Globs and Regular Expressions

CSCI 2010 Principles of Computer Science. Data and Expressions 08/09/2013 CSCI

Grep and Shell Programming

Semantic Analysis. Lecture 9. February 7, 2018

Review of Fundamentals

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Perl. Perl. Perl. Which Perl

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Iterators and blocks. Using iterators and blocks Iterate with each or use a for loop? Creating iterators

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Regular Expressions 1

Perl. Many of these conflict with design principles of languages for teaching.

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

(Refer Slide Time: 01:12)

IT441. Regular Expressions. Handling Text: DRAFT. Network Services Administration

Ruby: Introduction, Basics

Programming Languages and Compilers (CS 421)

n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

CSE 390a Lecture 2. Exploring Shell Commands, Streams, and Redirection

Vi & Shell Scripting

Language Reference Manual

Indian Institute of Technology Kharagpur. PERL Part III. Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. I.I.T.

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

COMP 2718: Shell Scripts: Part 1. By: Dr. Andrew Vardy

CMSC 331 Final Exam Fall 2013

CS Unix Tools & Scripting

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Lexical Analysis. Lecture 3. January 10, 2018

B The SLLGEN Parsing System

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

Regular Expressions Explained

Lecture Set 2: Starting Java

Indian Institute of Technology Kharagpur. PERL Part II. Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. I.I.T.

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Lexical Analysis. Lecture 3-4

Introduction to Perl. Perl Background. Sept 24, 2007 Class Meeting 6

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

Part 5 Program Analysis Principles and Techniques

Data and Expressions. Outline. Data and Expressions 12/18/2010. Let's explore some other fundamental programming concepts. Chapter 2 focuses on:

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Introduction to Perl. c Sanjiv K. Bhatia. Department of Mathematics & Computer Science University of Missouri St. Louis St.

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Lexical Analysis. Chapter 2

ECE251 Midterm practice questions, Fall 2010

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

String Computation Program

CMSC330 Fall 2016 Midterm #1 2:00pm/3:30pm

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

ECE 122 Engineering Problem Solving with Java

CSC 467 Lecture 3: Regular Expressions

They grow as needed, and may be made to shrink. Officially, a Perl array is a variable whose value is a list.

Lexical Analysis. Lecture 2-4

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

Regular expressions: Text editing and Advanced manipulation. HORT Lecture 4 Instructor: Kranthi Varala

Handout 7, Lex (5/30/2001)

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Programming Languages and Compilers (CS 421)

Compilers CS S-01 Compiler Basics & Lexical Analysis

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

Principles of Programming Languages COMP251: Syntax and Grammars

Mastering Modern Linux by Paul S. Wang Appendix: Pattern Processing with awk

Compilers CS S-01 Compiler Basics & Lexical Analysis

Full file at

Lecture Set 2: Starting Java

CMSC 330: Organization of Programming Languages. OCaml Imperative Programming

Ruby Regular Expressions AND FINITE AUTOMATA

Program Fundamentals

C mini reference. 5 Binary numbers 12

COMS 3101 Programming Languages: Perl. Lecture 2

CSE 390a Lecture 7. Regular expressions, egrep, and sed

CSCI 2132: Software Development

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Programming introduction part I:

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Regular expressions and case insensitivity

CMSC 330: Organization of Programming Languages. OCaml Imperative Programming

ML 4 A Lexer for OCaml s Type System

Transcription:

CMSC 330: Organization of Programming Languages Ruby Regular Expressions 1

String Processing in Ruby Earlier, we motivated scripting languages using a popular application of them: string processing The Ruby String class provides many useful methods for manipulating strings Concatenating them, grabbing substrings, searching in them, etc. A key feature in Ruby is its native support for regular expressions Very useful for parsing and searching First gained popularity in Perl 2

String Operations in Ruby "hello".index("l", 0) Ø Return index of the first occurrence of string in s, starting at n "hello".sub("h", "j") Ø Replace first occurrence of "h" by "j" in string Ø Use gsub ("global" sub) to replace all occurrences "r1\tr2\t\tr3".split("\t") Ø Return array of substrings delimited by tab Consider these three examples again All involve searching in a string for a certain pattern What if we want to find more complicated patterns? Ø Find first occurrence of "a" or "b" Ø Split string at tabs, spaces, and newlines 3

Regular Expressions A way of describing patterns or sets of strings Searching and matching Formally describing strings Ø The symbols (lexemes or tokens) that make up a language Common to lots of languages and tools awk, sed, perl, grep, Java, OCaml, C libraries, etc. Ø Popularized (and made fast) as a language feature in Perl Based on some really elegant theory Future lecture 4

Example Regular Expressions in Ruby /Ruby/ Matches exactly the string "Ruby" Regular expressions can be delimited by / s Use \ to escape / s in regular expressions /(Ruby OCaml Java)/ Matches either "Ruby", "OCaml", or "Java" /(Ruby Regular)/ or /R(uby egular)/ Matches either "Ruby" or "Regular" Use ( ) s for grouping; use \ to escape ( ) s 5

Using Regular Expressions Regular expressions are instances of Regexp We ll see use of a Regexp.new later Basic matching using =~ method of String line = gets if line =~ /Ruby/ then puts "Found Ruby" end # read line from standard input # returns nil if not found Can use regular expressions in index, search, etc. offset = line.index(/(max MIN)/) # search starting from 0 line.sub(/(perl Python)/, "Ruby") # replace line.split(/(\t \n )/) # split at tab, space, # newline 6

Repetition in Regular Expressions /(Ruby)*/ {"", "Ruby", "RubyRuby", "RubyRubyRuby",...} * means zero or more occurrences /Ruby+/ {"Ruby", "Rubyy", "Rubyyy",... } + means one or more occurrence so /e+/ is the same as /ee*/ /(Ruby)?/ {"", "Ruby"}? means optional, i.e., zero or one occurrence 8

Repetition in Regular Expressions /(Ruby){3}/ { RubyRubyRuby } {x} means repeat the search for exactly x occurrences /(Ruby){3,}/ { RubyRubyRuby, RubyRubyRubyRuby, } {x,} means repeat the search for at least x occurrences /(Ruby){3, 5}/ { RubyRubyRuby, RubyRubyRubyRuby, RubyRubyRubyRubyRuby } {x, y} means repeat the search for at least x occurrences and at most y occurrences 9

Watch Out for Precedence /(Ruby)*/ means {"", "Ruby", "RubyRuby",...} /Ruby*/ means {"Rub", "Ruby", "Rubyy",...} In general * {n} and + bind most tightly Then concatenation (adjacency of regular expressions) Then Best to use parentheses to disambiguate Note that parentheses have another use, to extract matches, as we ll see later 10

Character Classes /[abcd]/ {"a", "b", "c", "d"} (Can you write this another way?) /[a-za-z0-9]/ Any upper or lower case letter or digit /[^0-9]/ Any character except 0-9 (the ^ is like not and must come first) /[\t\n ]/ Tab, newline or space /[a-za-z_\$][a-za-z_\$0-9]*/ Java identifiers ($ escaped...see next slide) 11

Special Characters. any character ^ beginning of line $ end of line \$ just a $ \d digit, [0-9] \s whitespace, [\t\r\n\f\s] \w word character, [A-Za-z0-9_] \D non-digit, [^0-9] \S non-space, [^\t\r\n\f\s] \W non-word, [^A-Za-z0-9_] Using /^pattern$/ ensures entire string/line must match pattern 12

Potential Character Class Confusions ^ Inside character classes: not Outside character classes: beginning of line [ ] Inside regular expressions: character class Outside regular expressions: array Ø Note: [a-z] does not make a valid array ( ) Inside character classes: literal characters ( ) Ø Note /(0..2)/ does not mean 012 Outside character classes: used for grouping Inside character classes: range (e.g., a to z given by [a-z]) Outside character classes: subtraction 13

Summary Let re represents an arbitrary pattern; then: /re/ matches regexp re /(re 1 re 2 )/ match either re 1 or re 2 /(re)*/ match 0 or more occurrences of re /(re)+/ match 1 or more occurrences of re /(re)?/ match 0 or 1 occurrences of re /(re){2}/ match exactly two occurrences of re /[a-z]/ same as (a b c... z) / [^0-9]/ match any character that is not 0, 1, etc. ^, $ match start or end of string 14

Try out regexps at rubular.com 15

Regular Expression Practice Make Ruby regular expressions representing All lines beginning with a or b /^(a b)/ All lines containing at least two (only alphabetic) words separated by white-space /[a-za-z] + \s + [a-za-z] + / All lines where a and b alternate and appear at least once /^((ab) + a?) ((ba) + b?)$/ An expression which would match both of these lines (but not radically different ones) Ø CMSC330: Organization of Programming Languages: Fall 2016 Ø CMSC351: Algorithms: Fall 2016 16

Quiz 1 How many different strings could this regex match? /^Hello. Anyone awake?$/ A. 1 B. 2 C. 4 D. More than 4 17

Quiz 1 How many different strings could this regex match? /^Hello. Anyone awake?$/ A. 1 B. 2 C. 4 D. More than 4 Matches any character e or nothing 18

Quiz 2 Which regex is not equivalent to the others? A. ^[computer]$ B. ^(c o m p u t e r)$ C. ^([comp] [uter])$ D. ^c?o?m?p?u?t?e?r?$ 19

Quiz 2 Which regex is not equivalent to the others? A. ^[computer]$ B. ^(c o m p u t e r)$ C. ^([comp] [uter])$ D. ^c?o?m?p?u?t?e?r?$ 20

Quiz 3 Which string does not match the regex? /[a-z]{4}\d{3}/ A. cmsc\d\d\d B. cmsc330 C. hellocmsc330 D. cmsc330world 21

Quiz 3 Which string does not match the regex? Recall that without ^ and $, a regex will match any substring /[a-z]{4}\d{3}/ A. cmsc\d\d\d B. cmsc330 C. hellocmsc330 D. cmsc330world 22

Extracting Substrings based on R.E. s Method 1: Back References Two options to extract substrings based on R.E. s: Use back references Ruby remembers which strings matched the parenthesized parts of r.e. s These parts can be referred to using special variables called back references (named $1, $2, ) 25

Back Reference Example Extract information from a report gets =~ /^Min: (\d+) Max: (\d+)$/ min, max = $1, $2 sets min = $1 and max = $2 Warning Despite their names, $1 etc are local variables def m(s) s =~ /(Foo)/ puts $1 # prints Foo end m("foo") puts $1 # prints nil 26

Another Back Reference Example Warning 2 If another search is performed, all back references are reset to nil gets =~ /(h)e(ll)o/ puts $1 puts $2 gets =~ /h(e)llo/ puts $1 puts $2 gets =~ /hello/ puts $1 hello h ll hello e nil hello nil 27

Quiz 4 What is the output of the following code? s = help I m stuck in a text editor s =~ /([A-Z]+)/ puts $1 A. help B. I C. I m D. I m stuck in a text editor 28

Quiz 4 What is the output of the following code? s = help I m stuck in a text editor s =~ /([A-Z]+)/ puts $1 A. help B. I C. I m D. I m stuck in a text editor 29

Quiz 5 What is the output of the following code? Why was 6 afraid of 7? =~ /\d\s(\w+).*(\d)/ puts $2 A. afraid B. Why C. 6 D. 7 30

Quiz 5 What is the output of the following code? Why was 6 afraid of 7? =~ /\d\s(\w+).*(\d)/ puts $2 A. afraid B. Why C. 6 D. 7 31

Method 2: String.scan Also extracts substrings based on regular expressions Can optionally use parentheses in regular expression to affect how the extraction is done Has two forms that differ in what Ruby does with the matched substrings The first form returns an array The second form uses a code block Ø We ll see this later 32

First Form of the Scan Method str.scan(regexp) If regexp doesn't contain any parenthesized subparts, returns an array of matches Ø An array of all the substrings of str which matched s = "CMSC 330 Fall 2007" s.scan(/\s+ \S+/) # returns array ["CMSC 330", "Fall 2007"] Ø Note: these strings are chosen sequentially from as yet unmatched portions of the string, so while 330 Fall does match the regular expression above, it is not returned since 330 has already been matched by a previous substring. s.scan(/\s{2}/) # => ["CM", "SC", "33", "Fa", "ll", "20", "07"] 33

First Form of the Scan Method (cont.) If regexp contains parenthesized subparts, returns an array of arrays Ø Each sub-array contains the parts of the string which matched one occurrence of the search s = "CMSC 330 Fall 2007" s.scan(/(\s+) (\S+)/) # [["CMSC", "330"], # ["Fall", "2007"]] Ø Each sub-array has the same number of entries as the number of parenthesized subparts Ø All strings that matched the first part of the search (or $1 in back-reference terms) are located in the first position of each sub-array 34

Practice with Scan and Back-references > ls -l drwx------ 2 sorelle sorelle 4096 Feb 18 18:05 bin -rw------- 1 sorelle sorelle 674 Jun 1 15:27 calendar drwx------ 3 sorelle sorelle 4096 May 11 2006 cmsc311 drwx------ 2 sorelle sorelle 4096 Jun 4 17:31 cmsc330 drwx------ 1 sorelle sorelle 4096 May 30 19:19 cmsc630 drwx------ 1 sorelle sorelle 4096 May 30 19:20 cmsc631 Extract just the file or directory name from a line using scan back-references name = line.scan(/\s+$/) # [ bin ] if line =~ /(\S+$)/ name = $1 # bin end 35

Quiz 6 What is the output of the following code? s = Hello World t = s.scan(/\w{2}/).length puts t A. 3 B. 4 C. 5 D. 6 36

Quiz 6 What is the output of the following code? s = Hello World t = s.scan(/\w{2}/).length puts t A. 3 B. 4 C. 5 D. 6 37

Quiz 7 What is the output of the following code? s = To be, or not to be! a = s.scan(/(\s+) (\S+)/) puts a.inspect A. [ To, be,, or, not, to, be! ] B. [[ To, be, ],[ or, not ],[ to, be! ]] C. [ To, be, ] D. [ to, be! ] 38

Quiz 7 What is the output of the following code? s = To be, or not to be! a = s.scan(/(\s+) (\S+)/) puts a.inspect A. [ To, be,, or, not, to, be! ] B. [[ To, be, ],[ or, not ],[ to, be! ]] C. [ To, be, ] D. [ to, be! ] 39

Second Form of the Scan Method Can take a code block as an optional argument str.scan(regexp) { match block } Applies the code block to each match Short for str.scan(regexp).each { match block } The regular expression can also contain parenthesized subparts 40

Example of Second Form of Scan 12 34 23 19 77 87 11 98 3 2 45 0 input file: will be read line by line, but column summation is desired sum_a = sum_b = sum_c = 0 while (line = gets) line.scan(/(\d+)\s+(\d+)\s+(\d+)/) { a,b,c sum_a += a.to_i sum_b += b.to_i sum_c += c.to_i converts the string to an integer } end printf("total: %d %d %d\n", sum_a, sum_b, sum_c) Sums up three columns of numbers 41

Standard Library: File Lots of convenient methods for IO File.new("file.txt", "rw") f.readline f.readlines f.eof f.close f << object $stdin, $stdout, $stderr # open for rw access # reads the next line from a file # returns an array of all file lines # return true if at end of file # close file # convert object to string and write to f # global variables for standard UNIX IO By default stdin reads from keyboard, and stdout and stderr both write to terminal File inherits some of these methods from IO 42

Exceptions Use begin...rescue...ensure...end Like try...catch...finally in Java begin f = File.open("test.txt", "r") while!f.eof line = f.readline puts line end rescue Exception => e puts "Exception:" + e.to_s + " (class " + e.class.to_s + ") ensure f.close if f!= nil end Class of exception to catch Local name for exception Always happens 43

Command Line Arguments Stored in predefined global constant ARGV Example If Ø Invoke test.rb as ruby test.rb a b c Then Ø ARGV[0] = a Ø ARGV[1] = b Ø ARGV[2] = c 44

Practice: Amino Acid counting in DNA Write a function that will take a filename and read through that file counting the number of times each group of three letters appears so these numbers can be accessed from a hash. (assume: the number of chars per line is a multiple of 3) gcggcattcagcacccgtatactgttaagcaatccagatttttgtgtataacataccggc catactgaagcattcattgaggctagcgctgataacagtagcgctaacaatgggggaatg tggcaatacggtgcgattactaagagccgggaccacacaccccgtaaggatggagcgtgg taacataataatccgttcaagcagtgggcgaaggtggagatgttccagtaagaatagtgg gggcctactacccatggtacataattaagagatcgtcaatcttgagacggtcaatggtac cgagactatatcactcaactccggacgtatgcgcttactggtcacctcgttactgacgga 45

Practice: Amino Acid counting in DNA get the file handle array of lines from the file for each line in the file for each triplet in the line def countaa(filename) file = File.new(filename, "r") lines = file.readlines hash = Hash.new lines.each{ line acids = line.scan(/.../) acids.each{ aa if hash[aa] == nil hash[aa] = 1 else hash[aa] += 1 end } } end initialize the hash, or you will get an error when trying to index into an array with a string get an array of triplets in the line 46

Comparisons Sorting requires ability to compare two values Ruby comparison method <=> Ø -1 = less Ø 0 = equals Ø +1 = greater Examples 3 <=> 4 returns -1 4 <=> 3 returns +1 3 <=> 3 returns 0 47

Sorting Two ways to sort an Array Default sort (puts values in ascending order) Ø [2,5,1,3,4].sort # returns [1,2,3,4,5] Custom sort (based on value returned by code block} Ø [2,5,1,3,4].sort { x,y y <=> x } # returns [5,4,3,2,1] Ø Where -1 = less, 0 = equals, +1 = greater Ø Code block return value used for comparisons 48

Ruby Summary Interpreted Implicit declarations Dynamically typed Built-in regular expressions Easy string manipulation Makes it quick to write small programs Object-oriented Everything (!) is an object Code blocks Easy higher-order programming! Get ready for a lot more of this... Hallmark of scripting languages 49

Other Scripting Languages Perl and Python are also popular scripting languages Also are interpreted, use implicit declarations and dynamic typing, have easy string manipulation Both include optional compilation for speed of loading/execution Will look fairly familiar to you after Ruby Lots of the same core ideas All three have their proponents and detractors Use whichever language you personally prefer 50

Example Perl Program #!/usr/bin/perl foreach (split(//, $ARGV[0])) { if ($G{$_}) { $RE.= "\\". $G{$_}; } else { $RE.= $N? "(?!\\". join(" \\",values(%g)). ')(\w)' : '(\w)'; $G{$_} = ++$N; } } 51

Example Python Program #!/usr/bin/python import re list = ("deep", "deer", "duck") x = re.compile("^\s{3,5}.[aeiou]") for i in list: if re.match(x, i): print I else: print 52