Overview Unix/Regex Lab CS 341: Natural Language Processing Heather Pon-Barry 1. Setup & Unix review 2. Count words in a text 3. Sort a list of words in various ways 4. Search with grep Based on Unix For Poets (by Ken Church) 5. Two-minute response
Setting Up 1. Setup & Unix Review In your home directory, make a cs341 folder Make a directory called unixforpoets for today s lab activity
Unix Tools pwd ls cd <dirname> cd../ less <filename> head <filename> tail <filename> man <command> piping > < CTRL-C grep: search for a pattern (regular expression) sort uniq c (count duplicates) tr (translate characters) wc (word or line count) cat (send file(s) in stream) sed (edit string -- replacement)
Counting lines, words, characters 2. Count words in a text wc alice.txt 1601 27336 135029 alice.txt
tr command NAME tr - translate or delete characters SYNOPSIS tr [OPTION]... SET1 [SET2] DESCRIPTION Translate, squeeze, and/or delete characters from standard input, writing to standard output. -c complement of SET1 -s, if SET2 is specified, squeezes repeated SET2 characters to a single character --help display this help and exit Counting Words Input: mini-alice.txt; alice.txt Output: list of words with freq counts Algorithm 1. Create a file with one token per line (tr -sc ) 2. Sort (sort) 3. Count duplicates (uniq c) Practice using tr, sort, and uniq incrementally on mini-alice.txt Once you understand each step, run your command on alice.txt
Output head and tail 632 a 1 abide 1 able 94 about 3 above 1 absence 2 absurd 1 acceptance 2 accident 1 accidentally... Solution: tr -sc A-Za-z \n < alice.txt sort (hidden) uniq -c head gives you the first n lines (n=10 by default; can specify n with flag - n) tr -sc A-Za-z \n < alice.txt sort uniq -c head n 5 632 a 1 abide 1 able 94 about 3 above what do you think tail does?
Most Frequent Words Exercise 3. Sort a list of words in various ways Find the 50 most common words in alice.txt Hint: Use sort a second time, then head
grep 4. Search with grep Grep finds patterns specified as regular expressions globally search for regular expression and print
grep Try this: grep cheshire alice.txt it s a cheshire cat said the duchess and that s why pig she said the last word with such sudden violence that alice quite jumped but she saw in another moment that it was addressed to the baby and not to her so she took courage and went on again i didn t know that cheshire cats always grinned in fact i didn t know that cats could grin Next, try grepping other phrases grep Make an intermediary words file: tr -sc A-Za-z \n < alice.txt > alice.words Finding words ending in ing: grep 'ing$' alice.words sort uniq c
grep Take-home Message grep is a filter you keep only some lines of the input Try these on alice.words grep gh keep lines containing gh grep ˆcon keep lines beginning with con grep ing$ keep lines ending with ing grep v gh keep lines NOT containing gh Piping commands together can be simple yet powerful in Unix grep i [aeiou].*[aeiou] keep lines with two or more vowels grep i ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$ keep lines with exactly one vowel
https://xkcd.com/208/ 5. Two-minute response
Two-minute Response In Piazza, post a Note to Instructor only: 1. What is one thing you understand better after today s activity? Extra Exercises 2. What is something that s still unclear on/a question you have?
Sorting exercises Exercises on grep & wc In alice.txt Find the words in alice.txt that end in ling using sorting (and not using grep) Hint: what does this do? tr -sc 'A-Za-z' '\n' < alice.txt sort uniq head rev How many 4-letter words? How many different words are there with no vowels What subtypes do they belong to? How many 1 syllable words are there That is, ones with exactly one vowel Answer these with respect to word types, not word tokens
grep We used the following to keep lines with exactly one vowel grep i ˆ[ˆaeiou]*[aeiou][ˆaeiou]* $ What would happen if we instead used the command? In what contexts is this important? grep i [ˆaeiou]*[aeiou][ˆaeiou]*