Tools for Text. Unix Pipe Fitting for Data Analysis. David A. Smith

Similar documents
CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

CS 124/LINGUIST 180 From Languages to Information

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

CS 124/LINGUIST 180 From Languages to Informa<on. Unix for Poets (in 2013) Christopher Manning Stanford University

Tuesday, September 30, 14.

Overview. Unix/Regex Lab. 1. Setup & Unix review. 2. Count words in a text. 3. Sort a list of words in various ways. 4.

Unix for Poets (in 2016) Christopher Manning Stanford University Linguistics 278

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs


STATS Data Analysis using Python. Lecture 15: Advanced Command Line

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han


Scripting Languages Course 1. Diana Trandabăț

Shells & Shell Programming (Part B)

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

Unix Processes: C programmer s view. Unix Processes. Output a file: C Code. Process files/stdin: C Code. A Unix process executes in this environment

Unix Processes. A Unix process executes in this environment

A Brief Introduction to the Linux Shell for Data Science

ITST Searching, Extracting & Archiving Data

CST Lab #5. Student Name: Student Number: Lab section:

Practical 02. Bash & shell scripting


538 Text processing basics

Chapter 4. Unix Tutorial. Unix Shell

If you re using a Mac, follow these commands to prepare your computer to run these demos (and any other analysis you conduct with the Audio BNC

CSE 374 Programming Concepts & Tools. Laura Campbell (thanks to Hal Perkins) Winter 2014 Lecture 6 sed, command-line tools wrapup

CSE 390a Lecture 2. Exploring Shell Commands, Streams, and Redirection

Today. Review. Unix as an OS case study Intro to Shell Scripting. What is an Operating System? What are its goals? How do we evaluate it?

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

CSE 390a Lecture 2. Exploring Shell Commands, Streams, Redirection, and Processes

Linux command line basics III: piping commands for text processing. Yanbin Yin Fall 2015

Regular Expressions for Technical Writers (tutorial)

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

Review of Fundamentals

Midterm 1 1 /8 2 /9 3 /9 4 /12 5 /10. Faculty of Computer Science. Term: Fall 2018 (Sep4-Dec4) Student ID Information. Grade Table Question Score

CS Unix Tools & Scripting

CSE 303 Lecture 2. Introduction to bash shell. read Linux Pocket Guide pp , 58-59, 60, 65-70, 71-72, 77-80

Introduction To Linux. Rob Thomas - ACRC

Mineração de Dados Aplicada

CS160A EXERCISES-FILTERS2 Boyd

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Pipelines, Forks, and Shell


Getting to grips with Unix and the Linux family

Lab #2 Physics 91SI Spring 2013

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Essentials for Scientific Computing: Bash Shell Scripting Day 3

Shell Programming Systems Skills in C and Unix

Module 11: Search and Sort Tools

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Shells and Shell Programming

do shell script in AppleScript

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Wildcards and Regular Expressions

LAB 8 (Aug 4/5) Unix Utilities

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

My brother-in-law died suddenly

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

5/8/2012. Exploring Utilities Chapter 5

CSE 374 Midterm Exam 2/6/17 Sample Solution. Question 1. (12 points, 4 each) Suppose we have the following files and directories:

UNIX II:grep, awk, sed. October 30, 2017

Practical Unix exercise MBV INFX410

Systems Programming/ C and UNIX

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

Bash command shell language interpreter

Lecture 18 Regular Expressions

CS 25200: Systems Programming. Lecture 11: *nix Commands and Shell Internals

LAB 8 (Aug 4/5) Unix Utilities

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Shells and Shell Programming

Computer Systems and Architecture

9.2 Linux Essentials Exam Objectives

Lecture 5. Essential skills for bioinformatics: Unix/Linux

Linux Command Line Interface. December 27, 2017

Introduction p. 1 Who Should Read This Book? p. 1 What You Need to Know Before Reading This Book p. 2 How This Book Is Organized p.

Week Overview. Simple filter commands: head, tail, cut, sort, tr, wc grep utility stdin, stdout, stderr Redirection and piping /dev/null file

Reading and manipulating files

CSCI 4061: Pipes and FIFOs

Shell Programming Overview

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.

Computer Systems and Architecture

Essential Skills for Bioinformatics: Unix/Linux

Basic Linux (Bash) Commands

Introduction to Linux

CS Unix Tools & Scripting Lecture 7 Working with Stream

UNIX files searching, and other interrogation techniques

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Regular Expressions Explained

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

UNIX, GNU/Linux and simple tools for data manipulation

Transcription:

Tools for Text Unix Pipe Fitting for Data Analysis David A. Smith 1

Tools 2

3

4

Data 5

6

6

6

Data Structures nobody:*:-2: nogroup:*:-1: wheel:*:0:root daemon:*:1:root kmem:*:2:root sys:*:3:root tty:*:4:root operator:*:5:root mail:*:6:_teamsserver Sequence of lines Delimiters separate fields 7

Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace separates fields 8

Data Structures Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who, for his own amusement, never took up any book but the Baronetage; there he found occupation for an idle hour, and consolation in a distressed one; there his faculties were roused into admiration and respect, by contemplating the limited remnant of the earliest patents; there any unwelcome sensations, arising from domestic affairs Tokens separated by more complex orthographic conventions 9

Data Structures Plain text Flat file databases Linear not random access Scales to arbitrary sizes 10

Tools for Research 11

Tools for Research If we knew what we wanted to do... 11

Tools for Research If we knew what we wanted to do...... and what our data should look like,... 11

Tools for Research If we knew what we wanted to do...... and what our data should look like,...... we could build special-purpose programs and data structures. 11

Tools for Research If we knew what we wanted to do...... and what our data should look like,...... we could build special-purpose programs and data structures. But we don t know what we want! 11

Tools for Research If we knew what we wanted to do...... and what our data should look like,...... we could build special-purpose programs and data structures. But we don t know what we want! (It s research) 11

Tools for Research If we knew what we wanted to do...... and what our data should look like,...... we could build special-purpose programs and data structures. But we don t know what we want! (It s research) Some tools already exist; some don t. 11

Tools for Research If we knew what we wanted to do...... and what our data should look like,...... we could build special-purpose programs and data structures. But we don t know what we want! (It s research) Some tools already exist; some don t. Tools communicate with simple text data. 11

Discoverability Composability 12

Big Tools Discoverability Composability 12

Big Tools Discoverability CLI Composability 12

Big Tools Discoverability CLI+Google CLI Composability 12

Today s Tasks Sorting and counting words Regular expressions Matching and substitution Scripting and automation 13

Getting Some Data http://nltk.googlecode.com/svn/trunk/ nltk_data/packages/corpora/gutenberg.zip Unzip (e.g., unzip -a gutenberg.zip) Open the Termimal app, or similar Go to the new directory cd ~/Downloads/gutenberg ls 14

Looking at Files head -100 : first 100 lines tail -20 : last 20 lines -n +20 : tail starting at 20th line less Your favorite text editor 15

Counting Words wc Discoverability: man wc But I really meant, counting different words Really simple tokenization: tr Sorting: sort Aggregation (counting): uniq 16

Directing Data < Input from a file > Output to a file Pipe 17

Counting Words tr -sc 'A-Za-z' '\n' Turns one set of characters into another What about: tr 'A-Z' 'a-z' sort sort -n numerical order sort -r reverse order sort -k 2 sort by 2nd (whitespace) field uniq -c: count contiguous repeated lines 18

Counting Words $ tr -sc 'A-Za-z' '\n' < austen-emma.txt sort uniq -c 1 125 A 31 Abbey 1 Abbots 1 Abdy... Try piping to less, head, or out to a file. 19

More Counting Sort by word frequency Merge counts for upper and lower case Sort by word endings (rev) 20

Sorting by Frequency $ tr -sc 'A-Za-z' '\n' < austen-emma.txt sort uniq -c sort -rn 5186 to 4846 the 4673 and 4281 of 3192 I... 21

Merging Case $ tr -sc 'A-Za-z' '\n' < austen-emma.txt tr 'A-Z' 'a-z' sort uniq -c 1 1595 a 1 abbreviation 1 abdication 1 abide 3 abilities 30 able... 22

Parallel Streams paste file1 file2 paste <(pipe1) <(pipe2) cut -f 1 Counting bigrams join -1 2-2 2 : join on second field Inputs must be sorted on the joined field! 23

Parallel Streams $ tr -sc 'A-Za-z' '\n' < austen-emma.txt tr 'A-Z' 'a-z' > austenemma.words $ paste austen-emma.words <(tail -n +2 austen-emma.words) emma emma by by jane jane austen austen volume volume i i chapter chapter i i emma emma woodhouse woodhouse handsome handsome clever clever and and rich rich with... 24

Counting Bigrams $ paste austen-emma.words <(tail -n +2 austen-emma.words) sort uniq -c sort 608 to be 566 of the 449 it was 446 in the 395 i am 334 she had 331 she was 308 had been 301 it is 299 mr knightley 283 i have 278 could not 265 of her 256 mrs weston... 25

Comparing Counts $ tr -sc 'A-Za-z' '\n' < austen-sense.txt grep -v '^$' tr 'A-Z' 'a-z' sort uniq -c > austen-sense.counts $ tr -sc 'A-Za-z' '\n' < austen-persuasion.txt grep -v '^$' tr 'A-Z' 'a-z' sort uniq -c > austen-persuasion.counts $ join -1 2-2 2 austen-sense.counts austen-persuasion.counts a 2092 1595 abilities 9 3 able 46 30 abode 5 1 about 144 97 above 20 6 abroad 9 5 absence 11 9... 26

Comparing Counts $ tr -sc 'A-Za-z' '\n' < austen-sense.txt grep -v '^$' wc -l 120734 $ tr -sc 'A-Za-z' '\n' < austen-persuasion.txt grep -v '^$' wc -l 84123 $ join -1 2-2 2 austen-sense.counts austen-persuasion.counts awk '{ print $1, $2 * 10000 / 120734, $3 * 10000 / 84123 }' > sensepersuasion.counts $ less sense-persuasion.counts a 173.273 189.603 abilities 0.74544 0.356621 able 3.81003 3.56621 abode 0.414134 0.118874 about 11.927 11.5307 above 1.65653 0.713241 abroad 0.74544 0.594368 absence 0.911094 1.06986 absent 0.24848 0.475494... 27

Comparing Counts pcounts$persuasion 0.1 0.5 1.0 5.0 50.0 500.0 hall mary bath anne father smith sort everything looking thing mother beginning cousin bad decided morrow high colonel rooms also completely everybody value kind pride town entered equally body affection fancy whatever thousand spent declared easily t assured concern bed pounds edward expectation fanny behaviour park fear likewise john resolved Filtered out words w/similar frequencies 0.1 0.5 1.0 5.0 10.0 50.0 pcounts$sense 28

Digression: R graphs > counts <- read.table("sense-persuation.counts") > names(counts) <- alist(word,sense,persuasion) > pcount <- subset(counts, (sense > 2 persuasion > 2) & abs(log(sense/persuasion)) > 1.1) > plot(pcounts$sense, pcounts$persuasion, type="n", log="xy") > text(pcount$sense, pcount$persuasion, label=pcount$word) 29

Filters grep '^[A-Z]' generalized regular expression parser???! Matches words beginning with a capital Count results? What are regular expressions? 30

Regular Expressions 31

Regular Expressions Two types of characters Literal Every normal alphanumeric character is an RE and matches itself Meta-characters Special characters that allow you to combine REs in various ways Example a matches a a* matches ε or a or aa or aaa or... 32

Regex Basics Pattern Matches Character Concat went went Alternatives (go went) go went [aeiou] a o u disjunc. negation [^aeiou] b c d f g wildcard char. a z & Loops & skips a* ε a aa aaa... one or more a+ a aa aaa zero or one colou?r color colour 33

More Regexes Special characters \t tab \v vertical tab \n newline \r carriage return Aliases (shorthand) \d digits [0-9] \D non-digits [^0-9] \w alphabetic [a-za-z] \W non-alphabetic [^a-za-z] \s whitespace [\t\n\r\f\v] \w alphabetic [a-za-z] Examples \d+ dollars 3 dollars, 50 dollars, 982 dollars \w*oo\w* food, boo, oodles Escape character \ is the general escape character; e.g. \. is not a wildcard, but matches a period. m, UMass Amherst, if you want to use \ in a string it has to be escaped \\ 34

More Regexes Anchors. AKA, zero width characters. They match positions in the text. ^ beginning of line $ end of line \b word boundary, i.e. location with \w on one side but not on the other. \B negated word boundary, i.e. any location that would not match \b Examples: \bthe\b the together Counters {1}, {1,2}, {3,} 35

More Regexes Grouping a (good bad) movie He said it (again and )*again. Parens also indicate Registers (saved contents) b(\w+)h\1 matches boohoo and baha, but not boohaa The digit after the \ indicates which of multiple paren groups, as ordered by when then were opened. Grouping without the cost of register saving He went (?:this that) way. 36

Exercising grep How many upper case words? How many distinct upper case words? Are there any words with no vowels? Any one-syllable (i.e. one vowel) words? Omit words with silent e Gotchas: grep doesn t support all modern perl-style regex extensions, try egrep 37

Regex Substitutions Lots of languages support them: perl, python, ruby, Java (and the JVM),... perl -pe 's/regex/subst/g' easiest on the command line perl -000 -pe 's/regex/sub/g' Match paragraphs, not lines perl -ne 'print $1 while /foo (bar)/g' Print only the captured part 38

Munging with Regexes Remove common morphological suffixes from word list and get counts Patterns that indicate proper names Extract direct speech from novels Put each paragraph on a single line For, e.g., Mallet input 39

Replicable Research 40

Shell Scripts If you ve figured out a good pipeline, write it down! Better yet, make it executable. Make a file histogram containing #!/bin/sh tr -sc 'A-Za-z' '\n' \ sort uniq -c sort -rn Then run chmod +x histogram 41

Makefiles Output from one process is input for another Capture dependencies in a Makefile Run make target to generate output comparison:! emma.txt sense.txt join -1 2-2 2 $^ \ hist-stats > $@ %.hist:!%.txt histogram!./histogram < $< > $@ 42

Where Next? Transforming data for more specialized analysis tools E.g., Gephi, R, Voyant Programming to supply missing tools E.g., R, python ( Programming Historian ) Tools for more structured (XML) text Though regexes can work pretty well! 43

Thanks Inspired by Ken Church s Unix for Poets My colleagues at the NULab for Texts, Maps & Networks 44