CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

Similar documents
CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Informa<on. Unix for Poets (in 2013) Christopher Manning Stanford University

Unix for Poets (in 2016) Christopher Manning Stanford University Linguistics 278

Overview. Unix/Regex Lab. 1. Setup & Unix review. 2. Count words in a text. 3. Sort a list of words in various ways. 4.

538 Text processing basics

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

A Brief Introduction to the Linux Shell for Data Science

Mineração de Dados Aplicada

5/8/2012. Exploring Utilities Chapter 5

Introduction To Linux. Rob Thomas - ACRC

Tools for Text. Unix Pipe Fitting for Data Analysis. David A. Smith

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

CS160A EXERCISES-FILTERS2 Boyd

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

UNIX files searching, and other interrogation techniques

More text file manipulation: sorting, cutting, pasting, joining, subsetting,

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011


ITST Searching, Extracting & Archiving Data

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Lab #2 Physics 91SI Spring 2013

Linux Text Utilities 101 for S/390 Wizards SHARE Session 9220/5522

Week Overview. Simple filter commands: head, tail, cut, sort, tr, wc grep utility stdin, stdout, stderr Redirection and piping /dev/null file

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,


Advanced training. Linux components Command shell. LiLux a.s.b.l.

CSC209. Software Tools and Systems Programming.

CPSC 217 Midterm (Python 3 version)

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Introduction to Scripting using bash

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

Practical Session 0 Introduction to Linux

QUESTION BANK ON UNIX & SHELL PROGRAMMING-502 (CORE PAPER-2)

CST Lab #5. Student Name: Student Number: Lab section:

CSCI 4061: Pipes and FIFOs

Introduction to Unix

Recap From Last Time:

Introduction p. 1 Who Should Read This Book? p. 1 What You Need to Know Before Reading This Book p. 2 How This Book Is Organized p.

CSE 390a Lecture 2. Exploring Shell Commands, Streams, and Redirection

BGGN 213 Working with UNIX Barry Grant

Introduction To. Barry Grant

Reading and manipulating files

Unix L555. Dept. of Linguistics, Indiana University Fall Unix. Unix. Directories. Files. Useful Commands. Permissions. tar.

Basic Shell Scripting Practice. HPC User Services LSU HPC & LON March 2018

Scripting Languages Course 1. Diana Trandabăț

Practical 02. Bash & shell scripting

Computer Systems and Architecture

Introduction to Linux (and the terminal)

5/20/2007. Touring Essential Programs


bash, part 3 Chris GauthierDickey

The input can also be taken from a file and similarly the output can be redirected to another file.

Common File System Commands

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Module 8 Pipes, Redirection and REGEX

Getting to grips with Unix and the Linux family

BNF, EBNF Regular Expressions. Programming Languages,

Python allows variables to hold string values, just like any other type (Boolean, int, float). So, the following assignment statements are valid:

CSE 390a Lecture 2. Exploring Shell Commands, Streams, Redirection, and Processes

Regular Expressions in Perl

Linux command line basics III: piping commands for text processing. Yanbin Yin Fall 2015

Introduction to Unix

Introduction to UNIX command-line II

CS214-AdvancedUNIX. Lecture 2 Basic commands and regular expressions. Ymir Vigfusson. CS214 p.1

Computer Systems and Architecture

Introduction to Linux

Introduction: What is Unix?

Introduction To. Barry Grant

CSE 303 Lecture 2. Introduction to bash shell. read Linux Pocket Guide pp , 58-59, 60, 65-70, 71-72, 77-80

CS Unix Tools & Scripting Lecture 7 Working with Stream

6 Redirection. Standard Input, Output, And Error. 6 Redirection

Recap From Last Time: Setup Checklist BGGN 213. Todays Menu. Introduction to UNIX.

find starting-directory -name filename -user username

- c list The list specifies character positions.

Using UNIX. -rwxr--r-- 1 root sys Sep 5 14:15 good_program

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010


Dalhousie University CSCI 2132 Software Development Winter 2018 Lab 2, January 25

If you re using a Mac, follow these commands to prepare your computer to run these demos (and any other analysis you conduct with the Audio BNC

6.033 Computer System Engineering

CS Unix Tools & Scripting

Basics. I think that the later is better.

From pipelines to graphs: Escape the tyranny of the shell s linear pipelines with dgsh

UNIX II:grep, awk, sed. October 30, 2017

Shell Programming Overview

CSC209H Lecture 1. Dan Zingaro. January 7, 2015

CSC209. Software Tools and Systems Programming.

Unix Processes: C programmer s view. Unix Processes. Output a file: C Code. Process files/stdin: C Code. A Unix process executes in this environment

2) clear :- It clears the terminal screen. Syntax :- clear

Sub-capacity (Virtualization) License Counting Rules

ADVANCED LINUX SYSTEM ADMINISTRATION

Lecture 18 Regular Expressions

Table of Contents. Evaluation Copy

Introduction to UNIX. Introduction. Processes. ps command. The File System. Directory Structure. UNIX is an operating system (OS).

Introduction to UNIX. CSE 2031 Fall November 5, 2012

Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

CS 25200: Systems Programming. Lecture 11: *nix Commands and Shell Internals

UNIX, GNU/Linux and simple tools for data manipulation

Transcription:

CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by me and Chris Manning) Stanford University

Unix for Poets Text is everywhere The Web Dictionaries, corpora, email, etc. Billions and billions of words What can we do with it all? It is better to do something simple, than nothing at all. You can do simple things from a Unix command-line Sometimes it s much faster even than writing a quick python tool 2 DIY is very satisfying

Exercises we ll be doing today 1. Count words in a text 2. Sort a list of words in various ways ascii order rhyming order 3. Extract useful info from a dictionary 4. Compute ngram statistics 5. Work with parts of speech in tagged text 3

Tools grep: search for a pattern (regular expression) sort uniq c (count duplicates) tr (translate characters) wc (word or line count) sed (edit string -- replacement) cat (send file(s) in stream) echo (send text in stream) cut (columns in tab-separated files) paste (paste columns) head tail rev (reverse lines) comm join shuf (shuffle lines of text) 4

Prerequisites: get the text file we are using myth: ssh into a myth and then do: scp cardinal:/afs/ir/class/cs124/nyt_200811.txt.gz. Or if you re using your own Mac or Unix laptop, do that or you could download, if you haven't already: http://cs124.stanford.edu/nyt_200811.txt.gz Then: gunzip nyt_200811.txt.gz 5

Prerequisites The unix man command e.g., man tr (shows command options; not friendly) Input/output redirection: > output to a file < input from a file pipe CTRL-C 6

Exercise 1: Count words in a text Input: text file (nyt_200811.txt) (after it s gunzipped) Output: list of words in the file with freq counts Algorithm 1. Tokenize (tr) 2. Sort (sort) 3. Count duplicates (uniq c) Go read the man pages and figure out how to pipe these 7 together

Solution to Exercise 1 tr -sc 'A-Za-z' '\n' < nyt_200811.txt sort uniq -c 8 25476 a 1271 A 3 AA 3 AAA 1 Aalborg 1 Aaliyah 1 Aalto 2 aardvark

Some of the output tr -sc 'A-Za-z' '\n' < nyt_200811.txt sort uniq -c head n 5 25476 a tr -sc 'A-Za-z' '\n' < nyt_200811.txt sort uniq -c head Gives you the first 10 lines 9 1271 A 3 AA 3 AAA 1 Aalborg tail does the same with the end of the input (You can omit the -n but it s discouraged.)

Extended Counting Exercises 1. Merge upper and lower case by downcasing everything Hint: Put in a second tr command 2. How common are different sequences of vowels (e.g., ieu) 10 Hint: Put in a second tr command

Solutions Merge upper and lower case by downcasing everything tr -sc 'A-Za-z' '\n' < nyt_200811.txt tr 'A-Z' 'a-z' sort uniq -c or tr -sc 'A-Za-z' '\n' < nyt_200811.txt tr '[:upper:]' '[:lower:]' sort uniq -c 1. tokenize by replacing the complement of letters with newlines 2. replace all uppercase with lowercase 3. sort alphabetically

Solutions How common are different sequences of vowels (e.g., ieu) tr -sc 'A-Za-z' '\n' < nyt_200811.txt tr 'A-Z' 'a-z' tr -sc 'aeiou' '\n' sort uniq -c 12

Sorting and reversing lines of text sort sort f Ignore case sort n Numeric order sort r Reverse sort sort nr Reverse numeric sort echo "Hello" rev 13

Counting and sorting exercises Find the 50 most common words in the NYT Hint: Use sort a second time, then head Find the words in the NYT that end in "zz" Hint: Look at the end of a list of reversed words tr 'A-Z' 'a-z' < filename tr sc 'A-Za-z' '\n' rev sort rev uniq -c 14

Counting and sorting exercises Find the 50 most common words in the NYT tr -sc 'A-Za-z' '\n' < nyt_200811.txt sort uniq -c sort -nr head -n 50 Find the words in the NYT that end in "zz" 15 tr -sc 'A-Za-z' '\n' < nyt_200811.txt tr 'A-Z' 'a-z' rev sort uniq -c rev tail -n 10

Lesson Piping commands together can be simple yet powerful in Unix It gives flexibility. Traditional Unix philosophy: small tools that can be 16 composed

Bigrams = word pairs and their counts Algorithm: 1. Tokenize by word 2. Create two almost-duplicate files of words, off by one line, using tail 3. paste them together so as to get word and word on the same line i i +1 4. Count 17

Bigrams tr -sc 'A-Za-z' '\n' < nyt_200811.txt > nyt.words tail -n +2 nyt.words > nyt.nextwords paste nyt.words nyt.nextwords > nyt.bigrams head n 5 nyt.bigrams 18 KBR said said Friday Friday the the global global economic

Exercises Find the 10 most common bigrams (For you to look at:) What part-of-speech pattern are most of them? Find the 10 most common trigrams 19

Solutions Find the 10 most common bigrams tr 'A-Z' 'a-z' < nyt.bigrams sort uniq -c sort -nr head -n 10 Find the 10 most common trigrams tail -n +3 nyt.words > nyt.thirdwords paste nyt.words nyt.nextwords nyt.thirdwords > nyt.trigrams cat nyt.trigrams tr "[:upper:]" "[:lower:]" sort uniq -c sort -rn head -n 10 20

grep Grep finds patterns specified as regular expressions grep rebuilt nyt_200811.txt Conn and Johnson, has been rebuilt, among the first of the 222 move into their rebuilt home, sleeping under the same roof for the the part of town that was wiped away and is being rebuilt. That is to laser trace what was there and rebuilt it with accuracy," she home - is expected to be rebuilt by spring. Braasch promises that a the anonymous places where the country will have to be rebuilt, "The party will not be rebuilt without moderates being a part of 21

grep Grep finds patterns specified as regular expressions globally search for regular expression and print Finding words ending in ing: grep 'ing$' nyt.words sort uniq c 22

grep grep is a filter you keep only some lines of the input grep gh keep lines containing gh grep 'ˆcon' keep lines beginning with con grep 'ing$' keep lines ending with ing grep v gh keep lines NOT containing gh egrep [extended syntax] egrep '^[A-Z]+$' nyt.words sort uniq -c ALL UPPERCASE (egrep, grep e, grep P, even grep might work)

Counting lines, words, characters wc nyt_200811.txt 140000 1007597 6070784 nyt_200811.txt wc -l nyt.words 1017618 nyt.words Exercise: Why is the number of words different? 24

Exercises on grep & wc How many all uppercase words are there in this NYT file? How many 4-letter words? How many different words are there with no vowels What subtypes do they belong to? How many 1 syllable words are there That is, ones with exactly one vowel Type/token distinction: different words (types) vs. instances (tokens) 25

Solutions on grep & wc How many all uppercase words are there in this NYT file? grep -P '^[A-Z]+$' nyt.words wc How many 4-letter words? grep -P '^[a-za-z]{4}$' nyt.words wc How many different words are there with no vowels grep -v '[AEIOUaeiou]' nyt.words sort uniq wc How many 1 syllable words are there tr 'A-Z' 'a-z' < nyt.words grep -P '^[^aeiouaeiou]*[aeiouaeiou]+[^aeiouaeiou]*$' uniq wc Type/token distinction: different words (types) vs. instances (tokens)

sed sed is used when you need to make systematic changes to strings in a file (larger changes than tr ) It s line based: you optionally specify a line (by regex or line numbers) and specific a regex substitution to make For example to change all cases of George to Jane : sed 's/george/jane/' nyt_200811.txt less 27

sed exercises Count frequency of word initial consonant sequences Take tokenized words Delete the first vowel through the end of the word Sort and count Count word final consonant sequences 28

sed exercises Count frequency of word initial consonant sequences tr "[:upper:]" "[:lower:]" < nyt.words sed 's/[aeiouaeiou].*$//' sort uniq -c Count word final consonant sequences tr "[:upper:]" "[:lower:]" < nyt.words sed 's/^.*[aeiou]//g' sort uniq -c sort -rn less 29

cut tab separated files scp <sunet>@myth.stanford.edu:/afs/ir/class/cs124/parses.conll.gz. gunzip parses.conll.gz head n 5 parses.conll 1 Influential _ JJ JJ _ 2 amod 2 members _ NNS NNS _ 10 nsubj 3 of _ IN IN _ 2 prep 4 the _ DT DT _ 6 det 5 House _ NNP NNP _ 6 nn 30

cut tab separated files Frequency of different parts of speech: cut -f 4 parses.conll sort uniq -c sort -nr Get just words and their parts of speech: cut -f 2,4 parses.conll You can deal with comma separated files with: cut d, 31

cut exercises How often is that used as a determiner (DT) that rabbit versus a complementizer (IN) I know that they are plastic versus a relative (WDT) The class that I love Hint: With grep, you can use '\t' for a tab character What determiners occur in the data? What are the 5 most common? 32

cut exercise solutions How often is that used as a determiner (DT) that rabbit versus a complementizer (IN) I know that they are plastic versus a relative (WDT) The class that I love cat parses.conll grep -P '(that\t_\tdt) (that\t_\tin) (that\t_\twdt)' cut -f 2,4 sort uniq -c What determiners occur in the data? What are the 5 most common? cat parses.conll tr 'A-Z' 'a-z' grep -P