CS 124/LINGUIST 180 From Languages to Informa<on. Unix for Poets (in 2013) Christopher Manning Stanford University

Similar documents
CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

Unix for Poets (in 2016) Christopher Manning Stanford University Linguistics 278

Overview. Unix/Regex Lab. 1. Setup & Unix review. 2. Count words in a text. 3. Sort a list of words in various ways. 4.

538 Text processing basics

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

A Brief Introduction to the Linux Shell for Data Science

Mineração de Dados Aplicada

Tools for Text. Unix Pipe Fitting for Data Analysis. David A. Smith

CSC209. Software Tools and Systems Programming.

5/8/2012. Exploring Utilities Chapter 5

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

Lab #2 Physics 91SI Spring 2013

The input can also be taken from a file and similarly the output can be redirected to another file.

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Practical Session 0 Introduction to Linux


Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Linux command line basics III: piping commands for text processing. Yanbin Yin Fall 2015

CSE 303 Lecture 2. Introduction to bash shell. read Linux Pocket Guide pp , 58-59, 60, 65-70, 71-72, 77-80

CSE 390a Lecture 2. Exploring Shell Commands, Streams, and Redirection

Introduction p. 1 Who Should Read This Book? p. 1 What You Need to Know Before Reading This Book p. 2 How This Book Is Organized p.

Scripting Languages Course 1. Diana Trandabăț

CSE 390a Lecture 2. Exploring Shell Commands, Streams, Redirection, and Processes

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

CSC209. Software Tools and Systems Programming.

Introduction To. Barry Grant

CSCI 4061: Pipes and FIFOs

Practical 02. Bash & shell scripting

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

Shell Programming Systems Skills in C and Unix

Getting to grips with Unix and the Linux family

More text file manipulation: sorting, cutting, pasting, joining, subsetting,

Advanced training. Linux components Command shell. LiLux a.s.b.l.


Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

Shell Programming Overview

Linux II and III. Douglas Scofield. Crea-ng directories and files 18/01/14. Evolu5onary Biology Centre, Uppsala University

Shells & Shell Programming (Part B)

Practical Linux examples: Exercises

QUESTION BANK ON UNIX & SHELL PROGRAMMING-502 (CORE PAPER-2)

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

CS160A EXERCISES-FILTERS2 Boyd

Introduction to Unix

Reading and manipulating files

Useful commands in Linux and other tools for quality control. Ignacio Aguilar INIA Uruguay

If you re using a Mac, follow these commands to prepare your computer to run these demos (and any other analysis you conduct with the Audio BNC

Introduction To. Barry Grant

Laboratory 1 Semester 1 11/12

Week Overview. Simple filter commands: head, tail, cut, sort, tr, wc grep utility stdin, stdout, stderr Redirection and piping /dev/null file

Basic Shell Scripting Practice. HPC User Services LSU HPC & LON March 2018

Lecture 5. Essential skills for bioinformatics: Unix/Linux

UNIX files searching, and other interrogation techniques

Programming Concepts. Perl. Adapted from Practical Unix and Programming Hunter College

Lecture 18 Regular Expressions

Lab - 8 Awk Programming

Introduction: What is Unix?

CS214-AdvancedUNIX. Lecture 2 Basic commands and regular expressions. Ymir Vigfusson. CS214 p.1

CS 25200: Systems Programming. Lecture 11: *nix Commands and Shell Internals

The UNIX Shells. Computer Center, CS, NCTU. How shell works. Unix shells. Fetch command Analyze Execute

Introduction To Linux. Rob Thomas - ACRC

Introduction to Scripting using bash

Lecture 10: Potpourri: Enum / struct / union Advanced Unix #include function pointers

bash, part 3 Chris GauthierDickey

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Unix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University

Regular expressions: Text editing and Advanced manipulation. HORT Lecture 4 Instructor: Kranthi Varala

Linux Command Line Interface. December 27, 2017

Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.

Basics. proper programmers write commands


Shell. SSE2034: System Software Experiment 3, Fall 2018, Jinkyu Jeong

Lecture 2. Regular Expression Parsing Awk

Recap From Last Time: Setup Checklist BGGN 213. Todays Menu. Introduction to UNIX.

COMS 6100 Class Notes 3

Files and Directories

CST Lab #5. Student Name: Student Number: Lab section:

5/20/2007. Touring Essential Programs

Lecture 9: Potpourri: Call by reference vs call by value Enum / struct / union Advanced Unix

On successful completion of the course, the students will be able to attain CO: Experiment linked. 2 to 4. 5 to 8. 9 to 12.

Unix Tools / Command Line

Using UNIX. -rwxr--r-- 1 root sys Sep 5 14:15 good_program

"Bash vs Python Throwdown" -or- "How you can accomplish common tasks using each of these tools" Bash Examples. Copying a file: $ cp file1 file2

From pipelines to graphs: Escape the tyranny of the shell s linear pipelines with dgsh

CPSC 217 Midterm (Python 3 version)

LAB 8 (Aug 4/5) Unix Utilities

Useful Unix Commands Cheat Sheet

Introduction to UNIX. Introduction. Processes. ps command. The File System. Directory Structure. UNIX is an operating system (OS).

Introduction to UNIX. CSE 2031 Fall November 5, 2012

Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

Basic Linux (Bash) Commands

Introduction to Unix

ITST Searching, Extracting & Archiving Data

Introduction to Text-Processing. Jim Notwell 23 January 2013

CSE 374 Programming Concepts & Tools. Laura Campbell (thanks to Hal Perkins) Winter 2014 Lecture 6 sed, command-line tools wrapup

Recap From Last Time:

Transcription:

CS 124/LINGUIST 180 From Languages to Informa<on Unix for Poets (in 2013) Christopher Manning Stanford University

Unix for Poets (based on Ken Church s presenta<on) Text is available like never before The Web Dic=onaries, corpora, email, etc. Billions and billions of words What can we do with it all? It is beeer to do something simple, than nothing at all. You can do simple things from a Unix command- line DIY is more sa=sfying than begging for help 2

Exercises to be addressed 1. Count words in a text 2. Sort a list of words in various ways 1. ascii order 2. rhyming order 3. Extract useful info from a dic=onary 4. Compute ngram sta=s=cs 5. Work with parts of speech in tagged text 3

Tools grep: search for a paeern (regular expression) sort uniq c (count duplicates) tr (translate characters) wc (word or line count) sed (edit string - - replacement) cat (send file(s) in stream) echo (send text in stream) cut (columns in tab- separated files) paste (paste columns) head tail rev (reverse lines) comm join shuf (shuffle lines of text) 4

Prerequisites ssh into a corn cp /afs/ir/class/cs124/nyt_200811.txt. man, e.g., man tr (shows command op=ons; not friendly) Input/output redirec=on: > < CTRL- C 5

Exercise 1: Count words in a text Input: text file (nyt_200811.txt) Output: list of words in the file with freq counts Algorithm 1. Tokenize(tr) 2. Sort(sort) 3. Count duplicates (uniq c) 6

Solu<on to Exercise 1 tr - sc A- Za- z \n < nyt_200811.txt sort uniq - c 25476 a 1271 A 3 AA 3 AAA 1 Aalborg 1 Aaliyah 1 Aalto 2 aardvark 7

Some of the output tr - sc A- Za- z \n < nyt_200811.txt sort uniq - c head n 5 25476 a 1271 A 3 AA 3 AAA 1 Aalborg tr - sc A- Za- z \n < nyt_200811.txt sort uniq - c head Gives you the first 10 lines tail does the same with the end of the input (You can omit the - n but it s discouraged.) 8

Extended Coun<ng Exercises 1. Merge upper and lower case by downcasing everything Hint: Put in a second tr command 2. How common are different sequences of vowels (e.g., ieu) Hint: Put in a second tr command 9

sort sort f Sor<ng and reversing lines of text Ignore case sort n Numeric order sort r Reverse sort sort nr Reverse numeric sort echo Hello rev 10

Coun<ng and sor<ng exercises Find the 50 most common words in the NYT Hint: Use sort a second =me, then head Find the words in the NYT that end in zz Hint: Look at the end of a list of reversed words 11

Lesson Piping commands together can be simple yet powerful in Unix It gives flexibility. Tradi=onal Unix philosophy: small tools that can be composed 12

Bigrams = word pairs counts Algorithm 1. tokenize by word 2. print word i and word i +1 on the same line 3. count 13

Bigrams tr - sc A- Za- z \n < nyt_200811.txt > nyt.words tail n +2 nyt.words > nyt.nextwords paste nyt.words nyt.nextwords > nyt.bigrams head n 5 nyt.bigrams 14 KBR said said Friday Friday the the global global economic

Exercises Find the 10 most common bigrams (For you to look at:) What part- of- speech paeern are most of them? Find the 10 most common trigrams 15

grep Grep finds paeerns specified as regular expressions grep rebuilt nyt_200811.txt Conn and Johnson, has been rebuilt, among the first of the 222 move into their rebuilt home, sleeping under the same roof for the the part of town that was wiped away and is being rebuilt. That is to laser trace what was there and rebuilt it with accuracy," she home - is expected to be rebuilt by spring. Braasch promises that a the anonymous places where the country will have to be rebuilt, "The party will not be rebuilt without moderates being a part of 16

grep Grep finds paeerns specified as regular expressions globally search for regular expression and print Finding words ending in ing: grep ing$ nyt.words sort uniq - c 17

grep grep is a filter you keep only some lines of the input grep gh keep lines containing gh grep ˆcon grep ing$ grep v gh keep lines beginning with con keep lines ending with ing keep lines NOT containing gh grep - P Perl regular expressions (extended syntax) grep - P '^[A- Z]+$' nyt.words sort uniq c ALL UPPERCASE 18

Coun<ng lines, words, characters wc nyt_200811.txt 140000 1007597 6070784 nyt_200811.txt wc - l nyt.words 1017618 nyt.words 19

grep & wc exercises How many all uppercase words are there in this NYT file? How many 4- leeer words? How many different words are there with no vowels What subtypes do they belong to? How many 1 syllable words are there That is, ones with exactly one vowel Type/token dis=nc=on: different words (types) vs. instances (tokens) 20

sed sed is a simple string (i.e., lines of a file) editor You can match lines of a file by regex or line numbers and make changes Not much used in 2013, but The general regex replace func=on s=ll comes in handy sed 's/george Bush/Dubya/' nyt_200811.txt less 21

sed exercises Count frequency of word ini=al consonant sequences Take tokenized words Delete the first vowel through the end of the word Sort and count Count word final consonant sequences 22

awk Ken Church s slides then describe awk, a simple programming language for short programs on data usually in fields I honestly don t think it s worth learning awk in 2013 BeEer to write liele programs in your favorite scrip=ng language, be that Python, or Perl, or groovy, or 23

shuf Randomly permutes (shuffles) the lines of a file Exercises Print 10 random word tokens from the NYT excerpt 10 instances of words that appear, each word instance equally likely Print 10 random word types from the NYT excerpt 10 different words that appear, each different word equally likely 24

cut tab separated files cp /afs/ir/class/cs124/parses.conll. head n 5 parses.conll 1 Influen=al _ JJ JJ _ 2 amod 2 members _ NNS NNS _ 10 nsubj 3 of _ IN IN _ 2 prep 4 the _ DT DT _ 6 det 5 House _ NNP NNP _ 6 nn 25

cut tab separated files Frequency of different parts of speech: cut - f 4 parses.conll sort uniq - c sort nr Get just words and their parts of speech: cut - f 2,4 parses.conll You can deal with comma separate files with: cut d, 26

cut exercises How oen is that used as a determiner (DT) that man versus a complemen=zer (IN) I know that he is rich versus a rela=ve (WDT) The class that I love Hint: With grep P, you can use \t for a tab character What determiners occur in the data? What are the 5 most common? 27