NLP Lab Session Week 4 September 17, Reading and Processing Test, Stemming and Lemmatization. Getting Started

Similar documents
OR, you can download the file nltk_data.zip from the class web site, using a URL given in class.

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

L435/L555. Dept. of Linguistics, Indiana University Fall 2016

Download the examples: LabWeek5examples..py or download LabWeek5examples.txt and rename it as.py from the LabExamples folder or from blackboard.

NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1. Getting Started

Using IDLE for

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

GEO 425: SPRING 2012 LAB 9: Introduction to Postgresql and SQL

CS 1110, LAB 1: EXPRESSIONS AND ASSIGNMENTS First Name: Last Name: NetID:

For convenience in typing examples, we can shorten the wordnet name to wn.

NLP Final Project Fall 2015, Due Friday, December 18

CSCI3381-Cryptography

NLTK (Natural Language Tool Kit) Unix for Poets (without Unix) Unix Python

Week 1: Introduction to R, part 1

Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han

Getting Started with Python and the PyCharm IDE

Practical Natural Language Processing with Senior Architect West Monroe Partners

Getting Started. Excerpted from Hello World! Computer Programming for Kids and Other Beginners

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

Hello World! Computer Programming for Kids and Other Beginners. Chapter 1. by Warren Sande and Carter Sande. Copyright 2009 Manning Publications

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Module 1: Introduction RStudio

Information Retrieval and Organisation

NLTK Tutorial. CSC 485/2501 September 17, Krish Perumal /

Basic Uses of JavaScript: Modifying Existing Scripts

Course Outline - COMP150. Lectures and Labs

CS 1110 SPRING 2016: GETTING STARTED (Jan 27-28) First Name: Last Name: NetID:

Orange3 Text Mining Documentation

ENCM 339 Fall 2017: Editing and Running Programs in the Lab

Exploring Python Basics

Blackfin Online Learning & Development

HTML/CSS Lesson Plans

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9

Lab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han

1 Machine Learning System Design

EDGE Tutorial and Sample Project Overview

WordNet-based User Profiles for Semantic Personalization

The Dynamic Typing Interlude

CSI Lab 02. Tuesday, January 21st

NLTK is distributed with several corpora (singular: corpus). A corpus is a body of text (or other language data, eg speech).

A Step-by-Step Guide to getting started with Hot Potatoes

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Hands On Classification & Clustering

The Goal of this Document. Where to Start?

Welcome Back! Without further delay, let s get started! First Things First. If you haven t done it already, download Turbo Lister from ebay.

STAT 113: R/RStudio Intro

EECS 349 Machine Learning Homework 3

Exploring Python Basics

Lab 4: Basic PHP Tutorial, Part 2

Science One CS : Getting Started

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Wikipedia, Dead Authors, Naive Bayes & Python

Announcements for this Lecture

Spring 2008 June 2, 2008 Section Solution: Python

Week - 01 Lecture - 04 Downloading and installing Python

Text Input and Conditionals

LING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong

LING/C SC/PSYC 438/538. Lecture 3 Sandiway Fong

CIT 590 Homework 5 HTML Resumes

Practical 2: Using Minitab (not assessed, for practice only!)

By default, it is assumed that we open a file to read data. So the above is equivalent to the following:

Laboratory 1: Eclipse and Karel the Robot

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

STA312 Python, part 2

Gearing Up for Development CS130(0)

Lab1: Communicating science

Module 1: Information Extraction

Due: Tuesday 29 November by 11:00pm Worth: 8%

Basic Unix. Set-up. Finding Terminal on the imac. Method 1. Biochemistry laboratories Jean-Yves Sgro

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

What s New in Blackboard 9.1

Essential Skills for Bioinformatics: Unix/Linux

CMSC 201 Spring 2018 Lab 01 Hello World

Hey there, I m (name) and today I m gonna talk to you about rate of change and slope.

Introduction to Text Mining. Aris Xanthos - University of Lausanne

Editing Pronunciation in Clicker 5

ITNP43 HTML Practical 2 Links, Lists and Tables in HTML

FIT 100: Fluency with Information Technology

Adding content to your Blackboard 9.1 class

Upload Your Site. & Update. How to

Table of Contents EVALUATION COPY

Filter and PivotTables in Excel

Objectives: 1. Introduce the course. 2. Define programming 3. Introduce fundamental concepts of OO: object, class

Here are the steps in downloading the HTML code for signatures:

(10393) Database Performance Tuning Hands-On Lab

Files on disk are organized hierarchically in directories (folders). We will first review some basics about working with them.

(Refer Slide Time: 01.26)

Lab #2 Physics 91SI Spring 2013

An Interesting Way to Combine Numbers

Adding Content to Blackboard

Web API Lab. The next two deliverables you shall write yourself.

Stanford CS193p. Developing Applications for ios Fall Stanford CS193p. Fall 2013

5.0 INTRODUCTION 5.1 OBJECTIVES 5.2 BASIC OPERATIONS

Python Problems MTH 151. Texas A&M University. November 8, 2017

IP Routing Lab Assignment Configuring Basic Aspects of IP IGP Routing Protocols

Spring 2017 Gabriel Kuri

Lecture 3. Input, Output and Data Types

CSC401 Natural Language Computing

Filogeografía BIOL 4211, Universidad de los Andes 25 de enero a 01 de abril 2006

If you re using a Mac, follow these commands to prepare your computer to run these demos (and any other analysis you conduct with the Audio BNC

Transcription:

NLP Lab Session Week 4 September 17, 2014 Reading and Processing Test, Stemming and Lemmatization Getting Started In this lab session, we will use two saved files of python commands and definitions and also work together through a series of small examples using the IDLE window and that will be described in this lab document. All of these examples can be found in python files on the blackboard system, under Resources. Bigram.py, processtextfile.py, Labweek4examples.py, desert.txt, Smart.English.stop Put all of these files in one directory (and remember where that is!) and open an IDLE window. This section uses material from the NLTK book, Chapter 3, where they recommend to start the session with several imports: >>> from future import division >>> import nltk, re, pprint Reading Text from the web or from a file So far, we have used a small number of books from the Gutenberg Project that were provided in the NLTK. There is a larger collection available online, and you can see what is available in http://www.gutenberg.org/catalog/ Text number 2554 in the catalog is the book Crime and Punishment by Dostoevsky, and we can access this text directly from Python by using url libraries to connect to the url and to read the content. >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) <type 'str'> >>> len(raw) 1176893 >>> raw[:100] Note that the text obtained from this web page is pretty clean, that is, it does not contain html tags or other web formatting that would interfere with text processing. So from the raw text, we could proceed with tokenization and further steps of text processing.

But if there is html formatting, we may have to take further steps to clean up the text. Let s look at the example given in the NLTK book: http://news.bbc.co.uk/2/hi/health/2284783.stm >>> blondurl = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(blondurl).read() >>> html[:1000] Note that there is a lot of tags and formatting that is not text. One simple thing is to use the NLTK function clean_html to just remove html tags. If we wanted to do more sophisticated clean-up, we could investigate the python library BeautifulSoup. >>> braw = nltk.clean_html(html) >>> btokens = nltk.word_tokenize(braw) >>> tbokens[:100] For further cleanup, we could try to eliminate text representing menus and ads by hand, either on the string or on the tokens. Here we try to just pick out the tokens representing the text portion by inspecting the text and doing some trial and error. >>> braw = nltk.clean_html(html) >>> btokens = nltk.word_tokenize(braw) >>> btokens[:100] >>> btokens = btokens[96:399] >>> btokens[:100] >>> btokens = btokens[5:] >>> btokens[:100] Finally, another way to get text in is for you to collect the text into a file and read this file. To see this, I have collected the text of a news stories by copying it directly from the web page and pasting it into a text file: desert.txt Now reading a file is very simple in Python and the only difficulty is making sure that Python can find the file in the right directory. One option is to put the file that you want to read in the current directory of IDLE. To find out what that directory is, we can do: >>> import os >>> os.getcwd() Then we could put desert.txt in that directory and just open it: >>> f.open( desert.txt ) Or instead we could add the path to the directory of the file by adding to sys.path, as shown later in this lab.

Or we can just read the file by giving the full path. Here you must substitute the path to the file desert.txt on your own lab machine. >>> f = open('h:\\nlpclass\labexamplesweek4\desert.txt ) >>> text = f.read() For Mac users, the path will use forward slash. For example, this is the path that I use on my mac laptop: >>> f = open('/users/njmccrac1/aaadocs/nlpfall2014/labs/labexamplesweek4/desert.txt') >>> text = f.read() Module of Bigram functions When we processed bigrams from text last week, we defined a function to filter out tokens consisting entirely of non-alphabetical characters. However, in the Emma text, we saw that there were weird tokens containing special characters and starting with underbars that were given high scores by the mutual information (PMI) scoring function. So instead, we might filter out the tokens that don t consist entirely of alphabetical characters. This will filter out words like Syracuse-based, but it will also filter out those annoying tokens with many underbars at the beginning or end, which skews out mutual information findings. To make it easier to use any functions that we have defined, we can place them into one python file, which python would call a Module. The name of the file is Bigram.py and if the file is in the same directory as another Python program, you can say import Bigram to access the function definitions. Then you use the name of the module to use the function names, for example, Bigram.alpha_filter. In your IDLE window, use the File menu to open the Bigram.py file. Now if you want to use the functions in the Bigram file in your IDLE window, you will need to add the path to the Bigram file to the system path. To see which directories the system will search to find programs for the IDLE window: >>> import sys >>> sys.path Suppose that you added the Bigram.py file to a directory on the H: drive called NLPclass. To add that directory to your path: >>> sys.path.append( H:\\NLPclass\ LabExamplesWeek4 ) For those working on a Mac, my path on my laptop is: >>> sys.path.append('/users/njmccrac1/aaadocs/nlpfall2013/labs/labexamplesweek4')

Now you could use the functions such as alpha_filter in your IDLE window without copying the definition to the window. Mutual Information Review the definition of association ratio from the Church and Hanks paper, and the mutualinfo function. I have put this function, along with the other unigram and bigram frequency distribution functions into the Bigram.py file. From your IDLE window under the File menu, use Open to open the Bigram.py file into another IDLE window. Read through this module through to see the function definitions. Now we will use the text that we got from Project Gutenberg from the novel Crime and Punishment. Check that your text is still in a variable called raw. Then tokenize the text, but this time, we won t use lowercase. >>> raw[:200] >>> crimetokens = nltk.word_tokenize(raw) >>> len(crimetokens) In order to use our filter functions, we can >>> import Bigram Now we can run our text processing steps using filter definitions from the module. >>> from nltk.collocations import * >>> bigram_measures = nltk.collocations.bigramassocmeasures() >>> stopwords = nltk.corpus.stopwords.words('english') Make a bigram finder and apply the non-alphabetical and stopword filters. >>> finder = BigramCollocationFinder.from_words(crimetokens) >>> finder.apply_word_filter(bigram.alpha_filter) >>> finder.apply_word_filter(lambda w: w in stopwords) # score by bigram frequency >>> scored = finder.score_ngrams(bigram_measures.raw_freq) >>> scored[:30] # Or for NLTK 3.0, sort the results first: >>> sorted(scored, key=itemgetter(1), reverse=true)[:30] Now run the scorer again to get bigrams in order by Pointwise Mutual Information. >>> scored = finder.score_ngrams(bigram_measures.pmi) >>> scored[:30]

# Or for NLTK 3.0, sort the results first: >>> sorted(scored, key=itemgetter(1), reverse=true)[:30] We can run this all again with the other alphabetical filter. [Note that if you make changes to the Bigram module, you should save it and then use the reload function, instead of importing it again. >>> reload(bigram) ] Stemming and Lemmatization For this part, we will use the crime and punishment text from the Gutenberg Corpus as we did above. Note that crimetokens has words with regular capitalization and we can make crimewords to have lower-case words with no capitalization. >>> crimewords = [w.lower( ) for w in crimetokens] NLTK has two stemmers, Porter and Lancaster, described in section 3.6 of the NLTK book. To use these stemmers, you first create them. >>> porter = nltk.porterstemmer() >>> lancaster = nltk.lancasterstemmer() Then we ll compare how the stemmers work using both the regular-cased text and the lowercased text. >>> crimepstem = [porter.stem(t) for t in crimetokens] >>> crimepstem[:200] >>> crimelstem = [lancaster.stem(t) for t in crimetokens] >>> crimelstem[:100] Try the same examples with the crimewords to see lower case version of the stemmers. The NLTK has a lemmatizer that uses the WordNet on-line thesaurus as a dictionary to look up roots and find the word. >>> wnl = nltk.wordnetlemmatizer() >>> crimelemma = [wnl.lemmatize(t) for t in crimetokens] >>> crimelemma[:200]

Processing Text from Files Open the processtextfile.py and read through it to observe how it reads text from the file desert.txt and how it creates a stop word list from the file Smart.English.stop. Look at the Smart.English.stop words file in a text editor. Note that processtextfile then runs the bigram finder and scorers. Run the processtextfile by selecting Run Module from the Run menu. The results will appear in your main IDLE window (which restarts the session and erases all the definitions that you already made). Modify the code to try the mutual information function with and without stopwords, by using the comment character # to switch between the two statements. Also try thresholds of 3, and perhaps 5. Note that another list of stopwords is available in NLTK: nltk.corpus stopwords. This list has 127 words and can be accessed as follows: >>> from nltk.corpus import stopwords >>> nltkstoplist = nltk.corpus.stopwords.words('english') >>> nltkstoplist For your homework, if you develop a file similar to the processtextfile in lab, then you can rerun your examples without as much typing into IDLE. Note that you can also run this python file by using python directly from a terminal/command prompt window. That is, on a PC, open a command prompt window and change directory until you get to the LabExamples4 directory. Then run (I think): % processtextfile.py On a Mac, open a terminal window and change directory until you get to the LabExamples4 directory. Then run: % python processfile.py Ideas for Homework For documents that come from NLTK corpora, read the NLTK book sections from Chapter 2 on the different corpora. Also note that Chapter 2 discusses how to load your own text with the PlainCorpusReader, or you can just read text from files as we did in the lab example with desert.txt. Example of using word frequencies to analyze text: Nate Silver s analysis of State of the Union Speeches from 1962 to 2010.

http://www.fivethirtyeight.com/2010/01/obamas-sotu-clintonian-in-good-way.html Here is a statement of the question that he is trying to answer by looking at word frequencies to compare the SOTU speech in 2010 with earlier speeches: What did President Obama focus his attention upon and how does this compare to his predecessors? [And you may find it interesting to note also the criticism of the technique at http://thelousylinguist.blogspot.com/2010/01/bad-linguistics-sigh.html. I am not expecting you to be experts in how you categorize the words or bigrams that you use in your analysis (and I actually think that Nate Silver was not too far off the mark in picking categories of words to answer his question.)] How is this example different from your assignment? It is different because it does more comparisons that you are required to do and only uses word frequencies while you must also look at bigram frequencies and mutual information. But if you choose the analysis option, this is the type of thing that you are aiming for. Also, you can just use word lists and frequencies/scores without making colored graphs. Also note that you can collect your own text from just about anywhere. For a (too short) example of this, see the potato chip marketing example. Discuss the programming option. There is no exercise for today.