Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han

Size: px
Start display at page:

Download "Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han"

Transcription

1 Lab 12: Processing a Corpus Ling 1330/2330: Computational Linguistics Na-Rae Han

2 Objectives How to process a corpus 10/4/2018 2

3 Beyond a single, short text So far, we have been handling relatively short texts, one at a time. Going multiple Find out what's involved in processing a text archive of multiple text files (aka corpus) Let's try this today Going big Find out what's involved in processing HUMONGUOUS text files 10/4/2018 3

4 Processing multiple texts From the NLTK Corpora page, download: C-Span Inaugural Address Corpus The C-Span Inaugural Address Corpus Includes 56 past presidential inaugural address, from 1789 (Washington) to 2009 (Obama). The directory has 56.txt files and one README file. QUESTION: How do we effectively process this many files? 10/4/2018 4

5 Corpus vs. sub-corpora Sub-corpus 1 Sub-corpus 2 Entire Corpus 10/4/2018 5

6 Big token lists for sub-corpora text text text text text text text text sub-corpus 1 TOKENS Good when individual texts don't need separate attention. sub-corpus 2 TOKENS 10/4/2018 6

7 Pools & individual token lists text text text text text text text text tokens tokens tokens tokens tokens tokens tokens tokens sub-corpus 1 TOKENS Individual token lists as well as sub-corpus pools sub-corpus 2 TOKENS 10/4/2018 7

8 Using glob glob: a file-name globbing utility Returns a list of file names that match the specified pattern >>> import glob >>> files = glob.glob(r'd:\lab\inaugural\*.txt') >>> len(files) 56 >>> files[:5] ['D:\\Lab\\inaugural\\1789-Washington.txt', 'D:\\Lab\\inaugural\\1793-Washington.txt', 'D:\\Lab\\inaugural\\1797-Adams.txt', 'D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt'] >>> files[-1] 'D:\\Lab\\inaugural\\2009-Obama.txt' >>> All files ending in.txt Excludes README 10/4/2018 8

9 Using glob Addresses from 1800's only >>> files2 = glob.glob(r'd:\lab\inaugural\18*.txt') >>> len(files2) 25 >>> files2[:5] ['D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt', 'D:\\Lab\\inaugural\\1809-Madison.txt', 'D:\\Lab\\inaugural\\1813-Madison.txt', 'D:\\Lab\\inaugural\\1817-Monroe.txt'] >>> files2[-1] 'D:\\Lab\\inaugural\\1897-McKinley.txt' >>> All files starting with '18' and ending with '.txt' 10/4/2018 9

10 Build dictionary of texts For-loop through file names and build a dictionary of key (filename): value (text content) >>> files[0] 'D:\\Lab\\inaugural\\1789-Washington.txt' >>> files[0][12:-4] 'ural\\1789-washington' >>> files[0][17:-4] '1789-Washington' >>> files[2][17:-4] '1797-Adams' >>> files[0].index('\\') 2 >>> files[0].rindex('\\') 16 >>> files[0][files[0].rindex('\\')+1:-4] '1789-Washington' Full name is too long. How to extract this? Gets the job done Highest index of '\' (Windows dir separator) This is the more principled way of extracting the short file name 10/4/

11 Build dictionary of texts For-loop through file names and build a dictionary of key (filename): value (text content) >>> files[0][files[0].rindex('\\')+1:-4] '1789-Washington' >>> fn2txt = {} >>> for longname in files: f = open(longname) txt = f.read() f.close() start = longname.rindex('\\')+1 short = longname[start:-4] fn2txt[short] = txt >>> fn2txt['1809-madison'][:40] 'Unwilling to depart from examples of the' >>> fn2txt['1789-washington'][:40] 'Fellow-Citizens of the Senate and of the' fn2txt file name as key, text string as value 10/4/

12 Processing each text Task: Compute the average sentence length for each presidential address. We have to build separate token lists for each speech. >>> fn2toks = {} >>> for (fn, txt) in fn2txt.items(): toks = textstats.gettokens(txt) fn2toks[fn] = toks fn2toks file name as key, token list as value >>> len(fn2toks) 56 >>> fn2toks['1789-washington'] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':',... >>> fn2toks['2001-bush'][:10] ['president', 'clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ','] 12

13 Speech length, 'peace' count >>> for fn in fn2toks: toks = fn2toks[fn] print(len(toks), fn) Washington Washington Obama >>> for fn in fn2toks: toks = fn2toks[fn] print(toks.count('peace'), '\t', fn) Eisenhower Eisenhower Kennedy Johnson Nixon Nixon Carter 13

14 Average sentence length, per address >>> for fn in fn2toks: toks = fn2toks[fn] sentcount = toks.count('.') + toks.count('!') \ + toks.count('?') avgsentlen = len(toks)/sentcount print(avgsentlen, '\t', fn) Washington Washington Adams Jefferson Jefferson Madison Bush Bush Obama >>> Assumes every sentence ends with '.', '!', or '?' 14

15 Treating files as a single corpus Task: Compile word frequency of the Inaugural Speeches. For this, we only need to build a single pool of tokenized words. For each text, tokenize it, and then add the result to the pool of tokenized words. >>> import textstats >>> alltoks = [] >>> for txt in fn2txt.values(): toks = textstats.gettokens(txt) alltoks.extend(toks) >>> len(alltoks) >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] >>> alltoks[-15:] ['you', '.', 'god', 'bless', 'you', '.', 'and', 'god', 'bless', 'the', 'united', 'states', 'of', 'america', '.'] 15

16 Word frequency of entire corpus >>> allfreq = textstats.getfreq(alltoks) >>> allfreq['citizens'] 237 >>> allfreq['battle'] 12 >>> for k in sorted(allfreq, key=allfreq.get, reverse=true)[:10]: print(k, allfreq[k]) the 9906 of 6986, 6862 and to 4432 in 2749 a 2193 our 2058 that 1726 >>> 16

17 Treating files as a single corpus, take 2 Task: Compile word frequency of the Inaugural Speeches. Alternative approach: join all text strings into a single gigantic text string And then, tokenize it all at once. >>> alltxt = '\n'.join(fn2txt.values()) All speech texts, concatenated with a line break in between >>> alltoks = textstats.gettokens(alltxt) >>> len(alltoks) >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] 17

Handout 12: Textual models

Handout 12: Textual models Handout 12: Textual models Taylor Arnold Loading and parsing the data The full text of all the State of the Union addresses through 2016 are available in the R package sotu, available on CRAN. The package

More information

Lecture 12: Shell Scripting, SSH, Super-Computing. LING 1340/2340: Data Science for Linguists Na-Rae Han

Lecture 12: Shell Scripting, SSH, Super-Computing. LING 1340/2340: Data Science for Linguists Na-Rae Han Lecture 12: Shell Scripting, SSH, Super-Computing LING 1340/2340: Data Science for Linguists Na-Rae Han Objectives Batch processing through for loop Shell scripting Server access through SSH Pitt's timeshare

More information

Lecture 6: more pandas (and git/github) LING 1340/2340: Data Science for Linguists Na-Rae Han

Lecture 6: more pandas (and git/github) LING 1340/2340: Data Science for Linguists Na-Rae Han Lecture 6: more pandas (and git/github) LING 1340/2340: Data Science for Linguists Na-Rae Han Objectives git and GitHub: Let's be more disciplined! Python's pandas library Tools: Git and GitHub Jupyter

More information

Statistical Programming Camp: An Introduction to R

Statistical Programming Camp: An Introduction to R Statistical Programming Camp: An Introduction to R Handout 5: Loops and Conditional Statements Fox Chapter 2, 8 In this handout, we cover the following new materials: Using loops for(i in X){ to repeat

More information

NLP Lab Session Week 4 September 17, Reading and Processing Test, Stemming and Lemmatization. Getting Started

NLP Lab Session Week 4 September 17, Reading and Processing Test, Stemming and Lemmatization. Getting Started NLP Lab Session Week 4 September 17, 2014 Reading and Processing Test, Stemming and Lemmatization Getting Started In this lab session, we will use two saved files of python commands and definitions and

More information

Lab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han

Lab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han Lab 7: Reading Files, Importing, Bigram Function Ling 1330/2330: Computational Linguistics Na-Rae Han Objectives Importing Reading text files range() Bigram function More sorting with sorted() sorted()

More information

For convenience in typing examples, we can shorten the wordnet name to wn.

For convenience in typing examples, we can shorten the wordnet name to wn. NLP Lab Session Week 14, December 4, 2013 More Semantics: WordNet similarity in NLTK and LDA Mallet demo More on Final Projects: weka memory and loading Spam documents Getting Started For the final projects,

More information

POL 345: Quantitative Analysis and Politics

POL 345: Quantitative Analysis and Politics POL 345: Quantitative Analysis and Politics Precept Handout 4 Week 5 (Verzani Chapter 6: 6.2) Remember to complete the entire handout and submit the precept questions to the Blackboard DropBox 24 hours

More information

Lab 20: Regular Expressions in Python. Ling 1330/2330: Computational Linguistics Na-Rae Han

Lab 20: Regular Expressions in Python. Ling 1330/2330: Computational Linguistics Na-Rae Han Lab 20: Regular Expressions in Python Ling 1330/2330: Computational Linguistics Na-Rae Han Exercise 10: regexing Jobs [x X] [xx] (x X) Within [... ], all characters are already considered forming a set,

More information

L435/L555. Dept. of Linguistics, Indiana University Fall 2016

L435/L555. Dept. of Linguistics, Indiana University Fall 2016 for : for : L435/L555 Dept. of, Indiana University Fall 2016 1 / 12 What is? for : Decent definition from wikipedia: Computer programming... is a process that leads from an original formulation of a computing

More information

16 January 2018 Ken Benoit, Kohei Watanabe & Akitaka Matsuo London School of Economics and Political Science

16 January 2018 Ken Benoit, Kohei Watanabe & Akitaka Matsuo London School of Economics and Political Science 16 January 2018 Ken Benoit, Kohei Watanabe & Akitaka Matsuo London School of Economics and Political Science Quantitative Analysis of Textual Data 5.5 years of development, 17 releases 6,791 commits; 719

More information

Lab 18: Regular Expressions in Python. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han

Lab 18: Regular Expressions in Python. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han Lab 18: Regular Expressions in Python Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han Learning to use regex in Python Na-Rae's tutorials: http://www.pitt.edu/~naraehan/python3/re.html http://www.pitt.edu/~naraehan/python3/more_list_comp.html

More information

Introduction to Text Mining. Aris Xanthos - University of Lausanne

Introduction to Text Mining. Aris Xanthos - University of Lausanne Introduction to Text Mining Aris Xanthos - University of Lausanne Preliminary notes Presentation designed for a novice audience Text mining = text analysis = text analytics: using computational and quantitative

More information

Tutorial and Exercises with WordList in WordSmith Tools: Level I

Tutorial and Exercises with WordList in WordSmith Tools: Level I Tutorial and Exercises with WordList in WordSmith Tools: Level I WordSmith Tools, developed by Mike Scott, is a corpus analysis tool that integrates three text analysis tools: a monolingual concordancer

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

University of Sheffield, NLP. Chunking Practical Exercise

University of Sheffield, NLP. Chunking Practical Exercise Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person

More information

Introductory Statistics

Introductory Statistics Introductory Statistics This document is attributed to Barbara Illowsky and Susan Dean Chapter 2 Open Assembly Edition Open Assembly editions of open textbooks are disaggregated versions designed to facilitate

More information

Question Answering Using XML-Tagged Documents

Question Answering Using XML-Tagged Documents Question Answering Using XML-Tagged Documents Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/trec11/index.html XML QA System P Full text processing of TREC top 20 documents Sentence

More information

Speech Recognition. Project: Phone Recognition using Sphinx. Chia-Ho Ling. Sunya Santananchai. Professor: Dr. Kepuska

Speech Recognition. Project: Phone Recognition using Sphinx. Chia-Ho Ling. Sunya Santananchai. Professor: Dr. Kepuska Speech Recognition Project: Phone Recognition using Sphinx Chia-Ho Ling Sunya Santananchai Professor: Dr. Kepuska Objective Use speech data corpora to build a model using CMU Sphinx.Apply a built model

More information

The Last Campaign: Robert F. Kennedy And 82 Days That Inspired America By Thurston Clarke

The Last Campaign: Robert F. Kennedy And 82 Days That Inspired America By Thurston Clarke The Last Campaign: Robert F. Kennedy And 82 Days That Inspired America By Thurston Clarke If you are searching for the book The Last Campaign: Robert F. Kennedy and 82 Days That Inspired America by Thurston

More information

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Maca a configurable tool to integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Outline Morphological resources for Polish Tagset and segmentation differences

More information

Lab 6: Data Types, Mutability, Sorting. Ling 1330/2330: Computational Linguistics Na-Rae Han

Lab 6: Data Types, Mutability, Sorting. Ling 1330/2330: Computational Linguistics Na-Rae Han Lab 6: Data Types, Mutability, Sorting Ling 1330/2330: Computational Linguistics Na-Rae Han Objectives Data types and conversion Tuple Mutability Sorting: additional parameters Text processing overview

More information

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:

More information

Lab 8: File I/O, Mutability vs. Assignment. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han

Lab 8: File I/O, Mutability vs. Assignment. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han Lab 8: File I/O, Mutability vs. Assignment Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han Objectives File I/O Writing to a file File I/O pitfalls File reference and absolute path Mutability

More information

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases LIDER Survey Overview Participant profile (organisation type, industry sector) Relevant use-cases Discovering and extracting information Understanding opinion Content and data (Data Management) Monitoring

More information

CSC 5930/9010: Text Mining GATE Developer Overview

CSC 5930/9010: Text Mining GATE Developer Overview 1 CSC 5930/9010: Text Mining GATE Developer Overview Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 GATE Components 2 We will deal primarily with GATE Developer:

More information

A bit of theory: Algorithms

A bit of theory: Algorithms A bit of theory: Algorithms There are different kinds of algorithms Vector space models. e.g. support vector machines Decision trees, e.g. C45 Probabilistic models, e.g. Naive Bayes Neural networks, e.g.

More information

Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017

Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017 Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017 Bioinformatics refers the application of statistics and computer science to the management and analysis of data from the biosciences. In common

More information

Of Search and Semantics

Of Search and Semantics Of Search and Semantics Patrick Pantel NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008-2 - Vannaver Bush proposes to build a body of knowledge for all mankind: Memex

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

" x stands for the sum (the total) of the numbers

 x stands for the sum (the total) of the numbers June 3, 2009 Statistics on your calculator page 1 Statistics on your calculator Your calculator has many statistics features, found mostly under STAT and 2 nd Y= (STATPLOT). You used one of these features,

More information

Overview and Update of NARA's Electronic Records Archive (ERA) Program.

Overview and Update of NARA's Electronic Records Archive (ERA) Program. Overview and Update of NARA's Electronic Records Archive (ERA) Program. Informal Remarks Presented to the Government Information Preservation Working Group 14 October 2004 National Archives and Records

More information

Digging Deeper Reaching Further

Digging Deeper Reaching Further Digging Deeper Reaching Further Libraries Empowering Users to Mine the HathiTrust Digital Library Resources Module 3: Working with Textual Data Instructor Guide Further reading: go.illinois.edu/ddrf-resources

More information

Recitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018

Recitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha   March 23, 2018 Recitation 9 CS435: Introduction to Big Data GTA: Bibek R. Shrestha Email: cs435@cs.colostate.edu March 23, 2018 Today... Discussion on issues in Programming Assignment 2 2 How to program PA2: Part-A?

More information

Lecture 2: Data in Linguistics, Git/GitHub, Jupyter Notebook. LING 1340/2340: Data Science for Linguists Na-Rae Han

Lecture 2: Data in Linguistics, Git/GitHub, Jupyter Notebook. LING 1340/2340: Data Science for Linguists Na-Rae Han Lecture 2: Data in Linguistics, Git/GitHub, Jupyter Notebook LING 1340/2340: Data Science for Linguists Na-Rae Han Objectives What do linguistic data look like? Tools: You should be taking NOTES! Git and

More information

Tutorial to QuotationFinder_0.4.4

Tutorial to QuotationFinder_0.4.4 Tutorial to QuotationFinder_0.4.4 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,

More information

Gary F. Simons. SIL International

Gary F. Simons. SIL International Gary F. Simons SIL International AARDVARC Symposium, LSA, Portland, OR, 11 Jan 2015 Given the relentless entropy that degrades our field recordings, and innovation that makes the technology we have used

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Windows On Windows systems, simply double click the AntConc icon and this will launch the program.

Windows On Windows systems, simply double click the AntConc icon and this will launch the program. AntConc (Windows, Macintosh OS X, and Linux) Build 3.5.2 (February 8, 2018) Laurence Anthony, Ph.D. Center for English Language Education in Science and Engineering, School of Science and Engineering,

More information

Lab 4: Shell Scripting

Lab 4: Shell Scripting Lab 4: Shell Scripting Nathan Jarus June 12, 2017 Introduction This lab will give you some experience writing shell scripts. You will need to sign in to https://git.mst.edu and git clone the repository

More information

EECS 349 Machine Learning Homework 3

EECS 349 Machine Learning Homework 3 WHAT TO HAND IN You are to submit the following things for this homework: 1. A SINGLE PDF document containing answers to the homework questions. 2. The WELL COMMENTED MATLAB source code for all software

More information

Internal Commands COPY and TYPE

Internal Commands COPY and TYPE Internal Commands COPY and TYPE Ch 5 1 Overview Will review file-naming rules. Ch 5 2 Overview Will learn some internal commands that can be used to manage and manipulate files. Ch 5 3 Overview The value

More information

University of Sheffield, NLP. Chunking Practical Exercise

University of Sheffield, NLP. Chunking Practical Exercise Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person

More information

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques Text Retrieval Readings Introduction Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniues 1 2 Text Retrieval:

More information

Automatic Metadata Extraction for Archival Description and Access

Automatic Metadata Extraction for Archival Description and Access Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques

More information

Spring (percentages may not add to 100% due to rounding)

Spring (percentages may not add to 100% due to rounding) Spring 2016 Survey Information: Registered Voters, Random Selection, Landline and Cell Telephone Survey Number of Adult Wisconsin Registered Voters: 616 Interview Period: 4/12-4/15, 2016 Margin of Error

More information

LING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong

LING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong Adminstrivia Reminder: Homework 1: JM Chapter 1 Homework 2: Install Perl and Python (if needed) Today s Topics App of the Day Homework 3 Start with Perl App

More information

Proposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016

Proposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016 Proposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016 V.2.1 0. Changes to This Document This revision is oriented towards the general public. The notion of provenance

More information

Unix L555. Dept. of Linguistics, Indiana University Fall Unix. Unix. Directories. Files. Useful Commands. Permissions. tar.

Unix L555. Dept. of Linguistics, Indiana University Fall Unix. Unix. Directories. Files. Useful Commands. Permissions. tar. L555 Dept. of Linguistics, Indiana University Fall 2010 1 / 21 What is? is an operating system, like DOS or Windows developed in 1969 by Bell Labs works well for single computers as well as for servers

More information

Data Mining with R. Text Mining. Hugh Murrell

Data Mining with R. Text Mining. Hugh Murrell Data Mining with R Text Mining Hugh Murrell reference books These slides are based on a book by Yanchang Zhao: R and Data Mining: Examples and Case Studies. http://www.rdatamining.com for further background

More information

Homework 6: Heaps Due: 5:00 PM, Apr 9, 2018

Homework 6: Heaps Due: 5:00 PM, Apr 9, 2018 CS18 Integrated Introduction to Computer Science Fisler, Nelson Contents Homework 6: Heaps Due: 5:00 PM, Apr 9, 2018 1 Sifting Up and Down 2 2 Text Processing 3 3 Appendix 6 Objectives By the end of this

More information

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by me and Chris Manning) Stanford University Unix for Poets Text is everywhere The Web

More information

Automatic Bangla Corpus Creation

Automatic Bangla Corpus Creation Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net

More information

WebAnno: a flexible, web-based annotation tool for CLARIN

WebAnno: a flexible, web-based annotation tool for CLARIN WebAnno: a flexible, web-based annotation tool for CLARIN Richard Eckart de Castilho, Chris Biemann, Iryna Gurevych, Seid Muhie Yimam #WebAnno This work is licensed under a Attribution-NonCommercial-ShareAlike

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are

More information

Retrieval and Classification (TREX) Features and Functions

Retrieval and Classification (TREX) Features and Functions Retrieval and Classification (TREX) Features and Functions Retrieval and Classification (TREX) 5.0 Copyright Copyright 2002 SAP AG. All rights reserved. No part of this publication may be reproduced or

More information

Research Tools: DIY Text Tools

Research Tools: DIY Text Tools As with the other Research Tools, the DIY Text Tools are primarily designed for small research projects at the undergraduate level. What are the DIY Text Tools for? These tools are designed to help you

More information

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Digital Humanities. Tutorial Regular Expressions. March 10, 2014 Digital Humanities Tutorial Regular Expressions March 10, 2014 1 Introduction In this tutorial we will look at a powerful technique, called regular expressions, to search for specific patterns in corpora.

More information

Lab 1: Course Intro, Getting Started with Python IDLE. Ling 1330/2330 Computational Linguistics Na-Rae Han

Lab 1: Course Intro, Getting Started with Python IDLE. Ling 1330/2330 Computational Linguistics Na-Rae Han Lab 1: Course Intro, Getting Started with Python IDLE Ling 1330/2330 Computational Linguistics Na-Rae Han Objectives Course Introduction http://www.pitt.edu/~naraehan/ling1330/index.html Student survey

More information

A Random Walk through Cyber Security

A Random Walk through Cyber Security A Random Walk through Cyber Security Dr. Edward G. Amoroso Chief Executive Officer, TAG Cyber LLC Adjunct Professor, Stevens Institute and NYU Senior Advisor, APL/JHU; 2010 AT&T Labs Fellow eamoroso@tag-cyber.com

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

White Paper. Video Streaming in Saudi Aramco. Video Streaming Project Team Computer Applications Department. Video Streaming in Saudi Aramco

White Paper. Video Streaming in Saudi Aramco. Video Streaming Project Team Computer Applications Department. Video Streaming in Saudi Aramco Mohamad Sarfraz -- email: "Sarfraz, Mohammed" White Paper Video Streaming in Saudi Aramco Video Streaming Project Team Computer Applications Department TABLE OF CONTENTS

More information

George W. Bush Presidential Library and Museum 2943 SMU Boulevard, Dallas, Texas

George W. Bush Presidential Library and Museum 2943 SMU Boulevard, Dallas, Texas George W. Bush Presidential Library and Museum 2943 SMU Boulevard, Dallas, Texas 75205 www.georgewbushlibrary.smu.edu ELECTRONIC RECORDS at the George W. Bush Presidential Library The Electronic Records

More information

Storytelling in InfoVis

Storytelling in InfoVis Storytelling in InfoVis CS 4460 Intro. to Information Visualization September 2, 2014 John Stasko Purpose Review Two main uses of infovis Analysis Understand your data better and act upon that understanding

More information

2. (a) Explain when the Quick sort is preferred to merge sort and vice-versa.

2. (a) Explain when the Quick sort is preferred to merge sort and vice-versa. Code No: RR210504 Set No. 1 1. (a) Order the following functions according to their order of growth (from the lowest to the highest). (n-2)!, 5 log (n+100) 10,2 2n, 0.001n 4 +3n 3 +1, ln 2 n, n 1/3, 3

More information

Data Science & . June 14, 2018

Data Science &  . June 14, 2018 Data Science & Email June 14, 2018 Attention. Source: OPTE Project How do you build repeat audience attention? EMAIL It s the best way to: own your audience cultivate relationships through repeatable

More information

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for

More information

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1 Projective vs non-projective dependencies If we extract dependencies from trees,

More information

CMSC 723 Computational Linguistics I. Introduction to Python and NLTK. Session 2 Wednesday, September 9, Outline. Spend minutes on Python

CMSC 723 Computational Linguistics I. Introduction to Python and NLTK. Session 2 Wednesday, September 9, Outline. Spend minutes on Python CMSC 723 Computational Linguistics I Introduction to Python and NLTK Session 2 Wednesday, September 9, 2009 1 Outline Spend 30-40 minutes on Python - - Not an intro! Very quick run-through of how Python

More information

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

PS2 out today. Lab 2 out today. Lab 1 due today - how was it? 6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your

More information

Identification of Coreferential Chains in Video Texts for Semantic Annotation of News Videos

Identification of Coreferential Chains in Video Texts for Semantic Annotation of News Videos Identification of Coreferential Chains in Video Texts for Semantic Annotation of News Videos Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK -UzayInstitute, Ankara -Turkey dilek.kucuk@uzay.tubitak.gov.tr 2

More information

NLTK Tutorial: Basics

NLTK Tutorial: Basics Table of Contents NLTK Tutorial: Basics Edward Loper 1. Goals...3 2. Accessing NLTK...3 3. Words...4 3.1. Types and Tokens...4 3.2. Text Locations...4 3.2.1. Units...5 3.2.2. Sources...6 3.3. Tokens and

More information

US Constitution. Articles I-VII

US Constitution. Articles I-VII US Constitution Articles I-VII Quick Questions ª What is the Constitution? ª What is the Preamble? ª What are the Articles and their purpose? Preamble http://www.schooltube.com/video/03f9c858260a4da9b582/

More information

Business Process Model and Notation (BPMN)

Business Process Model and Notation (BPMN) Business Process Model and Notation (BPMN) Daniel Brookshier, Distinguished Fellow, No Magic Inc. 1 BPMN Introduction n BPMN 2.0 is an international standard for business process modeling. n Developed

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

Machine Learning in GATE

Machine Learning in GATE Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort

More information

Respondus Test Software

Respondus Test Software Respondus Test Software Create and Format the Test in Word... 1 Basic Formatting... 1 Formatting Specific Question Types... 2 Import the Test into Respondus... 3 Preview and Edit the Test in Respondus...

More information

R Basics / Course Business

R Basics / Course Business R Basics / Course Business We ll be using a sample dataset in class today: CourseWeb: Course Documents " Sample Data " Week 2 Can download to your computer before class CourseWeb survey on research/stats

More information

Due: Tuesday 29 November by 11:00pm Worth: 8%

Due: Tuesday 29 November by 11:00pm Worth: 8% CSC 180 H1F Project # 3 General Instructions Fall 2016 Due: Tuesday 29 November by 11:00pm Worth: 8% Submitting your assignment You must hand in your work electronically, using the MarkUs system. Log in

More information

Tutorial to QuotationFinder_0.4.3

Tutorial to QuotationFinder_0.4.3 Tutorial to QuotationFinder_0.4.3 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can either detect

More information

Office of Presidential Libraries; Proposed Disposal of George H.W. Bush and Clinton. Agency: National Archives and Records Administration (NARA)

Office of Presidential Libraries; Proposed Disposal of George H.W. Bush and Clinton. Agency: National Archives and Records Administration (NARA) This document is scheduled to be published in the Federal Register on 06/28/2013 and available online at http://federalregister.gov/a/2013-15564, and on FDsys.gov Billing Code 7515-01U NATIONAL ARCHIVES

More information

Storing and Reusing Macros

Storing and Reusing Macros 101 CHAPTER 9 Storing and Reusing Macros Introduction 101 Saving Macros in an Autocall Library 102 Using Directories as Autocall Libraries 102 Using SAS Catalogs as Autocall Libraries 103 Calling an Autocall

More information

CSCE C. Lab 10 - File I/O. Dr. Chris Bourke

CSCE C. Lab 10 - File I/O. Dr. Chris Bourke CSCE 155 - C Lab 10 - File I/O Dr. Chris Bourke Prior to Lab Before attending this lab: 1. Read and familiarize yourself with this handout. 2. Review the following free textbook resources: http://en.wikibooks.org/wiki/c_programming/file_io

More information

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Alina Karakanta, Mihaela Vela, Elke Teich Department of Language Science and Technology, Saarland University Outline Introduction

More information

Python Odds & Ends. April 23, CSCI Intro. to Comp. for the HumaniDes and Social Sciences 1

Python Odds & Ends. April 23, CSCI Intro. to Comp. for the HumaniDes and Social Sciences 1 Python Odds & Ends April 23, 2015 CSCI 0931 - Intro. to Comp. for the HumaniDes and Social Sciences 1 Today Web InteracDon and Forms Graphical User Interfaces Natural Language Processing CSCI 0931 - Intro.

More information

Creating N-gram profile for a Wikipedia Corpus

Creating N-gram profile for a Wikipedia Corpus Programming Assignment 1 CS 435 Introduction to Big Data Creating N-gram profile for a Wikipedia Corpus Due: Feb. 21, 2018 5:00PM Submission: via Canvas, individual submission Objectives The goal of this

More information

Forging Industry Association

Forging Industry Association Forging Industry Association Alan s TEC Generic Prosperity in the Age of Decline Brian Beaulieu CEO 213 Forecast Results 2 Duration Forecast Actuals Accuracy US GDP 12 $15.818 Trillion $15.966 Dec 99.3%

More information

1 Modules 2 IO. 3 Lambda Functions. 4 Some tips and tricks. 5 Regex. Sandeep Sadanandan (TU, Munich) Python For Fine Programmers May 30, / 22

1 Modules 2 IO. 3 Lambda Functions. 4 Some tips and tricks. 5 Regex. Sandeep Sadanandan (TU, Munich) Python For Fine Programmers May 30, / 22 1 Modules 2 IO 3 Lambda Functions 4 Some tips and tricks 5 Regex Sandeep Sadanandan (TU, Munich) Python For Fine Programmers May 30, 2009 1 / 22 What are they? Modules are collections of classes or functions

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Tutorial to QuotationFinder_0.6

Tutorial to QuotationFinder_0.6 Tutorial to QuotationFinder_0.6 What is QuotationFinder, and for which purposes can it be used? QuotationFinder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,

More information

NLTK is distributed with several corpora (singular: corpus). A corpus is a body of text (or other language data, eg speech).

NLTK is distributed with several corpora (singular: corpus). A corpus is a body of text (or other language data, eg speech). 1 ICL/Introduction to Python 3/2006-10-02 2 NLTK NLTK: Python Natural Language ToolKit NLTK is a set of Python modules which you can import into your programs, eg: from nltk_lite.utilities import re_show

More information

File Input/Output in Python. October 9, 2017

File Input/Output in Python. October 9, 2017 File Input/Output in Python October 9, 2017 Moving beyond simple analysis Use real data Most of you will have datasets that you want to do some analysis with (from simple statistics on few hundred sample

More information

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by Chris Manning) Stanford University Unix for Poets (based on Ken Church s presentation)

More information

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing

TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald Databases & Information Systems Group ADReM Research

More information

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1 Regular Expressions Todd Kelley kelleyt@algonquincollege.com CST8207 Todd Kelley 1 Our standard.bashrc and.bash_profile (or.profile) Our standard script header Regular Expressions 2 [ -z "${PS1-}" ] &&

More information

CSC401 Natural Language Computing

CSC401 Natural Language Computing CSC401 Natural Language Computing Jan 19, 2018 TA: Willie Chang Varada Kolhatkar, Ka-Chun Won, and Aryan Arbabi) Mascots: r/sandersforpresident (left) and r/the_donald (right) To perform sentiment analysis

More information