Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han
|
|
- Anthony James
- 5 years ago
- Views:
Transcription
1 Lab 12: Processing a Corpus Ling 1330/2330: Computational Linguistics Na-Rae Han
2 Objectives How to process a corpus 10/4/2018 2
3 Beyond a single, short text So far, we have been handling relatively short texts, one at a time. Going multiple Find out what's involved in processing a text archive of multiple text files (aka corpus) Let's try this today Going big Find out what's involved in processing HUMONGUOUS text files 10/4/2018 3
4 Processing multiple texts From the NLTK Corpora page, download: C-Span Inaugural Address Corpus The C-Span Inaugural Address Corpus Includes 56 past presidential inaugural address, from 1789 (Washington) to 2009 (Obama). The directory has 56.txt files and one README file. QUESTION: How do we effectively process this many files? 10/4/2018 4
5 Corpus vs. sub-corpora Sub-corpus 1 Sub-corpus 2 Entire Corpus 10/4/2018 5
6 Big token lists for sub-corpora text text text text text text text text sub-corpus 1 TOKENS Good when individual texts don't need separate attention. sub-corpus 2 TOKENS 10/4/2018 6
7 Pools & individual token lists text text text text text text text text tokens tokens tokens tokens tokens tokens tokens tokens sub-corpus 1 TOKENS Individual token lists as well as sub-corpus pools sub-corpus 2 TOKENS 10/4/2018 7
8 Using glob glob: a file-name globbing utility Returns a list of file names that match the specified pattern >>> import glob >>> files = glob.glob(r'd:\lab\inaugural\*.txt') >>> len(files) 56 >>> files[:5] ['D:\\Lab\\inaugural\\1789-Washington.txt', 'D:\\Lab\\inaugural\\1793-Washington.txt', 'D:\\Lab\\inaugural\\1797-Adams.txt', 'D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt'] >>> files[-1] 'D:\\Lab\\inaugural\\2009-Obama.txt' >>> All files ending in.txt Excludes README 10/4/2018 8
9 Using glob Addresses from 1800's only >>> files2 = glob.glob(r'd:\lab\inaugural\18*.txt') >>> len(files2) 25 >>> files2[:5] ['D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt', 'D:\\Lab\\inaugural\\1809-Madison.txt', 'D:\\Lab\\inaugural\\1813-Madison.txt', 'D:\\Lab\\inaugural\\1817-Monroe.txt'] >>> files2[-1] 'D:\\Lab\\inaugural\\1897-McKinley.txt' >>> All files starting with '18' and ending with '.txt' 10/4/2018 9
10 Build dictionary of texts For-loop through file names and build a dictionary of key (filename): value (text content) >>> files[0] 'D:\\Lab\\inaugural\\1789-Washington.txt' >>> files[0][12:-4] 'ural\\1789-washington' >>> files[0][17:-4] '1789-Washington' >>> files[2][17:-4] '1797-Adams' >>> files[0].index('\\') 2 >>> files[0].rindex('\\') 16 >>> files[0][files[0].rindex('\\')+1:-4] '1789-Washington' Full name is too long. How to extract this? Gets the job done Highest index of '\' (Windows dir separator) This is the more principled way of extracting the short file name 10/4/
11 Build dictionary of texts For-loop through file names and build a dictionary of key (filename): value (text content) >>> files[0][files[0].rindex('\\')+1:-4] '1789-Washington' >>> fn2txt = {} >>> for longname in files: f = open(longname) txt = f.read() f.close() start = longname.rindex('\\')+1 short = longname[start:-4] fn2txt[short] = txt >>> fn2txt['1809-madison'][:40] 'Unwilling to depart from examples of the' >>> fn2txt['1789-washington'][:40] 'Fellow-Citizens of the Senate and of the' fn2txt file name as key, text string as value 10/4/
12 Processing each text Task: Compute the average sentence length for each presidential address. We have to build separate token lists for each speech. >>> fn2toks = {} >>> for (fn, txt) in fn2txt.items(): toks = textstats.gettokens(txt) fn2toks[fn] = toks fn2toks file name as key, token list as value >>> len(fn2toks) 56 >>> fn2toks['1789-washington'] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':',... >>> fn2toks['2001-bush'][:10] ['president', 'clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ','] 12
13 Speech length, 'peace' count >>> for fn in fn2toks: toks = fn2toks[fn] print(len(toks), fn) Washington Washington Obama >>> for fn in fn2toks: toks = fn2toks[fn] print(toks.count('peace'), '\t', fn) Eisenhower Eisenhower Kennedy Johnson Nixon Nixon Carter 13
14 Average sentence length, per address >>> for fn in fn2toks: toks = fn2toks[fn] sentcount = toks.count('.') + toks.count('!') \ + toks.count('?') avgsentlen = len(toks)/sentcount print(avgsentlen, '\t', fn) Washington Washington Adams Jefferson Jefferson Madison Bush Bush Obama >>> Assumes every sentence ends with '.', '!', or '?' 14
15 Treating files as a single corpus Task: Compile word frequency of the Inaugural Speeches. For this, we only need to build a single pool of tokenized words. For each text, tokenize it, and then add the result to the pool of tokenized words. >>> import textstats >>> alltoks = [] >>> for txt in fn2txt.values(): toks = textstats.gettokens(txt) alltoks.extend(toks) >>> len(alltoks) >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] >>> alltoks[-15:] ['you', '.', 'god', 'bless', 'you', '.', 'and', 'god', 'bless', 'the', 'united', 'states', 'of', 'america', '.'] 15
16 Word frequency of entire corpus >>> allfreq = textstats.getfreq(alltoks) >>> allfreq['citizens'] 237 >>> allfreq['battle'] 12 >>> for k in sorted(allfreq, key=allfreq.get, reverse=true)[:10]: print(k, allfreq[k]) the 9906 of 6986, 6862 and to 4432 in 2749 a 2193 our 2058 that 1726 >>> 16
17 Treating files as a single corpus, take 2 Task: Compile word frequency of the Inaugural Speeches. Alternative approach: join all text strings into a single gigantic text string And then, tokenize it all at once. >>> alltxt = '\n'.join(fn2txt.values()) All speech texts, concatenated with a line break in between >>> alltoks = textstats.gettokens(alltxt) >>> len(alltoks) >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] 17
Handout 12: Textual models
Handout 12: Textual models Taylor Arnold Loading and parsing the data The full text of all the State of the Union addresses through 2016 are available in the R package sotu, available on CRAN. The package
More informationLecture 12: Shell Scripting, SSH, Super-Computing. LING 1340/2340: Data Science for Linguists Na-Rae Han
Lecture 12: Shell Scripting, SSH, Super-Computing LING 1340/2340: Data Science for Linguists Na-Rae Han Objectives Batch processing through for loop Shell scripting Server access through SSH Pitt's timeshare
More informationLecture 6: more pandas (and git/github) LING 1340/2340: Data Science for Linguists Na-Rae Han
Lecture 6: more pandas (and git/github) LING 1340/2340: Data Science for Linguists Na-Rae Han Objectives git and GitHub: Let's be more disciplined! Python's pandas library Tools: Git and GitHub Jupyter
More informationStatistical Programming Camp: An Introduction to R
Statistical Programming Camp: An Introduction to R Handout 5: Loops and Conditional Statements Fox Chapter 2, 8 In this handout, we cover the following new materials: Using loops for(i in X){ to repeat
More informationNLP Lab Session Week 4 September 17, Reading and Processing Test, Stemming and Lemmatization. Getting Started
NLP Lab Session Week 4 September 17, 2014 Reading and Processing Test, Stemming and Lemmatization Getting Started In this lab session, we will use two saved files of python commands and definitions and
More informationLab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han
Lab 7: Reading Files, Importing, Bigram Function Ling 1330/2330: Computational Linguistics Na-Rae Han Objectives Importing Reading text files range() Bigram function More sorting with sorted() sorted()
More informationFor convenience in typing examples, we can shorten the wordnet name to wn.
NLP Lab Session Week 14, December 4, 2013 More Semantics: WordNet similarity in NLTK and LDA Mallet demo More on Final Projects: weka memory and loading Spam documents Getting Started For the final projects,
More informationPOL 345: Quantitative Analysis and Politics
POL 345: Quantitative Analysis and Politics Precept Handout 4 Week 5 (Verzani Chapter 6: 6.2) Remember to complete the entire handout and submit the precept questions to the Blackboard DropBox 24 hours
More informationLab 20: Regular Expressions in Python. Ling 1330/2330: Computational Linguistics Na-Rae Han
Lab 20: Regular Expressions in Python Ling 1330/2330: Computational Linguistics Na-Rae Han Exercise 10: regexing Jobs [x X] [xx] (x X) Within [... ], all characters are already considered forming a set,
More informationL435/L555. Dept. of Linguistics, Indiana University Fall 2016
for : for : L435/L555 Dept. of, Indiana University Fall 2016 1 / 12 What is? for : Decent definition from wikipedia: Computer programming... is a process that leads from an original formulation of a computing
More information16 January 2018 Ken Benoit, Kohei Watanabe & Akitaka Matsuo London School of Economics and Political Science
16 January 2018 Ken Benoit, Kohei Watanabe & Akitaka Matsuo London School of Economics and Political Science Quantitative Analysis of Textual Data 5.5 years of development, 17 releases 6,791 commits; 719
More informationLab 18: Regular Expressions in Python. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han
Lab 18: Regular Expressions in Python Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han Learning to use regex in Python Na-Rae's tutorials: http://www.pitt.edu/~naraehan/python3/re.html http://www.pitt.edu/~naraehan/python3/more_list_comp.html
More informationIntroduction to Text Mining. Aris Xanthos - University of Lausanne
Introduction to Text Mining Aris Xanthos - University of Lausanne Preliminary notes Presentation designed for a novice audience Text mining = text analysis = text analytics: using computational and quantitative
More informationTutorial and Exercises with WordList in WordSmith Tools: Level I
Tutorial and Exercises with WordList in WordSmith Tools: Level I WordSmith Tools, developed by Mike Scott, is a corpus analysis tool that integrates three text analysis tools: a monolingual concordancer
More informationFinal Project Discussion. Adam Meyers Montclair State University
Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...
More informationUniversity of Sheffield, NLP. Chunking Practical Exercise
Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person
More informationIntroductory Statistics
Introductory Statistics This document is attributed to Barbara Illowsky and Susan Dean Chapter 2 Open Assembly Edition Open Assembly editions of open textbooks are disaggregated versions designed to facilitate
More informationQuestion Answering Using XML-Tagged Documents
Question Answering Using XML-Tagged Documents Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/trec11/index.html XML QA System P Full text processing of TREC top 20 documents Sentence
More informationSpeech Recognition. Project: Phone Recognition using Sphinx. Chia-Ho Ling. Sunya Santananchai. Professor: Dr. Kepuska
Speech Recognition Project: Phone Recognition using Sphinx Chia-Ho Ling Sunya Santananchai Professor: Dr. Kepuska Objective Use speech data corpora to build a model using CMU Sphinx.Apply a built model
More informationThe Last Campaign: Robert F. Kennedy And 82 Days That Inspired America By Thurston Clarke
The Last Campaign: Robert F. Kennedy And 82 Days That Inspired America By Thurston Clarke If you are searching for the book The Last Campaign: Robert F. Kennedy and 82 Days That Inspired America by Thurston
More informationMaca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology
Maca a configurable tool to integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Outline Morphological resources for Polish Tagset and segmentation differences
More informationLab 6: Data Types, Mutability, Sorting. Ling 1330/2330: Computational Linguistics Na-Rae Han
Lab 6: Data Types, Mutability, Sorting Ling 1330/2330: Computational Linguistics Na-Rae Han Objectives Data types and conversion Tuple Mutability Sorting: additional parameters Text processing overview
More informationData for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit
Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:
More informationLab 8: File I/O, Mutability vs. Assignment. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han
Lab 8: File I/O, Mutability vs. Assignment Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han Objectives File I/O Writing to a file File I/O pitfalls File reference and absolute path Mutability
More informationLIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases
LIDER Survey Overview Participant profile (organisation type, industry sector) Relevant use-cases Discovering and extracting information Understanding opinion Content and data (Data Management) Monitoring
More informationCSC 5930/9010: Text Mining GATE Developer Overview
1 CSC 5930/9010: Text Mining GATE Developer Overview Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 GATE Components 2 We will deal primarily with GATE Developer:
More informationA bit of theory: Algorithms
A bit of theory: Algorithms There are different kinds of algorithms Vector space models. e.g. support vector machines Decision trees, e.g. C45 Probabilistic models, e.g. Naive Bayes Neural networks, e.g.
More informationLing 473 Project 4 Due 11:45pm on Thursday, August 31, 2017
Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017 Bioinformatics refers the application of statistics and computer science to the management and analysis of data from the biosciences. In common
More informationOf Search and Semantics
Of Search and Semantics Patrick Pantel NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008-2 - Vannaver Bush proposes to build a body of knowledge for all mankind: Memex
More informationPrivacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,
More information" x stands for the sum (the total) of the numbers
June 3, 2009 Statistics on your calculator page 1 Statistics on your calculator Your calculator has many statistics features, found mostly under STAT and 2 nd Y= (STATPLOT). You used one of these features,
More informationOverview and Update of NARA's Electronic Records Archive (ERA) Program.
Overview and Update of NARA's Electronic Records Archive (ERA) Program. Informal Remarks Presented to the Government Information Preservation Working Group 14 October 2004 National Archives and Records
More informationDigging Deeper Reaching Further
Digging Deeper Reaching Further Libraries Empowering Users to Mine the HathiTrust Digital Library Resources Module 3: Working with Textual Data Instructor Guide Further reading: go.illinois.edu/ddrf-resources
More informationRecitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018
Recitation 9 CS435: Introduction to Big Data GTA: Bibek R. Shrestha Email: cs435@cs.colostate.edu March 23, 2018 Today... Discussion on issues in Programming Assignment 2 2 How to program PA2: Part-A?
More informationLecture 2: Data in Linguistics, Git/GitHub, Jupyter Notebook. LING 1340/2340: Data Science for Linguists Na-Rae Han
Lecture 2: Data in Linguistics, Git/GitHub, Jupyter Notebook LING 1340/2340: Data Science for Linguists Na-Rae Han Objectives What do linguistic data look like? Tools: You should be taking NOTES! Git and
More informationTutorial to QuotationFinder_0.4.4
Tutorial to QuotationFinder_0.4.4 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,
More informationGary F. Simons. SIL International
Gary F. Simons SIL International AARDVARC Symposium, LSA, Portland, OR, 11 Jan 2015 Given the relentless entropy that degrades our field recordings, and innovation that makes the technology we have used
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationWindows On Windows systems, simply double click the AntConc icon and this will launch the program.
AntConc (Windows, Macintosh OS X, and Linux) Build 3.5.2 (February 8, 2018) Laurence Anthony, Ph.D. Center for English Language Education in Science and Engineering, School of Science and Engineering,
More informationLab 4: Shell Scripting
Lab 4: Shell Scripting Nathan Jarus June 12, 2017 Introduction This lab will give you some experience writing shell scripts. You will need to sign in to https://git.mst.edu and git clone the repository
More informationEECS 349 Machine Learning Homework 3
WHAT TO HAND IN You are to submit the following things for this homework: 1. A SINGLE PDF document containing answers to the homework questions. 2. The WELL COMMENTED MATLAB source code for all software
More informationInternal Commands COPY and TYPE
Internal Commands COPY and TYPE Ch 5 1 Overview Will review file-naming rules. Ch 5 2 Overview Will learn some internal commands that can be used to manage and manipulate files. Ch 5 3 Overview The value
More informationUniversity of Sheffield, NLP. Chunking Practical Exercise
Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person
More informationManning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques
Text Retrieval Readings Introduction Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniues 1 2 Text Retrieval:
More informationAutomatic Metadata Extraction for Archival Description and Access
Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques
More informationSpring (percentages may not add to 100% due to rounding)
Spring 2016 Survey Information: Registered Voters, Random Selection, Landline and Cell Telephone Survey Number of Adult Wisconsin Registered Voters: 616 Interview Period: 4/12-4/15, 2016 Margin of Error
More informationLING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong
LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong Adminstrivia Reminder: Homework 1: JM Chapter 1 Homework 2: Install Perl and Python (if needed) Today s Topics App of the Day Homework 3 Start with Perl App
More informationProposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016
Proposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016 V.2.1 0. Changes to This Document This revision is oriented towards the general public. The notion of provenance
More informationUnix L555. Dept. of Linguistics, Indiana University Fall Unix. Unix. Directories. Files. Useful Commands. Permissions. tar.
L555 Dept. of Linguistics, Indiana University Fall 2010 1 / 21 What is? is an operating system, like DOS or Windows developed in 1969 by Bell Labs works well for single computers as well as for servers
More informationData Mining with R. Text Mining. Hugh Murrell
Data Mining with R Text Mining Hugh Murrell reference books These slides are based on a book by Yanchang Zhao: R and Data Mining: Examples and Case Studies. http://www.rdatamining.com for further background
More informationHomework 6: Heaps Due: 5:00 PM, Apr 9, 2018
CS18 Integrated Introduction to Computer Science Fisler, Nelson Contents Homework 6: Heaps Due: 5:00 PM, Apr 9, 2018 1 Sifting Up and Down 2 2 Text Processing 3 3 Appendix 6 Objectives By the end of this
More informationCS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky
CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by me and Chris Manning) Stanford University Unix for Poets Text is everywhere The Web
More informationAutomatic Bangla Corpus Creation
Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net
More informationWebAnno: a flexible, web-based annotation tool for CLARIN
WebAnno: a flexible, web-based annotation tool for CLARIN Richard Eckart de Castilho, Chris Biemann, Iryna Gurevych, Seid Muhie Yimam #WebAnno This work is licensed under a Attribution-NonCommercial-ShareAlike
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationRetrieval and Classification (TREX) Features and Functions
Retrieval and Classification (TREX) Features and Functions Retrieval and Classification (TREX) 5.0 Copyright Copyright 2002 SAP AG. All rights reserved. No part of this publication may be reproduced or
More informationResearch Tools: DIY Text Tools
As with the other Research Tools, the DIY Text Tools are primarily designed for small research projects at the undergraduate level. What are the DIY Text Tools for? These tools are designed to help you
More informationDigital Humanities. Tutorial Regular Expressions. March 10, 2014
Digital Humanities Tutorial Regular Expressions March 10, 2014 1 Introduction In this tutorial we will look at a powerful technique, called regular expressions, to search for specific patterns in corpora.
More informationLab 1: Course Intro, Getting Started with Python IDLE. Ling 1330/2330 Computational Linguistics Na-Rae Han
Lab 1: Course Intro, Getting Started with Python IDLE Ling 1330/2330 Computational Linguistics Na-Rae Han Objectives Course Introduction http://www.pitt.edu/~naraehan/ling1330/index.html Student survey
More informationA Random Walk through Cyber Security
A Random Walk through Cyber Security Dr. Edward G. Amoroso Chief Executive Officer, TAG Cyber LLC Adjunct Professor, Stevens Institute and NYU Senior Advisor, APL/JHU; 2010 AT&T Labs Fellow eamoroso@tag-cyber.com
More informationImplementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky
Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding
More informationThe Goal of this Document. Where to Start?
A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce
More informationWhite Paper. Video Streaming in Saudi Aramco. Video Streaming Project Team Computer Applications Department. Video Streaming in Saudi Aramco
Mohamad Sarfraz -- email: "Sarfraz, Mohammed" White Paper Video Streaming in Saudi Aramco Video Streaming Project Team Computer Applications Department TABLE OF CONTENTS
More informationGeorge W. Bush Presidential Library and Museum 2943 SMU Boulevard, Dallas, Texas
George W. Bush Presidential Library and Museum 2943 SMU Boulevard, Dallas, Texas 75205 www.georgewbushlibrary.smu.edu ELECTRONIC RECORDS at the George W. Bush Presidential Library The Electronic Records
More informationStorytelling in InfoVis
Storytelling in InfoVis CS 4460 Intro. to Information Visualization September 2, 2014 John Stasko Purpose Review Two main uses of infovis Analysis Understand your data better and act upon that understanding
More information2. (a) Explain when the Quick sort is preferred to merge sort and vice-versa.
Code No: RR210504 Set No. 1 1. (a) Order the following functions according to their order of growth (from the lowest to the highest). (n-2)!, 5 log (n+100) 10,2 2n, 0.001n 4 +3n 3 +1, ln 2 n, n 1/3, 3
More informationData Science & . June 14, 2018
Data Science & Email June 14, 2018 Attention. Source: OPTE Project How do you build repeat audience attention? EMAIL It s the best way to: own your audience cultivate relationships through repeatable
More informationCorpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing
Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for
More informationAgenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing
Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1 Projective vs non-projective dependencies If we extract dependencies from trees,
More informationCMSC 723 Computational Linguistics I. Introduction to Python and NLTK. Session 2 Wednesday, September 9, Outline. Spend minutes on Python
CMSC 723 Computational Linguistics I Introduction to Python and NLTK Session 2 Wednesday, September 9, 2009 1 Outline Spend 30-40 minutes on Python - - Not an intro! Very quick run-through of how Python
More informationPS2 out today. Lab 2 out today. Lab 1 due today - how was it?
6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your
More informationIdentification of Coreferential Chains in Video Texts for Semantic Annotation of News Videos
Identification of Coreferential Chains in Video Texts for Semantic Annotation of News Videos Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK -UzayInstitute, Ankara -Turkey dilek.kucuk@uzay.tubitak.gov.tr 2
More informationNLTK Tutorial: Basics
Table of Contents NLTK Tutorial: Basics Edward Loper 1. Goals...3 2. Accessing NLTK...3 3. Words...4 3.1. Types and Tokens...4 3.2. Text Locations...4 3.2.1. Units...5 3.2.2. Sources...6 3.3. Tokens and
More informationUS Constitution. Articles I-VII
US Constitution Articles I-VII Quick Questions ª What is the Constitution? ª What is the Preamble? ª What are the Articles and their purpose? Preamble http://www.schooltube.com/video/03f9c858260a4da9b582/
More informationBusiness Process Model and Notation (BPMN)
Business Process Model and Notation (BPMN) Daniel Brookshier, Distinguished Fellow, No Magic Inc. 1 BPMN Introduction n BPMN 2.0 is an international standard for business process modeling. n Developed
More informationAssignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running
More informationMachine Learning in GATE
Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort
More informationRespondus Test Software
Respondus Test Software Create and Format the Test in Word... 1 Basic Formatting... 1 Formatting Specific Question Types... 2 Import the Test into Respondus... 3 Preview and Edit the Test in Respondus...
More informationR Basics / Course Business
R Basics / Course Business We ll be using a sample dataset in class today: CourseWeb: Course Documents " Sample Data " Week 2 Can download to your computer before class CourseWeb survey on research/stats
More informationDue: Tuesday 29 November by 11:00pm Worth: 8%
CSC 180 H1F Project # 3 General Instructions Fall 2016 Due: Tuesday 29 November by 11:00pm Worth: 8% Submitting your assignment You must hand in your work electronically, using the MarkUs system. Log in
More informationTutorial to QuotationFinder_0.4.3
Tutorial to QuotationFinder_0.4.3 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can either detect
More informationOffice of Presidential Libraries; Proposed Disposal of George H.W. Bush and Clinton. Agency: National Archives and Records Administration (NARA)
This document is scheduled to be published in the Federal Register on 06/28/2013 and available online at http://federalregister.gov/a/2013-15564, and on FDsys.gov Billing Code 7515-01U NATIONAL ARCHIVES
More informationStoring and Reusing Macros
101 CHAPTER 9 Storing and Reusing Macros Introduction 101 Saving Macros in an Autocall Library 102 Using Directories as Autocall Libraries 102 Using SAS Catalogs as Autocall Libraries 103 Calling an Autocall
More informationCSCE C. Lab 10 - File I/O. Dr. Chris Bourke
CSCE 155 - C Lab 10 - File I/O Dr. Chris Bourke Prior to Lab Before attending this lab: 1. Read and familiarize yourself with this handout. 2. Review the following free textbook resources: http://en.wikibooks.org/wiki/c_programming/file_io
More informationEuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates
EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Alina Karakanta, Mihaela Vela, Elke Teich Department of Language Science and Technology, Saarland University Outline Introduction
More informationPython Odds & Ends. April 23, CSCI Intro. to Comp. for the HumaniDes and Social Sciences 1
Python Odds & Ends April 23, 2015 CSCI 0931 - Intro. to Comp. for the HumaniDes and Social Sciences 1 Today Web InteracDon and Forms Graphical User Interfaces Natural Language Processing CSCI 0931 - Intro.
More informationCreating N-gram profile for a Wikipedia Corpus
Programming Assignment 1 CS 435 Introduction to Big Data Creating N-gram profile for a Wikipedia Corpus Due: Feb. 21, 2018 5:00PM Submission: via Canvas, individual submission Objectives The goal of this
More informationForging Industry Association
Forging Industry Association Alan s TEC Generic Prosperity in the Age of Decline Brian Beaulieu CEO 213 Forecast Results 2 Duration Forecast Actuals Accuracy US GDP 12 $15.818 Trillion $15.966 Dec 99.3%
More information1 Modules 2 IO. 3 Lambda Functions. 4 Some tips and tricks. 5 Regex. Sandeep Sadanandan (TU, Munich) Python For Fine Programmers May 30, / 22
1 Modules 2 IO 3 Lambda Functions 4 Some tips and tricks 5 Regex Sandeep Sadanandan (TU, Munich) Python For Fine Programmers May 30, 2009 1 / 22 What are they? Modules are collections of classes or functions
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationTutorial to QuotationFinder_0.6
Tutorial to QuotationFinder_0.6 What is QuotationFinder, and for which purposes can it be used? QuotationFinder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,
More informationNLTK is distributed with several corpora (singular: corpus). A corpus is a body of text (or other language data, eg speech).
1 ICL/Introduction to Python 3/2006-10-02 2 NLTK NLTK: Python Natural Language ToolKit NLTK is a set of Python modules which you can import into your programs, eg: from nltk_lite.utilities import re_show
More informationFile Input/Output in Python. October 9, 2017
File Input/Output in Python October 9, 2017 Moving beyond simple analysis Use real data Most of you will have datasets that you want to do some analysis with (from simple statistics on few hundred sample
More informationCS 124/LINGUIST 180 From Languages to Information
CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by Chris Manning) Stanford University Unix for Poets (based on Ken Church s presentation)
More informationTokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017
Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation
More informationjldadmm: A Java package for the LDA and DMM topic models
jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical
More informationTriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing
TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald Databases & Information Systems Group ADReM Research
More informationRegular Expressions. Todd Kelley CST8207 Todd Kelley 1
Regular Expressions Todd Kelley kelleyt@algonquincollege.com CST8207 Todd Kelley 1 Our standard.bashrc and.bash_profile (or.profile) Our standard script header Regular Expressions 2 [ -z "${PS1-}" ] &&
More informationCSC401 Natural Language Computing
CSC401 Natural Language Computing Jan 19, 2018 TA: Willie Chang Varada Kolhatkar, Ka-Chun Won, and Aryan Arbabi) Mascots: r/sandersforpresident (left) and r/the_donald (right) To perform sentiment analysis
More information