Handout 12: Textual models

Similar documents
December Scroll Down for Upcoming Months. Sunday Monday Tuesday Wednesday Thursday Friday Saturday

" x stands for the sum (the total) of the numbers

Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han

Epilogue: General Conclusion and Discussion

Introductory Statistics

Overview and Update of NARA's Electronic Records Archive (ERA) Program.

Prentice Hall: Economics Principles in Action 2005 Correlated to: Kansas Social Studies Standards (High School)

Package cleannlp. January 22, Type Package

Declarative Programming

CEMA Marc Mar h c 2013 Alan Alan Beaulieu ITR E c E onomics ITR Economics

US Constitution. Articles I-VII

Statistical Programming Camp: An Introduction to R

(12) Patent Application Publication (10) Pub. No.: US 2005/ A1

Introducing Feature Network Models

The U.S. National Spatial Data Infrastructure

Database Management Systems

Spring FREQUENCIES

Economics: Principles in Action 2005 Correlated to: Indiana Family and Consumer Sciences Education, Consumer Economics (High School, Grades 9-12)

EC999: Named Entity Recognition

ACE (Automatic Content Extraction) English Annotation Guidelines for Values. Version

COS 126 General Computer Science Fall Exam 1

Reminder: John the Patriotic Dog Breeder

CSE 160 Spring 2015: Midterm Exam

Information Retrieval CSCI

The Simple Guide to GDPR Data Protection: Considerations for and File Sharing

Package TANDEM. R topics documented: June 15, Type Package

Analysis and Latent Semantic Indexing

Text Classification and Clustering Using Kernels for Structured Data

Text Mining with R: Building a Text Classifier

Data Mining with R. Text Mining. Hugh Murrell

Vision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Automatic Metadata Extraction for Archival Description and Access

Relational terminology. Databases - Sets & Relations. Sets. Membership

The Password Is Operations! Arithmetic and Geometric Sequences

Lecture 4 Sorting Euiseong Seo

Monkey to the Top. Monkey to the Top. Visit for thousands of books and materials.

Sentiment analysis under temporal shift

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

HENRY KISSINGER THE PAPERS OF JOHN F. KENENDY

Building Search Applications

Adams County Unit Name: Natchez # 5315 Alfred Hunter, President. Alcorn County Unit Name: Alcorn # 5262 J.C. Hill, President.

Package wordcloud. August 24, 2018

Lasso. November 14, 2017

Data Management - 50%

Automatic Summarization

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Criteria: State: Indiana County: ALL COUNTIES ID #: All. Results: 363 records found.

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Sheffield Ave. Anticipated Acquisition Areas** Residential Commercial. Industrial. Municipal Exempt Non-Municipal !!!! Utility !!!! !!!!

Dimensionality Reduction of fmri Data using PCA

Lecture 22 The Generalized Lasso

Refresher on Dependency Syntax and the Nivre Algorithm

(Multinomial) Logistic Regression + Feature Engineering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

POL 345: Quantitative Analysis and Politics

2018 WV HSCA BOYS SOCCER RANKINGS

10-701/15-781, Fall 2006, Final

Standard Format for Importing Questions

Deep Learning. Volker Tresp Summer 2014

SS 2017 (MA-INF 3302) Prof. Dr. Rainer Manthey Institute of Computer Science University of Bonn. Temporal Information Systems 1

TWO RIVERS BUS STOPS AND BUS INFORMATION

Keyword Extraction by KNN considering Similarity among Features

Data Preprocessing. Data Mining 1

Economic Update Baker College - Flint

Practical OmicsFusion

Beyond Friendships and Followers: The Wikipedia Social Network

Houston Economic Outlook. Prepared by Patrick Jankowski SVP Research

Middle School Math Course 2

Classification and Regression

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Content-based Management of Document Access. Control

Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Running Example. Mention Pair Model. Mention Pair Example

Package bestnormalize

2. An implementation-ready data model needn't necessarily contain enforceable rules to guarantee the integrity of the data.

Objectives In Lesson 60, you will: C Create and name OneNote folders, sections, and page headers. C Enter notes into OneNote containers.

Chapter 2: Entity-Relationship Model

Supervised classification of law area in the legal domain

Real Estate Forecast 2015

Senate Committee on Homeland Security and Governmental Affairs Full Committee Hearing

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Sebastian Mueller Mueller 1. Watching In: Self-Surveillance in the Contemporary United States

Chapter 2. Architecture of a Search Engine

Spring (percentages may not add to 100% due to rounding)

TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing

Investigating Insider Threats

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Logistic Regression

Data Mining Classification - Part 1 -

Ranking Algorithms For Digital Forensic String Search Hits

Search Engines. Information Retrieval in Practice

Package nodeharvest. June 12, 2015

Learning to attach semantic metadata to Web Services

Analysis of Submitted Aggregate Entities

Building Internet Systems of Lasting Value

Georgia Department of Education 2015 Four-Year Graduation Rate All Students

Natural Language Processing

Package corenlp. June 3, 2015

Transcription:

Handout 12: Textual models Taylor Arnold Loading and parsing the data The full text of all the State of the Union addresses through 2016 are available in the R package sotu, available on CRAN. The package also contains meta-data concerning each speech that we will add to the document table while annotating the corpus. The code to run this annotation is given by: library(sotu) library(cleannlp) data(sotu_text) data(sotu_meta) init_spacy() sotu <- cleannlp::run_annotators(sotu_text, as_strings = TRUE, meta = sotu_meta) The annotation object, which we will use in the example in the following analysis, is stored in the object sotu. We will create a single data frame with all of the tokens here and save the metadata as a seperate data set: tokens <- cleannlp::get_token(sotu, combine = TRUE) tokens ## # A tibble: 2,181,493 14 ## id sid tid word ## <int> <int> <int> <chr> ## 1 1 1 1 Fellow ## 2 1 1 2 - ## 3 1 1 3 Citizens ## 4 1 1 4 of ## 5 1 1 5 the ## 6 1 1 6 Senate ## 7 1 1 7 and ## 8 1 1 8 House ## 9 1 1 9 of ## 10 1 1 10 Representatives ## #... with 2,181,483 more rows, and 10 more ## # variables: lemma <chr>, upos <chr>, ## # pos <chr>, cid <int>, source <int>, ## # relation <chr>, word_source <chr>,

handout 12: textual models 2 ## # lemma_source <chr>, entity_type <chr>, ## # entity <chr> doc <- cleannlp::get_document(sotu) Models The get_tfidf function provided by cleannlp converts a token table into a sparse matrix representing the term-frequency inverse document frequency matrix (or any intermediate part of that calculation). This is particularly useful when building models from a textual corpus. The tidy_pca, also included with the package, takes a matrix and returns a data frame containing the desired number of principal components. Dimension reduction involves piping the token table for a corpus into the get_tfidif function and passing the results to tidy_pca. temp <- filter(tokens, pos %in% c("nn", "NNS")) tfidf <- get_tfidf(temp, min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm")$tfidf pca <- tidy_pca(tfidf, get_document(sotu)) select(pca, year, PC1, PC2) In this example only non-proper nouns have been included in order to minimize the stylistic attributes of the speeches in order to focus more on their content. A scatter plot of the speeches using these components is shown. There is a definitive temporal pattern to the documents, with the 20th century addresses forming a distinct cluster on the right side of the plot. The output of the get_tfidf function may be given directly to the LDA function in the package topicmodels. The topic model function requires raw counts, so the type variable in get_tfidf is set to tf''; the results may then be directly piped tolda. library(topicmodels) tm <- cleannlp::get_token(sotu) %>% temp <- filter(tokens, pos %in% c("nn", "NNS")) temp <- get_tfidf(temp, min_df = 0.05, max_df = 0.95, type = "tf", tf_weight = "raw") LDA(temp$tf, k = 16, control = list(verbose = 1)) The topics, ordered by approximate time period, are visualized in the Figure. Most topics persist for a few decades and then largely disappear, though some persist over non-contiguous periods of the presidency. The Energy topic, for example, appears during the 1950s

handout 12: textual models 3 PC2 James Madison Thomas Jefferson George Washington John Quincy Adams James Monroe John Adams Andrew Jackson Martin Van Buren Millard Fillmore Zachary Taylor James K. Polk Ronald Reagan Franklin Pierce George W. Bush John Tyler James Buchanan Gerald R. Ford Andrew Johnson Abraham Lincoln Richard M. Nixon Harry S Truman Jimmy Carter George Barack Bush Obama Rutherford B. Hayes Theodore Roosevelt Lyndon B. Johnson William McKinley Woodrow Wilson William J. Clinton John F. Kennedy Ulysses S. Grant Dwight D. Eisenhower Benjamin Harrison Franklin D. Roosevelt Warren G. Harding Chester A. Arthur Year (1790,1813] (1813,1835] (1835,1858] (1858,1880] (1880,1903] (1903,1926] (1926,1948] Grover Cleveland William Howard Taft Calvin Coolidge Herbert Hoover (1948,1971] (1971,1993] (1993,2016] PC1 Figure 1: State of the Union Speeches, highlighting each President s first address, plotted using the first two principal components of the term frequency matrix of non-proper nouns.

handout 12: textual models 4 and crops up again during the energy crisis of the 1970s. The world, man, freedom, force, defense topic peaks during both World Wars, but is absent during the 1920s and early 1930s. Finally, the cleannlp data model is also convenient for building predictive models. The State of the Union corpus does not lend itself to an obviously applicable prediction problem. A classifier that distinguishes speeches made by George W. Bush and Barrack Obama will be constructed here for the purpose of illustration. As a first step, a term-frequency matrix is extracted using the same technique as was used with the topic modeling function. However, here the frequency is computed for each sentence in the corpus rather than the document as a whole. The ability to do this seamlessly with a single additional mutate function defining a new id illustrates the flexibility of the get_tfidf function. library(matrix) temp <- left_join(tokens, doc) temp <- filter(temp, year > 2000) temp <- filter(temp, pos %in% c("nn", "NNS")) temp <- mutate(temp, new_id = paste(id, sid, sep = "-")) mat <- get_tfidf(temp, min_df = 0, max_df = 1, type = "tf", tf_weight = "raw", doc_var = "new_id") It will be nessisary to define a response variable y indicating whether this is a speech made by President Obama as well as a training flag indicating which speeches were made in odd numbered years. This is done via a separate table join and a pair of mutations. meta <- left_join(data_frame(new_id = mat$id), temp[!duplicated(temp$new_id),]) meta <- mutate(meta, y = as.numeric(president == "Barack Obama")) meta <- mutate(meta, train = year %in% seq(2001,2016, by = 2)) The output may now be used as input to the elastic net function provided by the glmnet package. The response is set to the binomial family given the binary nature of the response and training is done on only those speeches occurring in odd-numbered years. Cross-validation is used in order to select the best value of the model s tuning parameter. library(glmnet) model <- cv.glmnet(mat$tf[meta$train,], meta$y[meta$train], family = "binomial") The algorithm does a very good job of separating the speeches. Looking at the odd years versus even years (the training and testing sets, respectively) indicates that the model has not been over-fit. One benefit of the penalized linear regression model is that it is possible to interpret the coefficients in a meaningful way. Here are the

handout 12: textual models 5 subject, citizen, object, measure, commerce duty, act, treaty, commerce, right bank, duty, money, currency, system citizen, act, treaty, duty, force report, citizen, subject, attention, land gold, cent, condition, work, silver question, treaty, service, right, officer man, business, work, condition, corporation system, condition, service, purpose, policy world, man, life, freedom, way world, force, effort, defense, freedom dollar, expenditure, program, price, tax program, system, effort, development, world program, energy, policy, legislation, effort child, tonight, tax, family, budget job, business, world, tax, family 1800 1850 1900 1950 2000 Year Posterior probability 0.25 0.50 0.75 Figure 2: Distribution of topic model posterior probabilities over time on the State of the Union corpus. Top five words associated with each topic are displayed, with topics sorted by the median year of all documents placed into the respective topic using the maximum posterior probabilities.

handout 12: textual models 6 2016 2015 2014 2013 2012 2011 Year 2010 2009 2008 President George W. Bush Barack Obama 2007 2006 2005 2004 2003 2002 2001 0.00 0.25 0.50 0.75 1.00 Predicted probability Figure 3: Boxplot of predicted probabilities, at the sentence level, for all 16 State of the Union addresses by Presidents George W. Bush and Barack Obama. The probability represents the extent to which the model believe the sentence was spoken by President Obama. Odd years were used for training and even years for testing. Cross-validation on the training set was used, with the one standard error rule, to set the lambda tuning parameter.

handout 12: textual models 7 non-zero elements of the regression vector, coded as whether the have a positive (more Obama) or negative (more Bush) sign: beta <- coef(model, s = model[["lambda"]][11])[-1] sprintf("%s (%d)", mat$vocab, sign(beta))[beta!= 0] [1] "job (1)" "business (1)" "citizen (-1)" [4] "terrorist (-1)" "government (-1)" "freedom (-1)" [7] "home (1)" "college (1)" "weapon (-1)" [10] "deficit (1)" "company (1)" "peace (-1)" [13] "enemy (-1)" "terror (-1)" "income (-1)" [16] "drug (-1)" "kid (1)" "regime (-1)" [19] "class (1)" "crisis (1)" "industry (1)" [22] "need (-1)" "fact (1)" "relief (-1)" [25] "bank (1)" "liberty (-1)" "society (-1)" [28] "duty (-1)" "folk (1)" "account (-1)" [31] "compassion (-1)" "environment (-1)" "inspector (-1)" These generally seem as expected given the main policy topics of focus under each administration. During most of the Bush presidency, as mentioned before, the focus was on national security and foreign policy. Obama, on the other hand, inherited the recession of 2008 and was far more focused on the overall economic policy.