Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

Similar documents
NLP in practice, an example: Semantic Role Labeling

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Dependency Parsing CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

CS395T Project 2: Shift-Reduce Parsing

Basic Parsing with Context-Free Grammars. Some slides adapted from Karl Stratos and from Chris Manning

The CKY Parsing Algorithm and PCFGs. COMP-550 Oct 12, 2017

Transition-based dependency parsing

Backpropagating through Structured Argmax using a SPIGOT

Transition-based Parsing with Neural Nets

Introduction to Data-Driven Dependency Parsing

Assignment 4 CSE 517: Natural Language Processing

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Stack- propaga+on: Improved Representa+on Learning for Syntax

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

Parsing with Dynamic Programming

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Transition-Based Dependency Parsing with MaltParser

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Semantic Dependency Graph Parsing Using Tree Approximations

Tekniker för storskalig parsning: Dependensparsning 2

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

Refresher on Dependency Syntax and the Nivre Algorithm

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian

Exam Marco Kuhlmann. This exam consists of three parts:

Easy-First POS Tagging and Dependency Parsing with Beam Search

Density-Driven Cross-Lingual Transfer of Dependency Parsers

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches

Dependency grammar and dependency parsing

Dependency grammar and dependency parsing

EDAN20 Language Technology Chapter 13: Dependency Parsing

Dependency Parsing domain adaptation using transductive SVM

Homework 2: Parsing and Machine Learning

TectoMT: Modular NLP Framework

Context Encoding LSTM CS224N Course Project

Statistical Parsing for Text Mining from Scientific Articles

CS 224N Assignment 2 Writeup

Parsing the Language of Web 2.0

Guiding Semi-Supervision with Constraint-Driven Learning

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Dependency grammar and dependency parsing

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Introduction to CRFs. Isabelle Tellier

Semantic image search using queries

The Expectation Maximization (EM) Algorithm

A Dependency Parser for Tweets. Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Non-Projective Dependency Parsing in Expected Linear Time

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Dependency Parsing. Ganesh Bhosale Neelamadhav G Nilesh Bhosale Pranav Jawale under the guidance of

Transition-based Dependency Parsing with Rich Non-local Features

COS 320. Compiling Techniques

Compiler Design Overview. Compiler Design 1

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

Structured Learning. Jun Zhu

Final Project Discussion. Adam Meyers Montclair State University

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Dynamic Feature Selection for Dependency Parsing

Sparse Feature Learning

Semantics as a Foreign Language. Gabriel Stanovsky and Ido Dagan EMNLP 2018

Projective Dependency Parsing with Perceptron

Treex: Modular NLP Framework

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

LL(1) predictive parsing

Wednesday, August 31, Parsers

Computationally Efficient M-Estimation of Log-Linear Structure Models

Supplementary A. Built-in transition systems

Deliverable D1.4 Report Describing Integration Strategies and Experiments

Machine Learning in GATE

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Compiling Regular Expressions COMP360

Incremental Integer Linear Programming for Non-projective Dependency Parsing

arxiv: v2 [cs.cl] 24 Mar 2015

Building Compilers with Phoenix

K-best Parsing Algorithms

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

English Understanding: From Annotations to AMRs

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10

Probabilistic parsing with a wide variety of features

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Lexicalized Semi-Incremental Dependency Parsing

A simple syntax-directed

Text Mining for Software Engineering

Undirected Dependency Parsing

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Homework & Announcements

Frame Semantic Structure Extraction

Using Search-Logs to Improve Query Tagging

V. Thulasinath M.Tech, CSE Department, JNTU College of Engineering Anantapur, Andhra Pradesh, India

HyLaP-AM Semantic Search in Scientific Documents

Elements of Language Processing and Learning lecture 3

Transcription:

Learning Latent Linguistic Structure to Optimize End Tasks David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu 12 October 2012

Learning Latent Linguistic Structure to Optimize End Tasks David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu 12 October 2012

Why Use NLP? 2 1 2 2 oh, we need consecutive we need 6 and 7 i have 8 9 4 why do n t you get 7 2

Why Use NLP? K S S C 2 1 2 2 oh, we need consecutive we need 6 and 7 i have 8 9 4 why do n t you get 7 2

Why Use NLP? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 2

Why Use NLP? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 3

Why Use NLP? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 4

Task-directed Learning What does the cow say? The cow says mooo

Language Learning learning feedback KB say(cow, moo ) = 0.87 REL:say the cow says moo the cow says moo utterance

How to Use NLP? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 7

How to Use NLP? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 Can we parse these as coordinations? 7

How to Use NLP? Do we need training data for tagging, parsing, entity detection, coref, etc.? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 Can we parse these as coordinations? 7

How to Use NLP? Isn t predicting this, or other game states, what we care about? Do we need training data for tagging, parsing, entity detection, coref, etc.? K 2 oh, we need consecutive S str 1 we need 6 and 7 S hnd 2 i have 8 9 4 C pup 2 why do n t you get 7 Can we parse these as coordinations? 7

The Story So Far Joint inference prevents propagating errors But... Out-of-domain data plays havoc with NLP Joint training requires everything to be annotated

Learning Latent Representations Require less training data for......each stage in the pipeline...adapting to a new domain Efficient inference and learning for coupling structured observed and latent variables Features inspired by pipelined models 9

This Talk Motivation Latent representations from auxiliary tasks Latent representations from structured end tasks

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Start state: ([ ], [1,...,n], {}) Final state: (S, [], A) Shift: (S, i B, A) ) (S i, B, A) Reduce: (S i, B, A) ) (S, B, A) Right-Arc: (S i, j B, A) ) (S i j, B, A [ {i! j}) Left-Arc: (S i, j B, A) ) (S, j B, A [ {i j})

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [who, did, you, see] B {}

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [who] S [did, you, see] B {} Shift

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [did, you, see] B { who OBJ did } Left-arc OBJ

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Reduce Stack Buffer Arcs [did] S [see] B { who OBJ did, did SBJ! you }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc VG Stack Buffer Arcs [did, see] S [] B { who OBJ did, did SBJ! you, did VG! see }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs Very fast linear-time performance WSJ 23 (2k sentences) in 3 s [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 85 80 Unlabeled

Classifying Actions Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift

Classifying Actions Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift Output: Action [+ Label]

Classifying Actions POS, Form in buffer; lookahead 3 Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift Output: Action [+ Label]

Classifying Actions POS, Form of top 2 stack items POS, Form in buffer; lookahead 3 Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift Output: Action [+ Label]

Classifying Actions POS, Form of top 2 stack items POS, Form in buffer; lookahead 3 Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift Dep. label of words in stack, buffer Output: Action [+ Label]

Classifying Actions POS, Form of top 2 stack items POS, Form in buffer; lookahead 3 Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift Dep. label of words in stack, buffer Can these be Output: Action [+ Label] made less sparse?

Distributional Embeddings Keyword in 7-gram context among them the winner in Paris in becoming a Snavely winner, I was but onto the 77 winner mark with Bishops The recent Newbury winner has been installed to be the winner for the second We've got the winner on the phone

Distributional Embedding became the first female winner of the prestigious Nobel Prize 23

Distributional Embedding became the first female winner of the prestigious Nobel Prize 23

Distributional Embedding Keyword in 7-dependency context became the first female winner of the prestigious Nobel Prize 24

Distributional Embedding Discriminative language model training became the first female winner of the prestigious Nobel Prize capital 25

Distributional Embedding Discriminative language model training became the first female winner of the prestigious Nobel Prize capital Learn to predict observed word 25

Distributional Embedding Discriminative language model training became the first female winner of the prestigious Nobel Prize 26

Distributional Embedding Discriminative language model training became the first female winner of the prestigious Nobel Prize Lookup in codebook 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 26

Distributional Embedding Discriminative language model training became the first female winner of the prestigious Nobel Prize Lookup in codebook 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 3-layer perceptron Correct word? 26

Distributional Embedding Discriminative language model training became the first female winner of the prestigious Nobel Prize Lookup in codebook Backprop updates codes 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 3-layer perceptron Correct word? 26

Distributional Embedding Discriminative language model training Words=points in 50d space became the first female winner of the prestigious Nobel Prize Lookup in codebook Backprop updates codes 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 50 dim. 3-layer perceptron Correct word? 26

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 % Accuracy 95 90 85 84.17 80 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 85 80 84.17 82.95 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 85 80 84.17 82.95 83.33 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 85 80 84.17 82.95 83.33 84.66 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 89.49 85 80 84.17 82.95 83.33 84.66 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 89.49 89.82 85 80 84.17 82.95 83.33 84.66 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 89.49 89.82 89.74 85 80 84.17 82.95 83.33 84.66 Unlabeled

Better Features for Parsing Baseline 7-gram +7-gram +7-gram +ext +7-dep clusters 7-dep +7-dep +7-dep +ext 100 95 % Accuracy 90 89.49 89.82 89.74 90.28 85 80 84.17 82.95 83.33 84.66 Unlabeled

Better Features for Parsing ROOT 1 2 3-6 7+ 100 % Accuracy 87.5 75 82 94.2 87.22 79.44 86.5 96.38 92.18 86.71 72.12 62.5 50 Baseline 56.06 +7-dep

Unsupervised Domain Adaptation Baseline W embed B embed W+B embed 100 95 % Accuracy 90 85 84.17 89.82 87.93 89.44 86.67 86.74 87.05 80 WSJ 81.58 Brown

This Talk Motivation Latent representations from auxiliary tasks Latent representations from structured end tasks

Semantic Role Labeling Pierre Vinken, 61 years old, will join the board as a nonexecutive director Vinken s joining the board as director is imminent 31

Semantic Role Labeling Pierre Vinken, 61 years old, will join the board as a nonexecutive director Vinken s joining the board as director is imminent PRED ARG0 ARG1 ARG-PRD join Vinken board director 31

Semantic Role Labeling Pierre Vinken, 61 years old, will join the board as a nonexecutive director Vinken s joining the board as director is imminent PRED ARG0 ARG1 ARG-PRD join Vinken board director 31

Coupling Structured Variables Outputs are complex combinatorial objects E.g., sequences, trees, graphs, strings Intermediate representations are complex combinatorial objects E.g., syntax helps predict semantics Developed efficient inference and learning 32

Composable Syntax Semantic Role Labeling Relation Extraction employs(organization, PERSON) Dependency and constituency models Named-entity recognition Couple constituency model with semi-crf No grammar composition

Global Constraints Dependency trees Cf. Projective & nonprojective (MST) parsing (Nested) bracketings Cf. CKY & semi-crfs Matchings and alignments Weighted bipartite matching & network flow; ITG Intersecting specialized grammars (CFG, TAG, etc.) Other finite-state, context-free, etc., constraints... Compare TurboParsers and Dual Decomposition 34

Semantic Role Labeling a.) Syntactic Combinatorial Constraint DEP-Tree b.) Syntactic Layer Link 5, 1 Link 5, 2 Link 5, 3... Link 5, n D-Connect 5, 1 D-Connect 5, 2 D-Connect 5, 3 D-Connect 5, n c.) Argument Prediction Arg 5, 1 Arg 5, 2 Arg 5, 3 d.) Sense Prediction sense. 01 role A0 role A1 At Most 1 role A2... role A-TM At Most 1 Pred 5 sense. 02... sense. S 35

Semantic Role Labeling 100 Baseline Oracle syntax Hidden syntax Median SRL Max SRL 86.25 Labeled F1 72.5 58.75 45 Chinese Czech English German 36

Hidden Syntax $ Alles bleibt voller Widersprüche. Everything remains full of contradictions. 37

Hidden Syntax Gold $ Alles bleibt voller Widersprüche. Everything remains full of contradictions. 37

Hidden Syntax Gold $ Alles bleibt voller Widersprüche. Hidden Everything remains full of contradictions. 37

Hidden Syntax Gold $ Alles bleibt voller Widersprüche. Hidden Everything remains full of contradictions. Heading sentence with punctuation doesn t affect SRL. 37

Hidden Syntax Gold $ Es bleibt nur die Hoffnung auf einen erneuten Urknall. Hidden There remains only the hope of a renewed Big Bang. 38

Hidden Syntax Gold $ Es bleibt nur die Hoffnung auf einen erneuten Urknall. Hidden It s useful to have NPs consistent inside and outside of PPs. There remains only the hope of a renewed Big Bang. 38

Relation Extraction 39

Relation Extraction Labeled F1 80 77.5 75 72.5 70 Baseline Parser Deps. Hidden Deps. Parser Constit. Hidden Constit. 40

In Summary Representations trained on more appropriate auxiliary tasks support better learning Models with latent syntax models near performance of those with oracle syntax Assertion (see papers): Latent structured variables without asymptotic slowdown

Thank you

Hidden Distribution 43