Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Similar documents
Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Transition-based Parsing with Neural Nets

CS395T Project 2: Shift-Reduce Parsing

Transition-based Dependency Parsing with Rich Non-local Features

Bidirectional Transition-Based Dependency Parsing

Easy-First POS Tagging and Dependency Parsing with Beam Search

Transition-Based Dependency Parsing with MaltParser

Non-Projective Dependency Parsing with Non-Local Transitions

Dependency Parsing CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Parsing with Dynamic Programming

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Basic Parsing with Context-Free Grammars. Some slides adapted from Karl Stratos and from Chris Manning

arxiv: v1 [cs.cl] 25 Apr 2017

Generalized Higher-Order Dependency Parsing with Cube Pruning

Stack- propaga+on: Improved Representa+on Learning for Syntax

Incremental Integer Linear Programming for Non-projective Dependency Parsing

A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

Transition-based dependency parsing

Introduction to Data-Driven Dependency Parsing

Optimal Incremental Parsing via Best-First Dynamic Programming

Undirected Dependency Parsing

Dynamic Feature Selection for Dependency Parsing

arxiv: v2 [cs.cl] 24 Mar 2015

Supplementary A. Built-in transition systems

HC-search for Incremental Parsing

Combining Discrete and Continuous Features for Deterministic Transition-based Dependency Parsing

Accurate Parsing and Beyond. Yoav Goldberg Bar Ilan University

Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies

Collins and Eisner s algorithms

Easy-First Chinese POS Tagging and Dependency Parsing

Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model

Dynamic Programming Algorithms for Transition-Based Dependency Parsers

Advanced Search Algorithms

Structured Perceptron with Inexact Search

Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser

Tekniker för storskalig parsning: Dependensparsning 2

Non-Projective Dependency Parsing in Expected Linear Time

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Utilizing Dependency Language Models for Graph-based Dependency Parsing Models

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

Statistical Dependency Parsing

K-best Parsing Algorithms

Density-Driven Cross-Lingual Transfer of Dependency Parsers

Transition-based Dependency Parsing Using Two Heterogeneous Gated Recursive Neural Networks

Dependency grammar and dependency parsing

Dependency grammar and dependency parsing

LinearFold GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA. Liang Huang. Oregon State University & Baidu Research

Incremental Joint POS Tagging and Dependency Parsing in Chinese

Dependency Parsing domain adaptation using transductive SVM

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

Optimal Shift-Reduce Constituent Parsing with Structured Perceptron

DRAGNN: A TRANSITION-BASED FRAMEWORK FOR DYNAMICALLY CONNECTED NEURAL NETWORKS

Improving Transition-Based Dependency Parsing with Buffer Transitions

Spanning Tree Methods for Discriminative Training of Dependency Parsers

Automatic Discovery of Feature Sets for Dependency Parsing

Dependency grammar and dependency parsing

Encoder-Decoder Shift-Reduce Syntactic Parsing

In-Order Transition-based Constituent Parsing

Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks

Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles

On Structured Perceptron with Inexact Search, NAACL 2012

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

arxiv: v1 [cs.cl] 13 Mar 2017

Probabilistic Graph-based Dependency Parsing with Convolutional Neural Network

A Quick Guide to MaltParser Optimization

Advanced PCFG Parsing

A Dependency Parser for Tweets. Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith

Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers

Bidirectional LTAG Dependency Parsing

Online Learning of Approximate Dependency Parsing Algorithms

EDAN20 Language Technology Chapter 13: Dependency Parsing

Forest-Based Search Algorithms

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Natural Language Processing with Deep Learning CS224N/Ling284

Context Encoding LSTM CS224N Course Project

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

Assignment 4 CSE 517: Natural Language Processing

Support Vector Machine Learning for Interdependent and Structured Output Spaces

an incremental algorithm for transition-based ccg parsing

Semantic Dependency Graph Parsing Using Tree Approximations

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

Homework 2: Parsing and Machine Learning

splitsvm: Fast, Space-Efficient, non-heuristic, Polynomial Kernel Computation for NLP Applications

Clinical Named Entity Recognition Method Based on CRF

Dependency Parsing with Undirected Graphs

Discriminative Classifiers for Deterministic Dependency Parsing

Neural Transition Based Parsing of Web Queries: An Entity Based Approach

Approximate Dynamic Oracle for Dependency Parsing with Reinforcement Learning

Dependency Parsing. Allan Jie. February 20, Slides: Allan Jie Dependency Parsing February 20, / 16

Predicting Structure in Handwritten Algebra Data From Low Level Features

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Non-projective Dependency Parsing using Spanning Tree Algorithms

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Approximate Large Margin Methods for Structured Prediction

A Transition-based Algorithm for AMR Parsing

Discriminative Training for Phrase-Based Machine Translation

Transcription:

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Tianze Shi* Liang Huang Lillian Lee* * Cornell University Oregon State University O n 3 O n 3 Minimal O n 6 Theoretical Feature Set Practical

Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 2

Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 3

Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 4

Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 5

Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 6

Dependency Parsing root xcomp obj OUTPUT INPUT nsubj mark det She wanted to eat an apple 7

Transition-based Dependency Parsing Initial state Terminal states 8

Transition-based Dependency Parsing Goal: max score( ) = max score( ) Initial state Terminal states 9

Exact Decoding with Dynamic Programming Goal: max score( ) = max score( ) Initial state Exponential to polynomial Terminal states (Huang and Sagae, 2010; Kuhlmann, Gómez-Rodríguez and Satta, 2011) 10

Transition Systems DP Complexity # Action Types Arc-standard O n 4 3 Arc-eager O n 3 4 In our paper Arc-hybrid O n 3 3 Presentational convenience 11

Arc-hybrid Transition System State s 2 s 1 s 0 b 0 b 1 Stack Buffer Initial State ROOT She wanted Terminal State ROOT (Yamada and Matsumoto, 2003) (Gómez-Rodríguez et al., 2008) (Kuhlmann et al., 2011) 12

Arc-hybrid Transition System Transitions b 0 s 1 s 0 s 0 b 0 reduce reduce b 0 s 1 b 0 s 0 s 0 Same as arc-standard 13

Arc-hybrid Transition System Transitions b 0 s 1 s 0 s 0 b 0 reduce reduce b 0 s 1 b 0 s 0 s 0 Same as arc-standard 14

Arc-hybrid Transition System Transitions b 0 s 1 s 0 s 0 b 0 reduce reduce b 0 s 1 b 0 s 0 s 0 Same as arc-standard 15

Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 16

Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 17

Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 18

Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 19

Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 20

Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 21

Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 22

Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 23

Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 24

Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 25

Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 26

Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 27

Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 28

Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 29

Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 30

Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 31

Dynamic Programming for Arc-hybrid 0 1 2 3 4 5 6 (n) ROOT She wanted to eat an apple Stack Buffer Deduction Item Goal i j 0 n + 1 [i, j] [0, n + 1] 32

Dynamic Programming for Arc-hybrid i j [i, j] [j, j + 1] i j j + 1 33

Dynamic Programming for Arc-hybrid reduce i, k [k, j] [?, j] k j reduce? j k 34

Dynamic Programming for Arc-hybrid reduce i, k [k, j] [i, j] i k j reduce i j k 35

Dynamic Programming for Arc-hybrid reduce i, k [k, j] [i, j] i i * k k j reduce i j k 36

Dynamic Programming for Arc-hybrid In Kuhlmann, Gómez-Rodríguez and Satta (2011) s notation i * [i, k] reduce i, k [k, j] [i, j] i k * [k, j] i k j * [i, j] reduce i j k 37

Dynamic Programming for Arc-hybrid i * [i, k] reduce i, k [k, j] [i, j] i k * [k, j] i k j * [i, j] reduce i j k 38

Dynamic Programming for Arc-hybrid [i, j] [j, j + 1] Goal: [0, n + 1] reduce i, k [k, j] [i, j] k j O n 3 reduce i, k [k, j] [i, j] i k 39

Time Complexity in Practice Complexity depends on feature representation! Typical feature representation: Feature templates look at specific positions in the stack and in the buffer 40

Time Complexity in Practice Compare the following features s 0 s 0 b 0 s 1 b 0 Time complexities are different!!! i j k i j s sh i, j s sh k, i, j j j + 1 i j j + 1 41

Time Complexity in Practice Complexity depends on feature representation! Typical feature representation: Feature templates look at specific positions in the stack and in the buffer Best-known complexity in practice: O(n 6 ) (Huang and Sagae, 2010) Stack Buffer s 2 s 1 s 0 b 0 b 1 s 1. lc s 1. rc s 0. lc s 0. rc 42

Best-known Time Complexities (recap) O n 3 O n 6 Theoretical Gap: Feature representation Practical 43

In Practice, Instead of Exact Decoding Greedy search (Nivre, 2003, 2004, 2008; Chen and Manning, 2014) Beam search (Zhang and Clark, 2011; Weiss et al.,2015) Best-first search (Sagae and Lavie, 2006; Sagae and Tsujii, 2007; Zhao et al., 2013) Dynamic oracles (Goldberg and Nivre, 2012, 2013) Global normalization on the beam (Zhou et al., 2015; Andor et al., 2016) Reinforcement learning (Lê and Fokkens, 2017) Learning to search (Daumé III and Marcu, 2005; Chang et al., 2016; Wiseman and Rush, 2016) 44

How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack Buffer s 2 s 1 s 0 b 0 b 1 b 2 s 1. lc i s 1. rc i s 0. lc i s 0. rc i s 1. lc 0. lc 0 s 1. rc 0. rc 0 s 0. lc 0. lc 0 s 0. rc 0. rc 0 45

How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM Dyer et al. (2015) s σ 1 Stack s 2 s 1 s 0 Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) 46

How Many Positional Features Do We Need? Bi-LSTMs give compact feature representations (Kiperwasser and Goldberg, 2016; Cross and Huang, 2016) Features used in Kiperwasser and Goldberg (2016) Stack s 2 s 1 s 0 b 0 Buffer Features used in Cross and Huang (2016) Stack s 1 s 0 b 0 Buffer 47

How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM Dyer et al. (2015) Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) Summarizing trees on stack Exponential DP Enables Slow DP Enables Fast DP Summarizing input 48

Model Architecture s sh, s re, s re Multi-layer perceptron s 2 s 1 s 0 b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 49

Model Architecture s sh, s re, s re Multi-layer perceptron s 1 s 0 b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 50

Model Architecture s sh, s re, s re Multi-layer perceptron s 0 b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 51

Model Architecture s sh, s re, s re Multi-layer perceptron b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 52

How Many Positional Features Do We Need? We answer the question empirically experimented with greedy decoding Two positional feature vectors are enough! 100 94. 08 ±0.13 94. 08 ±0.05 94. 03 ±0.12 52. 39 ±0.23 UAS on PTB (dev) 80 60 40 {s 2, s 1, s 0, b 0 } {s 1, s 0, b 0 } {s 0, b 0 } {b 0 } Considered in prior work 53

How Many Positional Features Do We Need? Our minimal feature set Stack s 0 b 0 Buffer Counter-intuitive, but works for greedy decoding s 1 s 0 b 0 reduce s 1 b 0 s 0 54

How Many Positional Features Do We Need? Our minimal feature set Stack s 0 b 0 Buffer Counter-intuitive, but works for greedy decoding The bare deduction items already contain enough information to extract features for DP Leads to the first O n 3 implementation of global decoders! 55

How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM Dyer et al. (2015) Summarizing trees on stack Exponential DP Enables Slow DP Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) Enables Fast DP Our work Summarizing input Fast(er) DP 56

Best-known Time Complexities (recap) O n 3 O n 6 Theoretical Gap: Feature representation Practical 57

Our contribution Minimal Feature Set O n 3 O n 3 O n 6 Theoretical Practical 58

Decoding i Score of the sub-sequence * [i, j] i, j : v j, j + 1 : 0 i j i j j + 1 59

Decoding i * [i, k] i k reduce i, k : v 1 k, j : v 2 i, j : v 1 + v 2 + Δ Δ = s sh i, k + s re k, j * [k, j] * [i, j] i k j reduce i j k 60

Training Separate incorrect from correct by a margin max score( ) - score( ) + cost( ) Cost-augmented decoding (Taskar et al., 2005) reduce i, k : v 1 k, j : v 2 i, j : v 1 + v 2 + s sh i, k + s re k, j + 1 head k j i j k 61

Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Local Global 62

Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Our arc-eager DP Our arc-hybrid DP Local Global Our Global 63

Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 Our best local BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Our arc-eager DP Our arc-hybrid DP Local Global Our Local Our Global 64

Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 Our best local BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 20 KBKDS16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Our arc-eager DP Our arc-hybrid DP 15 Our all global 5 Our arc-eager DP 5 Our arc-hybrid DP Local Global Our Local Our Global Ensemble 65

Results CoNLL 17 Shared Task Macro-average of 81 treebanks in 49 languages 2 nd highest overall performance 75 75.00 LAS 74 74.32 74.00 73.75 73 Ensemble Exact Arc-eager Exact Arc-hybrid Graphbased (Shi, Wu, Chen and Cheng, 2017; Zeman et al., 2017) 66

Conclusion Bi-LSTM feature set is minimal yet highly effective First O n 3 implementation of exact decoders Global training and decoding gave high performance 67

More in Our Paper Description and analysis of three transition systems (arc-standard, arc-hybrid, arc-eager) CKY-style representations of the deduction systems = + Theoretical analysis of the global methods Arc-eager models can simulate arc-hybrid models Arc-eager models can simulate edge-factored models 68

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set https://github.com/tzshi/dp-parser-emnlp17 Tianze Shi* Liang Huang Lillian Lee* * Cornell University Oregon State University

CKY-style Visualization 70

CKY-style Visualization 71

Results with Arc-eager and Arc-standard 72

Results with Arc-eager and Arc-standard 73