Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Similar documents
Introduction to Data-Driven Dependency Parsing

Tekniker för storskalig parsning: Dependensparsning 2

Incremental Integer Linear Programming for Non-projective Dependency Parsing

CS395T Project 2: Shift-Reduce Parsing

Projective Dependency Parsing with Perceptron

The Application of Constraint Rules to Data-driven Parsing

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Online Learning of Approximate Dependency Parsing Algorithms

Dynamic Feature Selection for Dependency Parsing

A Quick Guide to MaltParser Optimization

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Dependency grammar and dependency parsing

Parsing with Dynamic Programming

Dependency grammar and dependency parsing

Spanning Tree Methods for Discriminative Training of Dependency Parsers

Dependency grammar and dependency parsing

Basic Parsing with Context-Free Grammars. Some slides adapted from Karl Stratos and from Chris Manning

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Sorting Out Dependency Parsing

Refresher on Dependency Syntax and the Nivre Algorithm

Non-Projective Dependency Parsing in Expected Linear Time

Non-projective Dependency Parsing using Spanning Tree Algorithms

Sorting Out Dependency Parsing

Transition-Based Dependency Parsing with MaltParser

Density-Driven Cross-Lingual Transfer of Dependency Parsers

Dependency Parsing CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Transition-based dependency parsing

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Assignment 4 CSE 517: Natural Language Processing

Online Service for Polish Dependency Parsing and Results Visualisation

Undirected Dependency Parsing

Easy-First POS Tagging and Dependency Parsing with Beam Search

Klein & Manning, NIPS 2002

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

NLP in practice, an example: Semantic Role Labeling

What is Parsing? NP Det N. Det. NP Papa N caviar NP NP PP S NP VP. N spoon VP V NP VP VP PP. V spoon V ate PP P NP. P with.

Dependency Parsing. Allan Jie. February 20, Slides: Allan Jie Dependency Parsing February 20, / 16

STRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH

Sparse Feature Learning

tree follows. Game Trees

Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser

Collins and Eisner s algorithms

We have already seen the transportation problem and the assignment problem. Let us take the transportation problem, first.

Depfix: Automatic post-editing of phrase-based machine translation outputs

Exploring Automatic Feature Selection for Transition-Based Dependency Parsing

MaltOptimizer: A System for MaltParser Optimization

Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

Lesson 2 7 Graph Partitioning

Transition-based Dependency Parsing with Rich Non-local Features

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Dependency Parsing. Johan Aulin D03 Department of Computer Science Lund University, Sweden

Dependency Parsing L545. With thanks to Joakim Nivre and Sandra Kübler. Dependency Parsing 1(70)

Computational Linguistics

Advanced PCFG Parsing

The Expectation Maximization (EM) Algorithm

Managing a Multilingual Treebank Project

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

K-best Parsing Algorithms

Dependency Parsing with Undirected Graphs

Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies

Package corenlp. June 3, 2015

CSE 417 Branch & Bound (pt 4) Branch & Bound

On Structured Perceptron with Inexact Search, NAACL 2012

COMP 182: Algorithmic Thinking Prim and Dijkstra: Efficiency and Correctness

Introduction to SLAM Part II. Paul Robertson

Language Support, Linguistics, and Text Analytics in Solr

Homework 2: Parsing and Machine Learning

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

CSC 8301 Design & Analysis of Algorithms: Kruskal s and Dijkstra s Algorithms

Treex: Modular NLP Framework

Incremental Integer Linear Programming for Non-projective Dependency Parsing

TectoMT: Modular NLP Framework

Stack- propaga+on: Improved Representa+on Learning for Syntax

Non-Projective Dependency Parsing with Non-Local Transitions

Distributed minimum spanning tree problem

Algorithms for NLP. Chart Parsing. Reading: James Allen, Natural Language Understanding. Section 3.4, pp

Iterative CKY parsing for Probabilistic Context-Free Grammars

TA Section 7 Problem Set 3. SIFT (Lowe 2004) Shape Context (Belongie et al. 2002) Voxel Coloring (Seitz and Dyer 1999)

Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers

Algorithm Analysis Graph algorithm. Chung-Ang University, Jaesung Lee

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical Methods for NLP

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian

arxiv: v2 [cs.ds] 25 Jan 2017

KAF: a generic semantic annotation format

Exam Marco Kuhlmann. This exam consists of three parts:

Automatic Discovery of Feature Sets for Dependency Parsing

Parsing in Parallel on Multiple Cores and GPUs

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

Natural Language Dependency Parsing. SPFLODD October 13, 2011

CSE 417 Dynamic Programming (pt 6) Parsing Algorithms

Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model

Tree. number of vertices. Connected Graph. CSE 680 Prof. Roger Crawfis

Algorithms for Data Science

TELCOM2125: Network Science and Analysis

SPANNING TREES. Lecture 21 CS2110 Spring 2016

Transcription:

Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1

Projective vs non-projective dependencies If we extract dependencies from trees, y have certain properties Each word has one governor that it points to Each word governs some span A words governor cannot fall within span that word governs When we project to string, no crossing dependencies result Or languages do not fit nice into constituent structure e.g., free word order languages such as Czech These languages have higher frequency of non-projective dependencies 2

Schematic from Hall and Novák (2005) w c w c w b w a w b w a w b w c w a a b c a b c a b c Figure 1: Examples of projective and non-projective trees. The trees on left and center are both projective. The tree on right is non-projective. A tree is projective if: for every three nodes: w a, w b, and w c where a < b < c; if w a is governed by w c n w b is transitively governed by w c or if w c is governed by w a n w b is transitively governed by w a. Keith Hall and Václav Novák. 2005. Corrective Modeling for Non-Projective Dependency Parsing. In Proceedings of 9th International Workshop on Parsing Technologies (IWPT). 3

Example from Nivre (2006) Pred Atr 0 1 R Z (Out-of AuxP AuxK AuxP Sb AuxZ 2 3 4 5 P VB T C nich je jen jedna m is only one-fem-sg ( Only one of m concerns quality. ) 6 R na to Adv 7 N4 kvalitu quality 8 Z:..) Figure 1: Dependency graph for Czech sentence from Prague Dependency Treebank Nivre, J. (2006) Constraints on Non-Projective Dependency Parsing. In Proceedings of 11th Conference of European Chapter of Association for Computational Linguistics (EACL). 4

Degree of non-projectivity For a dependency graph G, let G(i, j) be graph over substring S[i, j] Let e be an arc between words i and j, i < j Degree of e is number of weakly connected subgraphs in G(i + 1, j 1) that are not dominated by head of e Degree of graph is maximum degree of any edge in graph In graph on previous slide, (1,5) arc is non-projective G(2, 4) contains three connected components of one word each Words 2 and 4 are dominated by 5 in graph Word 3 is not, hence arc has degree 1 When searching non-projective structures, can restrict max degree 5

Computational Linguistics Volume 34, Number 4 How common is non-projectivity? (Nivre, CL, 2008) Table 1 Data sets. Tok = number of tokens ( 1000); Sen = number of sentences ( 1000); T/S = tokens per sentence (mean); Lem = lemmatization present; CPoS = number of coarse-grained part-of-speech tags; PoS = number of (fine-grained) part-of-speech tags; MSF = number of morphosyntactic features (split into atoms); Dep = number of dependency types; NPT = proportion of non-projective dependencies/tokens (%); NPS = proportion of non-projective dependency graphs/sentences (%). Language Tok Sen T/S Lem CPoS PoS MSF Dep NPT NPS Arabic 54 1.5 37.2 yes 14 19 19 27 0.4 11.2 Bulgarian 190 14.4 14.8 no 11 53 50 18 0.4 5.4 Chinese 337 57.0 5.9 no 22 303 0 82 0.0 0.0 Czech 1,249 72.7 17.2 yes 12 63 61 78 1.9 23.2 Danish 94 5.2 18.2 no 10 24 47 52 1.0 15.6 Dutch 195 13.3 14.6 yes 13 302 81 26 5.4 36.4 German 700 39.2 17.8 no 52 52 0 46 2.3 27.8 Japanese 151 17.0 8.9 no 20 77 0 7 1.1 5.3 Portuguese 207 9.1 22.8 yes 15 21 146 55 1.3 18.9 Slovene 29 1.5 18.7 yes 11 28 51 25 1.9 22.2 Spanish 89 3.3 27.0 yes 15 38 33 21 0.1 1.7 Swedish 191 11.0 17.3 no 37 37 0 56 1.0 9.8 Turkish 58 5.0 11.5 yes 14 30 82 25 1.5 11.6 6 from treebanks of thirteen different languages with considerable typological variation.

Converting from non-projective to projective Issue with our current training data, for use with projective algorithms Nivre and Nilsson discuss in Pseudo-projective dependency parsing (2005) Basic approach: While re are non-projective arcs Find smallest (shortest distance) non-projective arc Change head of this arc to its head s governor Not guaranteed to be minimal transformation In that paper, y remember (via labels) lifting Then, in a post-process, y can un-do lift Hall and Novak (2005) also have a post-processing approach based on constituency parsers 7

Relaxing projectivity The brute force CYK and Eisner algorithms result in projective dependency structures If we relax requirement of projectivity Build a complete directed graph between words Find minimum spanning tree of graph Chu-Liu / Edmonds algorithm O(n 2 ) When parsing free word order languages (like Czech), ability to capture non-projective dependencies is important Can hurt performance slightly in languages like English with few non-projective dependencies 8

Nivre-style non-projective parsing algorithm Algorithm for incrementally checking all O(n 2 ) pairs for a link for j = 1 to n for i = j 1 to 0 if PERMISSIBLE(i, j) LINK(i, j) PERMISSIBLE checks for well-formedness conditions (and non-projectivity degree limitations) LINK is a classifier that decides: head left, head right, no arc Nivre (2007) achieves accuracy improvements for 5 languages over projective algorithm, at some efficiency cost 9

Dependency graph Given a sentence, want to discover dependency structure All words (but one) dependent on one or word in sentence One word is main (or ROOT) head of sentence ( bit ) For any word, don t know a priori what its head is Build a graph with link to every possible head Will approach this problem as a weighted disambiguation Every link will have a weight Find best solution for particular weighted graph 10

Building a word graph I 2 ROOT 2 8 10 8 bit dog postman One node for each word in string, plus ROOT Directed arcs from head to dependent, with score (higher better) Here we just show score of each word with ROOT as ir head 11

2 Building a word graph II ROOT 2 8 10 8 30 bit 30 40 dog 20 3 11 2 postman Slightly pruned full graph (some arcs omitted for space) Scores on arcs must depend only on head and dependent nodes e.g., log P(head dependent) 40 12

Word graph as matrix From word w i (row) as dependent to word w j as head (column) i j ROOT dog bit postman 2-40 0 0 0 dog 8 0-30 0 11 bit 10 0 20-0 2 2 0 0 0-40 postman 8 0 3 30 0 - One arc per row in final solution 13

Minimum spanning tree Tree that links all nodes in a graph Has minimum cost/distance in space of all trees In our case, weight is good (like log probability) so want max Sum weights over all arcs used in tree Greedy algorithm (Chu-Liu/Edmonds) finds best solution n eliminates cycles 14

Find minimum spanning tree, step 1 2 ROOT 2 8 10 8 30 bit 30 40 dog 20 3 2 postman 40 11 For each node in graph, except ROOT: Select highest scoring incoming arc 15

MST algorithm, step 1 with matrix For each row, pick column with highest score i j ROOT dog bit postman 2-40 0 0 0 dog 8 0-30 0 11 bit 10 0 20-0 2 2 0 0 0-40 postman 8 0 3 30 0 - This defines a subgraph, which is a candidate solution Note that dog and bit form a cycle in subgraph 16

Minimum spanning tree I ROOT 30 bit 30 40 20 dog Need to deal with cycles postman Collapse one cycle into a single node Recalculate arcs in and out of new (collapsed) node Find new highest scoring incoming arc for every node 40 17

Re-calculating transitions in and out Using notation from McDonald et al. (2005) Let C be set of nodes in cycle Let s(x, x ) be score from x to x (head to dep) Let a(v) be head of max scoring incoming arc Then, for any x C s(c, x) = max x C s(x, x) and [ s(x, C) = max s(x, x ) s(a(x ), x ) + s(c) ] x C where s(c) = s(a(v), v) v C 18

Create new matrix, with cycle as single node Original Matrix j ROOT dog bit postman i 2-40 0 0 0 dog 8 0-30 0 11 bit 10 0 20-0 2 2 0 0 0-40 postman 8 0 3 30 0 - New matrix will collapse dog bit into a single node (Note that se will not always be contiguous words) 19

New matrix, with cycle as single node S(C) = 30 + 20 = 50 New scores: i j ROOT dog/bit postman 2-40 (dog) 0 0 dog/bit 40 (bit) 30 (bit) - 30 (bit) 32 (bit) 2 0 0 (dog) - 40 postman 8 0 30 (bit) 0-20

2 New graph, with cycle node ROOT 40 8 2 30 bit 40 dog 20 32 postman 40 Note: need to remember which internal node supplied max used in calculating each arc score Now select highest scoring incoming arc for each node 21 30

MST algorithm, iteration 2 with matrix For each row, pick column with highest score i j ROOT dog/bit postman 2-40 (dog) 0 0 dog/bit 40 (bit) 30 (bit) - 30 (bit) 32 (bit) 2 0 0 (dog) - 40 postman 8 0 30 (bit) 0 - This defines a subgraph, which is a candidate solution Arcs involving cycle go from/to specified cycle member Note that this subgraph has no cycles (hence we are done) 22

Minimum spanning tree II 40 ROOT 30 bit 40 dog 20 postman 40 If no cycles, this is minimum spanning tree Whichever node is max for incoming arc breaks cycle 30 Max outgoing arcs are heads for that dependency 23

Minimum spanning tree III ROOT bit dog Final solution: bit is head of string postman dog and postman are dependents of bit is dependent of closest noun 24

Notes on MST algorithm Graph has n + 1 nodes and n 2 transitions Original formulation of this algorithm (Chu-Liu/Edmonds) can have a maximum of n cycle collapsing, hence O(n 3 ) Tarjan update to algorithm uses a Fibonacci heap to achieve O(n 2 ) This is called an edge factored model, since edges (or arcs) in graphs can only depend on head and dependent Scores can be assigned via a log linear model Has become very popular due to efficiency and ability to capture non-projective dependencies for languages like Czech 25

Log linear modeling Haven t really talked about where scores come from Discriminative scenario, so given string To retain quadratic complexity, features depend only on head and dependent Word and POS-tag substrings around each McDonald et al. (2005) used MIRA Instance of passive-aggressive on-line algorithm Will discuss relative to perceptron in next lecture Basic idea is to include error magnitude in update To take into account neighboring dependencies requires eir increase in complexity or some clever new algorithms 26