Discriminative Training with Perceptron Algorithm for POS Tagging Task

Size: px
Start display at page:

Download "Discriminative Training with Perceptron Algorithm for POS Tagging Task"

Transcription

1 Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon 1 Introduction One of the most popular algorithms for structured prediction problems in natural language and speech processing is the perceptron algorithm [Rosenblatt1958, Collins2002]. The perceptron algorithm can be used to estimate the model parameters in any structured prediction learning frameworks. Collins presented a discriminative log-linear model and the perceptron algorithm to estimate the model parameters, as a global framework for discriminative training. This framework can be used for training finite-state tagging models such as POS tagging, shallow parsing, sentence segmentation, named entity recognition, etc. In the first part of this study, section 2, we show experimental results for POS tagging of English using the Collins framework. Structured prediction models including perceptron are supervised machine learning techniques and they need a large amount of labeled input-output data to provide an improvement to system performance. Training the model using large amount of data can be cumbersome. McDonald and colleagues [McDonald et al.2010] investigated distributed training strategies for the structured perceptron to reduce training times in the two tasks of named entity recognition and dependency parsing. In the second part of this paper, section 3, we investigate their techniques for another structure prediction task, POS tagging. 2 Discriminative Model and the Perceptron Algorithm This section describes a discriminative log-linear model and the perceptron algorithm to learn the model parameters to train a POS tagger. This framework is first 1

2 Perceptron(T= {(x i, y i )} N i=1, ᾱ{default = 0}, T) For t = 1..T For i = 1..N calculate z i = argmax z GEN(xi )Φ(x i, z).ᾱ If (z i y i ) then ᾱ = ᾱ + Φ(x i, y i ) Φ(x i, z i ) return ᾱ Figure 1: The perceptron algorithm presented by [Collins2002] as a global framework for discriminative training. To train a discriminative POS tagging model, the task is to learn a mapping from inputs x X to outputs y Y, where X is the set of all input sentences and Y is the set of all possible POS tag sequences. Given a set of training examples (x i, y i ), a function GEN(x) that enumerates a set of possible POS tag sequences of length n (where n is the length of x), ᾱ R d a parameter vector, and representation Φ that maps each (x, y) X Y to a feature vector Φ(x, y), there is a mapping from an input x to an output F (x) defined by the formula: F (x) = arg max Φ(x, y) ᾱ (1) y GEN(x) where Φ(x, y) ᾱ is the inner product Σ i α i Φ i (x, y). The model learns the parameter values ᾱ during the training and the decoding algorithm searches for the y that maximizes (1). The feature vector Φ(x, y) represents arbitrary features of the sentence and POS tag sequences. In section 2.1 we describe the feature templates we used in our experiments. To estimate the parameter values ᾱ of the model we use the perceptron algorithm. Figure 1 shows the perceptron algorithm [Collins2002]. At each training example (x i, y i ), the algorithm updates the parameter vector ᾱ by subtracting the features values of the best-scoring hypothesis z i to it and adding the feature values of the true hypothesis y i from it. The algorithm then moves to the next example. This procedure is repeated T times (epochs) over the training examples. The regular perceptron algorithm suffers from over-fitting problem. One solution to this problem is the averaged perceptron which sets the final weight vector to the average of all the parameter vectors seen in training. 2.1 Features We extract features from word input sequence and POS-tag output sequence. The feature set includes bigrams of surrounding words, a window of size 2 of the next 2

3 and previous words, POS-tag of the previous word, and orthographical features, as shown in Table 1. The orthographical feature set includes prefixes and suffixes of the words (up to 4 characters), and presence of a hyphen, digit, or an uppercase character. We do not restrict the orthographical features to rare 1 or unknown words, and we activate them for all words. This improves accuracy at the cost of some speed compared to the case that the orthographical features are activated only for rare or unknown words. Similar to Ratnaparkhi [Ratnaparkhi1997] or Roark et al. [Roark et al.2012], we restricted the search space to the tag dictionary for each word. For known words, the tag dictionary contains the tags occurred with the word in the training set, and for unknown or rare words, the tag dictionary contains all tags in the tag set. Use of tag dictionary speedups the tagger significantly without hurting the accuracy. Lexical Orthographical t i, t i 1 t i, w i t i, w i [0] t i, w i 1 t i, w i [0 1] t i, w i+1 t i, w i [0 2] t i, w i 2 t i, w i [0 3] t i, w i+2 t i, w i [n] t i, w i, w i+1 t i, w i [n-1 n] t i, w i, w i 1 t i, w i [n-2 n] t i, w i+1, w i+2 t i, w i [n-3 n] t i, w i 1, w i 2 t i, w i containing digit t i, w i containing hyphen t i, w i containing uppercase Table 1: Feature templates for POS tagging 2.2 Experimental Results We ran our experiments on the WSJ Penn Treebank corpus [Marcus et al.1999] using sections 2-21 for training, section 24 for development, and section 23 for testing. The decoding process is performed using a Viterbi search with Markov order-0 assumption. Table 2 shows the accuracy of POS tagging for English using our tagger. To asses how the results of our tagger will generalize to an independent test set, we used a k-fold cross-validation approach (k=20). All labeled examples (the 1 Rare words occur less than 5 times in the training data. 3

4 accuracy dev set 97.1% test set 97.3% Table 2: POS tagging accuracy on development (section 24) and test (section 23) data combination of the training and development sets) are sequentially partitioned into k disjoint subsets. At each fold, one of the subsets is used as the development set and the union of other subsets is used as the training set. Each of the k subsets are used exactly once as the development set. The cross-validation approach, determines the best performance on the development set and its corresponding number of epochs. The mean of these epochs is ī. Our final run on the test set (section 23) is then performed by training on all labeled examples for ī epochs with no development set. The averaged accuracy of the best performances over the 20 folds is 97.1% (σ = 0.2), and the accuracy of the final run on the test set is 97.3%. These results confirm the generalizability of our tagger to arbitrary test sets. 3 Distributed Perceptron In this section, we describe two distributed training strategies we used for distributed training of the perceptron algorithm, which are based on the strategies proposed in McDonald et al. 3.1 Parameter Mixing Parameter mixing method is the straight-forward strategy of training separate models on disjoint subsets of the training data in parallel, and then mixing all parameters as the final model. Figure 2 shows this algorithm [McDonald et al.2010]. First, we partition the training data into S shards, then we train S separate perceptron models on these shards in parallel, and finally we mix the parameter values of shards by taking average of those. In a map-reduce framework, training separate perceptron models is done in the map step and mixing (averaging) the parameter values is done in the reduce step. The advantages of this method is that it is easily scalable to very large data sets and it is resource efficient with respect to network usage. The disadvantage of this method is that it can be sub-optimal, which means that it does not necessarily return a separating weight vector, even when the training set is separable. 4

5 ParameterMix(T= {(x i, y i )} N i=1 ) Shard T into S parts T = {T 1,...,T S } ᾱ s = Perceptron(T s, 0, T) ᾱ = s µ sᾱ s return ᾱ Figure 2: Parameter mixing method for distributed perceptron 3.2 Iterative Parameter Mixing A slight modification to the parameter mixing method, called iterative parameter mixing, makes it optimal. Iterative parameter mixing finds a separating hyperplane (assuming that the training set is separable), and yields to comparable or better accuracies than serially trained perceptron, at the cost of increasing network usage. Similar to parameter mixing, we first shard the training data into S shards. Then, we train a separate single epoch perceptron on each shard and mix (average) the model weights. We train another single epoch perceptron on each shard, but this time with the mixed weight vector as the initial value for the perceptrons. The process repeats for T times. Figure 3 shows this algorithm [McDonald et al.2010]. In a map-reduce framework, training the single epoch perceptron models is done in the map step and mixing the parameter values and re-sending them to the shards is done in the reduce step. 3.3 Experiments We investigated distributed training of the perceptron algorithm in the POS tagging task. We used WSJ Penn Treebank sections 2-21 for training and section 24 for development. Note that we focus on the training phase results in this section. We IterativeParameterMix(T= {(x i, y i )} N i=1 ) Shard T into S parts T = {T 1,...,T S } Set ᾱ = 0 For t = 1..T ᾱ (s,t) = Perceptron(T s, ᾱ, 1) ᾱ = s µ (s,t)ᾱ (s,t) return ᾱ Figure 3: Iterative parameter mixing method for distributed perceptron 5

6 compared POS tagging accuracy and perceptron training time in three systems: 1. Serial: non-distributed perceptron on all training data. 2. Parameter mix: distributed perceptron using the parameter mixing method. 3. Iterative parameter mix: distributed perceptron using the iterative parameter mixing method. For all three systems we compared results for regular and averaged perceptron algorithms. For parallel systems (2 and 3) we used 10 disjoint equal-sized shards of training data, built by sequentially splitting the complete training data. Each shard contains around 3,900 POS tagged sentences. To mix the model weights, we used the uniform mixing strategy by taking the mean of the weight vectors, in both regular and averaged perceptrons. Note that the reported results in this section are not from an actual map-reduce such as Haoop implementation. Instead, we simulated a map-reduce framework by piplining the perceptron epochs and wight vector mixing procedures iterativelty. The training time reported for each training epoch is calculated by taking the maximum value among all parallel shards. 3.4 Results Results of the regular and averaged perceptrons are shown in Figure 4. Both distributed algorithms return the models much quicker in terms of wall clock as well as the number of training epochs compared to the serially trained perceptron, in both regular and averaged perceptrons. Parameter mixing method does not meet the performance of training serially on all data for averaged perceptron, neither for the regular perceptron except in the first few epochs. Iterative parameter mixing method achieves a better performance than the parameter mixing method in both regular and averaged perceptrons. It also achieves better accuracy compared to the serial scenario in the regular perceptron, and a comparable accuracy in the averaged perceptron. This happens because the parameter mixing has a similar effect as the averaged perceptron. Mixing parameters by taking the average of those, reduces the variances between different weight vectors, and produces a regularization effect. 4 Conclusion and Future Work In this paper, we described a discriminative training method to train a POS tagger using the perceptron algorithm in serial and distributed scenarios. Our POS 6

7 Regular Perceptron Averaged Perceptron POS tagging accuracy Serial Parameter Mix Iterative Parameter Mix POS tagging accuracy Serial Parameter Mix Iterative Parameter Mix Time (ms) Time (ms) Figure 4: Accuracy versus time for regular and averaged distributed perceptron in POS tagging tagger achieves over 97% accuracy for English. Training features are languageindependent and the tagger can be used for arbitrary languages. To reduce the training time, we tried two distributed training strategies for the structured perceptron, previously proposed in [McDonald et al.2010], in the task of POS tagging. Both parameter mixing methods are very quick and accurate, however, in a simple parameter mixing it is not guaranteed to produce an optimal model. Both distributed approaches reduce the time required to train the perceptron algorithm significantly. Similar results to the other structured prediction tasks, named entity recognition and dependency parsing, are hold for POS tagging. Our tagger is available for download in POStagger. Supporting other finite-state tagging tasks such as shallow parsing or hedge segmentation [Yarmohammadi et al.2014] will be available in the toolkit soon. Another future work is to implement a map-reduce version of distributed perceptron algorithms in the Hadoop framework. 7

8 References [Collins2002] M. Collins Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing, EMNLP 02, pages 1 8, Stroudsburg, PA, USA. [Marcus et al.1999] Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor Treebank-3. Linguistic Data Consortium, Philadelphia. [McDonald et al.2010] Ryan McDonald, Keith Hall, and Gideon Mann Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages , Stroudsburg, PA, USA. Association for Computational Linguistics. [Ratnaparkhi1997] Adwait Ratnaparkhi A maximum entropy model for part-of-speech tagging. In EMNLP [Roark et al.2012] Brian Roark, Kristy Hollingshead, and Nathan Bodenstab Finite-state chart constraints for reduced complexity context-free parsing pipelines. Computational Linguistics, 38(4): [Rosenblatt1958] Frank Rosenblatt The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6): [Yarmohammadi et al.2014] Mahsa Yarmohammadi, Aaron Dunlop, and Brian Roark Transforming trees into hedges and parsing with hedgebank grammars. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pages , Baltimore, Maryland, June. 8

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce Andrea Gesmundo Computer Science Department University of Geneva Geneva, Switzerland andrea.gesmundo@unige.ch

More information

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Using Search-Logs to Improve Query Tagging

Using Search-Logs to Improve Query Tagging Using Search-Logs to Improve Query Tagging Kuzman Ganchev Keith Hall Ryan McDonald Slav Petrov Google, Inc. {kuzman kbhall ryanmcd slav}@google.com Abstract Syntactic analysis of search queries is important

More information

Feature Extraction and Loss training using CRFs: A Project Report

Feature Extraction and Loss training using CRFs: A Project Report Feature Extraction and Loss training using CRFs: A Project Report Ankan Saha Department of computer Science University of Chicago March 11, 2008 Abstract POS tagging has been a very important problem in

More information

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

Discriminative Training for Phrase-Based Machine Translation

Discriminative Training for Phrase-Based Machine Translation Discriminative Training for Phrase-Based Machine Translation Abhishek Arun 19 April 2007 Overview 1 Evolution from generative to discriminative models Discriminative training Model Learning schemes Featured

More information

Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging

Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging Wenbin Jiang Haitao Mi Qun Liu Key Lab. of Intelligent Information Processing Institute of Computing Technology Chinese Academy

More information

Reranking with Baseline System Scores and Ranks as Features

Reranking with Baseline System Scores and Ranks as Features Reranking with Baseline System Scores and Ranks as Features Kristy Hollingshead and Brian Roark Center for Spoken Language Understanding OGI School of Science & Engineering Oregon Health & Science University

More information

Training for Fast Sequential Prediction Using Dynamic Feature Selection

Training for Fast Sequential Prediction Using Dynamic Feature Selection Training for Fast Sequential Prediction Using Dynamic Feature Selection Emma Strubell Luke Vilnis Andrew McCallum School of Computer Science University of Massachusetts, Amherst Amherst, MA 01002 {strubell,

More information

Comparisons of Sequence Labeling Algorithms and Extensions

Comparisons of Sequence Labeling Algorithms and Extensions Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models

More information

On Structured Perceptron with Inexact Search, NAACL 2012

On Structured Perceptron with Inexact Search, NAACL 2012 On Structured Perceptron with Inexact Search, NAACL 2012 John Hewitt CIS 700-006 : Structured Prediction for NLP 2017-09-23 All graphs from Huang, Fayong, and Guo (2012) unless otherwise specified. All

More information

Easy-First POS Tagging and Dependency Parsing with Beam Search

Easy-First POS Tagging and Dependency Parsing with Beam Search Easy-First POS Tagging and Dependency Parsing with Beam Search Ji Ma JingboZhu Tong Xiao Nan Yang Natrual Language Processing Lab., Northeastern University, Shenyang, China MOE-MS Key Lab of MCC, University

More information

MEMMs (Log-Linear Tagging Models)

MEMMs (Log-Linear Tagging Models) Chapter 8 MEMMs (Log-Linear Tagging Models) 8.1 Introduction In this chapter we return to the problem of tagging. We previously described hidden Markov models (HMMs) for tagging problems. This chapter

More information

Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser

Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser Nathan David Green and Zdeněk Žabokrtský Charles University in Prague Institute of Formal and Applied Linguistics

More information

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013 The Perceptron Simon Šuster, University of Groningen Course Learning from data November 18, 2013 References Hal Daumé III: A Course in Machine Learning http://ciml.info Tom M. Mitchell: Machine Learning

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

Dynamic Feature Selection for Dependency Parsing

Dynamic Feature Selection for Dependency Parsing Dynamic Feature Selection for Dependency Parsing He He, Hal Daumé III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N Fruit flies like a banana

More information

Online Learning of Approximate Dependency Parsing Algorithms

Online Learning of Approximate Dependency Parsing Algorithms Online Learning of Approximate Dependency Parsing Algorithms Ryan McDonald Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 {ryantm,pereira}@cis.upenn.edu

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Conditional Random Fields. Mike Brodie CS 778

Conditional Random Fields. Mike Brodie CS 778 Conditional Random Fields Mike Brodie CS 778 Motivation Part-Of-Speech Tagger 2 Motivation object 3 Motivation I object! 4 Motivation object Do you see that object? 5 Motivation Part-Of-Speech Tagger -

More information

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11 Graph-Based Parsing Miguel Ballesteros Algorithms for NLP Course. 7-11 By using some Joakim Nivre's materials from Uppsala University and Jason Eisner's material from Johns Hopkins University. Outline

More information

A New Perceptron Algorithm for Sequence Labeling with Non-local Features

A New Perceptron Algorithm for Sequence Labeling with Non-local Features A New Perceptron Algorithm for Sequence Labeling with Non-local Features Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292

More information

Flexible Text Segmentation with Structured Multilabel Classification

Flexible Text Segmentation with Structured Multilabel Classification Flexible Text Segmentation with Structured Multilabel Classification Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia,

More information

Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Lexicographic Semirings for Exact Automata Encoding of Sequence Models Lexicographic Semirings for Exact Automata Encoding of Sequence Models Brian Roark, Richard Sproat, and Izhak Shafran {roark,rws,zak}@cslu.ogi.edu Abstract In this paper we introduce a novel use of the

More information

Accelerated Estimation of Conditional Random Fields using a Pseudo-Likelihood-inspired Perceptron Variant

Accelerated Estimation of Conditional Random Fields using a Pseudo-Likelihood-inspired Perceptron Variant Accelerated Estimation of Conditional Random Fields using a Pseudo-Likelihood-inspired Perceptron Variant Teemu Ruokolainen a Miikka Silfverberg b Mikko Kurimo a Krister Lindén b a Department of Signal

More information

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing Jun Suzuki, Hideki Isozaki NTT CS Lab., NTT Corp. Kyoto, 619-0237, Japan jun@cslab.kecl.ntt.co.jp isozaki@cslab.kecl.ntt.co.jp

More information

Base Noun Phrase Chunking with Support Vector Machines

Base Noun Phrase Chunking with Support Vector Machines Base Noun Phrase Chunking with Support Vector Machines Alex Cheng CS674: Natural Language Processing Final Project Report Cornell University, Ithaca, NY ac327@cornell.edu Abstract We apply Support Vector

More information

Conditional Random Fields for Word Hyphenation

Conditional Random Fields for Word Hyphenation Conditional Random Fields for Word Hyphenation Tsung-Yi Lin and Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl260}@ucsd.edu February 12,

More information

Tekniker för storskalig parsning: Dependensparsning 2

Tekniker för storskalig parsning: Dependensparsning 2 Tekniker för storskalig parsning: Dependensparsning 2 Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi joakim.nivre@lingfil.uu.se Dependensparsning 2 1(45) Data-Driven Dependency

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Semi-Supervised Learning of Named Entity Substructure

Semi-Supervised Learning of Named Entity Substructure Semi-Supervised Learning of Named Entity Substructure Alden Timme aotimme@stanford.edu CS229 Final Project Advisor: Richard Socher richard@socher.org Abstract The goal of this project was two-fold: (1)

More information

Maximum Entropy based Natural Language Interface for Relational Database

Maximum Entropy based Natural Language Interface for Relational Database International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 7, Number 1 (2014), pp. 69-77 International Research Publication House http://www.irphouse.com Maximum Entropy based

More information

Learning with Probabilistic Features for Improved Pipeline Models

Learning with Probabilistic Features for Improved Pipeline Models Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu School of EECS Ohio University Athens, OH 45701 bunescu@ohio.edu Abstract We present a novel learning framework for pipeline

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented

More information

Density-Driven Cross-Lingual Transfer of Dependency Parsers

Density-Driven Cross-Lingual Transfer of Dependency Parsers Density-Driven Cross-Lingual Transfer of Dependency Parsers Mohammad Sadegh Rasooli Michael Collins rasooli@cs.columbia.edu Presented by Owen Rambow EMNLP 2015 Motivation Availability of treebanks Accurate

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Parsing in Parallel on Multiple Cores and GPUs

Parsing in Parallel on Multiple Cores and GPUs 1/28 Parsing in Parallel on Multiple Cores and GPUs Mark Johnson Centre for Language Sciences and Department of Computing Macquarie University ALTA workshop December 2011 Why parse in parallel? 2/28 The

More information

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012 A CAS STUDY: Structure learning for Part-of-Speech Tagging Danilo Croce WM 2011/2012 27 gennaio 2012 TASK definition One of the tasks of VALITA 2009 VALITA is an initiative devoted to the evaluation of

More information

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Support Vector Machine Learning for Interdependent and Structured Output Spaces Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey * Most of the slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition Lecture 4:

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools Wahed Hemati, Alexander Mehler, and Tolga Uslu Text Technology Lab, Goethe Universitt

More information

Detection and Extraction of Events from s

Detection and Extraction of Events from  s Detection and Extraction of Events from Emails Shashank Senapaty Department of Computer Science Stanford University, Stanford CA senapaty@cs.stanford.edu December 12, 2008 Abstract I build a system to

More information

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney Hidden Markov Models Natural Language Processing: Jordan Boyd-Graber University of Colorado Boulder LECTURE 20 Adapted from material by Ray Mooney Natural Language Processing: Jordan Boyd-Graber Boulder

More information

Better Evaluation for Grammatical Error Correction

Better Evaluation for Grammatical Error Correction Better Evaluation for Grammatical Error Correction Daniel Dahlmeier 1 and Hwee Tou Ng 1,2 1 NUS Graduate School for Integrative Sciences and Engineering 2 Department of Computer Science, National University

More information

arxiv: v2 [cs.cl] 24 Mar 2015

arxiv: v2 [cs.cl] 24 Mar 2015 Yara Parser: A Fast and Accurate Dependency Parser Mohammad Sadegh Rasooli 1 and Joel Tetreault 2 1 Department of Computer Science, Columbia University, New York, NY, rasooli@cs.columbia.edu 2 Yahoo Labs,

More information

Convolution Kernels for Natural Language

Convolution Kernels for Natural Language Convolution Kernels for Natural Language Michael Collins AT&T Labs Research 180 Park Avenue, New Jersey, NJ 07932 mcollins@research.att.com Nigel Duffy Department of Computer Science University of California

More information

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang Wen-tau Yih Microsoft Research Redmond, WA 98052, USA {minchang,scottyih}@microsoft.com Abstract Due to

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

Modeling Sequence Data

Modeling Sequence Data Modeling Sequence Data CS4780/5780 Machine Learning Fall 2011 Thorsten Joachims Cornell University Reading: Manning/Schuetze, Sections 9.1-9.3 (except 9.3.1) Leeds Online HMM Tutorial (except Forward and

More information

Lexicalized Semi-Incremental Dependency Parsing

Lexicalized Semi-Incremental Dependency Parsing Lexicalized Semi-Incremental Dependency Parsing Hany Hassan Khalil Sima an Andy Way Cairo TDC Language and Computation School of Computing IBM University of Amsterdam Dublin City University Cairo, Egypt

More information

Confidence in Structured-Prediction using Confidence-Weighted Models

Confidence in Structured-Prediction using Confidence-Weighted Models Confidence in Structured-Prediction using Confidence-Weighted Models Avihai Mejer Department of Computer Science Technion-Israel Institute of Technology Haifa 32, Israel amejer@tx.technion.ac.il Koby Crammer

More information

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop Large-Scale Syntactic Processing: JHU 2009 Summer Research Workshop Intro CCG parser Tasks 2 The Team Stephen Clark (Cambridge, UK) Ann Copestake (Cambridge, UK) James Curran (Sydney, Australia) Byung-Gyu

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Introduction to SLAM Part II. Paul Robertson

Introduction to SLAM Part II. Paul Robertson Introduction to SLAM Part II Paul Robertson Localization Review Tracking, Global Localization, Kidnapping Problem. Kalman Filter Quadratic Linear (unless EKF) SLAM Loop closing Scaling: Partition space

More information

Lexicalized Semi-Incremental Dependency Parsing

Lexicalized Semi-Incremental Dependency Parsing Lexicalized Semi-Incremental Dependency Parsing Hany Hassan, Khalil Sima an and Andy Way Abstract Even leaving aside concerns of cognitive plausibility, incremental parsing is appealing for applications

More information

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Random Restarts in Minimum Error Rate Training for Statistical Machine Translation

Random Restarts in Minimum Error Rate Training for Statistical Machine Translation Random Restarts in Minimum Error Rate Training for Statistical Machine Translation Robert C. Moore and Chris Quirk Microsoft Research Redmond, WA 98052, USA bobmoore@microsoft.com, chrisq@microsoft.com

More information

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct 1 INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS Jan Tore Lønning, Lecture 8, 12 Oct. 2016 jtl@ifi.uio.no Today 2 Preparing bitext Parameter tuning Reranking Some linguistic issues STMT so far 3 We

More information

Transition-Based Dependency Parsing with MaltParser

Transition-Based Dependency Parsing with MaltParser Transition-Based Dependency Parsing with MaltParser Joakim Nivre Uppsala University and Växjö University Transition-Based Dependency Parsing 1(13) Introduction Outline Goals of the workshop Transition-based

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Frustratingly Easy Domain Adaptation

Frustratingly Easy Domain Adaptation Frustratingly Easy Domain Adaptation Hal Daumé III School of Computing University of Utah Salt Lake City, Utah 84112 me@hal3.name Abstract We describe an approach to domain adaptation that is appropriate

More information

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016 Topics in Parsing: Context and Markovization; Dependency Parsing COMP-599 Oct 17, 2016 Outline Review Incorporating context Markovization Learning the context Dependency parsing Eisner s algorithm 2 Review

More information

Alignment Link Projection Using Transformation-Based Learning

Alignment Link Projection Using Transformation-Based Learning Alignment Link Projection Using Transformation-Based Learning Necip Fazil Ayan, Bonnie J. Dorr and Christof Monz Department of Computer Science University of Maryland College Park, MD 20742 {nfa,bonnie,christof}@umiacs.umd.edu

More information

TTIC 31190: Natural Language Processing

TTIC 31190: Natural Language Processing TTIC 31190: Natural Language Processing Kevin Gimpel Winter 2016 Lecture 2: Text Classification 1 Please email me (kgimpel@ttic.edu) with the following: your name your email address whether you taking

More information

Utilizing Dependency Language Models for Graph-based Dependency Parsing Models

Utilizing Dependency Language Models for Graph-based Dependency Parsing Models Utilizing Dependency Language Models for Graph-based Dependency Parsing Models Wenliang Chen, Min Zhang, and Haizhou Li Human Language Technology, Institute for Infocomm Research, Singapore {wechen, mzhang,

More information

A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy Francesco Sartorio Department of Information Engineering University of Padua, Italy sartorio@dei.unipd.it Giorgio Satta Department

More information

Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation

Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation Wenbin Jiang and Fandong Meng and Qun Liu and Yajuan Lü Key Laboratory of Intelligent Information Processing

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK NG, Jun Ping National University of Singapore ngjp@nus.edu.sg 30 November 2009 The latest version of QANUS and this documentation can always be downloaded from

More information

A simple pattern-matching algorithm for recovering empty nodes and their antecedents

A simple pattern-matching algorithm for recovering empty nodes and their antecedents A simple pattern-matching algorithm for recovering empty nodes and their antecedents Mark Johnson Brown Laboratory for Linguistic Information Processing Brown University Mark Johnson@Brown.edu Abstract

More information

Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model

Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model Yue Zhang Oxford University Computing Laboratory yue.zhang@comlab.ox.ac.uk Stephen Clark Cambridge University Computer

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Exam Marco Kuhlmann. This exam consists of three parts:

Exam Marco Kuhlmann. This exam consists of three parts: TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding

More information

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models Fast, Piecewise Training for Discriminative Finite-state and Parsing Models Charles Sutton and Andrew McCallum Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 USA {casutton,mccallum}@cs.umass.edu

More information

Approximate Large Margin Methods for Structured Prediction

Approximate Large Margin Methods for Structured Prediction : Approximate Large Margin Methods for Structured Prediction Hal Daumé III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu Slide 1 Structured Prediction

More information

Structured Perceptron with Inexact Search

Structured Perceptron with Inexact Search Structured Perceptron with Inexact Search Liang Huang Suphan Fayong Yang Guo presented by Allan July 15, 2016 Slides: http://www.statnlp.org/sperceptron.html Table of Contents Structured Perceptron Exact

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017) Discussion School of Computing and Information Systems The University of Melbourne COMP9004 WEB SEARCH AND TEXT ANALYSIS (Semester, 07). What is a POS tag? Sample solutions for discussion exercises: Week

More information

Advanced PCFG Parsing

Advanced PCFG Parsing Advanced PCFG Parsing BM1 Advanced atural Language Processing Alexander Koller 4 December 2015 Today Agenda-based semiring parsing with parsing schemata. Pruning techniques for chart parsing. Discriminative

More information

News-Oriented Keyword Indexing with Maximum Entropy Principle.

News-Oriented Keyword Indexing with Maximum Entropy Principle. News-Oriented Keyword Indexing with Maximum Entropy Principle. Li Sujian' Wang Houfeng' Yu Shiwen' Xin Chengsheng2 'Institute of Computational Linguistics, Peking University, 100871, Beijing, China Ilisujian,

More information

Voting between Multiple Data Representations for Text Chunking

Voting between Multiple Data Representations for Text Chunking Voting between Multiple Data Representations for Text Chunking Hong Shen and Anoop Sarkar School of Computing Science Simon Fraser University Burnaby, BC V5A 1S6, Canada {hshen,anoop}@cs.sfu.ca Abstract.

More information

Discriminative Parse Reranking for Chinese with Homogeneous and Heterogeneous Annotations

Discriminative Parse Reranking for Chinese with Homogeneous and Heterogeneous Annotations Discriminative Parse Reranking for Chinese with Homogeneous and Heterogeneous Annotations Weiwei Sun and Rui Wang and Yi Zhang Department of Computational Linguistics, Saarland University German Research

More information

Optimal Shift-Reduce Constituent Parsing with Structured Perceptron

Optimal Shift-Reduce Constituent Parsing with Structured Perceptron Optimal Shift-Reduce Constituent Parsing with Structured Perceptron Le Quang Thang Hanoi University of Science and Technology {lelightwin@gmail.com} Hiroshi Noji and Yusuke Miyao National Institute of

More information

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING Joshua Goodman Speech Technology Group Microsoft Research Redmond, Washington 98052, USA joshuago@microsoft.com http://research.microsoft.com/~joshuago

More information

Formalizing the Use and Characteristics of Constraints in Pipeline Systems

Formalizing the Use and Characteristics of Constraints in Pipeline Systems Formalizing the Use and Characteristics of Constraints in Pipeline Systems Kristy Hollingshead B.A., University of Colorado, 2000 M.S., Oregon Health & Science University, 2004 Presented to the Center

More information

Semantics Isn t Easy Thoughts on the Way Forward

Semantics Isn t Easy Thoughts on the Way Forward Semantics Isn t Easy Thoughts on the Way Forward NANCY IDE, VASSAR COLLEGE REBECCA PASSONNEAU, COLUMBIA UNIVERSITY COLLIN BAKER, ICSI/UC BERKELEY CHRISTIANE FELLBAUM, PRINCETON UNIVERSITY New York University

More information

Transition-based Dependency Parsing with Rich Non-local Features

Transition-based Dependency Parsing with Rich Non-local Features Transition-based Dependency Parsing with Rich Non-local Features Yue Zhang University of Cambridge Computer Laboratory yue.zhang@cl.cam.ac.uk Joakim Nivre Uppsala University Department of Linguistics and

More information

Probabilistic parsing with a wide variety of features

Probabilistic parsing with a wide variety of features Probabilistic parsing with a wide variety of features Mark Johnson Brown University IJCNLP, March 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) upported by NF grants LI 9720368

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Learning to Follow Navigational Route Instructions

Learning to Follow Navigational Route Instructions Learning to Follow Navigational Route Instructions Nobuyuki Shimizu Information Technology Center University of Tokyo shimizu@r.dl.itc.u-tokyo.ac.jp Andrew Haas Department of Computer Science State University

More information

Robust Information Extraction with Perceptrons

Robust Information Extraction with Perceptrons Robust Information Extraction with Perceptrons Mihai Surdeanu Technical University of Catalonia surdeanu@lsi.upc.edu Massimiliano Ciaramita Yahoo! Research Barcelona massi@yahoo-inc.com Abstract We present

More information

Discriminative Classifiers for Deterministic Dependency Parsing

Discriminative Classifiers for Deterministic Dependency Parsing Discriminative Classifiers for Deterministic Dependency Parsing Johan Hall Växjö University jni@msi.vxu.se Joakim Nivre Växjö University and Uppsala University nivre@msi.vxu.se Jens Nilsson Växjö University

More information