A CAS STUDY: Structure learning for Part-of-Speech Tagging Danilo Croce WM 2011/2012 27 gennaio 2012
TASK definition One of the tasks of VALITA 2009 VALITA is an initiative devoted to the evaluation of atural Language Processing and Speech tools for Italian. Part-of-speech tagging In the Part-of-Speech Tagging task, systems are required to assign a tag, consisting of a combination of lexical category (PoS tag) and morphological features to each token in a set of sentences. http://www.evalita.it/2009/tasks/pos
POS tagging and learning During the WM you have seen different quantitative approaches for modeling linguistic problem as Stochastic Processes: Hidden-Markov models (generative models) Support Vector Machines (discriminative models) POS tagging problem can be modeled as a sequential tagging task Linguistic information can be acquire by annotated example We will see how to combine this two paradigms
SVM and POS tagging We need to model the task as a stochastic process We aim to classify a sentence (i.e. a sequence of words) with respect to a possible sequence of POS The complexity is combinatory We could classify each word without the contextual information, ignoring other words in the sentence Maybe it can work for not ambiguous cases: the often but the context is crucial to classify a word like run IDA: classify words with respect to the POS tags, but using a contextual information to find the best solution for the entire sentence
SVM and POS tagging (2) An HMM model: The sentence is a SQUC Words (represented through a set of features) are our OSVATIOS HMM STATS are mapped into POS tags The transition probability is estimated from the training set SVM classifier are used to estimate the emission probability The solution is estimated by applying the Viterbi algorithm
A simple example w *,1 W *,2 x 1,1 x 1.2 x 2,1 x 2.2 A classifier x 3,1 x 3,1 for each x 4,1 POS x 5,1 * refers to the,, or POS x 3.2 x 3.2 x 4.2 x 5.2 x 6,1 x 6.2 x 7,1 x 7.2 x 8,1 x 8.2 W *, x 1, x 2, x 3, x 3, x 4, x 5, x 6, x 7, x 8, Yesterday a robber killed a guardian with a knife. V V V V V V V V V
SVM HMM : Structured Learning for POS The SVM HMM model learns a discriminative model isomorphic to a k-order Hidden Markov Model through the Structural SVM formulation. Input: feature vectors Output: label sequence Output labeling sequence: Given a story of lenght k missions Transitions The cutting-plane algorithms is applied to estimate w in polinomial time
SVM HMM input class Sent_id Feature vector Comment 4 qid:1 1:1 2:1 51:1 247:1 2675:1 # four 12 qid:1 58:1 84:1 197:1 250:1 433:1 1145:1 2677:1 # score 3 qid:1 8:1 83:1 88:1 202:1 363:1 364:1 438:1 1147:1 # and 4 qid:1 16:1 47:1 87:1 135:1 197:1 365:1 366:1 # seven 15 qid:1 30:1 49:1 142:1 197:1 202:1 387:1 # years 8 qid:1 39:1 83:1 202:1 267:1 392:1 # ago 20 qid:1 83:1 87:1 247:1 269:1 2675:1 2676:1 # our.. 21 qid:2 5:1 83:1 576:1 923:1 1379:1 1469:1 # now 19 qid:2 23:1 84:1 87:1 577:1 926:1 1383:1 1470:1 # we 30 qid:2 26:1 83:1 84:1 88:1 433:1 578:1 627:1 # are 29 qid:2 7:1 8:1 9:1 87:1 88:1 438:1 628:1 1077:1 3377:1 # engaged 8 qid:2 15:1 16:1 17:1 23:1 47:1 185:1 1082:1 3381:1 # in 8 qid:3 23:1 47:1 48:1 87:1 219:1 1621:1 # on 7 qid:3 3:1 26:1 49:1 50:1 459:1 # a 9 qid:3 5:1 197:1 217:1 460:1 519:1 1535:1 1536:1 1537:1 # great 12 qid:3 8:1 109:1 202:1 219:1 522:1 531:1 1538:1 1539:1 1540:1 # battlefield Sparse notation
How to use SVM HMM Download: http://download.joachims.org/svm_hmm/current/svm_hmm.tar.gz Compile Learn: svm_hmm_learn -c <C> --t <OD_T> -e 0.1 e 1 training_input.dat modelfile.dat -c: Typical SVM parameter C trading-off slack vs. magnitude of the weight-vector (1, 10, 100, 10 3, 10 4 depends by the training set size). --t: Order of dependencies of transitions in HMM (1,2 o 3) Classify: svm_hmm_classify test_input.dat modelfile.dat classify.tags
Feature ngineering The better is the feature representation of words, the better will be the performance Feature engineering Contextual (k word before and after the target word) The word suffix Dictionary Information Feature post-processing ormalization Do not mix features!!!
Project objectives The project consists in defining and implementing a POS tagging system based on the SVM HMM learning framework The system must be implemented in Java For this course the experimental settings are the coarse grain POS tag set open task setting (you can use external resources) You will be provided of the training/development data
Project objectives (2) The system must be Chaos compliant CHAOS is a modular and lexicalized syntactic and semantic parser for Italian and for nglish. The system implements a modular and lexicalised approach to the syntactic parsing problem. The pool of models defines a tokenizer, pos tagger, dependency parser, name entity recognizer Modules defines a sequence of annotators.g. pos tagging can not be applied without tokenizer The XDG provide a data structure containing all the linguistic information added by each module Chaos is written in JAVA
Project objectives (3) Training data will be provided within the XDG structure: Tokenized and POS tagged sentence SVM HMM is written in C The system builds an input file for the learning system Test data will be provided with no pos tags SVM HMM is written in C The system builds a file for the classification system We have a SVM HMM classifier in Java You have to define a module to enrich words with POS tagging information We will help you to integrate the classifiers
Project objectives (4) A proper feature engineering must be defined Contest: When the system is ready you will be provided of a test set Sentences must be labeled and we will measure the performances Tagging accuracy: it is defined as the percentage of correctly tagged tokens with respect to the total number of tokens A final short report is required