Complex Prediction Problems

Similar documents
Part 5: Structured Support Vector Machines

SVM: Multiclass and Structured Prediction. Bin Zhao

Structured Learning. Jun Zhu

Hidden Markov Support Vector Machines

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Visual Recognition: Examples of Graphical Models

Part 5: Structured Support Vector Machines

Comparisons of Sequence Labeling Algorithms and Extensions

Structured Prediction with Relative Margin

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Support Vector Machines.

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

SVMs for Structured Output. Andrea Vedaldi

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

17 STRUCTURED PREDICTION

Conditional Random Fields : Theory and Application

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Bilevel Sparse Coding

SUPPORT VECTOR MACHINES

27: Hybrid Graphical Models and Neural Networks

Belief propagation and MRF s

Support Vector Machines.

A new training method for sequence data

Conditional Random Fields with High-Order Features for Sequence Labeling

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Learning with infinitely many features

Scaling Conditional Random Fields for Natural Language Processing

Natural Language Processing

arxiv: v1 [cs.lg] 22 Feb 2016

Learning a Distance Metric for Structured Network Prediction. Stuart Andrews and Tony Jebara Columbia University

Support Vector Machines

arxiv: v3 [cs.lg] 2 Aug 2012

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Introduction to CRFs. Isabelle Tellier

Learning Hierarchies at Two-class Complexity

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Generative and discriminative classification

STA 4273H: Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Kernel Methods & Support Vector Machines

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Conditional Random Fields for Object Recognition

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Approximate Large Margin Methods for Structured Prediction

Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Introduction to Graphical Models

Discriminative Training for Phrase-Based Machine Translation

Stanford University. A Distributed Solver for Kernalized SVM

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Introduction to Machine Learning

A generalized quadratic loss for Support Vector Machines

12 Classification using Support Vector Machines

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Structured Max Margin Learning on Image Annotation and Multimodal Image Retrieval

Generative and discriminative classification techniques

Computationally Efficient M-Estimation of Log-Linear Structure Models

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Graphical Models for Computer Vision

Generative and discriminative classification techniques

SUPPORT VECTOR MACHINES

Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning

Classification by Support Vector Machines

Application of Support Vector Machine Algorithm in Spam Filtering

Support Vector Machines

Structure and Support Vector Machines. SPFLODD October 31, 2013

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning Classifiers and Boosting

Algorithms for Markov Random Fields in Computer Vision

Classification by Support Vector Machines

TRANSDUCTIVE LINK SPAM DETECTION

Semi-supervised learning and active learning

Posterior Regularization for Structured Latent Varaible Models

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Contextual Classification with Functional Max-Margin Markov Networks

Kernel Conditional Random Fields: Representation and Clique Selection

Machine Learning: Think Big and Parallel

Convexization in Markov Chain Monte Carlo

Loopy Belief Propagation

Support Vector Machines

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Learning with Probabilistic Features for Improved Pipeline Models

COMS 4771 Support Vector Machines. Nakul Verma

Bilinear Programming

Collective classification in network data

Probabilistic Graphical Models

Estimating Labels from Label Proportions

Support Vector Machines + Classification for IR

Comparison between Explicit Learning and Implicit Modeling of Relational Features in Structured Output Spaces

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Structured Prediction via Output Space Search

Learning CRFs using Graph Cuts

Mathematical Programming and Research Methods (Part II)

Conditional Random Field for tracking user behavior based on his eye s movements 1

Contextual Recognition of Hand-drawn Diagrams with Conditional Random Fields

Transcription:

Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08

Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity Recognition: person, location, organization names Coreference Identification: noun phrases refering to the same object Relation extraction: eg. Person works for Organization Ultimate tasks Document Summarization Question Answering

Problems Complex tasks consisting of multiple structured subtasks Real world problems too complicated for solving at once Ubiquitous in many domains Natural Language Processing Computational Biology Computational Vision

2345*6&7- Complex!!"#$"%&"'()*+,-$./.0+%/1)20+1+34()56+."0%) Prediction Example!"&+%7/64)!.6$&.$6")56"70&.0+% 8(!""#!$%&'%'(#'(&()'&*+*'(*,)-*(#'./)#&%(+#%'0, 4(!&&&&&&!1!!1!1!1!!1!1!1!1--------!----!-!1!1!1!1!!11!1!1!1!&&&&&1!1!!1!1!!) Motion Tracking in Computational Vision! 9:"6/11)&+%'.6/0%.'()*+,-$."6):0'0+%());7"%.0<4) Subtask: Identify joint angles of human body =+0%.)/%31"')+<)>$,/%)?+74!"#$%#!% &'()*+,-./01, 2

Example #&'/%"/01-3-D protein structure prediction in Computational Biology Subtask: Identify secondary structured Prediction from amino-acid sequence 34()56+."0%) AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW &%(+#%'0,!&&&&&1!1!!1!1!!)

Standard Approach to Pipeline Approach Define intermediate/sub-tasks Solve them individually or in a cascaded manner NP Use output of subtasks as features (input) for target task NER POS y 0 1 y 0 2 y 0 3 y 1 1 y 1 2 y 1 3 x1 x2 x3 Parse Chunk POS 2 X SUTTON, y 0 MCCALLUM AND ROH y 2 1 x1 y 1 1 y 1 2 y 2 2 y 2 3 S...... VP y 2 n x2 x3 xn a) b) where for POS and for NER where x:x+pos tags Problems: Error propagation Figure 1: Graphical representation of (a) linear-chain CR No learning acrosshidden tasks nodes can depend on observations at an links only to observations at the same time step.

New Approach to Proposed approach: Solve tasks jointly discriminatively Decompose multiple structured tasks Use methods from multitask learning Good predictors are it smooth Restrict the search space for smooth functions of all tasks Device targeted approximation methods Standard approximation algorithms do not capture specifics Dependencies within tasks are stronger than dependencies across tasks Advantages Less/no error propagation Enables learning across tasks

Structured Output (SO) Prediction Supervised Learning Given input/output pairs (x, y) X Y Y = {0,..., m}, Y = R Data from unknown/fixed distribution D over X Y Goal: Learn a mapping h : X Y State-of-the art are discriminative, eg. SVMs, Boosting In Structured Output prediction, Multivariate response variable with structural dependency. Y : exponential in number of variables Sequences, tree, hierarchical classification, ranking

SO Prediction Generative framework: Model P(x, y) Advantages: Efficient learning and inference algorithms Disadvantages: Harder problem, Questionable independence assumption, Limited representation Local approaches: eg. [Roth, 2001] Advantages: Efficient algorithms Disadvantages: Ignore/problematic long range dependencies Discriminative learning Advantages: Richer representation via kernels, capture dependencies Disadvantages: Expensive computation (SO prediction involves iteratively computing marginals or best labeling during training)

Formal Setting Given S = ((x 1, y 1 ),..., (x l, y n )) Find h : X Y, h(x) = argmax y F(x, y) Linear discriminant function F : X Y R SUTTON, MCCALLUM AND ROHANIMANESH F w (x, y) = ψ(x, y), w Cost function: (y, y ) 0 eg. 0-1 loss, Hamming loss Canonical example: Label sequence learning, where both x and y are sequences

Maximum Margin Learning [Altun et al 03] 2./01,-)0-'/-!!"#$%&%'!"()$*'+,"(*$*) Define separation margin [Crammer & Singer 01]!!"#$%"&'"()*)+$,%&-)*.$%&/0*)--"*&1&2$%."*&345 γ i = F w (x i, y i ) max F w (x i, y) y y i Maximize min i γ i with small w! 6)7$-$8"&&&&&&&&&&&&&&! 6$%$-$8"& Minimize i max y y i (1 + F w (x i, y) F w (x i, y i )) + + λ w 2 2!"#$%#!% &'()*+,-./01,

Max-Margin Learning (cont.) i max y y i (1 + F w (x i, y) F w (x i, y i )) + + λ w 2 2 A convex non-quadratic program min w,ξ 1 2 w 2 2 + C n i ξ i s.t. w, ψ(x i, y i ) max y y w, ψ(x i, y) 1 ξ i, i

Max-Margin Learning (cont.) i max y y i (1 + F w (x i, y) F w (x i, y i )) + + λ w 2 2 A convex quadratic program min w,ξ 1 2 w 2 2 + C n i ξ i s.t. w, ψ(x i, y i ) w, ψ(x i, y) 1 ξ i, i, y y i Number of constraints exponential Sparsity: Only a few of the constraints will be active

Max-Margin Dual Problem Using Lagrangian techniques, the dual: max 1 α i (y)α j (y )δψ(x i, y)δψ(x j, y ) + α i (y) 2 i,j,y,y i,y s.t. 0 α i (y), α i (y) C n, i y y i where δψ(x i, y) = ψ(x i, y i ) ψ(x i, y) Use the structure of equality constraints Replace the inner product with a kernel for implicit non-linear mapping

Max-Margin Optimization Exploit sparseness and the structure of constraints by incrementally adding constraints (cutting plane algorithm) Maintain a working set Y i Y for each training instance Iterate over training instance Incrementally augment (or shrink) working set Y i ŷ = argmax y Y y i F(x i, y) via Dynamic Programming F(x i, y i ) F(x i, ŷ) 1 ɛ? Optimize over Lagrange multipliers α i of Y i

Max-Margin Cost Sensitivity Cost function : Y Y R Multiclass 0/1 loss Sequences Hamming loss Parsing (1-F1) Extend max-margin framework for cost sensitivity (Taskar et.al. 2004) max y y i ( (y i, y) + F w (x i, y) F w (x i, y i )) + (Tsochantaridis et.al. 2004) max y y i (y i, y)(1 + F w (x i, y) F w (x i, y i )) +

Example: Sequences #$%&'()*'+,'-.'/!"#$%&"'($)*("+,'-*%'.%,/.0'*1$%.#"*+ 2$)*/1*3$'-$.#4%$3'"+#*'#"/$ 56*'#71$3'*-'-$.#4%$3 " 8&3$%9.#"*+':';.&$;< "#$%&"'($)*("+,'-*%'.%,/.0'*1$%.#"*+ " =.&$;':';.&$;< $)*/1*3$'-$.#4%$3'"+#*'#"/$ Viterbi decoding for argmax operation Decompose features into time ψ(x, y) = t 6*'#71$3'*-'-$.#4%$3 (ψ(x t, y t ) + ψ(y t, y t 1 )) Two types of features " 8&3$%9.#"*+':';.&$;< Observation-label: ψ(x t, y t ) = φ(x t ) Λ(y t ) " =.&$;':';.&$;< Label-label: ψ(y t, y t 1 ) = Λ(y t ) Λ(y t 1 ) % &'()*+,-./01, $2 % &'()*+,-./01, $2

Example: Sequences (cont.) Inner product between two features separately ψ(x, y), ψ( x, ȳ)) = s,t φ(x t ), φ( x s ) δ(y t, ȳ s ) + δ(y t, ȳ s )δ(y t 1, ȳ s 1 ) = s,t k((x t, y t ), ( x s, ȳ s )) + k((y t, y t 1 ), (ȳ s, ȳ s 1 )) Arbitrary kernels on x Linear kernels on y

Other SO Prediction Methods Find w to minimize expected loss E (x,y) D [ (y, h f (x))] w = argmin w l L(x i, y i, w) + λ w 2 i=1 Loss functions Hinge loss Log-loss: CRF [Lafferty et al 2001] L(x, y, f ) = F (x, y) + log ŷ Y exp(f(x, ŷ)) Exp-loss: Structured Boosting [Altun et al 2002] L(x, y, f ) = ŷ Y exp(f (x, ŷ) F (x, y))

SUTTON, via MCCALLUM SOAND Prediction ROHANIMANESH Figure 1: Graphical representation of (a) linear-chain CRF, and (b) factorial CRF. Although the hidden nodes can depend on observations at any time step, for clarity we have shown links only to observations at the same time step. Possible Solution: Treat complex prediction as a loopy graph and use standard approximation methods Shortcomings: parameter estimation using BP (Section 3.3.2), and cascaded parameter estimation (Section 3.3.3). Then, in Section No knowledge 3.4, we describeof inference graph andstructure parameter estimation in marginal DCRFs. In Section 4, we present the experimental results, including evaluation of factorial CRFs on noun-phrase chunking (Section 4.1), comparison of BP schedules in FCRFs (Section 4.2), evaluation of marginal DCRFs Solution: on both the chunking data and synthetic data (Section 4.3), and cascaded training of DCRFs for transfer learning (Section 4.4). Finally, in Section 5 and Section 6, we present related work and Dependencies within tasks more important than conclude. No knowledge that tasks defined over same input space dependencies across tasks. Use this for approximation method Restrict function class for each task via learning across 2. Conditional Random Fields (CRFs) Conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) are conditional probability tasksdistributions that factorize according to an undirected model. CRFs are defined as follows. Let y be a set of output variables that we wish to predict, and x be a set of input variables that are observed. For example, Yasemin in natural Altun language Complex processing, Prediction x may be a sequence of words

Joint Learning of Multiple SO prediction [Altun 2008] Tasks 1,..., m Learn a discriminative function T : X Y 1... Y m T (x, y; w, w) = [ F l (x, y l ; w l ) + ] F ll (y l, y l ; w, w). l l where w l capture dependencies within individual tasks w l,l capture dependencies across tasks F l defined as before F ll linear functions wrt cliques assignments of tasks l, l

Joint Learning of Multiple SO prediction Assume a low dimensional representation Θ shared across all tasks [Argyriou et al 2007] F l (x, y l ; w l, Θ) = w lσ, Θ T ψ(x, y l ) Find T by discovering Θ and learning w, w min ˆr(Θ) + r(w) + r( w) + Θ,w, w m n l=1 i=1 r, r regularization, eg. L2 norm ˆr, eg. Frobenius norm, trace norm L loss function, eg. Log-loss, hinge-loss L l (x i, yi l ; w, w, Θ), Optimization is not jointly convex over Θ and w, w

Joint Learning of Multiple SO prediction Via a reformulation, we get a jointly convex optimization min A,D, w Alσ, D + m A lσ + r( w) + lσ Optimize iteratively wrt A, w and D Closed form solution for D n l=1 i=1 L l (x i, yi l ; A, w).

Optimization A and w decomposes into tasks parameters Optimize wrt each tasks parameters iteratively Problem: F l,l is function of all other tasks Solution: Loopy Belief Propagation like algorithm where each clique assignment is approximated wrt current parameters iteratively Run DP for all other tasks, fix clique assignment values, optimize wrt current task

clique with the dynamic program and perform normalization. This procedure is sufficient Algorithm since is not crucial to achieve a proper conditional distribution over the complete Y l, but to get a local preference of label assignments for the current clique. Finally optimizing Eq. (5) with respect to D for fixed w, w corresponds to solving the minimization problem of min l a l,d + a l. (5) show that the optimal solution to this problem is given by D =(AA T ) 1 2 / A T, (9) where A T denotes the trace norm of A. Algorithm 1 Joint Learning of Multiple Structure Prediction Tasks 1: repeat 2: for each task l do 3: compute â l = argmin a i L(x i, y l ; a) + a, D + a via computing ψ functions for each x i with dynamic programming 4: end for 5: compute D = (AAT ) 1 2 A T and D + 6: until convergence Inference involves computing ψ functions iteratively until convergence, until f l (x) = argmax T l (x, y l ) does not change with respect to the previous iteration for any l. 5 Experiments 6 Related Work In Section 1, we stated that our approach improves over some previous work, e. g. (8; 16) via using multitask learning to restrict the function class and designing an approximation

Experiments Task1: POS tagging evaluated with accuracy Task2: NER evaluated with F1 score Data: 2000 sentences from CONLL03 English corpus Structure: Sequence Loss: Log-loss (CRF) Cascaded Joint (no Θ) MT-Joint POS 92.63 93.21 93.67 NER 58.77(noP) 67.42(predP) 69.75 (truep) 68.51 70.01 Table 1: Comparison of cascaded model and joint optimization for POS tagging and NER Inference involves computing ψ functions iteratively until convergence, until f l (x) = argmax T l (x, y l ) does not change with respect to the previous iteration for any l. We would like to emphasize that this algorithm is immediately applicable for experimental settings where there exists separate training data for each task. This is because, training of individual tasks (Line 3 ofyasemin Algorithm Altun 1) does Complex not Prediction use the correct labels of other tasks,

Conclusion IE involves complex tasks, ie. multiple structured prediction tasks Structured prediction methods include CRFs, Max-Margin SO Proposed a novel approach to joint prediction of multiple SO problems Using a special approximation algorithm Using multi-task methods More experimental evaluation is required