Exponentiated Gradient Algorithms for Large-margin Structured Classification

Similar documents
Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

A General Greedy Approximation Algorithm with Applications

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Scan Scheduling Specification and Analysis

RSA (Rivest Shamir Adleman) public key cryptosystem: Key generation: Pick two large prime Ô Õ ¾ numbers È.

Learning to Align Sequences: A Maximum-Margin Approach

Comparisons of Sequence Labeling Algorithms and Extensions

Complex Prediction Problems

Agglomerative Information Bottleneck

RSA (Rivest Shamir Adleman) public key cryptosystem: Key generation: Pick two large prime Ô Õ ¾ numbers È.

SVM: Multiclass and Structured Prediction. Bin Zhao

Constrained optimization

Learning Hierarchies at Two-class Complexity

Tolls for heterogeneous selfish users in multicommodity networks and generalized congestion games

Designing Networks Incrementally

Visual Recognition: Examples of Graphical Models

Probabilistic analysis of algorithms: What s it good for?

Structured Learning. Jun Zhu

End-to-end bandwidth guarantees through fair local spectrum share in wireless ad-hoc networks

Natural Language Processing

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

All lecture slides will be available at CSC2515_Winter15.html

SVMs for Structured Output. Andrea Vedaldi

Part 5: Structured Support Vector Machines

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

A Modality for Recursion

Locality Preserving Projections (LPP) Abstract

The Online Median Problem

Combine the PA Algorithm with a Proximal Classifier

Computing optimal linear layouts of trees in linear time

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Response Time Analysis of Asynchronous Real-Time Systems

Database Languages and their Compilers

SFU CMPT Lecture: Week 9

A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models

SFU CMPT Lecture: Week 8

Opinion Mining by Transformation-Based Domain Adaptation

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Linear methods for supervised learning

Hidden Markov Support Vector Machines

Probabilistic Graphical Models

From Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols

Optimal Time Bounds for Approximate Clustering

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Lecture 7: Support Vector Machine

Kernel Methods & Support Vector Machines

Conditional Random Fields : Theory and Application

Part 5: Structured Support Vector Machines

SVM online learning ... Salvatore Frandina Siena, August 19, Department of Information Engineering, Siena, Italy

DM6 Support Vector Machines

Flexible Text Segmentation with Structured Multilabel Classification

Support Vector Machines.

Expander-based Constructions of Efficiently Decodable Codes

Correlation Clustering

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Online Facility Location

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Learning a Distance Metric for Structured Network Prediction. Stuart Andrews and Tony Jebara Columbia University

A Simple Additive Re-weighting Strategy for Improving Margins

Conditional Random Fields for Object Recognition

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

A Comparison of Structural CSP Decomposition Methods

Support Vector Machines and their Applications

SIMULTANEOUS PERTURBATION STOCHASTIC APPROXIMATION FOR REAL-TIME OPTIMIZATION OF MODEL PREDICTIVE CONTROL

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

On Clusterings Good, Bad and Spectral

Improving the Performance of Text Categorization using N-gram Kernels

Log-linear models and conditional random fields

Introduction to CRFs. Isabelle Tellier

Computing Gaussian Mixture Models with EM using Equivalence Constraints

Approximation by NURBS curves with free knots

A Note on Karr s Algorithm

Support vector machines

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

On the Performance of Greedy Algorithms in Packet Buffering

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

A 2-Approximation Algorithm for the Soft-Capacitated Facility Location Problem

Probabilistic Graphical Models

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Black-Box Extension Fields and the Inexistence of Field-Homomorphic One-Way Permutations

Structure and Support Vector Machines. SPFLODD October 31, 2013

Loopy Belief Propagation

COMS 4771 Support Vector Machines. Nakul Verma

AppART + Growing Neural Gas = high performance hybrid neural network for function approximation

Comparing Data Compression in Web-based Animation Models using Kolmogorov Complexity

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

On-line multiplication in real and complex base

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Structure and Complexity in Planning with Unary Operators

Application of Support Vector Machine Algorithm in Spam Filtering

Transductive Learning: Motivation, Model, Algorithms

Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

Conditional Random Field for tracking user behavior based on his eye s movements 1

A new training method for sequence data

Discriminative Reranking for Machine Translation

Discriminative Training with Perceptron Algorithm for POS Tagging Task

CS570: Introduction to Data Mining

CRFs for Image Classification

Transcription:

Exponentiated Gradient Algorithms for Large-margin Structured Classification Peter L. Bartlett U.C.Berkeley bartlett@stat.berkeley.edu Ben Taskar Stanford University btaskar@cs.stanford.edu Michael Collins MIT CSAIL mcollins@csail.mit.edu David McAllester TTI at Chicago mcallester@tti-c.org Abstract We consider the problem of structured classification, where the task is to predict a label Ý from an input Ü, and Ý has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient even in cases where the number of labels Ý is exponential in size provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1 Introduction Structured classification is the problem of predicting Ý from Ü in the case where Ý has meaningful internal structure. For example Ü might be a word string and Ý a sequence of part of speech labels, or Ü might be a Markov random field and Ý a labeling of Ü, or Ü might be a word string and Ý a parse of Ü. In these examples the number of possible labels Ý is exponential in the size of Ü. This paper presents a training algorithm for a general definition of structured classification covering both Markov random fields and parsing. We restrict our attention to linear discriminative classification. We assume that pairs Ü Ý can be embedded in a linear feature space Ü Ýµ, and that a predictive rule is determined by a direction (weight vector) Û in that feature space. In linear discriminative prediction we select the Ý that has the greatest value for the inner product Ü Ýµ Û. Linear discrimination has been widely studied in the binary and multiclass setting [6, 4]. However, the case of structured labels has only recently been considered [2, 12, 3, 13]. The structured-label case takes into account the internal structure of Ý in the assignment of feature vectors, the computation of loss, and the definition and use of margins. We focus on a formulation where each label Ý is represented as a set of parts, or equivalently, as a bit-vector. Moreover, we assume that the feature vector for Ý and the loss for Ý are both linear in the individual bits of Ý. This formulation has the advantage that it naturally covers both simple labeling problems, such as part-of-speech tagging, as well as more complex problems such as parsing. We consider the large-margin optimization problem defined in [12] for selecting the classification direction Û given a training sample. The starting-point for these methods is a

primal problem that has one constraint for each possible labeling Ý; or equivalently a dual problem where each Ý has an associated dual variable. We give a new training algorithm that relies on an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient even in cases where the number of labels Ý is exponential in size provided that certain expectations under Gibbs distributions can be calculated efficiently. The computation of these expectations appears to be a natural computational problem for structured problems, and has specific polynomial-time dynamic programming algorithms for some important examples: for example, the clique-tree belief propagation algorithm can be used in Markov random fields, and the inside-outside algorithm can be used in the case of weighted context-free grammars. The optimization method for structured labels relies on a more general result, specifically the application of exponentiated gradient (EG) updates [7, 8] to quadratic programs (QPs). We describe a method for solving QPs based on EG updates, and give bounds on its rate of convergence. The algorithm uses multiplicative updates on dual parameters in the problem. In addition to their application to the structured-labels task, the EG updates lead to simple algorithms for optimizing conventional binary or multiclass SVM problems. Related work [2, 12, 3, 13] consider large-margin methods for Markov random fields and (weighted) context-free grammars. We consider the optimization problem defined in [12]. [12] use a row-generation approach based on Viterbi decoding combined with an SMO optimization method. [5] describe exponentiated gradient algorithms for SVMs, but for binary classification in the hard-margin case, without slack variables. We show that the EG-QP algorithm converges significantly faster than the rates shown in [5]. Multiplicative updates for SVMs are also described in [11], but unlike our method, the updates in [11] do not appear to factor in a way that allows algorithms for MRFs and WCFGs based on Gibbsdistribution representations. Our algorithms are related to those for conditional random fields (CRFs) [9]. CRFs define a linear model for structured problems, in a similar way to the models in our work, and also rely on the efficient computation of marginals in the training phase. Finally, see [1] for a longer version of the current paper, which includes more complete derivations and proofs. 2 The General Setting We consider the problem of learning a function, where is a set and is a countable set. We assume a loss function Ä Ê. The function Ä Ü Ý Ýµ measures the loss when Ý is the true label for Ü, and Ý is a predicted label; typically, Ý is the label proposed by some function ܵ. In general we will assume that Ä Ü Ý Ýµ ¼ for Ý Ý. Given some distribution over examples µ in, our aim is to find a function with low expected loss, or risk, Ä µµ. We consider functions which take a linear form. First, we assume a fixed function which maps an input Ü to a set of candidates ܵ. For all Ü, we assume that ܵ, and that ܵ is finite. A second component to the model is a feature-vector representation Ê. Given a parameter vector Û ¾ Ê, we consider functions of the form Û Üµ Ö ÑÜ Ü Ýµ Û Ý¾ ܵ Given Ò independent training examples Ü Ý µ with the same distribution as µ, we will formalize a large-margin optimization problem that is a generalization of support vector methods for binary classifiers, and is essentially the same as the formulation in [12]. The optimal parameters are taken to minimize the following regularized empirical risk function: ½ ¾ Û¾ ÑÜ Ä Ü Ý Ýµ Ñ Ý Ûµµ Ý where Ñ Ý Ûµ Û Ü Ý µ Û Ü Ýµ is the margin on ݵ and Þµ ÑÜÞ ¼. This optimization can be expressed as the primal problem in Figure 1. Following [12], the dual of this problem is also shown in Figure 1. The dual is a quadratic

Primalproblem: È ½ ÑÒ Û ¾ Û¾ Subject to the constraints: Ý ¾ Ü µ Û Ý Ä Ý ¼ Dual problem: ÑÜ ««µ, where «µ È Ý «ÝÄ Ý È È ½ ¾ ¾ Ý Þ «Ý«Þ Ý Þ Subject to the constraints: Ý «Ý ½ Ý «Ý ¼ Relationship between optimal values: Û È Ý «Ý Ý where Û is the Ö ÑÒ of the primal problem, and «is the Ö ÑÜ of the dual problem. Figure 1: The primal and dual problems. We use the definitions Ä Ý Ä Ü Ý Ýµ, and Ý Ü Ý µ Ü Ýµ. We assume that for all, Ä Ý ¼ for Ý Ý. The constant dictates the relative penalty for values of the slack variables which are greater than 0. program «µ in the dual variables «Ý for all ½ Ò, Ý ¾ Ü µ. The dual variables for each example are constrained to form a probability distribution over. 2.1 Models for structured classification The problems we are interested in concern structured labels, which have a natural decomposition into parts. Formally, we assume some countable set of parts, Ê. We also assume a function Ê which maps each object Ü Ýµ ¾ to a finite subset of Ê. Thus Ê Ü Ýµ is the set of parts belonging to a particular object. In addition we assume a feature-vector representation of parts: this is a function Ê Ê. The feature vector for an object Ü Ýµ is then a sum of the feature vectors for its parts, and we also assume that the loss function Ä Ü Ý Ýµ decomposes into a sum over parts: Ü Ýµ Ö¾Ê Üݵ Ü Öµ Ä Ü Ý Ýµ Ö¾Ê Üݵ Ð Ü Ý Öµ Here Ü Öµ is a local feature vector for part Ö paired with input Ü, and Ð Ü Ý Öµ is a local loss for part Ö when proposed for the pair Ü Ýµ. For convenience we define indicator variables Á Ü Ý Öµ which are ½ if Ö ¾ Ê Ü Ýµ, ¼ otherwise. We also define sets Ê Ü µ ݾ ܵ Ê Ü Ýµ for all ½ Ò. Example 1: Markov Random Fields (MRFs) In an MRF the space of labels ܵ, and their underlying structure, can be represented by a graph. The graph Î µ is a collection of vertices Î Ú ½ Ú ¾ Ú Ð and edges. Each vertex Ú ¾ Î has a set of possible labels,. The set ܵ is then defined as ½ ¾ Ð. Each clique in the graph has a set of possible configurations: for example, if a particular clique contains vertices Ú Ú Ú, the set of possible configurations of this clique is. We define to be the set of cliques in the graph, and for any ¾ we define µ to be the set of possible configurations for that clique. We decompose each Ý ¾ ܵ into a set of parts, by defining Ê Ü Ýµ µ ¾ Ê ¾ ¾ µ µ is consistent with y. The feature vector representation Ü µ for each part can essentially track any characteristics of the assignment for clique, together with any features of the input Ü. A number of choices for the loss function Ð Ü Ý µµ are È possible. For example, consider the Hamming loss used in [12], defined as Ä Ü Ý Ýµ Á Ý Ý. To achieve this, first assign each vertex Ú to a single one of the cliques in which it appears. Second, define Ð Ü Ý µµ to be the number of labels in the assignment µ which are both incorrect and correspond to vertices which have been assigned to the clique (note that assigning each vertex to a single clique avoids double counting of label errors). Example 2: Weighted Context-Free Grammars (WCFGs). In this example Ü is an input string, and Ý is a parse tree for that string, i.e., a left-most derivation for Ü under some context-free grammar. The set ܵ is the set of all left-most derivations for Ü

Inputs: A learning rate. Data structures: A vector of variables, Ö, Ö ¾ Ê Ü µ. Definitions: «Ý µ ÜÔ ÈÖ¾Ê Ü Ýµ Öµ where is a normalization term. Algorithm: Choose initial values ½ for the Ö variables (these values can be arbitrary). For Ø ½ Ì ½: È For ½ Ò Ö ¾ Ê Ü µ, calculate Ø Ö Ý «Ý Ø µá Ü Ý Öµ. È È Set Û Ø Ö¾Ê Ü Ý µ Ö Ö¾Ê Ü µ Ø Ö Ö For ½ Ò Ö ¾ Ê Ü µ, calculate updates Ø ½ Ö Ø Ö Ð Ö Û Ø Ö µ Output: Parameter values Û Ì ½ Figure 2: The EG algorithm for structured problems. We use Ö Ü Öµ and Ð Ö Ð Ü Ý Öµ. under the grammar. For convenience, we restrict the grammar to be in Chomsky-normal form, where all rules in the grammar are of the form or, where are non-terminal symbols, and is some terminal symbol. We take a part Ö to be a CF-rule-tuple Ñ. Under this representation spans words inclusive in Ü; spans words Ñ; and spans words Ñ ½µ. The function Ê Ü Ýµ maps a derivation Ý to the set of parts which it includes. In WCFGs Ü Öµ can be any function mapping a rule production and its position in the sentence Ü, to a feature vector. One example of a loss function would be to define Ð Ü Ý Öµ to be ½ only if Ö s non-terminal is not seen spanning words in the derivation Ý. This would lead to Ä Ü Ý Ýµ tracking the number of constituent errors in Ý, where a constituent is a non-terminal, start-point, end-pointµ tuple such as µ. 3 EG updates for structured objects We now consider an algorithm for computing «Ö ÑÜ «¾ «µ, where «µ is the dual form of the maximum margin problem, as in Figure 1. In particular, we are interested in È the optimal values of the primal form parameters, which are related to «by Û Ý «Ý Ý. A key problem is that in many of our examples, the number of dual variables «Ý precludes dealing with these variables directly. For example, in the MRF case or the WCFG cases, the set ܵ is exponential in size, and the number of dual variables «Ý is therefore also exponential. We describe an algorithm that is efficient for certain examples of structured objects such as MRFs or WCFGs. Instead of representing the «Ý variables explicitly, we will instead manipulate a vector of variables Ö for ½ Ò Ö ¾ Ê Ü µ. Thus we have one of these mini-dual variables for each part seen in the training data. Each of the variables Ö can take any value in the reals. We now define the dual variables «Ý as a function of the vector, which takes the form of a Gibbs distribution: «Ý µ ÜÔ ÈÖ¾Ê Ü Ýµ Öµ ÈÝ ¼ ÜÔ ÈÖ¾Ê Ü Ý ¼ µ Öµ Figure 2 shows an algorithm for maximizing «µ. The algorithm defines a sequence of values ½ ¾. In the next section we prove that the sequence «½ µµ «¾ µµ converges to ÑÜ ««µ. The algorithm can be implemented efficiently, independently of the dimensionality of «, provided that there is an efficient algorithm for computing marginal terms Ö È Ý «Ý µá Ü Ý Öµ for all ½ Ò Ö ¾ Ê Ü µ, and all. A key property is that the primal parameters Û È Ý «Ý µ Ý È Ü Ý µ

È Ý «Ý µ Ü Ýµ can be expressed in terms of the marginal terms, because: Ý «Ý µ Ü Ýµ «Ý µ Ý Ö¾Ê Üݵ Ü Öµ Ö¾Ê Ü µ Ö Ü Öµ È È and hence Û Ü Ý µ Ö¾Ê Ü µ Ö Ü Öµ. The Ö values can be calculated for MRFs and WCFGs in many cases, using standard algorithms. For example, in the WCFG case, the inside-outside algorithm can be used, provided that each part Ö is a context-free rule production, as described in Example 2 above. In the MRF case, the Ö values can be calculated efficiently if the tree-width of the underlying graph is small. Note that the main storage requirements of the algorithm in Figure 2 concern the vector. This is a vector which has as many components as there are parts in the training set. In practice, the number of parts in the training data can become extremely large. Fortunately, an alternative, primal form algorithm is possible. Rather than explicitly storing the Ö variables, we can store a vector Þ Ø of the same dimensionality as Û Ø. The Ö values can be computed from Þ Ø. More explicitly, the main body of the algorithm in Figure 2 can be replaced with the following: Set Þ ½ to some initial value. For Ø ½ Ì ½: Set Û Ø ¼ For ½ Ò: Compute Ø Ö for Ö ¾ Ê Ü µ, using Ö Ø Ø ½µÐ Ö Þ Ø Ö µ; È È Set Û Ø Û Ø Ö¾Ê Ü Ý µ Ö Ö¾Ê Ü µ Ø Ö Ö Set Þ Ø ½ Þ Ø Û Ø It can be verified that if Ö Ö ½ Ö Þ ½, then this alternative algorithm defines the same sequence of (implicit) Ö Ø values, and (explicit) ÛØ values, as the original algorithm. In the next section we show that the original algorithm converges for any choice of initial values ½, so this restriction on Ö ½ should not be significant. 4 Exponentiated gradient (EG) updates for quadratic programs We now prove convergence properties of the algorithm in Figure 2. We show that it is an instantiation of a general algorithm for optimizing quadratic programs (QPs), which relies on Exponentiated Gradient (EG) updates [7, 8]. In the general problem we assume a positive semi-definite matrix ¾ Ê Ñ Ñ, and a vector ¾ Ê Ñ, specifying a loss function É «µ ¼ «½ ¾ «¼ «. Here «is an Ñ-dimensional vector of reals. We assume È that «is formed by the concatenation of Ò vectors «¾ Ê Ñ for ½ Ò, where Ñ Ñ. We assume that each «lies in a simplex of dimension Ñ, so that the feasible set is ««¾ Ê Ñ for ½ Ò Ñ ½ «½ for all, «¼ (1) Our aim is to find Ö ÑÒ «¾ É «µ. Figure 3 gives an algorithm the EG-QP algorithm for finding the minimum. In the next section we give a proof of its convergence properties. The EG-QP algorithm can be used to find the minimum of «µ, and hence the maximum of the dual objective «µ. We justify the algorithm in Figure 2 by showing that it is equivalent to minimization of «µ using the EG-QP algorithm. We give the following theorem: È Theorem 1 Define «µ Ý «½ ÝÄ Ý È È È ¾ ¾ Ý Þ «Ý«Þ Ý Þ, and assume as in section 2 that Ä Ý Ö¾Ê Ü Ð Ü Ýµ Ý Öµ and Ü Ýµ È Ö¾Ê Ü Ýµ Ü Öµ. Consider the sequence «½ µ «Ì ½ µ defined by the algorithm in Figure 2, and the sequence «½ «Ì ½ defined by the EG-QP algorithm when applied to É «µ «µ. Then under the assumption that «½ µ «½, it follows that «Ø µ «Ø for Ø ½ Ì ½µ.

Inputs: A positive semi-definite matrix, and a vector, specifying a loss function É «µ «½ ¾ «¼ «. Each vector «is in, where is defined in Eq. 1. Algorithm: Initialize «½ to a point in the interior of. Choose a learning rate ¼. For Ø ½ Ì Calculate Ø ÖÉ «Ø µ «Ø. Calculate «Ø ½ from «Ø as: «Ø ½ «Ø ÜÔ È Ø «Ø ÜÔ Ø Output: Return «Ì ½. Figure 3: The EG-QP algorithm for quadratic programs. Proof. We can write «µ ÈÝ «ÝÄÝ ½ Ü Ýµ ÈÝ «Ý Ü. It ݵ¾ follows that «Ø µ Ä Ý Ü ÝµÛ Ø Ð «Ý ÈÖ¾Ê Üݵ Ö È ÖÛ Ø where as before Û Ø È Ü Ý µ Ý «ØÝ Ü Ýµµ. The rest of the proof proceeds by induction; due to space constraints we give a sketch of the proof here. The idea is to show that «Ø ½ µ «Ø ½ under the inductive hypothesis that «Ø µ «Ø. This follows immediately from the definitions of the mappings «Ø µ «Ø ½ µ and È «Ø «Ø ½ in the two algorithms, together with the identities Ø Ý «Ø µ «Ý Ö¾Ê Ü Ð Ýµ Ö Ö Û Ø µ and Ø ½ Ö Ø Ö Ð Ö Ö Û Ø µ. ¾ ¾ È 4.1 Convergence of the exponentiated gradient QP algorithm The following theorem shows how the optimization algorithm converges to an optimal solution. The theorem compares the value of the objective function for the algorithm s vector «Ø to the value for a comparison vector Ù ¾. (For example, consider Ù as the solution of the QP.) The convergence result is in terms of several properties of the algorithm and the comparison vector Ù. The distance between Ù and «½ is measured using the Kullback- Liebler (KL) divergence. Recall that the KL-divergence between two probability vectors Ù Ú is defined as Ù Úµ È Ù ÐÓ Ù Ú. For sequences of probability vectors, Ù ¾ with Ù Ù ½ Ù Ò µ and Ù Ù ½ Ù Ñ µ, we can define a divergence as the sum È of KL-divergences: for Ù Ú ¾, Ù Ò Úµ ½ Ù Ú µ. Two other key parameters are, the largest eigenvalue of the positive semidefinite symmetric matrix, and ÖÉ «µµ ÑÜ «¾ ÑÜ ÖÉ «µµ ÑÒ ¾ Ò ÑÜ ÑÜ Theorem 2 For all Ù ¾, ½ Ì É «Ø Ù «½ µ ½ É «½ µ É «Ì ½ µ µ É Ùµ Ì Ì ¾ ¾ ½ µ µ Ì Ø½ Choosing ¼ µ ensures that É «Ì ½ ½ Ì Ì É «Ø Ù «½ µ µ É Ùµ ¾ µ Ì Ø½ The first lemma we require is due to Kivinen and Warmuth [8]. ½ É «½ µ É «Ì ½ µ Ì Lemma 1 For any Ù ¾, É «Ø µ É Ùµ Ù «Ø µ Ù «Ø ½ µ «Ø «Ø ½ µ We focus on the third term. Define Ö µ É «µ as the segment of the gradient vector corresponding to the component «of «, and define the random variable Ø, satisfying ÈÖ Ø Ö µ É «Ø µ «.

Lemma 2 «Ø «Ø ½ µ Proof. «Ø «Ø ½ µ Ò ½ Ò ½ Ò ½ Ò ÐÓ ÐÓ Ø Øµ «Ø ÐÓ «Ø «Ø ½ «Ø ½ Ò ÐÓ ½ ÐÓ «Ø ÜÔ Ö µ «Ø ÜÔ Ö «Ø Ö ½ ¾ Ö Ø Øµ ½ ¾ Ò ½ Ò ½ ÚÖ Ø µ ÚÖ Ø µ This last inequality is at the heart of the proof of Bernstein s inequality; e.g., see [10]. The second part of the proof of the theorem involves bounding this variance in terms of the loss. The following lemma relies on the fact that this variance is, to first order, the decrease in the quadratic loss, and that the second order term in the Taylor series expansion of the loss is small compared to the variance, provided the steps are not too large. The lemma and its proof require several definitions. È For any, let Ê ¼ ½µ be the softmax function, µ ÜÔ µ ½ ÜÔ µ, for ¾ Ê. We shall work in the exponential parameter space: let Ø be the exponential parameters at step Ø, so that the updates are Ø ½ Ø ÖÉ «Ø µ, and the QP variables satisfy «Ø Ø µ. Define the random variables Ø, satisfying ÈÖ Ø µ. This takes Ö µé «Ø µ the same values as Ø, but its distribution is given by a different exponential parameter ( instead of Ø). Define Ø Ø ½ Ø ½ µ Ø ½ ¾ ¼ ½. Lemma 3 For some ¾ Ø Ø ½, Ò ½ ÚÖ Ø µ ¾ µ Ò ½ ÚÖ Ø µ É «Ø µ É «Ø ½ µ but for all ¾ Ø Ø ½, ÚÖ Ø µ ÚÖ Ø µ. Hence, Ò ½ ÚÖ Ø µ ½ ½ µ µ É «Ø µ É «Ø ½ µ Thus, for ¼ µ, É «Ø µ is non-increasing in Ø. The proof is in [1]. Theorem 2 follows from an easy calculation. 5 Experiments We compared an online 1 version of the Exponentiated Gradient algorithm with the factored Sequential Minimal Optimization (SMO) algorithm in [12] on a sequence segmentation task. We selected the first 1000 sentences (12K words) from the CoNLL-2003 named entity recognition challenge data set for our experiment. The goal is to extract (multiword) entity names of people, organizations, locations and miscellaneous entities. Each word is labelled by 9 possible tags (beginning of one of the four entity types, continuation of one of the types, or not-an-entity). We trained a first-order Markov chain over the tags, 1 In the online algorithm we calculate marginal terms, and updates to the Û Ø parameters, one training example at a time. As yet we do not have convergence bounds for this method, but we have found that it works well in practice.

14 14 12 12 10 10 8 8 6 SMO 6 SMO EG (th -2.7) 4 EG (eta.5) 4 EG (th -3) EG (eta 1) EG (th -4.5) 2 2 0 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 1 8 15 (a) (b) Figure 4: Number of iterations over training set vs. dual objective for the SMO and EG algorithms. (a) Comparison with different values; (b) Comparison with ½ and different initial values. where our cliques are just the nodes for the tag of each word and edges between tags of consecutive words. The feature vector for each node assignment consists of the word itself, its capitalization and morphological features, etc., as well as the previous and consecutive words and their features. Likewise, the feature vector for each edge assignment consists of the two words and their features as well as surrounding words. Figure 4 shows the growth of the dual objective function after each pass through the data for SMO and EG, for several settings of the learning rate and the initial setting of the parameters. Note that SMO starts up very quickly but slows down in a suboptimal region, while EG lags at the start, but overtakes SMO and achieves a larger than 10% increase in the value of the objective. These preliminary results suggest that a hybrid algorithm could get the benefits of both, by starting out with several SMO updates and then switching to EG. The key issue is to switch from the marginal representation SMO maintains to the Gibbs representation that EG uses. We can find that produces by first computing conditional probabilities that correspond to our marginals (e.g. dividing edge marginals by node marginals in this case) and then letting s be the logs of the conditional probabilities. References [1] Long version of this paper. Available at http://www.ai.mit.edu/people/mcollins. [2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In ICML, 2003. [3] Michael Collins. Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In Harry Bunt, John Carroll, and Giorgio Satta, editors, New Developments in Parsing Technology. Kluwer, 2004. [4] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5):265 292, 2001. [5] N. Cristianini, C. Campbell, and J. Shawe-Taylor. Multiplicative updatings for support-vector learning. Technical report, NeuroCOLT2, 1998. [6] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000. [7] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997. [8] J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning Research, 45(3):301 329, 2001. [9] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, 2001. [10] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984. [11] F. Sha, L. Saul, and D. Lee. Multiplicative updates for large margin classifiers. In COLT, 2003. [12] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In NIPS, 2003. [13] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. ICML, 2004 (To appear). 22 29 36 43 50 57 64 71 78 85 92 99