Structured Prediction Models via the Matrix-Tree Theorem

Size: px
Start display at page:

Download "Structured Prediction Models via the Matrix-Tree Theorem"

Transcription

1 Structured Prediction Models via the Matrix-Tree Theorem Terry Koo, Amir Globerson, Xavier Carreras and Michael Collins MIT CSAIL, Cambridge, MA 02139, USA Abstract This paper provides an algorithmic framework for learning statistical models involving directed spanning trees, or equivalently non-projective dependency structures. We show how partition functions and marginals for directed spanning trees can be computed by an adaptation of Kirchhoff s Matrix-Tree Theorem. To demonstrate an application of the method, we perform experiments which use the algorithm in training both log-linear and max-margin dependency parsers. The new training methods give improvements in accuracy over perceptron-trained models. 1 Introduction Learning with structured data typically involves searching or summing over a set with an exponential number of structured elements, for example the set of all parse trees for a given sentence. Methods for summing over such structures include the inside-outside algorithm for probabilistic contextfree grammars (Baker, 1979), the forward-backward algorithm for hidden Markov models (Baum et al., 1970), and the belief-propagation algorithm for graphical models (Pearl, 1988). These algorithms compute marginal probabilities and partition functions, quantities which are central to many methods for the statistical modeling of complex structures (e.g., the EM algorithm (Baker, 1979; Baum et al., 1970), contrastive estimation (Smith and Eisner, 2005), training algorithms for CRFs (Lafferty et al., 2001), and training algorithms for max-margin models (Bartlett et al., 2004; Taskar et al., 2004a)). This paper describes inside-outside-style algorithms for the case of directed spanning trees. These structures are equivalent to non-projective dependency parses (McDonald et al., 2005b), and more generally could be relevant to any task that involves learning a mapping from a graph to an underlying spanning tree. Unlike the case for projective dependency structures, partition functions and marginals for non-projective trees cannot be computed using dynamic-programming methods such as the insideoutside algorithm. In this paper we describe how these quantities can be computed by adapting a wellknown result in graph theory: Kirchhoff s Matrix- Tree Theorem (Tutte, 1984). A naïve application of the theorem yields O(n 4 ) and O(n 6 ) algorithms for computation of the partition function and marginals, respectively. However, our adaptation finds the partition function and marginals in O(n 3 ) time using simple matrix determinant and inversion operations. We demonstrate an application of the new inference algorithm to non-projective dependency parsing. Specifically, we show how to implement two popular supervised learning approaches for this task: globally-normalized log-linear models and max-margin models. Log-linear estimation critically depends on the calculation of partition functions and marginals, which can be computed by our algorithms. For max-margin models, Bartlett et al. (2004) have provided a simple training algorithm, based on exponentiated-gradient (EG) updates, that requires computation of marginals and can thus be implemented within our framework. Both of these methods explicitly minimize the loss incurred when parsing the entire training set. This contrasts with the online learning algorithms used in previous work with spanning-tree models (McDonald et al., 2005b). We applied the above two marginal-based training algorithms to six languages with varying degrees of non-projectivity, using datasets obtained from the CoNLL-X shared task (Buchholz and Marsi, 2006). Our experimental framework compared three training approaches: log-linear models, max-margin models, and the averaged perceptron. Each of these was applied to both projective and non-projective parsing. Our results demonstrate that marginal-based training yields models which out-

2 perform those trained using the averaged perceptron. In summary, the contributions of this paper are: 1. We introduce algorithms for inside-outsidestyle calculations for directed spanning trees, or equivalently non-projective dependency structures. These algorithms should have wide applicability in learning problems involving spanning-tree structures. 2. We illustrate the utility of these algorithms in log-linear training of dependency parsing models, and show improvements in accuracy when compared to averaged-perceptron training. 3. We also train max-margin models for dependency parsing via an EG algorithm (Bartlett et al., 2004). The experiments presented here constitute the first application of this algorithm to a large-scale problem. We again show improved performance over the perceptron. The goal of our experiments is to give a rigorous comparative study of the marginal-based training algorithms and a highly-competitive baseline, the averaged perceptron, using the same feature sets for all approaches. We stress, however, that the purpose of this work is not to give competitive performance on the CoNLL data sets; this would require further engineering of the approach. Similar adaptations of the Matrix-Tree Theorem have been developed independently and simultaneously by Smith and Smith (2007) and McDonald and Satta (2007); see Section 5 for more discussion. 2 Background 2.1 Discriminative Dependency Parsing Dependency parsing is the task of mapping a sentence x to a dependency structure y. Given a sentence x with n words, a dependency for that sentence is a tuple (h, m) where h [0... n] is the index of the head word in the sentence, and m [1... n] is the index of a modifier word. The value h = 0 is a special root-symbol that may only appear as the head of a dependency. We use D(x) to refer to all possible dependencies for a sentence x: D(x) = {(h, m) : h [0... n], m [1... n]}. A dependency parse is a set of dependencies that forms a directed tree, with the sentence s rootsymbol as its root. We will consider both projective Projective Single Root root He saw her Multi Root root He saw her Non-projective root He saw her root He saw her Figure 1: Examples of the four types of dependency structures. We draw dependency arcs from head to modifier. trees, where dependencies are not allowed to cross, and non-projective trees, where crossing dependencies are allowed. Dependency annotations for some languages, for example Czech, can exhibit a significant number of crossing dependencies. In addition, we consider both single-root and multi-root trees. In a single-root tree y, the root-symbol has exactly one child, while in a multi-root tree, the root-symbol has one or more children. This distinction is relevant as our training sets include both single-root corpora (in which all trees are single-root structures) and multi-root corpora (in which some trees are multiroot structures). The two distinctions described above are orthogonal, yielding four classes of dependency structures; see Figure 1 for examples of each kind of structure. We use Tp s (x) to denote the set of all possible projective single-root dependency structures for a sentence x, and Tnp(x) s to denote the set of single-root non-projective structures for x. The sets Tp m (x) and Tnp(x) m are defined analogously for multi-root structures. In contexts where any class of dependency structures may be used, we use the notation T (x) as a placeholder that may be defined as Tp s (x), Tnp(x), s Tp m (x) or Tnp(x). m Following McDonald et al. (2005a), we use a discriminative model for dependency parsing. Features in the model are defined through a function f(x, h, m) which maps a sentence x together with a dependency (h, m) to a feature vector in R d. A feature vector can be sensitive to any properties of the triple (x, h, m). Given a parameter vector w, the optimal dependency structure for a sentence x is y (x; w) = argmax y T (x) (h,m) y w f(x, h, m) (1) where the set T (x) can be defined as Tp s (x), Tnp(x), s Tp m (x) or Tnp(x), m depending on the type of parsing.

3 The parameters w will be learned from a training set {(x i, y i )} N i=1 where each x i is a sentence and each y i is a dependency structure. Much of the previous work on learning w has focused on training local models (see Section 5). McDonald et al. (2005a; 2005b) trained global models using online algorithms such as the perceptron algorithm or MIRA. In this paper we consider training algorithms based on work in conditional random fields (CRFs) (Lafferty et al., 2001) and max-margin methods (Taskar et al., 2004a). 2.2 Three Inference Problems This section highlights three inference problems which arise in training and decoding discriminative dependency parsers, and which are central to the approaches described in this paper. Assume that we have a vector θ with values θ h,m R for all (h, m) D(x); these values correspond to weights on the different dependencies in D(x). Define a conditional distribution over all dependency structures y T (x) as follows: } { P (y x; θ) = exp (h,m) y θ h,m (2) Z(x; θ) Z(x; θ) = exp θ h,m (3) y T (x) (h,m) y The function Z(x; θ) is commonly referred to as the partition function. Given the distribution P (y x; θ), we can define the marginal probability of a dependency (h, m) as µ h,m (x; θ) = P (y x; θ) y T (x) : (h,m) y The inference problems are then as follows: Problem 1: Decoding: Find argmax y T (x) (h,m) y θ h,m Problem 2: Computation of the Partition Function: Calculate Z(x; θ). Problem 3: Computation of the Marginals: For all (h, m) D(x), calculate µ h,m (x; θ). Note that all three problems require a maximization or summation over the set T (x), which is exponential in size. There is a clear motivation for being able to solve Problem 1: by setting θ h,m = w f(x, h, m), the optimal dependency structure y (x; w) (see Eq. 1) can be computed. In this paper the motivation for solving Problems 2 and 3 arises from training algorithms for discriminative models. As we will describe in Section 4, both log-linear and max-margin models can be trained via methods that make direct use of algorithms for Problems 2 and 3. In the case of projective dependency structures (i.e., T (x) defined as Tp s (x) or Tp m (x)), there are well-known algorithms for all three inference problems. Decoding can be carried out using Viterbistyle dynamic-programming algorithms, for example the O(n 3 ) algorithm of Eisner (1996). Computation of the marginals and partition function can also be achieved in O(n 3 ) time, using a variant of the inside-outside algorithm (Baker, 1979) applied to the Eisner (1996) data structures (Paskin, 2001). In the non-projective case (i.e., T (x) defined as Tnp(x) s or Tnp(x)), m McDonald et al. (2005b) describe how the CLE algorithm (Chu and Liu, 1965; Edmonds, 1967) can be used for decoding. However, it is not possible to compute the marginals and partition function using the inside-outside algorithm. We next describe a method for computing these quantities in O(n 3 ) time using matrix inverse and determinant operations. 3 Spanning-tree inference using the Matrix-Tree Theorem In this section we present algorithms for computing the partition function and marginals, as defined in Section 2.2, for non-projective parsing. We first reiterate the observation of McDonald et al. (2005a) that non-projective parses correspond to directed spanning trees on a complete directed graph of n nodes, where n is the length of the sentence. The above inference problems thus involve summation over the set of all directed spanning trees. Note that this set is exponentially large, and there is no obvious method for decomposing the sum into dynamicprogramming-like subproblems. This section describes how a variant of Kirchhoff s Matrix-Tree Theorem (Tutte, 1984) can be used to evaluate the partition function and marginals efficiently. In what follows, we consider the single-root setting (i.e., T (x) = Tnp(x)), s leaving the multi-root

4 case (i.e., T (x) = Tnp(x)) m to Section 3.3. For a sentence x with n words, define a complete directed graph G on n nodes, where each node corresponds to a word in x, and each edge corresponds to a dependency between two words in x. Note that G does not include the root-symbol h = 0, nor does it account for any dependencies (0, m) headed by the root-symbol. We assign non-negative weights to the edges of this graph, yielding the following weighted adjacency matrix A(θ) R n n, for h, m = 1... n: { 0, if h = m A h,m (θ) = exp {θ h,m }, otherwise To account for the dependencies (0, m) headed by the root-symbol, we define a vector of root-selection scores r(θ) R n, for m = 1... n: r m (θ) = exp {θ 0,m } Let the weight of a dependency structure y Tnp(x) s be defined as: ψ(y; θ) = r root(y) (θ) A h,m (θ) (h,m) y : h 0 Here, root(y) = m : (0, m) y is the child of the root-symbol; there is exactly one such child, since y T s np(x). Eq. 2 and 3 can be rephrased as: P (y x; θ) = Z(x; θ) = ψ(y; θ) Z(x; θ) (4) y T s np (x) ψ(y; θ) (5) In the remainder of this section, we drop the notational dependence on x for brevity. The original Matrix-Tree Theorem addressed the problem of counting the number of undirected spanning trees in an undirected graph. For the models we study here, we require a sum of weighted and directed spanning trees. Tutte (1984) extended the Matrix-Tree Theorem to this case. We briefly summarize his method below. First, define the Laplacian matrix L(θ) R n n of G, for h, m = 1... n: { nh L h,m (θ) = =1 A h,m(θ) if h = m A h,m (θ) otherwise Second, for a matrix X, let X (h,m) be the minor of X with respect to row h and column m; i.e., the determinant of the matrix formed by deleting row h and column m from X. Finally, define the weight of any directed spanning tree of G to be the product of the weights A h,m (θ) for the edges in that tree. Theorem 1 (Tutte, 1984, p. 140). Let L(θ) be the Laplacian matrix of G. Then L (m,m) (θ) is equal to the sum of the weights of all directed spanning trees of G which are rooted at m. Furthermore, the minors vary only in sign when traversing the columns of the Laplacian (Tutte, 1984, p. 150): h, m: ( 1) h+m L (h,m) (θ) = L (m,m) (θ) (6) 3.1 Partition functions via matrix determinants From Theorem 1, it directly follows that L (m,m) (θ) = A h,m (θ) y U(m) (h,m) y : h 0 where U(m) = {y Tnp s : root(y) = m}. A naïve method for computing the partition function is therefore to evaluate n Z(θ) = r m (θ)l (m,m) (θ) m=1 The above would require calculating n determinants, resulting in O(n 4 ) complexity. However, as we show below Z(θ) may be obtained in O(n 3 ) time using a single determinant evaluation. Define a new matrix ˆL(θ) to be L(θ) with the first row replaced by the root-selection scores: { rm (θ) h = 1 ˆL h,m (θ) = L h,m (θ) h > 1 This matrix allows direct computation of the partition function, as the following proposition shows. Proposition 1 The partition function in Eq. 5 is given by Z(θ) = ˆL(θ). Proof: Consider the row expansion of ˆL(θ) with respect to row 1: n ˆL(θ) = ( 1) 1+m ˆL1,m (θ)ˆl (1,m) (θ) = = m=1 n m=1 n m=1 ( 1) 1+m r m (θ)l (1,m) (θ) r m (θ)l (m,m) (θ) = Z(θ) The second line follows from the construction of ˆL(θ), and the third line follows from Eq. 6.

5 3.2 Marginals via matrix inversion The marginals we require are given by µ h,m (θ) = 1 Z(θ) y T s np : (h,m) y ψ(y; θ) To calculate these marginals efficiently for all values of (h, m) we use a well-known identity relating the log partition-function to marginals µ h,m (θ) = log Z(θ) θ h,m Since the partition function in this case has a closedform expression (i.e., the determinant of a matrix constructed from θ), the marginals can also obtained in closed form. Using the chain rule, the derivative of the log partition-function in Proposition 1 is µ h,m (θ) = = log ˆL(θ) θ h,m n n h =1 m =1 log ˆL(θ) ˆL,m (θ) h ˆL h,m (θ) θ h,m To perform the derivative, we use the identity log X X = (X 1) T and the fact that ˆL h,m (θ)/ θ h,m is nonzero for only a few h, m. Specifically, when h = 0, the marginals are given by [ˆL 1 ] µ 0,m (θ) = r m (θ) (θ) m,1 and for h > 0, the marginals are given by [ˆL 1 ] µ h,m (θ) = (1 δ 1,m )A h,m (θ) (θ) [ˆL 1 ] (1 δ h,1 )A h,m (θ) (θ) m,m m,h where δ h,m is the Kronecker delta. Thus, the complexity of evaluating all the relevant marginals is dominated by the matrix inversion, and the total complexity is therefore O(n 3 ). 3.3 Multiple Roots In the case of multiple roots, we can still compute the partition function and marginals efficiently. In fact, the derivation of this case is simpler than for single-root structures. Create an extended graph G which augments G with a dummy root node that has edges pointing to all of the existing nodes, weighted by the appropriate root-selection scores. Note that there is a bijection between directed spanning trees of G rooted at the dummy root and multi-root structures y Tnp(x). m Thus, Theorem 1 can be used to compute the partition function directly: construct a Laplacian matrix L(θ) for G and compute the minor L (0,0) (θ). Since this minor is also a determinant, the marginals can be obtained analogously to the single-root case. More concretely, this technique corresponds to defining the matrix ˆL(θ) as ˆL(θ) = L(θ) + diag(r(θ)) where diag(v) is the diagonal matrix with the vector v on its diagonal. 3.4 Labeled Trees The techniques above extend easily to the case where dependencies are labeled. For a model with L different labels, it suffices to define the edge and root scores as A h,m (θ) = L l=1 exp {θ h,m,l } and r m (θ) = L l=1 exp {θ 0,m,l }. The partition function over labeled trees is obtained by operating on these values as described previously, and the marginals are given by an application of the chain rule. Both inference problems are solvable in O(n 3 + Ln 2 ) time. 4 Training Algorithms This section describes two methods for parameter estimation that rely explicitly on the computation of the partition function and marginals. 4.1 Log-Linear Estimation In conditional log-linear models (Johnson et al., 1999; Lafferty et al., 2001), a distribution over parse trees for a sentence x is defined as follows: P (y x; w) = exp { (h,m) y w f(x, h, m) } Z(x; w) (7) where Z(x; w) is the partition function, a sum over Tp s (x), Tnp(x), s Tp m (x) or Tnp(x). m We train the model using the approach described by Sha and Pereira (2003). Assume that we have a training set {(x i, y i )} N i=1. The optimal parameters

6 are taken to be w = argmin w L(w) where N L(w) = C log P (y i x i ; w) w 2 i=1 The parameter C > 0 is a constant dictating the level of regularization in the model. Since L(w) is a convex function, gradient descent methods can be used to search for the global minimum. Such methods typically involve repeated computation of the loss L(w) and gradient L(w) w, requiring efficient implementations of both functions. Note that the log-probability of a parse is log P (y x; w) = w f(x, h, m) log Z(x; w) (h,m) y so that the main issue in calculating the loss function L(w) is the evaluation of the partition functions Z(x i ; w). The gradient of the loss is given by L(w) w where + C = w C N N i=1 i=1 (h,m) D(x i ) µ h,m (x; w) = (h,m) y i f(x i, h, m) µ h,m (x i ; w)f(x i, h, m) y T (x) : (h,m) y P (y x; w) is the marginal probability of a dependency (h, m). Thus, the main issue in the evaluation of the gradient is the computation of the marginals µ h,m (x i ; w). Note that Eq. 7 forms a special case of the loglinear distribution defined in Eq. 2 in Section 2.2. If we set θ h,m = w f(x, h, m) then we have P (y x; w) = P (y x; θ), Z(x; w) = Z(x; θ), and µ h,m (x; w) = µ h,m (x; θ). Thus in the projective case the inside-outside algorithm can be used to calculate the partition function and marginals, thereby enabling training of a log-linear model; in the nonprojective case the algorithms in Section 3 can be used for this purpose. 4.2 Max-Margin Estimation The second learning algorithm we consider is the large-margin approach for structured prediction (Taskar et al., 2004a; Taskar et al., 2004b). Learning in this framework again involves minimization of a convex function L(w). Let the margin for parse tree y on the i th training example be defined as m i,y (w) = w f(x i, h, m) w f(x i, h, m) (h,m) y i (h,m) y The loss function is then defined as L(w) = C N max (E i,y m i,y (w)) + 1 y T (x i ) 2 w 2 i=1 where E i,y is a measure of the loss or number of errors for parse y on the i th training sentence. In this paper we take E i,y to be the number of incorrect dependencies in the parse tree y when compared to the gold-standard parse tree y i. The definition of L(w) makes use of the expression max y T (xi ) (E i,y m i,y (w)) for the i th training example, which is commonly referred to as the hinge loss. Note that E i,yi = 0, and also that m i,yi (w) = 0, so that the hinge loss is always nonnegative. In addition, the hinge loss is 0 if and only if m i,y (w) E i,y for all y T (x i ). Thus the hinge loss directly penalizes margins m i,y (w) which are less than their corresponding losses E i,y. Figure 2 shows an algorithm for minimizing L(w) that is based on the exponentiated-gradient algorithm for large-margin optimization described by Bartlett et al. (2004). The algorithm maintains a set of weights θ i,h,m for i = 1... N, (h, m) D(x i ), which are updated example-by-example. The algorithm relies on the repeated computation of marginal values µ i,h,m, which are defined as follows: 1 µ i,h,m = P (y x i ) = P (y x i ) (8) y T (x i ) : (h,m) y { } exp (h,m) y θ i,h,m { } y T (x i ) exp (h,m) y θ i,h,m A similar definition is used to derive marginal values µ i,h,m from the values θ i,h,m. Computation of the µ and µ values is again inference of the form described in Problem 3 in Section 2.2, and can be 1 Bartlett et al. (2004) write P (y x i) as α i,y. The α i,y variables are dual variables that appear in the dual objective function, i.e., the convex dual of L(w). Analysis of the algorithm shows that as the θ i,h,m variables are updated, the dual variables converge to the optimal point of the dual objective, and the parameters w converge to the minimum of L(w).

7 Inputs: Training examples {(x i, y i)} N i=1. Parameters: Regularization constant C, starting point β, number of passes over training set T. Data Structures: Real values θ i,h,m and l i,h,m for i = 1... N, (h, m) D(x i). Learning rate η. Initialization: Set learning rate η = 1. Set θ C i,h,m = β for (h, m) y i, and θ i,h,m = 0 for (h, m) / y i. Set l i,h,m = 0 for (h, m) y i, and l i,h,m = 1 for (h, m) / y i. Calculate initial parameters as w = C δ i,h,m f(x i, h, m) i (h,m) D(x i ) where δ i,h,m = (1 l i,h,m µ i,h,m ) and the µ i,h,m values are calculated from the θ i,h,m values as described in Eq. 8. Algorithm: Repeat T passes over the training set, where each pass is as follows: Set obj = 0 For i = 1... N For all (h, m) D(x i): θ i,h,m = θ i,h,m + ηc (l i,h,m + w f(x i, h, m)) For example i, calculate marginals µ i,h,m from θ i,h,m values, and marginals µ i,h,m from θ i,h,m values (see Eq. 8). Update the parameters: w = w + C (h,m) D(x i ) δ i,h,mf(x i, h, m) where δ i,h,m = µ i,h,m µ i,h,m, For all (h, m) D(x i), set θ i,h,m = θ i,h,m Set obj = obj + C (h,m) D(x i ) l i,h,mµ i,h,m Set obj = obj w 2 2. If obj has decreased compared to last iteration, set η = η 2. Output: Parameter values w. Figure 2: The EG Algorithm for Max-Margin Estimation. The learning rate η is halved each time the dual objective function (see (Bartlett et al., 2004)) fails to increase. In our experiments we chose β = 9, which was found to work well during development of the algorithm. achieved using the inside-outside algorithm for projective structures, and the algorithms described in Section 3 for non-projective structures. 5 Related Work Global log-linear training has been used in the context of PCFG parsing (Johnson, 2001). Riezler et al. (2004) explore a similar application of log-linear models to LFG parsing. Max-margin learning has been applied to PCFG parsing by Taskar et al. (2004b). They show that this problem has a QP dual of polynomial size, where the dual variables correspond to marginal probabilities of CFG rules. A similar QP dual may be obtained for max-margin projective dependency parsing. However, for nonprojective parsing, the dual QP would require an exponential number of constraints on the dependency marginals (Chopra, 1989). Nevertheless, alternative optimization methods like that of Tsochantaridis et al. (2004), or the EG method presented here, can still be applied. The majority of previous work on dependency parsing has focused on local (i.e., classification of individual edges) discriminative training methods (Yamada and Matsumoto, 2003; Nivre et al., 2004; Y. Cheng, 2005). Non-local (i.e., classification of entire trees) training methods were used by McDonald et al. (2005a), who employed online learning. Dependency parsing accuracy can be improved by allowing second-order features, which consider more than one dependency simultaneously. McDonald and Pereira (2006) define a second-order dependency parsing model in which interactions between adjacent siblings are allowed, and Carreras (2007) defines a second-order model that allows grandparent and sibling interactions. Both authors give polytime algorithms for exact projective parsing. By adapting the inside-outside algorithm to these models, partition functions and marginals can be computed for second-order projective structures, allowing log-linear and max-margin training to be applied via the framework developed in this paper. For higher-order non-projective parsing, however, computational complexity results (McDonald and Pereira, 2006; McDonald and Satta, 2007) indicate that exact solutions to the three inference problems of Section 2.2 will be intractable. Exploration of approximate second-order non-projective inference is a natural avenue for future research. Two other groups of authors have independently and simultaneously proposed adaptations of the Matrix-Tree Theorem for structured inference on directed spanning trees (McDonald and Satta, 2007; Smith and Smith, 2007). There are some algorithmic differences between these papers and ours. First, we define both multi-root and single-root algorithms, whereas the other papers only consider multi-root

8 parsing. This distinction can be important as one often expects a dependency structure to have exactly one child attached to the root-symbol, as is the case in a single-root structure. Second, McDonald and Satta (2007) propose an O(n 5 ) algorithm for computing the marginals, as opposed to the O(n 3 ) matrix-inversion approach used by Smith and Smith (2007) and ourselves. In addition to the algorithmic differences, both groups of authors consider applications of the Matrix-Tree Theorem which we have not discussed. For example, both papers propose minimum-risk decoding, and McDonald and Satta (2007) discuss unsupervised learning and language modeling, while Smith and Smith (2007) define hiddenvariable models based on spanning trees. In this paper we used EG training methods only for max-margin models (Bartlett et al., 2004). However, Globerson et al. (2007) have recently shown how EG updates can be applied to efficient training of log-linear models. 6 Experiments on Dependency Parsing In this section, we present experimental results applying our inference algorithms for dependency parsing models. Our primary purpose is to establish comparisons along two relevant dimensions: projective training vs. non-projective training, and marginal-based training algorithms vs. the averaged perceptron. The feature representation and other relevant dimensions are kept fixed in the experiments. 6.1 Data Sets and Features We used data from the CoNLL-X shared task on multilingual dependency parsing (Buchholz and Marsi, 2006). In our experiments, we used a subset consisting of six languages; Table 1 gives details of the data sets used. 2 For each language we created a validation set that was a subset of the CoNLL-X 2 Our subset includes the two languages with the lowest accuracy in the CoNLL-X evaluations (Turkish and Arabic), the language with the highest accuracy (Japanese), the most nonprojective language (Dutch), a moderately non-projective language (Slovene), and a highly projective language (Spanish). All languages but Spanish have multi-root parses in their data. We are grateful to the providers of the treebanks that constituted the data of our experiments (Hajič et al., 2004; van der Beek et al., 2002; Kawata and Bartels, 2000; Džeroski et al., 2006; Civit and Martí, 2002; Oflazer et al., 2003). language %cd train val. test Arabic ,064 5,315 5,373 Dutch ,861 16,208 5,585 Japanese ,966 9,495 5,711 Slovene ,949 5,801 6,390 Spanish ,310 11,024 5,694 Turkish ,827 5,683 7,547 Table 1: Information for the languages in our experiments. The 2nd column (%cd) is the percentage of crossing dependencies in the training and validation sets. The last three columns report the size in tokens of the training, validation and test sets. training set for that language. The remainder of each training set was used to train the models for the different languages. The validation sets were used to tune the meta-parameters (e.g., the value of the regularization constant C) of the different training algorithms. We used the official test sets and evaluation script from the CoNLL-X task. All of the results that we report are for unlabeled dependency parsing. 3 The non-projective models were trained on the CoNLL-X data in its original form. Since the projective models assume that the dependencies in the data are non-crossing, we created a second training set for each language where non-projective dependency structures were automatically transformed into projective structures. All projective models were trained on these new training sets. 4 Our feature space is based on that of McDonald et al. (2005a) Results We performed experiments using three training algorithms: the averaged perceptron (Collins, 2002), log-linear training (via conjugate gradient descent), and max-margin training (via the EG algorithm). Each of these algorithms was trained using projective and non-projective methods, yielding six training settings per language. The different training algorithms have various meta-parameters, which we optimized on the validation set for each language/training-setting combination. The 3 Our algorithms also support labeled parsing (see Section 3.4). Initial experiments with labeled models showed the same trend that we report here for unlabeled parsing, so for simplicity we conducted extensive experiments only for unlabeled parsing. 4 The transformations were performed by running the projective parser with score +1 on correct dependencies and -1 otherwise: the resulting trees are guaranteed to be projective and to have a minimum loss with respect to the correct tree. Note that only the training sets were transformed. 5 It should be noted that McDonald et al. (2006) use a richer feature set that is incomparable to our features.

9 Perceptron Max-Margin Log-Linear p np p np p np Ara Dut Jap Slo Spa Tur Table 2: Test data results. The p and np columns show results with projective and non-projective training respectively. Ara Dut Jap Slo Spa Tur AV P E L Table 3: Results for the three training algorithms on the different languages (P = perceptron, E = EG, L = log-linear models). AV is an average across the results for the different languages. averaged perceptron has a single meta-parameter, namely the number of iterations over the training set. The log-linear models have two meta-parameters: the regularization constant C and the number of gradient steps T taken by the conjugate-gradient optimizer. The EG approach also has two metaparameters: the regularization constant C and the number of iterations, T. 6 For models trained using non-projective algorithms, both projective and nonprojective parsing was tested on the validation set, and the highest scoring of these two approaches was then used to decode test data sentences. Table 2 reports test results for the six training scenarios. These results show that for Dutch, which is the language in our data that has the highest number of crossing dependencies, non-projective training gives significant gains over projective training for all three training methods. For the other languages, non-projective training gives similar or even improved performance over projective training. Table 3 gives an additional set of results, which were calculated as follows. For each of the three training methods, we used the validation set results to choose between projective and non-projective training. This allows us to make a direct comparison of the three training algorithms. Table 3 6 We trained the perceptron for 100 iterations, and chose the iteration which led to the best score on the validation set. Note that in all of our experiments, the best perceptron results were actually obtained with 30 or fewer iterations. For the log-linear and EG algorithms we tested a number of values for C, and for each value of C ran 100 gradient steps or EG iterations, finally choosing the best combination of C and T found in validation. shows the results of this comparison. 7 The results show that log-linear and max-margin models both give a higher average accuracy than the perceptron. For some languages (e.g., Japanese), the differences from the perceptron are small; however for other languages (e.g., Arabic, Dutch or Slovene) the improvements seen are quite substantial. 7 Conclusions This paper describes inference algorithms for spanning-tree distributions, focusing on the fundamental problems of computing partition functions and marginals. Although we concentrate on loglinear and max-margin estimation, the inference algorithms we present can serve as black-boxes in many other statistical modeling techniques. Our experiments suggest that marginal-based training produces more accurate models than perceptron learning. Notably, this is the first large-scale application of the EG algorithm, and shows that it is a promising approach for structured learning. In line with McDonald et al. (2005b), we confirm that spanning-tree models are well-suited to dependency parsing, especially for highly non-projective languages such as Dutch. Moreover, spanning-tree models should be useful for a variety of other problems involving structured data. Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments. In addition, the authors gratefully acknowledge the following sources of support. Terry Koo was funded by a grant from the NSF (DMS ) and a grant from NTT, Agmt. Dtd. 6/21/1998. Amir Globerson was supported by a fellowship from the Rothschild Foundation - Yad Hanadiv. Xavier Carreras was supported by the Catalan Ministry of Innovation, Universities and Enterprise, and a grant from NTT, Agmt. Dtd. 6/21/1998. Michael Collins was funded by NSF grants and DMS We ran the sign test at the sentence level to measure the statistical significance of the results aggregated across the six languages. Out of 2,472 sentences total, log-linear models gave improved parses over the perceptron on 448 sentences, and worse parses on 343 sentences. The max-margin method gave improved/worse parses for 500/383 sentences. Both results are significant with p

10 References J. Baker Trainable grammars for speech recognition. In 97th meeting of the Acoustical Society of America. P. Bartlett, M. Collins, B. Taskar, and D. McAllester Exponentiated gradient algorithms for large margin structured classification. In NIPS. L.E. Baum, T. Petrie, G. Soules, and N. Weiss A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Annals of Mathematical Statistics, 41: S. Buchholz and E. Marsi CoNLL-X shared task on multilingual dependency parsing. In Proc. CoNLL-X. X. Carreras Experiments with a higher-order projective dependency parser. In Proc. EMNLP-CoNLL. S. Chopra On the spanning tree polyhedron. Oper. Res. Lett., pages Y.J. Chu and T.H. Liu On the shortest arborescence of a directed graph. Science Sinica, 14: M. Civit and M a A. Martí Design principles for a Spanish treebank. In Proc. of the First Workshop on Treebanks and Linguistic Theories (TLT). M. Collins Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. S. Džeroski, T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtsky, and A. Žele Towards a Slovene dependency treebank. In Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation (LREC). J. Edmonds Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: J. Eisner Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. A. Globerson, T. Koo, X. Carreras, and M. Collins Exponentiated gradient algorithms for log-linear structured prediction. In Proc. ICML. J. Hajič, O. Smrž, P. Zemánek, J. Šnaidauf, and E. Beška Prague Arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools, pages M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler Estimators for stochastic unification-based grammars. In Proc. ACL. M. Johnson Joint and conditional estimation of tagging and parsing models. In Proc. ACL. Y. Kawata and J. Bartels Stylebook for the Japanese treebank in VERBMOBIL. Verbmobil-Report 240, Seminar für Sprachwissenschaft, Universität Tübingen. J. Lafferty, A. McCallum, and F. Pereira Conditonal random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. R. McDonald and F. Pereira Online learning of approximate dependency parsing algorithms. In Proc. EACL. R. McDonald and G. Satta On the complexity of nonprojective data-driven dependency parsing. In Proc. IWPT. R. McDonald, K. Crammer, and F. Pereira. 2005a. Online large-margin training of dependency parsers. In Proc. ACL. R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. 2005b. Non-projective dependency parsing using spanning tree algorithms. In Proc. HLT-EMNLP. R. McDonald, K. Lerman, and F. Pereira Multilingual dependency parsing with a two-stage discriminative parser. In Proc. CoNLL-X. J. Nivre, J. Hall, and J. Nilsson Memory-based dependency parsing. In Proc. CoNLL. K. Oflazer, B. Say, D. Zeynep Hakkani-Tür, and G. Tür Building a Turkish treebank. In A. Abeillé, editor, Treebanks: Building and Using Parsed Corpora, chapter 15. Kluwer Academic Publishers. M.A. Paskin Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD , University of California, Berkeley. J. Pearl Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (2nd edition). Morgan Kaufmann Publishers. S. Riezler, R. Kaplan, T. King, J. Maxwell, A. Vasserman, and R. Crouch Speed and accuracy in shallow and deep stochastic parsing. In Proc. HLT-NAACL. F. Sha and F. Pereira Shallow parsing with conditional random fields. In Proc. HLT-NAACL. N.A. Smith and J. Eisner Contrastive estimation: Training log-linear models on unlabeled data. In Proc. ACL. D.A. Smith and N.A. Smith Probabilistic models of nonprojective dependency trees. In Proc. EMNLP-CoNLL. B. Taskar, C. Guestrin, and D. Koller. 2004a. Max-margin markov networks. In NIPS. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. 2004b. Max-margin parsing. In Proc. EMNLP. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun Support vector machine learning for interdependent and structured output spaces. In Proc. ICML. W. Tutte Graph Theory. Addison-Wesley. L. van der Beek, G. Bouma, R. Malouf, and G. van Noord The Alpino dependency treebank. In Computational Linguistics in the Netherlands (CLIN). Y. Matsumoto Y. Cheng, M. Asahara Machine learningbased dependency analyzer for chinese. In Proc. ICCC. H. Yamada and Y. Matsumoto Statistical dependency analysis with support vector machines. In Proc. IWPT.

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing Jun Suzuki, Hideki Isozaki NTT CS Lab., NTT Corp. Kyoto, 619-0237, Japan jun@cslab.kecl.ntt.co.jp isozaki@cslab.kecl.ntt.co.jp

More information

Tekniker för storskalig parsning: Dependensparsning 2

Tekniker för storskalig parsning: Dependensparsning 2 Tekniker för storskalig parsning: Dependensparsning 2 Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi joakim.nivre@lingfil.uu.se Dependensparsning 2 1(45) Data-Driven Dependency

More information

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Incremental Integer Linear Programming for Non-projective Dependency Parsing Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel James Clarke ICCS, University of Edinburgh 22. July 2006 EMNLP 2006 S. Riedel, J. Clarke (ICCS, Edinburgh)

More information

Online Learning of Approximate Dependency Parsing Algorithms

Online Learning of Approximate Dependency Parsing Algorithms Online Learning of Approximate Dependency Parsing Algorithms Ryan McDonald Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 {ryantm,pereira}@cis.upenn.edu

More information

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11

Graph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11 Graph-Based Parsing Miguel Ballesteros Algorithms for NLP Course. 7-11 By using some Joakim Nivre's materials from Uppsala University and Jason Eisner's material from Johns Hopkins University. Outline

More information

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Exponentiated Gradient Algorithms for Large-margin Structured Classification Exponentiated Gradient Algorithms for Large-margin Structured Classification Peter L. Bartlett U.C.Berkeley bartlett@stat.berkeley.edu Ben Taskar Stanford University btaskar@cs.stanford.edu Michael Collins

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

A Quick Guide to MaltParser Optimization

A Quick Guide to MaltParser Optimization A Quick Guide to MaltParser Optimization Joakim Nivre Johan Hall 1 Introduction MaltParser is a system for data-driven dependency parsing, which can be used to induce a parsing model from treebank data

More information

Sorting Out Dependency Parsing

Sorting Out Dependency Parsing Sorting Out Dependency Parsing Joakim Nivre Uppsala University and Växjö University Sorting Out Dependency Parsing 1(38) Introduction Introduction Syntactic parsing of natural language: Who does what to

More information

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1 Projective vs non-projective dependencies If we extract dependencies from trees,

More information

Probabilistic Models of Nonprojective Dependency Trees

Probabilistic Models of Nonprojective Dependency Trees Probabilistic Models of Nonprojective Dependenc Trees David A. Smith Department of Computer Science Center for Language and Speech Processing Johns Hopkins Universit Baltimore, MD 21218 USA dasmith@cs.jhu.edu

More information

Parsing with Dynamic Programming

Parsing with Dynamic Programming CS11-747 Neural Networks for NLP Parsing with Dynamic Programming Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between words

More information

Sorting Out Dependency Parsing

Sorting Out Dependency Parsing Sorting Out Dependency Parsing Joakim Nivre Uppsala University and Växjö University Sorting Out Dependency Parsing 1(38) Introduction Introduction Syntactic parsing of natural language: Who does what to

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Computationally Efficient M-Estimation of Log-Linear Structure Models

Computationally Efficient M-Estimation of Log-Linear Structure Models Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith, Doug Vail, and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu

More information

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and

More information

Non-projective Dependency Parsing using Spanning Tree Algorithms

Non-projective Dependency Parsing using Spanning Tree Algorithms Non-projective Dependency Parsing using Spanning Tree Algorithms Ryan McDonald Fernando Pereira Department of Computer and Information Science University of Pennsylvania {ryantm,pereira}@cis.upenn.edu

More information

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001 Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques

More information

Projective Dependency Parsing with Perceptron

Projective Dependency Parsing with Perceptron Projective Dependency Parsing with Perceptron Xavier Carreras, Mihai Surdeanu, and Lluís Màrquez Technical University of Catalonia {carreras,surdeanu,lluism}@lsi.upc.edu 8th June 2006 Outline Introduction

More information

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised

More information

Non-Projective Dependency Parsing in Expected Linear Time

Non-Projective Dependency Parsing in Expected Linear Time Non-Projective Dependency Parsing in Expected Linear Time Joakim Nivre Uppsala University, Department of Linguistics and Philology, SE-75126 Uppsala Växjö University, School of Mathematics and Systems

More information

Comparisons of Sequence Labeling Algorithms and Extensions

Comparisons of Sequence Labeling Algorithms and Extensions Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

CS395T Project 2: Shift-Reduce Parsing

CS395T Project 2: Shift-Reduce Parsing CS395T Project 2: Shift-Reduce Parsing Due date: Tuesday, October 17 at 9:30am In this project you ll implement a shift-reduce parser. First you ll implement a greedy model, then you ll extend that model

More information

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,

More information

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Support Vector Machine Learning for Interdependent and Structured Output Spaces Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,

More information

Flexible Text Segmentation with Structured Multilabel Classification

Flexible Text Segmentation with Structured Multilabel Classification Flexible Text Segmentation with Structured Multilabel Classification Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia,

More information

Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers

Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers Carnegie Mellon University Research Showcase @ CMU Language Technologies Institute School of Computer Science 8-2013 Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers Andre F.T. Martins

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin Dependency Parsing 2 CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre, Dan Jurafsky & James Martin Dependency Parsing Formalizing dependency trees Transition-based dependency parsing

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

Concise Integer Linear Programming Formulations for Dependency Parsing

Concise Integer Linear Programming Formulations for Dependency Parsing Concise Integer Linear Programming Formulations for Dependency Parsing André F. T. Martins Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Instituto

More information

Transition-Based Dependency Parsing with MaltParser

Transition-Based Dependency Parsing with MaltParser Transition-Based Dependency Parsing with MaltParser Joakim Nivre Uppsala University and Växjö University Transition-Based Dependency Parsing 1(13) Introduction Outline Goals of the workshop Transition-based

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2

More information

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Tianze Shi* Liang Huang Lillian Lee* * Cornell University Oregon State University O n 3 O n

More information

Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks

Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks Xu Sun MOE Key Laboratory of Computational Linguistics, Peking University Abstract Many structured

More information

STRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH

STRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH Slide 3.1 3 STRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH 3.0 Introduction 3.1 Graph Theory 3.2 Strategies for State Space Search 3.3 Using the State Space to Represent Reasoning with the Predicate

More information

Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

Dynamic Programming for Higher Order Parsing of Gap-Minding Trees Dynamic Programming for Higher Order Parsing of Gap-Minding Trees Emily Pitler, Sampath Kannan, Mitchell Marcus Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 epitler,kannan,mitch@seas.upenn.edu

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Loopy Belief Propagation

Loopy Belief Propagation Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure

More information

Training for Fast Sequential Prediction Using Dynamic Feature Selection

Training for Fast Sequential Prediction Using Dynamic Feature Selection Training for Fast Sequential Prediction Using Dynamic Feature Selection Emma Strubell Luke Vilnis Andrew McCallum School of Computer Science University of Massachusetts, Amherst Amherst, MA 01002 {strubell,

More information

The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs

The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs Franziska Meier fmeier@usc.edu U. of Southern California, 94 Bloom Walk, Los Angeles, CA 989 USA Amir Globerson amir.globerson@mail.huji.ac.il

More information

Spanning Tree Methods for Discriminative Training of Dependency Parsers

Spanning Tree Methods for Discriminative Training of Dependency Parsers University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science January 2006 Spanning Tree Methods for Discriminative Training of Dependency Parsers Ryan

More information

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Incremental Integer Linear Programming for Non-projective Dependency Parsing Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel and James Clarke School of Informatics, University of Edinburgh 2 Bucclecuch Place, Edinburgh EH8 9LW, UK s.r.riedel@sms.ed.ac.uk,

More information

Dynamic Feature Selection for Dependency Parsing

Dynamic Feature Selection for Dependency Parsing Dynamic Feature Selection for Dependency Parsing He He, Hal Daumé III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N Fruit flies like a banana

More information

Generalized Higher-Order Dependency Parsing with Cube Pruning

Generalized Higher-Order Dependency Parsing with Cube Pruning Generalized Higher-Order Dependency Parsing with Cube Pruning Hao Zhang Ryan McDonald Google, Inc. {haozhang,ryanmcd}@google.com Abstract State-of-the-art graph-based parsers use features over higher-order

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

Dependency Parsing with Undirected Graphs

Dependency Parsing with Undirected Graphs Dependency Parsing with Undirected Graphs Carlos Gómez-Rodríguez Departamento de Computación Universidade da Coruña Campus de Elviña, 15071 A Coruña, Spain carlos.gomez@udc.es Daniel Fernández-González

More information

SVMs for Structured Output. Andrea Vedaldi

SVMs for Structured Output. Andrea Vedaldi SVMs for Structured Output Andrea Vedaldi SVM struct Tsochantaridis Hofmann Joachims Altun 04 Extending SVMs 3 Extending SVMs SVM = parametric function arbitrary input binary output 3 Extending SVMs SVM

More information

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

Klein & Manning, NIPS 2002

Klein & Manning, NIPS 2002 Agenda for today Factoring complex model into product of simpler models Klein & Manning factored model: dependencies and constituents Dual decomposition for higher-order dependency parsing Refresh memory

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

Learning Hierarchies at Two-class Complexity

Learning Hierarchies at Two-class Complexity Learning Hierarchies at Two-class Complexity Sandor Szedmak ss03v@ecs.soton.ac.uk Craig Saunders cjs@ecs.soton.ac.uk John Shawe-Taylor jst@ecs.soton.ac.uk ISIS Group, Electronics and Computer Science University

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Hidden Markov Support Vector Machines

Hidden Markov Support Vector Machines Hidden Markov Support Vector Machines Yasemin Altun Ioannis Tsochantaridis Thomas Hofmann Department of Computer Science, Brown University, Providence, RI 02912 USA altun@cs.brown.edu it@cs.brown.edu th@cs.brown.edu

More information

Information Processing Letters

Information Processing Letters Information Processing Letters 112 (2012) 449 456 Contents lists available at SciVerse ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Recursive sum product algorithm for generalized

More information

Parsing partially bracketed input

Parsing partially bracketed input Parsing partially bracketed input Martijn Wieling, Mark-Jan Nederhof and Gertjan van Noord Humanities Computing, University of Groningen Abstract A method is proposed to convert a Context Free Grammar

More information

Improving Transition-Based Dependency Parsing with Buffer Transitions

Improving Transition-Based Dependency Parsing with Buffer Transitions Improving Transition-Based Dependency Parsing with Buffer Transitions Daniel Fernández-González Departamento de Informática Universidade de Vigo Campus As Lagoas, 32004 Ourense, Spain danifg@uvigo.es Carlos

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Statistical Dependency Parsing

Statistical Dependency Parsing Statistical Dependency Parsing The State of the Art Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Dependency Parsing 1(29) Introduction

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part II C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Converting Directed to Undirected Graphs (1) Converting Directed to Undirected Graphs (2) Add extra links between

More information

Transductive Phoneme Classification Using Local Scaling And Confidence

Transductive Phoneme Classification Using Local Scaling And Confidence 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion

More information

Dependency Parsing domain adaptation using transductive SVM

Dependency Parsing domain adaptation using transductive SVM Dependency Parsing domain adaptation using transductive SVM Antonio Valerio Miceli-Barone University of Pisa, Italy / Largo B. Pontecorvo, 3, Pisa, Italy miceli@di.unipi.it Giuseppe Attardi University

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

MEMMs (Log-Linear Tagging Models)

MEMMs (Log-Linear Tagging Models) Chapter 8 MEMMs (Log-Linear Tagging Models) 8.1 Introduction In this chapter we return to the problem of tagging. We previously described hidden Markov models (HMMs) for tagging problems. This chapter

More information

Structure and Support Vector Machines. SPFLODD October 31, 2013

Structure and Support Vector Machines. SPFLODD October 31, 2013 Structure and Support Vector Machines SPFLODD October 31, 2013 Outline SVMs for structured outputs Declara?ve view Procedural view Warning: Math Ahead Nota?on for Linear Models Training data: {(x 1, y

More information

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University K- Means + GMMs Clustering Readings: Murphy 25.5 Bishop 12.1, 12.3 HTF 14.3.0 Mitchell

More information

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013 The Perceptron Simon Šuster, University of Groningen Course Learning from data November 18, 2013 References Hal Daumé III: A Course in Machine Learning http://ciml.info Tom M. Mitchell: Machine Learning

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國 Conditional Random Fields - A probabilistic graphical model Yen-Chin Lee 指導老師 : 鮑興國 Outline Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional

More information

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang Wen-tau Yih Microsoft Research Redmond, WA 98052, USA {minchang,scottyih}@microsoft.com Abstract Due to

More information

Conditional Random Field for tracking user behavior based on his eye s movements 1

Conditional Random Field for tracking user behavior based on his eye s movements 1 Conditional Random Field for tracing user behavior based on his eye s movements 1 Trinh Minh Tri Do Thierry Artières LIP6, Université Paris 6 LIP6, Université Paris 6 8 rue du capitaine Scott 8 rue du

More information

Confidence in Structured-Prediction using Confidence-Weighted Models

Confidence in Structured-Prediction using Confidence-Weighted Models Confidence in Structured-Prediction using Confidence-Weighted Models Avihai Mejer Department of Computer Science Technion-Israel Institute of Technology Haifa 32, Israel amejer@tx.technion.ac.il Koby Crammer

More information

Assignment 4 CSE 517: Natural Language Processing

Assignment 4 CSE 517: Natural Language Processing Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set

More information

Feature Extraction and Loss training using CRFs: A Project Report

Feature Extraction and Loss training using CRFs: A Project Report Feature Extraction and Loss training using CRFs: A Project Report Ankan Saha Department of computer Science University of Chicago March 11, 2008 Abstract POS tagging has been a very important problem in

More information

Easy-First POS Tagging and Dependency Parsing with Beam Search

Easy-First POS Tagging and Dependency Parsing with Beam Search Easy-First POS Tagging and Dependency Parsing with Beam Search Ji Ma JingboZhu Tong Xiao Nan Yang Natrual Language Processing Lab., Northeastern University, Shenyang, China MOE-MS Key Lab of MCC, University

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization More on Learning Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization Neural Net Learning Motivated by studies of the brain. A network of artificial

More information

Probabilistic parsing with a wide variety of features

Probabilistic parsing with a wide variety of features Probabilistic parsing with a wide variety of features Mark Johnson Brown University IJCNLP, March 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) upported by NF grants LI 9720368

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation Supplementary Material: The Emergence of Organizing Structure in Conceptual Representation Brenden M. Lake, 1,2 Neil D. Lawrence, 3 Joshua B. Tenenbaum, 4,5 1 Center for Data Science, New York University

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies

Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies Ivan Titov University of Illinois at Urbana-Champaign James Henderson, Paola Merlo, Gabriele Musillo University

More information

A New Perceptron Algorithm for Sequence Labeling with Non-local Features

A New Perceptron Algorithm for Sequence Labeling with Non-local Features A New Perceptron Algorithm for Sequence Labeling with Non-local Features Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292

More information

Monotone Paths in Geometric Triangulations

Monotone Paths in Geometric Triangulations Monotone Paths in Geometric Triangulations Adrian Dumitrescu Ritankar Mandal Csaba D. Tóth November 19, 2017 Abstract (I) We prove that the (maximum) number of monotone paths in a geometric triangulation

More information

Hidden Markov Models in the context of genetic analysis

Hidden Markov Models in the context of genetic analysis Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Discriminative Training for Phrase-Based Machine Translation

Discriminative Training for Phrase-Based Machine Translation Discriminative Training for Phrase-Based Machine Translation Abhishek Arun 19 April 2007 Overview 1 Evolution from generative to discriminative models Discriminative training Model Learning schemes Featured

More information

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce Andrea Gesmundo Computer Science Department University of Geneva Geneva, Switzerland andrea.gesmundo@unige.ch

More information

Density-Driven Cross-Lingual Transfer of Dependency Parsers

Density-Driven Cross-Lingual Transfer of Dependency Parsers Density-Driven Cross-Lingual Transfer of Dependency Parsers Mohammad Sadegh Rasooli Michael Collins rasooli@cs.columbia.edu Presented by Owen Rambow EMNLP 2015 Motivation Availability of treebanks Accurate

More information

Conditional Random Fields with High-Order Features for Sequence Labeling

Conditional Random Fields with High-Order Features for Sequence Labeling Conditional Random Fields with High-Order Features for Sequence Labeling Nan Ye Wee Sun Lee Department of Computer Science National University of Singapore {yenan,leews}@comp.nus.edu.sg Hai Leong Chieu

More information