Structured Prediction Models via the Matrix-Tree Theorem
|
|
- Buddy Valentine Joseph
- 5 years ago
- Views:
Transcription
1 Structured Prediction Models via the Matrix-Tree Theorem Terry Koo, Amir Globerson, Xavier Carreras and Michael Collins MIT CSAIL, Cambridge, MA 02139, USA Abstract This paper provides an algorithmic framework for learning statistical models involving directed spanning trees, or equivalently non-projective dependency structures. We show how partition functions and marginals for directed spanning trees can be computed by an adaptation of Kirchhoff s Matrix-Tree Theorem. To demonstrate an application of the method, we perform experiments which use the algorithm in training both log-linear and max-margin dependency parsers. The new training methods give improvements in accuracy over perceptron-trained models. 1 Introduction Learning with structured data typically involves searching or summing over a set with an exponential number of structured elements, for example the set of all parse trees for a given sentence. Methods for summing over such structures include the inside-outside algorithm for probabilistic contextfree grammars (Baker, 1979), the forward-backward algorithm for hidden Markov models (Baum et al., 1970), and the belief-propagation algorithm for graphical models (Pearl, 1988). These algorithms compute marginal probabilities and partition functions, quantities which are central to many methods for the statistical modeling of complex structures (e.g., the EM algorithm (Baker, 1979; Baum et al., 1970), contrastive estimation (Smith and Eisner, 2005), training algorithms for CRFs (Lafferty et al., 2001), and training algorithms for max-margin models (Bartlett et al., 2004; Taskar et al., 2004a)). This paper describes inside-outside-style algorithms for the case of directed spanning trees. These structures are equivalent to non-projective dependency parses (McDonald et al., 2005b), and more generally could be relevant to any task that involves learning a mapping from a graph to an underlying spanning tree. Unlike the case for projective dependency structures, partition functions and marginals for non-projective trees cannot be computed using dynamic-programming methods such as the insideoutside algorithm. In this paper we describe how these quantities can be computed by adapting a wellknown result in graph theory: Kirchhoff s Matrix- Tree Theorem (Tutte, 1984). A naïve application of the theorem yields O(n 4 ) and O(n 6 ) algorithms for computation of the partition function and marginals, respectively. However, our adaptation finds the partition function and marginals in O(n 3 ) time using simple matrix determinant and inversion operations. We demonstrate an application of the new inference algorithm to non-projective dependency parsing. Specifically, we show how to implement two popular supervised learning approaches for this task: globally-normalized log-linear models and max-margin models. Log-linear estimation critically depends on the calculation of partition functions and marginals, which can be computed by our algorithms. For max-margin models, Bartlett et al. (2004) have provided a simple training algorithm, based on exponentiated-gradient (EG) updates, that requires computation of marginals and can thus be implemented within our framework. Both of these methods explicitly minimize the loss incurred when parsing the entire training set. This contrasts with the online learning algorithms used in previous work with spanning-tree models (McDonald et al., 2005b). We applied the above two marginal-based training algorithms to six languages with varying degrees of non-projectivity, using datasets obtained from the CoNLL-X shared task (Buchholz and Marsi, 2006). Our experimental framework compared three training approaches: log-linear models, max-margin models, and the averaged perceptron. Each of these was applied to both projective and non-projective parsing. Our results demonstrate that marginal-based training yields models which out-
2 perform those trained using the averaged perceptron. In summary, the contributions of this paper are: 1. We introduce algorithms for inside-outsidestyle calculations for directed spanning trees, or equivalently non-projective dependency structures. These algorithms should have wide applicability in learning problems involving spanning-tree structures. 2. We illustrate the utility of these algorithms in log-linear training of dependency parsing models, and show improvements in accuracy when compared to averaged-perceptron training. 3. We also train max-margin models for dependency parsing via an EG algorithm (Bartlett et al., 2004). The experiments presented here constitute the first application of this algorithm to a large-scale problem. We again show improved performance over the perceptron. The goal of our experiments is to give a rigorous comparative study of the marginal-based training algorithms and a highly-competitive baseline, the averaged perceptron, using the same feature sets for all approaches. We stress, however, that the purpose of this work is not to give competitive performance on the CoNLL data sets; this would require further engineering of the approach. Similar adaptations of the Matrix-Tree Theorem have been developed independently and simultaneously by Smith and Smith (2007) and McDonald and Satta (2007); see Section 5 for more discussion. 2 Background 2.1 Discriminative Dependency Parsing Dependency parsing is the task of mapping a sentence x to a dependency structure y. Given a sentence x with n words, a dependency for that sentence is a tuple (h, m) where h [0... n] is the index of the head word in the sentence, and m [1... n] is the index of a modifier word. The value h = 0 is a special root-symbol that may only appear as the head of a dependency. We use D(x) to refer to all possible dependencies for a sentence x: D(x) = {(h, m) : h [0... n], m [1... n]}. A dependency parse is a set of dependencies that forms a directed tree, with the sentence s rootsymbol as its root. We will consider both projective Projective Single Root root He saw her Multi Root root He saw her Non-projective root He saw her root He saw her Figure 1: Examples of the four types of dependency structures. We draw dependency arcs from head to modifier. trees, where dependencies are not allowed to cross, and non-projective trees, where crossing dependencies are allowed. Dependency annotations for some languages, for example Czech, can exhibit a significant number of crossing dependencies. In addition, we consider both single-root and multi-root trees. In a single-root tree y, the root-symbol has exactly one child, while in a multi-root tree, the root-symbol has one or more children. This distinction is relevant as our training sets include both single-root corpora (in which all trees are single-root structures) and multi-root corpora (in which some trees are multiroot structures). The two distinctions described above are orthogonal, yielding four classes of dependency structures; see Figure 1 for examples of each kind of structure. We use Tp s (x) to denote the set of all possible projective single-root dependency structures for a sentence x, and Tnp(x) s to denote the set of single-root non-projective structures for x. The sets Tp m (x) and Tnp(x) m are defined analogously for multi-root structures. In contexts where any class of dependency structures may be used, we use the notation T (x) as a placeholder that may be defined as Tp s (x), Tnp(x), s Tp m (x) or Tnp(x). m Following McDonald et al. (2005a), we use a discriminative model for dependency parsing. Features in the model are defined through a function f(x, h, m) which maps a sentence x together with a dependency (h, m) to a feature vector in R d. A feature vector can be sensitive to any properties of the triple (x, h, m). Given a parameter vector w, the optimal dependency structure for a sentence x is y (x; w) = argmax y T (x) (h,m) y w f(x, h, m) (1) where the set T (x) can be defined as Tp s (x), Tnp(x), s Tp m (x) or Tnp(x), m depending on the type of parsing.
3 The parameters w will be learned from a training set {(x i, y i )} N i=1 where each x i is a sentence and each y i is a dependency structure. Much of the previous work on learning w has focused on training local models (see Section 5). McDonald et al. (2005a; 2005b) trained global models using online algorithms such as the perceptron algorithm or MIRA. In this paper we consider training algorithms based on work in conditional random fields (CRFs) (Lafferty et al., 2001) and max-margin methods (Taskar et al., 2004a). 2.2 Three Inference Problems This section highlights three inference problems which arise in training and decoding discriminative dependency parsers, and which are central to the approaches described in this paper. Assume that we have a vector θ with values θ h,m R for all (h, m) D(x); these values correspond to weights on the different dependencies in D(x). Define a conditional distribution over all dependency structures y T (x) as follows: } { P (y x; θ) = exp (h,m) y θ h,m (2) Z(x; θ) Z(x; θ) = exp θ h,m (3) y T (x) (h,m) y The function Z(x; θ) is commonly referred to as the partition function. Given the distribution P (y x; θ), we can define the marginal probability of a dependency (h, m) as µ h,m (x; θ) = P (y x; θ) y T (x) : (h,m) y The inference problems are then as follows: Problem 1: Decoding: Find argmax y T (x) (h,m) y θ h,m Problem 2: Computation of the Partition Function: Calculate Z(x; θ). Problem 3: Computation of the Marginals: For all (h, m) D(x), calculate µ h,m (x; θ). Note that all three problems require a maximization or summation over the set T (x), which is exponential in size. There is a clear motivation for being able to solve Problem 1: by setting θ h,m = w f(x, h, m), the optimal dependency structure y (x; w) (see Eq. 1) can be computed. In this paper the motivation for solving Problems 2 and 3 arises from training algorithms for discriminative models. As we will describe in Section 4, both log-linear and max-margin models can be trained via methods that make direct use of algorithms for Problems 2 and 3. In the case of projective dependency structures (i.e., T (x) defined as Tp s (x) or Tp m (x)), there are well-known algorithms for all three inference problems. Decoding can be carried out using Viterbistyle dynamic-programming algorithms, for example the O(n 3 ) algorithm of Eisner (1996). Computation of the marginals and partition function can also be achieved in O(n 3 ) time, using a variant of the inside-outside algorithm (Baker, 1979) applied to the Eisner (1996) data structures (Paskin, 2001). In the non-projective case (i.e., T (x) defined as Tnp(x) s or Tnp(x)), m McDonald et al. (2005b) describe how the CLE algorithm (Chu and Liu, 1965; Edmonds, 1967) can be used for decoding. However, it is not possible to compute the marginals and partition function using the inside-outside algorithm. We next describe a method for computing these quantities in O(n 3 ) time using matrix inverse and determinant operations. 3 Spanning-tree inference using the Matrix-Tree Theorem In this section we present algorithms for computing the partition function and marginals, as defined in Section 2.2, for non-projective parsing. We first reiterate the observation of McDonald et al. (2005a) that non-projective parses correspond to directed spanning trees on a complete directed graph of n nodes, where n is the length of the sentence. The above inference problems thus involve summation over the set of all directed spanning trees. Note that this set is exponentially large, and there is no obvious method for decomposing the sum into dynamicprogramming-like subproblems. This section describes how a variant of Kirchhoff s Matrix-Tree Theorem (Tutte, 1984) can be used to evaluate the partition function and marginals efficiently. In what follows, we consider the single-root setting (i.e., T (x) = Tnp(x)), s leaving the multi-root
4 case (i.e., T (x) = Tnp(x)) m to Section 3.3. For a sentence x with n words, define a complete directed graph G on n nodes, where each node corresponds to a word in x, and each edge corresponds to a dependency between two words in x. Note that G does not include the root-symbol h = 0, nor does it account for any dependencies (0, m) headed by the root-symbol. We assign non-negative weights to the edges of this graph, yielding the following weighted adjacency matrix A(θ) R n n, for h, m = 1... n: { 0, if h = m A h,m (θ) = exp {θ h,m }, otherwise To account for the dependencies (0, m) headed by the root-symbol, we define a vector of root-selection scores r(θ) R n, for m = 1... n: r m (θ) = exp {θ 0,m } Let the weight of a dependency structure y Tnp(x) s be defined as: ψ(y; θ) = r root(y) (θ) A h,m (θ) (h,m) y : h 0 Here, root(y) = m : (0, m) y is the child of the root-symbol; there is exactly one such child, since y T s np(x). Eq. 2 and 3 can be rephrased as: P (y x; θ) = Z(x; θ) = ψ(y; θ) Z(x; θ) (4) y T s np (x) ψ(y; θ) (5) In the remainder of this section, we drop the notational dependence on x for brevity. The original Matrix-Tree Theorem addressed the problem of counting the number of undirected spanning trees in an undirected graph. For the models we study here, we require a sum of weighted and directed spanning trees. Tutte (1984) extended the Matrix-Tree Theorem to this case. We briefly summarize his method below. First, define the Laplacian matrix L(θ) R n n of G, for h, m = 1... n: { nh L h,m (θ) = =1 A h,m(θ) if h = m A h,m (θ) otherwise Second, for a matrix X, let X (h,m) be the minor of X with respect to row h and column m; i.e., the determinant of the matrix formed by deleting row h and column m from X. Finally, define the weight of any directed spanning tree of G to be the product of the weights A h,m (θ) for the edges in that tree. Theorem 1 (Tutte, 1984, p. 140). Let L(θ) be the Laplacian matrix of G. Then L (m,m) (θ) is equal to the sum of the weights of all directed spanning trees of G which are rooted at m. Furthermore, the minors vary only in sign when traversing the columns of the Laplacian (Tutte, 1984, p. 150): h, m: ( 1) h+m L (h,m) (θ) = L (m,m) (θ) (6) 3.1 Partition functions via matrix determinants From Theorem 1, it directly follows that L (m,m) (θ) = A h,m (θ) y U(m) (h,m) y : h 0 where U(m) = {y Tnp s : root(y) = m}. A naïve method for computing the partition function is therefore to evaluate n Z(θ) = r m (θ)l (m,m) (θ) m=1 The above would require calculating n determinants, resulting in O(n 4 ) complexity. However, as we show below Z(θ) may be obtained in O(n 3 ) time using a single determinant evaluation. Define a new matrix ˆL(θ) to be L(θ) with the first row replaced by the root-selection scores: { rm (θ) h = 1 ˆL h,m (θ) = L h,m (θ) h > 1 This matrix allows direct computation of the partition function, as the following proposition shows. Proposition 1 The partition function in Eq. 5 is given by Z(θ) = ˆL(θ). Proof: Consider the row expansion of ˆL(θ) with respect to row 1: n ˆL(θ) = ( 1) 1+m ˆL1,m (θ)ˆl (1,m) (θ) = = m=1 n m=1 n m=1 ( 1) 1+m r m (θ)l (1,m) (θ) r m (θ)l (m,m) (θ) = Z(θ) The second line follows from the construction of ˆL(θ), and the third line follows from Eq. 6.
5 3.2 Marginals via matrix inversion The marginals we require are given by µ h,m (θ) = 1 Z(θ) y T s np : (h,m) y ψ(y; θ) To calculate these marginals efficiently for all values of (h, m) we use a well-known identity relating the log partition-function to marginals µ h,m (θ) = log Z(θ) θ h,m Since the partition function in this case has a closedform expression (i.e., the determinant of a matrix constructed from θ), the marginals can also obtained in closed form. Using the chain rule, the derivative of the log partition-function in Proposition 1 is µ h,m (θ) = = log ˆL(θ) θ h,m n n h =1 m =1 log ˆL(θ) ˆL,m (θ) h ˆL h,m (θ) θ h,m To perform the derivative, we use the identity log X X = (X 1) T and the fact that ˆL h,m (θ)/ θ h,m is nonzero for only a few h, m. Specifically, when h = 0, the marginals are given by [ˆL 1 ] µ 0,m (θ) = r m (θ) (θ) m,1 and for h > 0, the marginals are given by [ˆL 1 ] µ h,m (θ) = (1 δ 1,m )A h,m (θ) (θ) [ˆL 1 ] (1 δ h,1 )A h,m (θ) (θ) m,m m,h where δ h,m is the Kronecker delta. Thus, the complexity of evaluating all the relevant marginals is dominated by the matrix inversion, and the total complexity is therefore O(n 3 ). 3.3 Multiple Roots In the case of multiple roots, we can still compute the partition function and marginals efficiently. In fact, the derivation of this case is simpler than for single-root structures. Create an extended graph G which augments G with a dummy root node that has edges pointing to all of the existing nodes, weighted by the appropriate root-selection scores. Note that there is a bijection between directed spanning trees of G rooted at the dummy root and multi-root structures y Tnp(x). m Thus, Theorem 1 can be used to compute the partition function directly: construct a Laplacian matrix L(θ) for G and compute the minor L (0,0) (θ). Since this minor is also a determinant, the marginals can be obtained analogously to the single-root case. More concretely, this technique corresponds to defining the matrix ˆL(θ) as ˆL(θ) = L(θ) + diag(r(θ)) where diag(v) is the diagonal matrix with the vector v on its diagonal. 3.4 Labeled Trees The techniques above extend easily to the case where dependencies are labeled. For a model with L different labels, it suffices to define the edge and root scores as A h,m (θ) = L l=1 exp {θ h,m,l } and r m (θ) = L l=1 exp {θ 0,m,l }. The partition function over labeled trees is obtained by operating on these values as described previously, and the marginals are given by an application of the chain rule. Both inference problems are solvable in O(n 3 + Ln 2 ) time. 4 Training Algorithms This section describes two methods for parameter estimation that rely explicitly on the computation of the partition function and marginals. 4.1 Log-Linear Estimation In conditional log-linear models (Johnson et al., 1999; Lafferty et al., 2001), a distribution over parse trees for a sentence x is defined as follows: P (y x; w) = exp { (h,m) y w f(x, h, m) } Z(x; w) (7) where Z(x; w) is the partition function, a sum over Tp s (x), Tnp(x), s Tp m (x) or Tnp(x). m We train the model using the approach described by Sha and Pereira (2003). Assume that we have a training set {(x i, y i )} N i=1. The optimal parameters
6 are taken to be w = argmin w L(w) where N L(w) = C log P (y i x i ; w) w 2 i=1 The parameter C > 0 is a constant dictating the level of regularization in the model. Since L(w) is a convex function, gradient descent methods can be used to search for the global minimum. Such methods typically involve repeated computation of the loss L(w) and gradient L(w) w, requiring efficient implementations of both functions. Note that the log-probability of a parse is log P (y x; w) = w f(x, h, m) log Z(x; w) (h,m) y so that the main issue in calculating the loss function L(w) is the evaluation of the partition functions Z(x i ; w). The gradient of the loss is given by L(w) w where + C = w C N N i=1 i=1 (h,m) D(x i ) µ h,m (x; w) = (h,m) y i f(x i, h, m) µ h,m (x i ; w)f(x i, h, m) y T (x) : (h,m) y P (y x; w) is the marginal probability of a dependency (h, m). Thus, the main issue in the evaluation of the gradient is the computation of the marginals µ h,m (x i ; w). Note that Eq. 7 forms a special case of the loglinear distribution defined in Eq. 2 in Section 2.2. If we set θ h,m = w f(x, h, m) then we have P (y x; w) = P (y x; θ), Z(x; w) = Z(x; θ), and µ h,m (x; w) = µ h,m (x; θ). Thus in the projective case the inside-outside algorithm can be used to calculate the partition function and marginals, thereby enabling training of a log-linear model; in the nonprojective case the algorithms in Section 3 can be used for this purpose. 4.2 Max-Margin Estimation The second learning algorithm we consider is the large-margin approach for structured prediction (Taskar et al., 2004a; Taskar et al., 2004b). Learning in this framework again involves minimization of a convex function L(w). Let the margin for parse tree y on the i th training example be defined as m i,y (w) = w f(x i, h, m) w f(x i, h, m) (h,m) y i (h,m) y The loss function is then defined as L(w) = C N max (E i,y m i,y (w)) + 1 y T (x i ) 2 w 2 i=1 where E i,y is a measure of the loss or number of errors for parse y on the i th training sentence. In this paper we take E i,y to be the number of incorrect dependencies in the parse tree y when compared to the gold-standard parse tree y i. The definition of L(w) makes use of the expression max y T (xi ) (E i,y m i,y (w)) for the i th training example, which is commonly referred to as the hinge loss. Note that E i,yi = 0, and also that m i,yi (w) = 0, so that the hinge loss is always nonnegative. In addition, the hinge loss is 0 if and only if m i,y (w) E i,y for all y T (x i ). Thus the hinge loss directly penalizes margins m i,y (w) which are less than their corresponding losses E i,y. Figure 2 shows an algorithm for minimizing L(w) that is based on the exponentiated-gradient algorithm for large-margin optimization described by Bartlett et al. (2004). The algorithm maintains a set of weights θ i,h,m for i = 1... N, (h, m) D(x i ), which are updated example-by-example. The algorithm relies on the repeated computation of marginal values µ i,h,m, which are defined as follows: 1 µ i,h,m = P (y x i ) = P (y x i ) (8) y T (x i ) : (h,m) y { } exp (h,m) y θ i,h,m { } y T (x i ) exp (h,m) y θ i,h,m A similar definition is used to derive marginal values µ i,h,m from the values θ i,h,m. Computation of the µ and µ values is again inference of the form described in Problem 3 in Section 2.2, and can be 1 Bartlett et al. (2004) write P (y x i) as α i,y. The α i,y variables are dual variables that appear in the dual objective function, i.e., the convex dual of L(w). Analysis of the algorithm shows that as the θ i,h,m variables are updated, the dual variables converge to the optimal point of the dual objective, and the parameters w converge to the minimum of L(w).
7 Inputs: Training examples {(x i, y i)} N i=1. Parameters: Regularization constant C, starting point β, number of passes over training set T. Data Structures: Real values θ i,h,m and l i,h,m for i = 1... N, (h, m) D(x i). Learning rate η. Initialization: Set learning rate η = 1. Set θ C i,h,m = β for (h, m) y i, and θ i,h,m = 0 for (h, m) / y i. Set l i,h,m = 0 for (h, m) y i, and l i,h,m = 1 for (h, m) / y i. Calculate initial parameters as w = C δ i,h,m f(x i, h, m) i (h,m) D(x i ) where δ i,h,m = (1 l i,h,m µ i,h,m ) and the µ i,h,m values are calculated from the θ i,h,m values as described in Eq. 8. Algorithm: Repeat T passes over the training set, where each pass is as follows: Set obj = 0 For i = 1... N For all (h, m) D(x i): θ i,h,m = θ i,h,m + ηc (l i,h,m + w f(x i, h, m)) For example i, calculate marginals µ i,h,m from θ i,h,m values, and marginals µ i,h,m from θ i,h,m values (see Eq. 8). Update the parameters: w = w + C (h,m) D(x i ) δ i,h,mf(x i, h, m) where δ i,h,m = µ i,h,m µ i,h,m, For all (h, m) D(x i), set θ i,h,m = θ i,h,m Set obj = obj + C (h,m) D(x i ) l i,h,mµ i,h,m Set obj = obj w 2 2. If obj has decreased compared to last iteration, set η = η 2. Output: Parameter values w. Figure 2: The EG Algorithm for Max-Margin Estimation. The learning rate η is halved each time the dual objective function (see (Bartlett et al., 2004)) fails to increase. In our experiments we chose β = 9, which was found to work well during development of the algorithm. achieved using the inside-outside algorithm for projective structures, and the algorithms described in Section 3 for non-projective structures. 5 Related Work Global log-linear training has been used in the context of PCFG parsing (Johnson, 2001). Riezler et al. (2004) explore a similar application of log-linear models to LFG parsing. Max-margin learning has been applied to PCFG parsing by Taskar et al. (2004b). They show that this problem has a QP dual of polynomial size, where the dual variables correspond to marginal probabilities of CFG rules. A similar QP dual may be obtained for max-margin projective dependency parsing. However, for nonprojective parsing, the dual QP would require an exponential number of constraints on the dependency marginals (Chopra, 1989). Nevertheless, alternative optimization methods like that of Tsochantaridis et al. (2004), or the EG method presented here, can still be applied. The majority of previous work on dependency parsing has focused on local (i.e., classification of individual edges) discriminative training methods (Yamada and Matsumoto, 2003; Nivre et al., 2004; Y. Cheng, 2005). Non-local (i.e., classification of entire trees) training methods were used by McDonald et al. (2005a), who employed online learning. Dependency parsing accuracy can be improved by allowing second-order features, which consider more than one dependency simultaneously. McDonald and Pereira (2006) define a second-order dependency parsing model in which interactions between adjacent siblings are allowed, and Carreras (2007) defines a second-order model that allows grandparent and sibling interactions. Both authors give polytime algorithms for exact projective parsing. By adapting the inside-outside algorithm to these models, partition functions and marginals can be computed for second-order projective structures, allowing log-linear and max-margin training to be applied via the framework developed in this paper. For higher-order non-projective parsing, however, computational complexity results (McDonald and Pereira, 2006; McDonald and Satta, 2007) indicate that exact solutions to the three inference problems of Section 2.2 will be intractable. Exploration of approximate second-order non-projective inference is a natural avenue for future research. Two other groups of authors have independently and simultaneously proposed adaptations of the Matrix-Tree Theorem for structured inference on directed spanning trees (McDonald and Satta, 2007; Smith and Smith, 2007). There are some algorithmic differences between these papers and ours. First, we define both multi-root and single-root algorithms, whereas the other papers only consider multi-root
8 parsing. This distinction can be important as one often expects a dependency structure to have exactly one child attached to the root-symbol, as is the case in a single-root structure. Second, McDonald and Satta (2007) propose an O(n 5 ) algorithm for computing the marginals, as opposed to the O(n 3 ) matrix-inversion approach used by Smith and Smith (2007) and ourselves. In addition to the algorithmic differences, both groups of authors consider applications of the Matrix-Tree Theorem which we have not discussed. For example, both papers propose minimum-risk decoding, and McDonald and Satta (2007) discuss unsupervised learning and language modeling, while Smith and Smith (2007) define hiddenvariable models based on spanning trees. In this paper we used EG training methods only for max-margin models (Bartlett et al., 2004). However, Globerson et al. (2007) have recently shown how EG updates can be applied to efficient training of log-linear models. 6 Experiments on Dependency Parsing In this section, we present experimental results applying our inference algorithms for dependency parsing models. Our primary purpose is to establish comparisons along two relevant dimensions: projective training vs. non-projective training, and marginal-based training algorithms vs. the averaged perceptron. The feature representation and other relevant dimensions are kept fixed in the experiments. 6.1 Data Sets and Features We used data from the CoNLL-X shared task on multilingual dependency parsing (Buchholz and Marsi, 2006). In our experiments, we used a subset consisting of six languages; Table 1 gives details of the data sets used. 2 For each language we created a validation set that was a subset of the CoNLL-X 2 Our subset includes the two languages with the lowest accuracy in the CoNLL-X evaluations (Turkish and Arabic), the language with the highest accuracy (Japanese), the most nonprojective language (Dutch), a moderately non-projective language (Slovene), and a highly projective language (Spanish). All languages but Spanish have multi-root parses in their data. We are grateful to the providers of the treebanks that constituted the data of our experiments (Hajič et al., 2004; van der Beek et al., 2002; Kawata and Bartels, 2000; Džeroski et al., 2006; Civit and Martí, 2002; Oflazer et al., 2003). language %cd train val. test Arabic ,064 5,315 5,373 Dutch ,861 16,208 5,585 Japanese ,966 9,495 5,711 Slovene ,949 5,801 6,390 Spanish ,310 11,024 5,694 Turkish ,827 5,683 7,547 Table 1: Information for the languages in our experiments. The 2nd column (%cd) is the percentage of crossing dependencies in the training and validation sets. The last three columns report the size in tokens of the training, validation and test sets. training set for that language. The remainder of each training set was used to train the models for the different languages. The validation sets were used to tune the meta-parameters (e.g., the value of the regularization constant C) of the different training algorithms. We used the official test sets and evaluation script from the CoNLL-X task. All of the results that we report are for unlabeled dependency parsing. 3 The non-projective models were trained on the CoNLL-X data in its original form. Since the projective models assume that the dependencies in the data are non-crossing, we created a second training set for each language where non-projective dependency structures were automatically transformed into projective structures. All projective models were trained on these new training sets. 4 Our feature space is based on that of McDonald et al. (2005a) Results We performed experiments using three training algorithms: the averaged perceptron (Collins, 2002), log-linear training (via conjugate gradient descent), and max-margin training (via the EG algorithm). Each of these algorithms was trained using projective and non-projective methods, yielding six training settings per language. The different training algorithms have various meta-parameters, which we optimized on the validation set for each language/training-setting combination. The 3 Our algorithms also support labeled parsing (see Section 3.4). Initial experiments with labeled models showed the same trend that we report here for unlabeled parsing, so for simplicity we conducted extensive experiments only for unlabeled parsing. 4 The transformations were performed by running the projective parser with score +1 on correct dependencies and -1 otherwise: the resulting trees are guaranteed to be projective and to have a minimum loss with respect to the correct tree. Note that only the training sets were transformed. 5 It should be noted that McDonald et al. (2006) use a richer feature set that is incomparable to our features.
9 Perceptron Max-Margin Log-Linear p np p np p np Ara Dut Jap Slo Spa Tur Table 2: Test data results. The p and np columns show results with projective and non-projective training respectively. Ara Dut Jap Slo Spa Tur AV P E L Table 3: Results for the three training algorithms on the different languages (P = perceptron, E = EG, L = log-linear models). AV is an average across the results for the different languages. averaged perceptron has a single meta-parameter, namely the number of iterations over the training set. The log-linear models have two meta-parameters: the regularization constant C and the number of gradient steps T taken by the conjugate-gradient optimizer. The EG approach also has two metaparameters: the regularization constant C and the number of iterations, T. 6 For models trained using non-projective algorithms, both projective and nonprojective parsing was tested on the validation set, and the highest scoring of these two approaches was then used to decode test data sentences. Table 2 reports test results for the six training scenarios. These results show that for Dutch, which is the language in our data that has the highest number of crossing dependencies, non-projective training gives significant gains over projective training for all three training methods. For the other languages, non-projective training gives similar or even improved performance over projective training. Table 3 gives an additional set of results, which were calculated as follows. For each of the three training methods, we used the validation set results to choose between projective and non-projective training. This allows us to make a direct comparison of the three training algorithms. Table 3 6 We trained the perceptron for 100 iterations, and chose the iteration which led to the best score on the validation set. Note that in all of our experiments, the best perceptron results were actually obtained with 30 or fewer iterations. For the log-linear and EG algorithms we tested a number of values for C, and for each value of C ran 100 gradient steps or EG iterations, finally choosing the best combination of C and T found in validation. shows the results of this comparison. 7 The results show that log-linear and max-margin models both give a higher average accuracy than the perceptron. For some languages (e.g., Japanese), the differences from the perceptron are small; however for other languages (e.g., Arabic, Dutch or Slovene) the improvements seen are quite substantial. 7 Conclusions This paper describes inference algorithms for spanning-tree distributions, focusing on the fundamental problems of computing partition functions and marginals. Although we concentrate on loglinear and max-margin estimation, the inference algorithms we present can serve as black-boxes in many other statistical modeling techniques. Our experiments suggest that marginal-based training produces more accurate models than perceptron learning. Notably, this is the first large-scale application of the EG algorithm, and shows that it is a promising approach for structured learning. In line with McDonald et al. (2005b), we confirm that spanning-tree models are well-suited to dependency parsing, especially for highly non-projective languages such as Dutch. Moreover, spanning-tree models should be useful for a variety of other problems involving structured data. Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments. In addition, the authors gratefully acknowledge the following sources of support. Terry Koo was funded by a grant from the NSF (DMS ) and a grant from NTT, Agmt. Dtd. 6/21/1998. Amir Globerson was supported by a fellowship from the Rothschild Foundation - Yad Hanadiv. Xavier Carreras was supported by the Catalan Ministry of Innovation, Universities and Enterprise, and a grant from NTT, Agmt. Dtd. 6/21/1998. Michael Collins was funded by NSF grants and DMS We ran the sign test at the sentence level to measure the statistical significance of the results aggregated across the six languages. Out of 2,472 sentences total, log-linear models gave improved parses over the perceptron on 448 sentences, and worse parses on 343 sentences. The max-margin method gave improved/worse parses for 500/383 sentences. Both results are significant with p
10 References J. Baker Trainable grammars for speech recognition. In 97th meeting of the Acoustical Society of America. P. Bartlett, M. Collins, B. Taskar, and D. McAllester Exponentiated gradient algorithms for large margin structured classification. In NIPS. L.E. Baum, T. Petrie, G. Soules, and N. Weiss A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Annals of Mathematical Statistics, 41: S. Buchholz and E. Marsi CoNLL-X shared task on multilingual dependency parsing. In Proc. CoNLL-X. X. Carreras Experiments with a higher-order projective dependency parser. In Proc. EMNLP-CoNLL. S. Chopra On the spanning tree polyhedron. Oper. Res. Lett., pages Y.J. Chu and T.H. Liu On the shortest arborescence of a directed graph. Science Sinica, 14: M. Civit and M a A. Martí Design principles for a Spanish treebank. In Proc. of the First Workshop on Treebanks and Linguistic Theories (TLT). M. Collins Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. S. Džeroski, T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtsky, and A. Žele Towards a Slovene dependency treebank. In Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation (LREC). J. Edmonds Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: J. Eisner Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. A. Globerson, T. Koo, X. Carreras, and M. Collins Exponentiated gradient algorithms for log-linear structured prediction. In Proc. ICML. J. Hajič, O. Smrž, P. Zemánek, J. Šnaidauf, and E. Beška Prague Arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools, pages M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler Estimators for stochastic unification-based grammars. In Proc. ACL. M. Johnson Joint and conditional estimation of tagging and parsing models. In Proc. ACL. Y. Kawata and J. Bartels Stylebook for the Japanese treebank in VERBMOBIL. Verbmobil-Report 240, Seminar für Sprachwissenschaft, Universität Tübingen. J. Lafferty, A. McCallum, and F. Pereira Conditonal random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. R. McDonald and F. Pereira Online learning of approximate dependency parsing algorithms. In Proc. EACL. R. McDonald and G. Satta On the complexity of nonprojective data-driven dependency parsing. In Proc. IWPT. R. McDonald, K. Crammer, and F. Pereira. 2005a. Online large-margin training of dependency parsers. In Proc. ACL. R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. 2005b. Non-projective dependency parsing using spanning tree algorithms. In Proc. HLT-EMNLP. R. McDonald, K. Lerman, and F. Pereira Multilingual dependency parsing with a two-stage discriminative parser. In Proc. CoNLL-X. J. Nivre, J. Hall, and J. Nilsson Memory-based dependency parsing. In Proc. CoNLL. K. Oflazer, B. Say, D. Zeynep Hakkani-Tür, and G. Tür Building a Turkish treebank. In A. Abeillé, editor, Treebanks: Building and Using Parsed Corpora, chapter 15. Kluwer Academic Publishers. M.A. Paskin Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD , University of California, Berkeley. J. Pearl Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (2nd edition). Morgan Kaufmann Publishers. S. Riezler, R. Kaplan, T. King, J. Maxwell, A. Vasserman, and R. Crouch Speed and accuracy in shallow and deep stochastic parsing. In Proc. HLT-NAACL. F. Sha and F. Pereira Shallow parsing with conditional random fields. In Proc. HLT-NAACL. N.A. Smith and J. Eisner Contrastive estimation: Training log-linear models on unlabeled data. In Proc. ACL. D.A. Smith and N.A. Smith Probabilistic models of nonprojective dependency trees. In Proc. EMNLP-CoNLL. B. Taskar, C. Guestrin, and D. Koller. 2004a. Max-margin markov networks. In NIPS. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. 2004b. Max-margin parsing. In Proc. EMNLP. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun Support vector machine learning for interdependent and structured output spaces. In Proc. ICML. W. Tutte Graph Theory. Addison-Wesley. L. van der Beek, G. Bouma, R. Malouf, and G. van Noord The Alpino dependency treebank. In Computational Linguistics in the Netherlands (CLIN). Y. Matsumoto Y. Cheng, M. Asahara Machine learningbased dependency analyzer for chinese. In Proc. ICCC. H. Yamada and Y. Matsumoto Statistical dependency analysis with support vector machines. In Proc. IWPT.
An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing
An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing Jun Suzuki, Hideki Isozaki NTT CS Lab., NTT Corp. Kyoto, 619-0237, Japan jun@cslab.kecl.ntt.co.jp isozaki@cslab.kecl.ntt.co.jp
More informationTekniker för storskalig parsning: Dependensparsning 2
Tekniker för storskalig parsning: Dependensparsning 2 Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi joakim.nivre@lingfil.uu.se Dependensparsning 2 1(45) Data-Driven Dependency
More informationIncremental Integer Linear Programming for Non-projective Dependency Parsing
Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel James Clarke ICCS, University of Edinburgh 22. July 2006 EMNLP 2006 S. Riedel, J. Clarke (ICCS, Edinburgh)
More informationOnline Learning of Approximate Dependency Parsing Algorithms
Online Learning of Approximate Dependency Parsing Algorithms Ryan McDonald Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 {ryantm,pereira}@cis.upenn.edu
More informationGraph-Based Parsing. Miguel Ballesteros. Algorithms for NLP Course. 7-11
Graph-Based Parsing Miguel Ballesteros Algorithms for NLP Course. 7-11 By using some Joakim Nivre's materials from Uppsala University and Jason Eisner's material from Johns Hopkins University. Outline
More informationExponentiated Gradient Algorithms for Large-margin Structured Classification
Exponentiated Gradient Algorithms for Large-margin Structured Classification Peter L. Bartlett U.C.Berkeley bartlett@stat.berkeley.edu Ben Taskar Stanford University btaskar@cs.stanford.edu Michael Collins
More informationConditional Random Fields for Object Recognition
Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu
More informationA Quick Guide to MaltParser Optimization
A Quick Guide to MaltParser Optimization Joakim Nivre Johan Hall 1 Introduction MaltParser is a system for data-driven dependency parsing, which can be used to induce a parsing model from treebank data
More informationSorting Out Dependency Parsing
Sorting Out Dependency Parsing Joakim Nivre Uppsala University and Växjö University Sorting Out Dependency Parsing 1(38) Introduction Introduction Syntactic parsing of natural language: Who does what to
More informationAgenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing
Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1 Projective vs non-projective dependencies If we extract dependencies from trees,
More informationProbabilistic Models of Nonprojective Dependency Trees
Probabilistic Models of Nonprojective Dependenc Trees David A. Smith Department of Computer Science Center for Language and Speech Processing Johns Hopkins Universit Baltimore, MD 21218 USA dasmith@cs.jhu.edu
More informationParsing with Dynamic Programming
CS11-747 Neural Networks for NLP Parsing with Dynamic Programming Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between words
More informationSorting Out Dependency Parsing
Sorting Out Dependency Parsing Joakim Nivre Uppsala University and Växjö University Sorting Out Dependency Parsing 1(38) Introduction Introduction Syntactic parsing of natural language: Who does what to
More informationComplex Prediction Problems
Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity
More informationStructured Learning. Jun Zhu
Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum
More informationComputationally Efficient M-Estimation of Log-Linear Structure Models
Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith, Doug Vail, and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu
More informationStructured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen
Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and
More informationNon-projective Dependency Parsing using Spanning Tree Algorithms
Non-projective Dependency Parsing using Spanning Tree Algorithms Ryan McDonald Fernando Pereira Department of Computer and Information Science University of Pennsylvania {ryantm,pereira}@cis.upenn.edu
More informationShallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001
Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques
More informationProjective Dependency Parsing with Perceptron
Projective Dependency Parsing with Perceptron Xavier Carreras, Mihai Surdeanu, and Lluís Màrquez Technical University of Catalonia {carreras,surdeanu,lluism}@lsi.upc.edu 8th June 2006 Outline Introduction
More informationStatistical parsing. Fei Xia Feb 27, 2009 CSE 590A
Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised
More informationNon-Projective Dependency Parsing in Expected Linear Time
Non-Projective Dependency Parsing in Expected Linear Time Joakim Nivre Uppsala University, Department of Linguistics and Philology, SE-75126 Uppsala Växjö University, School of Mathematics and Systems
More informationComparisons of Sequence Labeling Algorithms and Extensions
Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models
More informationConditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,
Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative
More informationCS395T Project 2: Shift-Reduce Parsing
CS395T Project 2: Shift-Reduce Parsing Due date: Tuesday, October 17 at 9:30am In this project you ll implement a shift-reduce parser. First you ll implement a greedy model, then you ll extend that model
More informationIntroduction to Data-Driven Dependency Parsing
Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,
More informationSupport Vector Machine Learning for Interdependent and Structured Output Spaces
Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,
More informationFlexible Text Segmentation with Structured Multilabel Classification
Flexible Text Segmentation with Structured Multilabel Classification Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia,
More informationTurning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers
Carnegie Mellon University Research Showcase @ CMU Language Technologies Institute School of Computer Science 8-2013 Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers Andre F.T. Martins
More informationMotivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)
Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,
More informationDependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin
Dependency Parsing 2 CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre, Dan Jurafsky & James Martin Dependency Parsing Formalizing dependency trees Transition-based dependency parsing
More informationJOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation
JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based
More informationD-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.
D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either
More informationConcise Integer Linear Programming Formulations for Dependency Parsing
Concise Integer Linear Programming Formulations for Dependency Parsing André F. T. Martins Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Instituto
More informationTransition-Based Dependency Parsing with MaltParser
Transition-Based Dependency Parsing with MaltParser Joakim Nivre Uppsala University and Växjö University Transition-Based Dependency Parsing 1(13) Introduction Outline Goals of the workshop Transition-based
More informationConditional Random Fields : Theory and Application
Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF
More informationComputer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models
Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2
More informationFast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set
Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Tianze Shi* Liang Huang Lillian Lee* * Cornell University Oregon State University O n 3 O n
More informationProbabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks
Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks Xu Sun MOE Key Laboratory of Computational Linguistics, Peking University Abstract Many structured
More informationSTRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH
Slide 3.1 3 STRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH 3.0 Introduction 3.1 Graph Theory 3.2 Strategies for State Space Search 3.3 Using the State Space to Represent Reasoning with the Predicate
More informationDynamic Programming for Higher Order Parsing of Gap-Minding Trees
Dynamic Programming for Higher Order Parsing of Gap-Minding Trees Emily Pitler, Sampath Kannan, Mitchell Marcus Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 epitler,kannan,mitch@seas.upenn.edu
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationLoopy Belief Propagation
Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure
More informationTraining for Fast Sequential Prediction Using Dynamic Feature Selection
Training for Fast Sequential Prediction Using Dynamic Feature Selection Emma Strubell Luke Vilnis Andrew McCallum School of Computer Science University of Massachusetts, Amherst Amherst, MA 01002 {strubell,
More informationThe More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs
The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs Franziska Meier fmeier@usc.edu U. of Southern California, 94 Bloom Walk, Los Angeles, CA 989 USA Amir Globerson amir.globerson@mail.huji.ac.il
More informationSpanning Tree Methods for Discriminative Training of Dependency Parsers
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science January 2006 Spanning Tree Methods for Discriminative Training of Dependency Parsers Ryan
More informationIncremental Integer Linear Programming for Non-projective Dependency Parsing
Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel and James Clarke School of Informatics, University of Edinburgh 2 Bucclecuch Place, Edinburgh EH8 9LW, UK s.r.riedel@sms.ed.ac.uk,
More informationDynamic Feature Selection for Dependency Parsing
Dynamic Feature Selection for Dependency Parsing He He, Hal Daumé III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N Fruit flies like a banana
More informationGeneralized Higher-Order Dependency Parsing with Cube Pruning
Generalized Higher-Order Dependency Parsing with Cube Pruning Hao Zhang Ryan McDonald Google, Inc. {haozhang,ryanmcd}@google.com Abstract State-of-the-art graph-based parsers use features over higher-order
More informationHidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017
Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models
More informationDependency Parsing with Undirected Graphs
Dependency Parsing with Undirected Graphs Carlos Gómez-Rodríguez Departamento de Computación Universidade da Coruña Campus de Elviña, 15071 A Coruña, Spain carlos.gomez@udc.es Daniel Fernández-González
More informationSVMs for Structured Output. Andrea Vedaldi
SVMs for Structured Output Andrea Vedaldi SVM struct Tsochantaridis Hofmann Joachims Altun 04 Extending SVMs 3 Extending SVMs SVM = parametric function arbitrary input binary output 3 Extending SVMs SVM
More informationPart 5: Structured Support Vector Machines
Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data
More informationECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov
ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern
More informationDiscriminative Training with Perceptron Algorithm for POS Tagging Task
Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu
More informationKlein & Manning, NIPS 2002
Agenda for today Factoring complex model into product of simpler models Klein & Manning factored model: dependencies and constituents Dual decomposition for higher-order dependency parsing Refresh memory
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @
More informationLearning Hierarchies at Two-class Complexity
Learning Hierarchies at Two-class Complexity Sandor Szedmak ss03v@ecs.soton.ac.uk Craig Saunders cjs@ecs.soton.ac.uk John Shawe-Taylor jst@ecs.soton.ac.uk ISIS Group, Electronics and Computer Science University
More informationNatural Language Processing
Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors
More information27: Hybrid Graphical Models and Neural Networks
10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look
More informationTransition-Based Dependency Parsing with Stack Long Short-Term Memory
Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationHidden Markov Support Vector Machines
Hidden Markov Support Vector Machines Yasemin Altun Ioannis Tsochantaridis Thomas Hofmann Department of Computer Science, Brown University, Providence, RI 02912 USA altun@cs.brown.edu it@cs.brown.edu th@cs.brown.edu
More informationInformation Processing Letters
Information Processing Letters 112 (2012) 449 456 Contents lists available at SciVerse ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Recursive sum product algorithm for generalized
More informationParsing partially bracketed input
Parsing partially bracketed input Martijn Wieling, Mark-Jan Nederhof and Gertjan van Noord Humanities Computing, University of Groningen Abstract A method is proposed to convert a Context Free Grammar
More informationImproving Transition-Based Dependency Parsing with Buffer Transitions
Improving Transition-Based Dependency Parsing with Buffer Transitions Daniel Fernández-González Departamento de Informática Universidade de Vigo Campus As Lagoas, 32004 Ourense, Spain danifg@uvigo.es Carlos
More informationRegularization and Markov Random Fields (MRF) CS 664 Spring 2008
Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))
More informationStatistical Dependency Parsing
Statistical Dependency Parsing The State of the Art Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Dependency Parsing 1(29) Introduction
More informationA Unified Framework to Integrate Supervision and Metric Learning into Clustering
A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December
More informationPart II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS
Part II C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Converting Directed to Undirected Graphs (1) Converting Directed to Undirected Graphs (2) Add extra links between
More informationTransductive Phoneme Classification Using Local Scaling And Confidence
202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion
More informationDependency Parsing domain adaptation using transductive SVM
Dependency Parsing domain adaptation using transductive SVM Antonio Valerio Miceli-Barone University of Pisa, Italy / Largo B. Pontecorvo, 3, Pisa, Italy miceli@di.unipi.it Giuseppe Attardi University
More informationIterative CKY parsing for Probabilistic Context-Free Grammars
Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST
More informationMEMMs (Log-Linear Tagging Models)
Chapter 8 MEMMs (Log-Linear Tagging Models) 8.1 Introduction In this chapter we return to the problem of tagging. We previously described hidden Markov models (HMMs) for tagging problems. This chapter
More informationStructure and Support Vector Machines. SPFLODD October 31, 2013
Structure and Support Vector Machines SPFLODD October 31, 2013 Outline SVMs for structured outputs Declara?ve view Procedural view Warning: Math Ahead Nota?on for Linear Models Training data: {(x 1, y
More informationMachine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University K- Means + GMMs Clustering Readings: Murphy 25.5 Bishop 12.1, 12.3 HTF 14.3.0 Mitchell
More informationThe Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013
The Perceptron Simon Šuster, University of Groningen Course Learning from data November 18, 2013 References Hal Daumé III: A Course in Machine Learning http://ciml.info Tom M. Mitchell: Machine Learning
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationConditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國
Conditional Random Fields - A probabilistic graphical model Yen-Chin Lee 指導老師 : 鮑興國 Outline Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional
More informationDual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang Wen-tau Yih Microsoft Research Redmond, WA 98052, USA {minchang,scottyih}@microsoft.com Abstract Due to
More informationConditional Random Field for tracking user behavior based on his eye s movements 1
Conditional Random Field for tracing user behavior based on his eye s movements 1 Trinh Minh Tri Do Thierry Artières LIP6, Université Paris 6 LIP6, Université Paris 6 8 rue du capitaine Scott 8 rue du
More informationConfidence in Structured-Prediction using Confidence-Weighted Models
Confidence in Structured-Prediction using Confidence-Weighted Models Avihai Mejer Department of Computer Science Technion-Israel Institute of Technology Haifa 32, Israel amejer@tx.technion.ac.il Koby Crammer
More informationAssignment 4 CSE 517: Natural Language Processing
Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set
More informationFeature Extraction and Loss training using CRFs: A Project Report
Feature Extraction and Loss training using CRFs: A Project Report Ankan Saha Department of computer Science University of Chicago March 11, 2008 Abstract POS tagging has been a very important problem in
More informationEasy-First POS Tagging and Dependency Parsing with Beam Search
Easy-First POS Tagging and Dependency Parsing with Beam Search Ji Ma JingboZhu Tong Xiao Nan Yang Natrual Language Processing Lab., Northeastern University, Shenyang, China MOE-MS Key Lab of MCC, University
More informationOverview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010
INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,
More informationMore on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization
More on Learning Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization Neural Net Learning Motivated by studies of the brain. A network of artificial
More informationProbabilistic parsing with a wide variety of features
Probabilistic parsing with a wide variety of features Mark Johnson Brown University IJCNLP, March 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) upported by NF grants LI 9720368
More informationAT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands
AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com
More informationSupplementary Material: The Emergence of. Organizing Structure in Conceptual Representation
Supplementary Material: The Emergence of Organizing Structure in Conceptual Representation Brenden M. Lake, 1,2 Neil D. Lawrence, 3 Joshua B. Tenenbaum, 4,5 1 Center for Data Science, New York University
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationOnline Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies
Online Graph Planarisation for Synchronous Parsing of Semantic and Syntactic Dependencies Ivan Titov University of Illinois at Urbana-Champaign James Henderson, Paola Merlo, Gabriele Musillo University
More informationA New Perceptron Algorithm for Sequence Labeling with Non-local Features
A New Perceptron Algorithm for Sequence Labeling with Non-local Features Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292
More informationMonotone Paths in Geometric Triangulations
Monotone Paths in Geometric Triangulations Adrian Dumitrescu Ritankar Mandal Csaba D. Tóth November 19, 2017 Abstract (I) We prove that the (maximum) number of monotone paths in a geometric triangulation
More informationHidden Markov Models in the context of genetic analysis
Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationDiscriminative Training for Phrase-Based Machine Translation
Discriminative Training for Phrase-Based Machine Translation Abhishek Arun 19 April 2007 Overview 1 Evolution from generative to discriminative models Discriminative training Model Learning schemes Featured
More informationHadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce
HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce Andrea Gesmundo Computer Science Department University of Geneva Geneva, Switzerland andrea.gesmundo@unige.ch
More informationDensity-Driven Cross-Lingual Transfer of Dependency Parsers
Density-Driven Cross-Lingual Transfer of Dependency Parsers Mohammad Sadegh Rasooli Michael Collins rasooli@cs.columbia.edu Presented by Owen Rambow EMNLP 2015 Motivation Availability of treebanks Accurate
More informationConditional Random Fields with High-Order Features for Sequence Labeling
Conditional Random Fields with High-Order Features for Sequence Labeling Nan Ye Wee Sun Lee Department of Computer Science National University of Singapore {yenan,leews}@comp.nus.edu.sg Hai Leong Chieu
More information