Structured Prediction Models via the Matrix-Tree Theorem

Size: px

Start display at page:

Download "Structured Prediction Models via the Matrix-Tree Theorem"

Buddy Valentine Joseph
5 years ago
Views:

1 Structured Prediction Models via the Matrix-Tree Theorem Terry Koo, Amir Globerson, Xavier Carreras and Michael Collins MIT CSAIL, Cambridge, MA 02139, USA Abstract This paper provides an algorithmic framework for learning statistical models involving directed spanning trees, or equivalently non-projective dependency structures. We show how partition functions and marginals for directed spanning trees can be computed by an adaptation of Kirchhoff s Matrix-Tree Theorem. To demonstrate an application of the method, we perform experiments which use the algorithm in training both log-linear and max-margin dependency parsers. The new training methods give improvements in accuracy over perceptron-trained models. 1 Introduction Learning with structured data typically involves searching or summing over a set with an exponential number of structured elements, for example the set of all parse trees for a given sentence. Methods for summing over such structures include the inside-outside algorithm for probabilistic contextfree grammars (Baker, 1979), the forward-backward algorithm for hidden Markov models (Baum et al., 1970), and the belief-propagation algorithm for graphical models (Pearl, 1988). These algorithms compute marginal probabilities and partition functions, quantities which are central to many methods for the statistical modeling of complex structures (e.g., the EM algorithm (Baker, 1979; Baum et al., 1970), contrastive estimation (Smith and Eisner, 2005), training algorithms for CRFs (Lafferty et al., 2001), and training algorithms for max-margin models (Bartlett et al., 2004; Taskar et al., 2004a)). This paper describes inside-outside-style algorithms for the case of directed spanning trees. These structures are equivalent to non-projective dependency parses (McDonald et al., 2005b), and more generally could be relevant to any task that involves learning a mapping from a graph to an underlying spanning tree. Unlike the case for projective dependency structures, partition functions and marginals for non-projective trees cannot be computed using dynamic-programming methods such as the insideoutside algorithm. In this paper we describe how these quantities can be computed by adapting a wellknown result in graph theory: Kirchhoff s Matrix- Tree Theorem (Tutte, 1984). A naïve application of the theorem yields O(n 4 ) and O(n 6 ) algorithms for computation of the partition function and marginals, respectively. However, our adaptation finds the partition function and marginals in O(n 3 ) time using simple matrix determinant and inversion operations. We demonstrate an application of the new inference algorithm to non-projective dependency parsing. Specifically, we show how to implement two popular supervised learning approaches for this task: globally-normalized log-linear models and max-margin models. Log-linear estimation critically depends on the calculation of partition functions and marginals, which can be computed by our algorithms. For max-margin models, Bartlett et al. (2004) have provided a simple training algorithm, based on exponentiated-gradient (EG) updates, that requires computation of marginals and can thus be implemented within our framework. Both of these methods explicitly minimize the loss incurred when parsing the entire training set. This contrasts with the online learning algorithms used in previous work with spanning-tree models (McDonald et al., 2005b). We applied the above two marginal-based training algorithms to six languages with varying degrees of non-projectivity, using datasets obtained from the CoNLL-X shared task (Buchholz and Marsi, 2006). Our experimental framework compared three training approaches: log-linear models, max-margin models, and the averaged perceptron. Each of these was applied to both projective and non-projective parsing. Our results demonstrate that marginal-based training yields models which out-

2 perform those trained using the averaged perceptron. In summary, the contributions of this paper are: 1. We introduce algorithms for inside-outsidestyle calculations for directed spanning trees, or equivalently non-projective dependency structures. These algorithms should have wide applicability in learning problems involving spanning-tree structures. 2. We illustrate the utility of these algorithms in log-linear training of dependency parsing models, and show improvements in accuracy when compared to averaged-perceptron training. 3. We also train max-margin models for dependency parsing via an EG algorithm (Bartlett et al., 2004). The experiments presented here constitute the first application of this algorithm to a large-scale problem. We again show improved performance over the perceptron. The goal of our experiments is to give a rigorous comparative study of the marginal-based training algorithms and a highly-competitive baseline, the averaged perceptron, using the same feature sets for all approaches. We stress, however, that the purpose of this work is not to give competitive performance on the CoNLL data sets; this would require further engineering of the approach. Similar adaptations of the Matrix-Tree Theorem have been developed independently and simultaneously by Smith and Smith (2007) and McDonald and Satta (2007); see Section 5 for more discussion. 2 Background 2.1 Discriminative Dependency Parsing Dependency parsing is the task of mapping a sentence x to a dependency structure y. Given a sentence x with n words, a dependency for that sentence is a tuple (h, m) where h [0... n] is the index of the head word in the sentence, and m [1... n] is the index of a modifier word. The value h = 0 is a special root-symbol that may only appear as the head of a dependency. We use D(x) to refer to all possible dependencies for a sentence x: D(x) = {(h, m) : h [0... n], m [1... n]}. A dependency parse is a set of dependencies that forms a directed tree, with the sentence s rootsymbol as its root. We will consider both projective Projective Single Root root He saw her Multi Root root He saw her Non-projective root He saw her root He saw her Figure 1: Examples of the four types of dependency structures. We draw dependency arcs from head to modifier. trees, where dependencies are not allowed to cross, and non-projective trees, where crossing dependencies are allowed. Dependency annotations for some languages, for example Czech, can exhibit a significant number of crossing dependencies. In addition, we consider both single-root and multi-root trees. In a single-root tree y, the root-symbol has exactly one child, while in a multi-root tree, the root-symbol has one or more children. This distinction is relevant as our training sets include both single-root corpora (in which all trees are single-root structures) and multi-root corpora (in which some trees are multiroot structures). The two distinctions described above are orthogonal, yielding four classes of dependency structures; see Figure 1 for examples of each kind of structure. We use Tp s (x) to denote the set of all possible projective single-root dependency structures for a sentence x, and Tnp(x) s to denote the set of single-root non-projective structures for x. The sets Tp m (x) and Tnp(x) m are defined analogously for multi-root structures. In contexts where any class of dependency structures may be used, we use the notation T (x) as a placeholder that may be defined as Tp s (x), Tnp(x), s Tp m (x) or Tnp(x). m Following McDonald et al. (2005a), we use a discriminative model for dependency parsing. Features in the model are defined through a function f(x, h, m) which maps a sentence x together with a dependency (h, m) to a feature vector in R d. A feature vector can be sensitive to any properties of the triple (x, h, m). Given a parameter vector w, the optimal dependency structure for a sentence x is y (x; w) = argmax y T (x) (h,m) y w f(x, h, m) (1) where the set T (x) can be defined as Tp s (x), Tnp(x), s Tp m (x) or Tnp(x), m depending on the type of parsing.

3 The parameters w will be learned from a training set {(x i, y i )} N i=1 where each x i is a sentence and each y i is a dependency structure. Much of the previous work on learning w has focused on training local models (see Section 5). McDonald et al. (2005a; 2005b) trained global models using online algorithms such as the perceptron algorithm or MIRA. In this paper we consider training algorithms based on work in conditional random fields (CRFs) (Lafferty et al., 2001) and max-margin methods (Taskar et al., 2004a). 2.2 Three Inference Problems This section highlights three inference problems which arise in training and decoding discriminative dependency parsers, and which are central to the approaches described in this paper. Assume that we have a vector θ with values θ h,m R for all (h, m) D(x); these values correspond to weights on the different dependencies in D(x). Define a conditional distribution over all dependency structures y T (x) as follows: } { P (y x; θ) = exp (h,m) y θ h,m (2) Z(x; θ) Z(x; θ) = exp θ h,m (3) y T (x) (h,m) y The function Z(x; θ) is commonly referred to as the partition function. Given the distribution P (y x; θ), we can define the marginal probability of a dependency (h, m) as µ h,m (x; θ) = P (y x; θ) y T (x) : (h,m) y The inference problems are then as follows: Problem 1: Decoding: Find argmax y T (x) (h,m) y θ h,m Problem 2: Computation of the Partition Function: Calculate Z(x; θ). Problem 3: Computation of the Marginals: For all (h, m) D(x), calculate µ h,m (x; θ). Note that all three problems require a maximization or summation over the set T (x), which is exponential in size. There is a clear motivation for being able to solve Problem 1: by setting θ h,m = w f(x, h, m), the optimal dependency structure y (x; w) (see Eq. 1) can be computed. In this paper the motivation for solving Problems 2 and 3 arises from training algorithms for discriminative models. As we will describe in Section 4, both log-linear and max-margin models can be trained via methods that make direct use of algorithms for Problems 2 and 3. In the case of projective dependency structures (i.e., T (x) defined as Tp s (x) or Tp m (x)), there are well-known algorithms for all three inference problems. Decoding can be carried out using Viterbistyle dynamic-programming algorithms, for example the O(n 3 ) algorithm of Eisner (1996). Computation of the marginals and partition function can also be achieved in O(n 3 ) time, using a variant of the inside-outside algorithm (Baker, 1979) applied to the Eisner (1996) data structures (Paskin, 2001). In the non-projective case (i.e., T (x) defined as Tnp(x) s or Tnp(x)), m McDonald et al. (2005b) describe how the CLE algorithm (Chu and Liu, 1965; Edmonds, 1967) can be used for decoding. However, it is not possible to compute the marginals and partition function using the inside-outside algorithm. We next describe a method for computing these quantities in O(n 3 ) time using matrix inverse and determinant operations. 3 Spanning-tree inference using the Matrix-Tree Theorem In this section we present algorithms for computing the partition function and marginals, as defined in Section 2.2, for non-projective parsing. We first reiterate the observation of McDonald et al. (2005a) that non-projective parses correspond to directed spanning trees on a complete directed graph of n nodes, where n is the length of the sentence. The above inference problems thus involve summation over the set of all directed spanning trees. Note that this set is exponentially large, and there is no obvious method for decomposing the sum into dynamicprogramming-like subproblems. This section describes how a variant of Kirchhoff s Matrix-Tree Theorem (Tutte, 1984) can be used to evaluate the partition function and marginals efficiently. In what follows, we consider the single-root setting (i.e., T (x) = Tnp(x)), s leaving the multi-root

4 case (i.e., T (x) = Tnp(x)) m to Section 3.3. For a sentence x with n words, define a complete directed graph G on n nodes, where each node corresponds to a word in x, and each edge corresponds to a dependency between two words in x. Note that G does not include the root-symbol h = 0, nor does it account for any dependencies (0, m) headed by the root-symbol. We assign non-negative weights to the edges of this graph, yielding the following weighted adjacency matrix A(θ) R n n, for h, m = 1... n: { 0, if h = m A h,m (θ) = exp {θ h,m }, otherwise To account for the dependencies (0, m) headed by the root-symbol, we define a vector of root-selection scores r(θ) R n, for m = 1... n: r m (θ) = exp {θ 0,m } Let the weight of a dependency structure y Tnp(x) s be defined as: ψ(y; θ) = r root(y) (θ) A h,m (θ) (h,m) y : h 0 Here, root(y) = m : (0, m) y is the child of the root-symbol; there is exactly one such child, since y T s np(x). Eq. 2 and 3 can be rephrased as: P (y x; θ) = Z(x; θ) = ψ(y; θ) Z(x; θ) (4) y T s np (x) ψ(y; θ) (5) In the remainder of this section, we drop the notational dependence on x for brevity. The original Matrix-Tree Theorem addressed the problem of counting the number of undirected spanning trees in an undirected graph. For the models we study here, we require a sum of weighted and directed spanning trees. Tutte (1984) extended the Matrix-Tree Theorem to this case. We briefly summarize his method below. First, define the Laplacian matrix L(θ) R n n of G, for h, m = 1... n: { nh L h,m (θ) = =1 A h,m(θ) if h = m A h,m (θ) otherwise Second, for a matrix X, let X (h,m) be the minor of X with respect to row h and column m; i.e., the determinant of the matrix formed by deleting row h and column m from X. Finally, define the weight of any directed spanning tree of G to be the product of the weights A h,m (θ) for the edges in that tree. Theorem 1 (Tutte, 1984, p. 140). Let L(θ) be the Laplacian matrix of G. Then L (m,m) (θ) is equal to the sum of the weights of all directed spanning trees of G which are rooted at m. Furthermore, the minors vary only in sign when traversing the columns of the Laplacian (Tutte, 1984, p. 150): h, m: ( 1) h+m L (h,m) (θ) = L (m,m) (θ) (6) 3.1 Partition functions via matrix determinants From Theorem 1, it directly follows that L (m,m) (θ) = A h,m (θ) y U(m) (h,m) y : h 0 where U(m) = {y Tnp s : root(y) = m}. A naïve method for computing the partition function is therefore to evaluate n Z(θ) = r m (θ)l (m,m) (θ) m=1 The above would require calculating n determinants, resulting in O(n 4 ) complexity. However, as we show below Z(θ) may be obtained in O(n 3 ) time using a single determinant evaluation. Define a new matrix ˆL(θ) to be L(θ) with the first row replaced by the root-selection scores: { rm (θ) h = 1 ˆL h,m (θ) = L h,m (θ) h > 1 This matrix allows direct computation of the partition function, as the following proposition shows. Proposition 1 The partition function in Eq. 5 is given by Z(θ) = ˆL(θ). Proof: Consider the row expansion of ˆL(θ) with respect to row 1: n ˆL(θ) = ( 1) 1+m ˆL1,m (θ)ˆl (1,m) (θ) = = m=1 n m=1 n m=1 ( 1) 1+m r m (θ)l (1,m) (θ) r m (θ)l (m,m) (θ) = Z(θ) The second line follows from the construction of ˆL(θ), and the third line follows from Eq. 6.

5 3.2 Marginals via matrix inversion The marginals we require are given by µ h,m (θ) = 1 Z(θ) y T s np : (h,m) y ψ(y; θ) To calculate these marginals efficiently for all values of (h, m) we use a well-known identity relating the log partition-function to marginals µ h,m (θ) = log Z(θ) θ h,m Since the partition function in this case has a closedform expression (i.e., the determinant of a matrix constructed from θ), the marginals can also obtained in closed form. Using the chain rule, the derivative of the log partition-function in Proposition 1 is µ h,m (θ) = = log ˆL(θ) θ h,m n n h =1 m =1 log ˆL(θ) ˆL,m (θ) h ˆL h,m (θ) θ h,m To perform the derivative, we use the identity log X X = (X 1) T and the fact that ˆL h,m (θ)/ θ h,m is nonzero for only a few h, m. Specifically, when h = 0, the marginals are given by [ˆL 1 ] µ 0,m (θ) = r m (θ) (θ) m,1 and for h > 0, the marginals are given by [ˆL 1 ] µ h,m (θ) = (1 δ 1,m )A h,m (θ) (θ) [ˆL 1 ] (1 δ h,1 )A h,m (θ) (θ) m,m m,h where δ h,m is the Kronecker delta. Thus, the complexity of evaluating all the relevant marginals is dominated by the matrix inversion, and the total complexity is therefore O(n 3 ). 3.3 Multiple Roots In the case of multiple roots, we can still compute the partition function and marginals efficiently. In fact, the derivation of this case is simpler than for single-root structures. Create an extended graph G which augments G with a dummy root node that has edges pointing to all of the existing nodes, weighted by the appropriate root-selection scores. Note that there is a bijection between directed spanning trees of G rooted at the dummy root and multi-root structures y Tnp(x). m Thus, Theorem 1 can be used to compute the partition function directly: construct a Laplacian matrix L(θ) for G and compute the minor L (0,0) (θ). Since this minor is also a determinant, the marginals can be obtained analogously to the single-root case. More concretely, this technique corresponds to defining the matrix ˆL(θ) as ˆL(θ) = L(θ) + diag(r(θ)) where diag(v) is the diagonal matrix with the vector v on its diagonal. 3.4 Labeled Trees The techniques above extend easily to the case where dependencies are labeled. For a model with L different labels, it suffices to define the edge and root scores as A h,m (θ) = L l=1 exp {θ h,m,l } and r m (θ) = L l=1 exp {θ 0,m,l }. The partition function over labeled trees is obtained by operating on these values as described previously, and the marginals are given by an application of the chain rule. Both inference problems are solvable in O(n 3 + Ln 2 ) time. 4 Training Algorithms This section describes two methods for parameter estimation that rely explicitly on the computation of the partition function and marginals. 4.1 Log-Linear Estimation In conditional log-linear models (Johnson et al., 1999; Lafferty et al., 2001), a distribution over parse trees for a sentence x is defined as follows: P (y x; w) = exp { (h,m) y w f(x, h, m) } Z(x; w) (7) where Z(x; w) is the partition function, a sum over Tp s (x), Tnp(x), s Tp m (x) or Tnp(x). m We train the model using the approach described by Sha and Pereira (2003). Assume that we have a training set {(x i, y i )} N i=1. The optimal parameters

6 are taken to be w = argmin w L(w) where N L(w) = C log P (y i x i ; w) w 2 i=1 The parameter C > 0 is a constant dictating the level of regularization in the model. Since L(w) is a convex function, gradient descent methods can be used to search for the global minimum. Such methods typically involve repeated computation of the loss L(w) and gradient L(w) w, requiring efficient implementations of both functions. Note that the log-probability of a parse is log P (y x; w) = w f(x, h, m) log Z(x; w) (h,m) y so that the main issue in calculating the loss function L(w) is the evaluation of the partition functions Z(x i ; w). The gradient of the loss is given by L(w) w where + C = w C N N i=1 i=1 (h,m) D(x i ) µ h,m (x; w) = (h,m) y i f(x i, h, m) µ h,m (x i ; w)f(x i, h, m) y T (x) : (h,m) y P (y x; w) is the marginal probability of a dependency (h, m). Thus, the main issue in the evaluation of the gradient is the computation of the marginals µ h,m (x i ; w). Note that Eq. 7 forms a special case of the loglinear distribution defined in Eq. 2 in Section 2.2. If we set θ h,m = w f(x, h, m) then we have P (y x; w) = P (y x; θ), Z(x; w) = Z(x; θ), and µ h,m (x; w) = µ h,m (x; θ). Thus in the projective case the inside-outside algorithm can be used to calculate the partition function and marginals, thereby enabling training of a log-linear model; in the nonprojective case the algorithms in Section 3 can be used for this purpose. 4.2 Max-Margin Estimation The second learning algorithm we consider is the large-margin approach for structured prediction (Taskar et al., 2004a; Taskar et al., 2004b). Learning in this framework again involves minimization of a convex function L(w). Let the margin for parse tree y on the i th training example be defined as m i,y (w) = w f(x i, h, m) w f(x i, h, m) (h,m) y i (h,m) y The loss function is then defined as L(w) = C N max (E i,y m i,y (w)) + 1 y T (x i ) 2 w 2 i=1 where E i,y is a measure of the loss or number of errors for parse y on the i th training sentence. In this paper we take E i,y to be the number of incorrect dependencies in the parse tree y when compared to the gold-standard parse tree y i. The definition of L(w) makes use of the expression max y T (xi ) (E i,y m i,y (w)) for the i th training example, which is commonly referred to as the hinge loss. Note that E i,yi = 0, and also that m i,yi (w) = 0, so that the hinge loss is always nonnegative. In addition, the hinge loss is 0 if and only if m i,y (w) E i,y for all y T (x i ). Thus the hinge loss directly penalizes margins m i,y (w) which are less than their corresponding losses E i,y. Figure 2 shows an algorithm for minimizing L(w) that is based on the exponentiated-gradient algorithm for large-margin optimization described by Bartlett et al. (2004). The algorithm maintains a set of weights θ i,h,m for i = 1... N, (h, m) D(x i ), which are updated example-by-example. The algorithm relies on the repeated computation of marginal values µ i,h,m, which are defined as follows: 1 µ i,h,m = P (y x i ) = P (y x i ) (8) y T (x i ) : (h,m) y { } exp (h,m) y θ i,h,m { } y T (x i ) exp (h,m) y θ i,h,m A similar definition is used to derive marginal values µ i,h,m from the values θ i,h,m. Computation of the µ and µ values is again inference of the form described in Problem 3 in Section 2.2, and can be 1 Bartlett et al. (2004) write P (y x i) as α i,y. The α i,y variables are dual variables that appear in the dual objective function, i.e., the convex dual of L(w). Analysis of the algorithm shows that as the θ i,h,m variables are updated, the dual variables converge to the optimal point of the dual objective, and the parameters w converge to the minimum of L(w).

7 Inputs: Training examples {(x i, y i)} N i=1. Parameters: Regularization constant C, starting point β, number of passes over training set T. Data Structures: Real values θ i,h,m and l i,h,m for i = 1... N, (h, m) D(x i). Learning rate η. Initialization: Set learning rate η = 1. Set θ C i,h,m = β for (h, m) y i, and θ i,h,m = 0 for (h, m) / y i. Set l i,h,m = 0 for (h, m) y i, and l i,h,m = 1 for (h, m) / y i. Calculate initial parameters as w = C δ i,h,m f(x i, h, m) i (h,m) D(x i ) where δ i,h,m = (1 l i,h,m µ i,h,m ) and the µ i,h,m values are calculated from the θ i,h,m values as described in Eq. 8. Algorithm: Repeat T passes over the training set, where each pass is as follows: Set obj = 0 For i = 1... N For all (h, m) D(x i): θ i,h,m = θ i,h,m + ηc (l i,h,m + w f(x i, h, m)) For example i, calculate marginals µ i,h,m from θ i,h,m values, and marginals µ i,h,m from θ i,h,m values (see Eq. 8). Update the parameters: w = w + C (h,m) D(x i ) δ i,h,mf(x i, h, m) where δ i,h,m = µ i,h,m µ i,h,m, For all (h, m) D(x i), set θ i,h,m = θ i,h,m Set obj = obj + C (h,m) D(x i ) l i,h,mµ i,h,m Set obj = obj w 2 2. If obj has decreased compared to last iteration, set η = η 2. Output: Parameter values w. Figure 2: The EG Algorithm for Max-Margin Estimation. The learning rate η is halved each time the dual objective function (see (Bartlett et al., 2004)) fails to increase. In our experiments we chose β = 9, which was found to work well during development of the algorithm. achieved using the inside-outside algorithm for projective structures, and the algorithms described in Section 3 for non-projective structures. 5 Related Work Global log-linear training has been used in the context of PCFG parsing (Johnson, 2001). Riezler et al. (2004) explore a similar application of log-linear models to LFG parsing. Max-margin learning has been applied to PCFG parsing by Taskar et al. (2004b). They show that this problem has a QP dual of polynomial size, where the dual variables correspond to marginal probabilities of CFG rules. A similar QP dual may be obtained for max-margin projective dependency parsing. However, for nonprojective parsing, the dual QP would require an exponential number of constraints on the dependency marginals (Chopra, 1989). Nevertheless, alternative optimization methods like that of Tsochantaridis et al. (2004), or the EG method presented here, can still be applied. The majority of previous work on dependency parsing has focused on local (i.e., classification of individual edges) discriminative training methods (Yamada and Matsumoto, 2003; Nivre et al., 2004; Y. Cheng, 2005). Non-local (i.e., classification of entire trees) training methods were used by McDonald et al. (2005a), who employed online learning. Dependency parsing accuracy can be improved by allowing second-order features, which consider more than one dependency simultaneously. McDonald and Pereira (2006) define a second-order dependency parsing model in which interactions between adjacent siblings are allowed, and Carreras (2007) defines a second-order model that allows grandparent and sibling interactions. Both authors give polytime algorithms for exact projective parsing. By adapting the inside-outside algorithm to these models, partition functions and marginals can be computed for second-order projective structures, allowing log-linear and max-margin training to be applied via the framework developed in this paper. For higher-order non-projective parsing, however, computational complexity results (McDonald and Pereira, 2006; McDonald and Satta, 2007) indicate that exact solutions to the three inference problems of Section 2.2 will be intractable. Exploration of approximate second-order non-projective inference is a natural avenue for future research. Two other groups of authors have independently and simultaneously proposed adaptations of the Matrix-Tree Theorem for structured inference on directed spanning trees (McDonald and Satta, 2007; Smith and Smith, 2007). There are some algorithmic differences between these papers and ours. First, we define both multi-root and single-root algorithms, whereas the other papers only consider multi-root

8 parsing. This distinction can be important as one often expects a dependency structure to have exactly one child attached to the root-symbol, as is the case in a single-root structure. Second, McDonald and Satta (2007) propose an O(n 5 ) algorithm for computing the marginals, as opposed to the O(n 3 ) matrix-inversion approach used by Smith and Smith (2007) and ourselves. In addition to the algorithmic differences, both groups of authors consider applications of the Matrix-Tree Theorem which we have not discussed. For example, both papers propose minimum-risk decoding, and McDonald and Satta (2007) discuss unsupervised learning and language modeling, while Smith and Smith (2007) define hiddenvariable models based on spanning trees. In this paper we used EG training methods only for max-margin models (Bartlett et al., 2004). However, Globerson et al. (2007) have recently shown how EG updates can be applied to efficient training of log-linear models. 6 Experiments on Dependency Parsing In this section, we present experimental results applying our inference algorithms for dependency parsing models. Our primary purpose is to establish comparisons along two relevant dimensions: projective training vs. non-projective training, and marginal-based training algorithms vs. the averaged perceptron. The feature representation and other relevant dimensions are kept fixed in the experiments. 6.1 Data Sets and Features We used data from the CoNLL-X shared task on multilingual dependency parsing (Buchholz and Marsi, 2006). In our experiments, we used a subset consisting of six languages; Table 1 gives details of the data sets used. 2 For each language we created a validation set that was a subset of the CoNLL-X 2 Our subset includes the two languages with the lowest accuracy in the CoNLL-X evaluations (Turkish and Arabic), the language with the highest accuracy (Japanese), the most nonprojective language (Dutch), a moderately non-projective language (Slovene), and a highly projective language (Spanish). All languages but Spanish have multi-root parses in their data. We are grateful to the providers of the treebanks that constituted the data of our experiments (Hajič et al., 2004; van der Beek et al., 2002; Kawata and Bartels, 2000; Džeroski et al., 2006; Civit and Martí, 2002; Oflazer et al., 2003). language %cd train val. test Arabic ,064 5,315 5,373 Dutch ,861 16,208 5,585 Japanese ,966 9,495 5,711 Slovene ,949 5,801 6,390 Spanish ,310 11,024 5,694 Turkish ,827 5,683 7,547 Table 1: Information for the languages in our experiments. The 2nd column (%cd) is the percentage of crossing dependencies in the training and validation sets. The last three columns report the size in tokens of the training, validation and test sets. training set for that language. The remainder of each training set was used to train the models for the different languages. The validation sets were used to tune the meta-parameters (e.g., the value of the regularization constant C) of the different training algorithms. We used the official test sets and evaluation script from the CoNLL-X task. All of the results that we report are for unlabeled dependency parsing. 3 The non-projective models were trained on the CoNLL-X data in its original form. Since the projective models assume that the dependencies in the data are non-crossing, we created a second training set for each language where non-projective dependency structures were automatically transformed into projective structures. All projective models were trained on these new training sets. 4 Our feature space is based on that of McDonald et al. (2005a) Results We performed experiments using three training algorithms: the averaged perceptron (Collins, 2002), log-linear training (via conjugate gradient descent), and max-margin training (via the EG algorithm). Each of these algorithms was trained using projective and non-projective methods, yielding six training settings per language. The different training algorithms have various meta-parameters, which we optimized on the validation set for each language/training-setting combination. The 3 Our algorithms also support labeled parsing (see Section 3.4). Initial experiments with labeled models showed the same trend that we report here for unlabeled parsing, so for simplicity we conducted extensive experiments only for unlabeled parsing. 4 The transformations were performed by running the projective parser with score +1 on correct dependencies and -1 otherwise: the resulting trees are guaranteed to be projective and to have a minimum loss with respect to the correct tree. Note that only the training sets were transformed. 5 It should be noted that McDonald et al. (2006) use a richer feature set that is incomparable to our features.

9 Perceptron Max-Margin Log-Linear p np p np p np Ara Dut Jap Slo Spa Tur Table 2: Test data results. The p and np columns show results with projective and non-projective training respectively. Ara Dut Jap Slo Spa Tur AV P E L Table 3: Results for the three training algorithms on the different languages (P = perceptron, E = EG, L = log-linear models). AV is an average across the results for the different languages. averaged perceptron has a single meta-parameter, namely the number of iterations over the training set. The log-linear models have two meta-parameters: the regularization constant C and the number of gradient steps T taken by the conjugate-gradient optimizer. The EG approach also has two metaparameters: the regularization constant C and the number of iterations, T. 6 For models trained using non-projective algorithms, both projective and nonprojective parsing was tested on the validation set, and the highest scoring of these two approaches was then used to decode test data sentences. Table 2 reports test results for the six training scenarios. These results show that for Dutch, which is the language in our data that has the highest number of crossing dependencies, non-projective training gives significant gains over projective training for all three training methods. For the other languages, non-projective training gives similar or even improved performance over projective training. Table 3 gives an additional set of results, which were calculated as follows. For each of the three training methods, we used the validation set results to choose between projective and non-projective training. This allows us to make a direct comparison of the three training algorithms. Table 3 6 We trained the perceptron for 100 iterations, and chose the iteration which led to the best score on the validation set. Note that in all of our experiments, the best perceptron results were actually obtained with 30 or fewer iterations. For the log-linear and EG algorithms we tested a number of values for C, and for each value of C ran 100 gradient steps or EG iterations, finally choosing the best combination of C and T found in validation. shows the results of this comparison. 7 The results show that log-linear and max-margin models both give a higher average accuracy than the perceptron. For some languages (e.g., Japanese), the differences from the perceptron are small; however for other languages (e.g., Arabic, Dutch or Slovene) the improvements seen are quite substantial. 7 Conclusions This paper describes inference algorithms for spanning-tree distributions, focusing on the fundamental problems of computing partition functions and marginals. Although we concentrate on loglinear and max-margin estimation, the inference algorithms we present can serve as black-boxes in many other statistical modeling techniques. Our experiments suggest that marginal-based training produces more accurate models than perceptron learning. Notably, this is the first large-scale application of the EG algorithm, and shows that it is a promising approach for structured learning. In line with McDonald et al. (2005b), we confirm that spanning-tree models are well-suited to dependency parsing, especially for highly non-projective languages such as Dutch. Moreover, spanning-tree models should be useful for a variety of other problems involving structured data. Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments. In addition, the authors gratefully acknowledge the following sources of support. Terry Koo was funded by a grant from the NSF (DMS ) and a grant from NTT, Agmt. Dtd. 6/21/1998. Amir Globerson was supported by a fellowship from the Rothschild Foundation - Yad Hanadiv. Xavier Carreras was supported by the Catalan Ministry of Innovation, Universities and Enterprise, and a grant from NTT, Agmt. Dtd. 6/21/1998. Michael Collins was funded by NSF grants and DMS We ran the sign test at the sentence level to measure the statistical significance of the results aggregated across the six languages. Out of 2,472 sentences total, log-linear models gave improved parses over the perceptron on 448 sentences, and worse parses on 343 sentences. The max-margin method gave improved/worse parses for 500/383 sentences. Both results are significant with p

10 References J. Baker Trainable grammars for speech recognition. In 97th meeting of the Acoustical Society of America. P. Bartlett, M. Collins, B. Taskar, and D. McAllester Exponentiated gradient algorithms for large margin structured classification. In NIPS. L.E. Baum, T. Petrie, G. Soules, and N. Weiss A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Annals of Mathematical Statistics, 41: S. Buchholz and E. Marsi CoNLL-X shared task on multilingual dependency parsing. In Proc. CoNLL-X. X. Carreras Experiments with a higher-order projective dependency parser. In Proc. EMNLP-CoNLL. S. Chopra On the spanning tree polyhedron. Oper. Res. Lett., pages Y.J. Chu and T.H. Liu On the shortest arborescence of a directed graph. Science Sinica, 14: M. Civit and M a A. Martí Design principles for a Spanish treebank. In Proc. of the First Workshop on Treebanks and Linguistic Theories (TLT). M. Collins Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. S. Džeroski, T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtsky, and A. Žele Towards a Slovene dependency treebank. In Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation (LREC). J. Edmonds Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: J. Eisner Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. A. Globerson, T. Koo, X. Carreras, and M. Collins Exponentiated gradient algorithms for log-linear structured prediction. In Proc. ICML. J. Hajič, O. Smrž, P. Zemánek, J. Šnaidauf, and E. Beška Prague Arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools, pages M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler Estimators for stochastic unification-based grammars. In Proc. ACL. M. Johnson Joint and conditional estimation of tagging and parsing models. In Proc. ACL. Y. Kawata and J. Bartels Stylebook for the Japanese treebank in VERBMOBIL. Verbmobil-Report 240, Seminar für Sprachwissenschaft, Universität Tübingen. J. Lafferty, A. McCallum, and F. Pereira Conditonal random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. R. McDonald and F. Pereira Online learning of approximate dependency parsing algorithms. In Proc. EACL. R. McDonald and G. Satta On the complexity of nonprojective data-driven dependency parsing. In Proc. IWPT. R. McDonald, K. Crammer, and F. Pereira. 2005a. Online large-margin training of dependency parsers. In Proc. ACL. R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. 2005b. Non-projective dependency parsing using spanning tree algorithms. In Proc. HLT-EMNLP. R. McDonald, K. Lerman, and F. Pereira Multilingual dependency parsing with a two-stage discriminative parser. In Proc. CoNLL-X. J. Nivre, J. Hall, and J. Nilsson Memory-based dependency parsing. In Proc. CoNLL. K. Oflazer, B. Say, D. Zeynep Hakkani-Tür, and G. Tür Building a Turkish treebank. In A. Abeillé, editor, Treebanks: Building and Using Parsed Corpora, chapter 15. Kluwer Academic Publishers. M.A. Paskin Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD , University of California, Berkeley. J. Pearl Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (2nd edition). Morgan Kaufmann Publishers. S. Riezler, R. Kaplan, T. King, J. Maxwell, A. Vasserman, and R. Crouch Speed and accuracy in shallow and deep stochastic parsing. In Proc. HLT-NAACL. F. Sha and F. Pereira Shallow parsing with conditional random fields. In Proc. HLT-NAACL. N.A. Smith and J. Eisner Contrastive estimation: Training log-linear models on unlabeled data. In Proc. ACL. D.A. Smith and N.A. Smith Probabilistic models of nonprojective dependency trees. In Proc. EMNLP-CoNLL. B. Taskar, C. Guestrin, and D. Koller. 2004a. Max-margin markov networks. In NIPS. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. 2004b. Max-margin parsing. In Proc. EMNLP. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun Support vector machine learning for interdependent and structured output spaces. In Proc. ICML. W. Tutte Graph Theory. Addison-Wesley. L. van der Beek, G. Bouma, R. Malouf, and G. van Noord The Alpino dependency treebank. In Computational Linguistics in the Netherlands (CLIN). Y. Matsumoto Y. Cheng, M. Asahara Machine learningbased dependency analyzer for chinese. In Proc. ICCC. H. Yamada and Y. Matsumoto Statistical dependency analysis with support vector machines. In Proc. IWPT.

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing Jun Suzuki, Hideki Isozaki NTT CS Lab., NTT Corp. Kyoto, 619-0237, Japan jun@cslab.kecl.ntt.co.jp isozaki@cslab.kecl.ntt.co.jp