Kernel Conditional Random Fields: Representation and Clique Selection

Size: px

Start display at page:

Download "Kernel Conditional Random Fields: Representation and Clique Selection"

Marcus McDaniel
6 years ago
Views:

1 Kernel Conditional Random Fields: Representation and Clique Selection John Lafferty Xiaojin Zhu Yan Liu School of Computer Science, Carnegie Mellon University, Pittsburgh PA, USA Abstract Kernel conditional random fields (KCRFs are introduced as a framework for discriminative modeling of graph-structured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Mercer kernels on labeled graphs. A procedure for greedily selecting cliques in the dual representation is then proposed, which allows sparse representations. By incorporating kernels and implicit feature spaces into conditional graphical models, the framework enables semi-supervised learning algorithms for structured data through the use of graph kernels. The framework and clique selection methods are demonstrated in synthetic data experiments, and are also applied to the problem of protein secondary structure prediction. 1. Introduction Many classification problems involve the annotation of data items having multiple components, with each component requiring a classification label. Such problems are challenging because the interaction between the components can be rich and complex. In text, speech, and image processing, for example, it is often useful to label individual words, sounds, or image patches with categories to enable higher level processing; but these labels can depend on one another in a highly complex manner. For biological sequence annotation, it is desirable to annotate each amino acid in a protein with a label, with the collection of labels representing the global geometric structure of the molecule. Here the labels in principle depend on the physical characteristics of the molecule and its ambient chemical envi- Appearing in Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, Copyright 2004 by the authors. ronment. In each case, classification tasks naturally arise which clearly violate the assumption of independent and identically distributed instances that is made in the majority of classification procedures in statistics and machine learning. It is therefore of central importance to extend recent advances in classification theory and practice to structured, non-independent data classification problems. Conditional random fields (Lafferty et al., 2001 have been proposed as an approach to modeling the interactions between labels in such problems using the tools of graphical models. A conditional random field (CRF is a model that assigns a joint probability distribution over labels conditional on the input, where the distribution respects the independence relations encoded in a graph. In general, the labels are not assumed to be independent, nor are the observations conditionally independent given the labels, as is assumed in generative models such as hidden Markov models. The CRF framework has already been used to obtain promising results in a number of domains where there is interaction between labels, including tagging, parsing and information extraction in natural language processing (Collins, 2002; Sha & Pereira, 2003; Pinto et al., 2003 and the modeling of spatial dependencies in image processing (Kumar & Hebert, In related work, Taskar et al. (2003 have studied random fields (also known as Markov networks fit using loss functions that incorporate a generalized notion of margin, and have observed how the kernel trick applies to this family of models. We present an extension of conditional random fields that permits the use of implicit features spaces through Mercer kernels, using the framework of regularization theory. Such an extension is motivated by the significant body of recent work that has shown kernel methods to be extremely effective in a wide variety of machine learning techniques; for example, they enable the integration of multiple sources of information in a principled manner. Our introduction of Mercer kernels into conditional graphical models is also motivated by the problem of semi-supervised learning. In many domains, the collection of annotated training data is difficult and costly, as it requires the efforts of expert human annotators, while the collection of unlabeled data may

2 be relatively easy and inexpensive. The emerging theme in recent research in semi-supervised learning is that kernel methods, in particular those based on graphical representations of unlabeled data, form a theoretically attractive and empirically promising set of techniques for combining labeled and unlabeled data (Belkin & Niyogi, 2002; Chapelle et al., 2002; Smola & Kondor, 2003; Zhu et al., In Section 2 we formalize the learning problem and present a version of the classical representer theorem of Kimeldorf and Wahba (1971. Unlike the classical result, for kernel conditional random fields the dual parameters depend on all potential assignments of labels to cliques in the graph, not only the observed labels. This motivates the need for algorithms to derive sparse representations, since the full representation has parameters for each labeled clique in the graphs appearing in the training data. In Section 3 we present a greedy algorithm for selecting a small number of representative cliques. This clique selection algorithm parallels the import vector selection algorithms of kernel logistic regression (Zhu & Hastie, 2001, and the feature selection methods that have been previously proposed for random fields and conditional random fields using explicit features (McCallum, In Section 4 the ideas and methods are demonstrated on two synthetic data sets, where the effects of the underlying graph kernels, clique selection, and sequential modeling can be clearly seen. In Section 5 we report the results of experiments using kernel CRFs for protein secondary structure prediction. This is the task of mapping primary sequences of amino acids onto a string of secondary structure assignments, such as helix, sheet, or coil. It is widely believed that secondary structure can contribute valuable information to discerning how proteins fold in three dimensions. We compare kernel conditional random fields, estimated using clique selection, against support vector machine classifiers, with both methods using kernels derived from position-specific scoring matrices (PSI-BLAST profiles as input features. In addition, we give results for the use of graph kernels derived from the PSI-BLAST profiles in a transductive, semi-supervised framework for estimating the kernel CRFs. The paper concludes with a brief discussion in Section Representation Before proceeding with formalism, we give some intuition for what our framework is intended to capture. Our goal is to annotate structured data, where the structure is represented by a graph. Labels are to be assigned to the nodes in the graph in order to minimize some loss function, such as 0-1 error; the labels come from a small set Y, for example, Y = {red, blue, green}. Each vertex in the graph is associated with a feature vector x v X. In image processing, the feature vector at a node might include a pixel intensity, as well as average pixel intensities smoothed over neighboring regions using wavelets. In protein secondary structure prediction, each node might correspond to an amino acid in the protein, and the feature vector at a node may include an amino acid histogram of all protein fragments in a database which closely match the given protein at that node. In the following section we present our notation and formal framework for such problems Cliques and labeled graphs Let G denote a collection of finite graphs. For example, G might be the set of finite chains, appropriate for sequence modeling, or the rectangular 2-dimensional grids, appropriate for some image processing tasks. The set of vertices of a graph g G is denoted by V (g, and size of the graph is the number of vertices, denoted g = V (g. A clique is a subset of the vertices which is fully connected, with any pair of vertices joined by an edge; we denote the set of cliques in the graph by C(g. The number of vertices in a clique is denoted by c. Similarly, we denote by C(G = {(g, c g G, c C(g} the collection of cliques across varying graphs. In other words, a member of C(G consists of a graph and a distinguished clique of that graph. We will work with kernels that compare components of different graphs. For example, we could consider a kernel K : C(G C(G {0, 1} given by K((g, c, (g, c = δ( c, c. We next consider labelings of a graph. Let Y be a finite set of labels; infinite Y is also possible in a regression framework, but we restrict to finite Y for simplicity. The set of Y-labelings of a graph g is denoted Y(g = { y y Y g }, and the collection of all Y- labeled graphs is Y(G = {(g, x g G, y Y(g}. Similarly, let X be an input feature space; for example, X = R n. The set X(g = { x x X g } denotes the set of assignments of a feature vector to each vertex of the graph g; X(G = {(g, x g G, x X(g} is the collection of all such annotated graphs. Finally, let Y C (g = { (c, y c c C(g, y c Y c } be the set of Y- labeled cliques in a graph. As above, we similarly define X Y C (g = {(x, c, y c x X(g, (c, y c Y C (g} and X Y C (G = {(g, x, c, y c (x, c, y c X Y C (g} Representer Theorem The prediction task for conditional graphical models is to learn a function h : X(G Y(G where h(g, x Y(g is a labeling of g, with the goal of minimizing a suitably defined loss function. The classifier h = h n is chosen based on a labeled sample { (g (i, x (i, y (i } n, with each i=1 (g (i, x (i, y (i being a labeled graph, the graph possibly changing from example to example.

3 To limit the complexity of the hypothesis, we will assume that it is determined completely by a function f : X Y C (G R. Let f(g, x denote the collection of values {f(g, x, c, y c }, with c C(g varying over the cliques of g and y c Y c varying over all possible labelings of that clique. We assume that a loss function φ(y, f(g, x is given. As an important example, and the loss function used in this paper, consider the negative log loss φ(y, f(g, x = (1 f c (x, y c + log exp f c (x, y c c C(g y Y(g c C(g where f c (x, y c is shorthand for f(g, x, c, y c. The negative log marginal loss could also be considered for minimizing the per-node error. The negative log loss function corresponds to a conditional random field given by ( p(y g, x = Z 1 (g, x, f exp f c (x, y c (2 We now discuss how the representer theorem of kernel machines (Kimeldorf & Wahba, 1971 applies to conditional graphical models. While this is a simple extension, we re not aware of an analogous formulation in the statistics or machine learning literature. Let K be a Mercer kernel on X Y C (G; thus K((g, x, c, y c, (g, x, c, y c R for each (x, c, y c X Y C (g and (x, c, y c X Y C(g. Intuitively, this assigns a measure of similarity between a labeled clique in one graph and a labeled clique in a (possibly different graph. We denote by H K the associated reproducing kernel Hilbert space, and by K the associated norm on L 2 (X Y C (G. Consider a regularized loss function of the form R φ f = i=1 ( φ y (i, f(g (i, x (i + Ω ( f K It is important to note that the loss depends on all possible assignments y c of labels to each clique, not just those observed in the labeled data y (i. Suppressing the dependence on the graph g in the notation, let K c (x, y c ;, = K((g, x, c, y c,. Following the argument for the standard representer theorem, it can easily be shown that the minimizer of a regularized loss function of the above form can be expressed in terms of the basis functions f( = K(, (g (i, x (i, c, y c. Proposition (Representer theorem for CRFs. Let K be a Mercer kernel on X Y C (G with associated RKHS norm c K, and let Ω : R + R + be strictly increasing. Then the minimizer f of R φ f = i=1 if it exists, has the form f ( = ( φ y (i, f(g (i, x (i + Ω ( f K i=1 c C(g (i c (y c K c(x (i, y c ; y c Y c The key property distinguishing this result from the standard representer theorem is that the dual parameters c (y c now depend on all assignments of labels Two special cases Let K be a Mercer kernel on Z = X Y Y. Thus, the kernel is defined in terms of the matrix entries K(z, z where z = (x, y 1, y 2. Using K we can define a kernel on edges in X Y C (G by K ((g, x, (v 1, v 2, (y 1, y 2, (g, x, (v 1, v 2, (y 1, y 2 = K((x v1, y 1, y 2, (x v y 1, 1, y 2. For the regularized risk minimization problem min f H K R φ (x, f, λ = min f H K φ(y (i, f(x (i + λ f K i=1 where f H K, the CRF representer theorem implies that the solution f has the form f(v (x, y 1,v 2 1, y 2 = (v,v (y, y K((x v1, y 1, y 2, (x (i v, y, y i=1 y,y (v,v In the special case of kernel K(z, z = K(x, x δ(y 1, y 1 it follows that f (v 1,v 2 (x, y 1, y 2 = i=1 v V (g (i v (y 1 K(x v1, x (i v Under the probabilistic model (2, this is simply kernel logistic regression. In the special case of K(z, z = K(x, x δ(y 1, y 1 + δ(y 1, y 1δ(y 2, y 2 we get that f (v 1,v 2 (x, y 1, y 2 = i,v v (y 1K(x (i v, x v 1 + α(y 1, y 2 and we recover a simple type of semiparametric CRF. 3. Clique Selection The representer theorem shows that the minimizing function f is supported by labeled cliques over the training

4 examples; however, this may result in an extremely large number of parameters. We therefore pursue a strategy of incrementally selecting cliques in order to greedily reduce the regularized risk. The resulting procedure is parallel to forward stepwise logistic regression, and to related methods for kernel logistic regression (Zhu & Hastie, 2001, as well as to the greedy selection procedure presented in (Della Pietra et al., { Our algorithm will maintain an active set A = (g (i, c, y c } Y C (G of labeled cliques, where the labelings are not restricted to those appearing in the training data. Each such candidate clique can be represented by a basis function h( = K((g (i, x (i, c, y c, H K, and is assigned a parameter α h = c (y c. We work with the regularized risk R φ (f = ( φ y (i, f(g (i, x (i + λ 2 f 2 K (3 i where φ is the log-loss of equation (1. To evaluate a candidate h, one strategy is to compute the gain sup α R φ (f R φ (f + αh, and to choose the candidate h having the largest gain. This presents an apparent difficulty, since the optimal parameter α cannot be computed in closed form, and must be evaluated numerically. For sequence models this involves forward-backward calculations for each candidate h, the cost of which is prohibitive. As an alternative, we adopt the functional gradient descent approach, which evaluates a small change to the current function. For a given candidate h, consider adding h to the current model with small weight ε; thus f f + εh. Then R φ (f + εh = R φ (f + εdr φ (f, h + O(ε 2, where the functional derivative of R φ at f in the direction h is computed as dr φ (f, h = E f [h] Ẽ[h] + λ f, h K where Ẽ[h] = i h(x(i, y (i is the empirical expectation and E f [h] = i y p(y x(i, fh(x (i, y is the model expectation conditioned on x, combined with the empirical distribution on x. The idea is that in directions h where the functional gradient dr φ (f, h is large, the model is mismatched with the labeled data; this direction should be added to the model to make a correction. This results in the greedy clique selection algorithm summarized in Figure 1. Following our earlier notation, h(x (i, y = h(g (i, x (i, c, y c c C(g (i is the sum over all cliques. The candidate functions h might include functions of the form h( = K((g (i, x (i, c, y c, Initialize with f = 0, and iterate: 1. For each candidate h H K, supported by a single labeled clique, calculate the functional derivative dr φ (f, h. 2. Select the candidate h = arg max h dr φ (f, h having the largest gradient direction. Set f f + α h h. 3. Estimate parameters α f for each active f by minimizing R φ (f. Figure 1. Greedy Clique Selection. Labeled cliques encode basis functions h which are greedily added to the model, using a form of functional gradient descent. where i is a specific instance, c is a particular clique of g (i, and y c is a labeling of that clique. Alternatively, in a slightly less greedy manner, at each step in the selection procedure a specific instance and clique may be selected, and functions for each clique labeling may be added. In the experiments reported below for sequences, the marginal probabilities p(y t = y x and expected counts for the state transitions are required; these are computed using the forward-backward algorithm, with log domain arithmetic to avoid underflow. A quasi-newton method (BFGS, cubic-polynomial line search is used to estimate the parameters in step 3. Prediction is carried out using the forward-backward algorithm to compute marginals rather than using the Viterbi algorithm Combining multiple kernels The above use of kernels enables semi-supervised learning for structured prediction problems. One of the emerging themes in semi-supervised learning is that graph kernels can provide a useful framework for combining labeled and unlabeled data. Here an undirected graph is defined over labeled and unlabeled data instances, and generally the assumption is that labels vary smoothly over the graph. The graph is represented by the weight matrix W, and one can construct a kernel from the graph Laplacian, substituting eigenvalues λ by r(λ, where r is a non-negative and (typically decreasing function. This regularizes high frequency components and encourages smooth functions on the graph; see (Smola & Kondor, 2003 for a description of this unifying view of graph kernels. It is important to note that such a use of a graph kernel for semi-supervised learning introduces an additional graphical structure, which should not be confused with the graph representing the explicit dependencies between labels in a CRF. For example, when modeling sequences, the natural CRF graph structure is a chain. By incorporating unlabeled data through the use of a graph kernel, an additional

5 graph that will generally have many cycles is implicitly introduced. However, the graph kernel and a more standard kernel may be naturally combined as a linear combination; see, for example, (Lanckriet et al., Synthetic Data Experiments To demonstrate the properties and advantages of KCRFs, we prepared two synthetic datasets: a galaxy dataset to investigate the relation to semi-supervised and sequential learning, and an HMM with Gaussian mixture emission probabilities to demonstrate the properties of clique selection and the advantages of incorporating kernels. Galaxy. The galaxy dataset is a variant of two spirals; see Figure 2 (left. Note the dense core of points from both classes. The sequences are generated from a 2-state hidden Markov model (HMM, where each state emits instances uniformly from one of the classes. There is a 90% chance of staying in the same state. The idea is that under a sequence model, an example from the core will have a better than random chance to be labeled correctly based on the context. This is not true under a non-sequence model, and the dataset as a whole will thus have about a 20% Bayes error rate under the iid assumption. We sample 100 sequences of length 20. Note the choice of semi-supervised vs. standard kernels and sequence vs. non-sequence models are orthogonal; the four combinations are all tested on. We construct a semi-supervised graph kernel by first creating an unweighted 10-nearest neighbor graph. We then compute the graph Laplacian L, and form the kernel K = 10 ( L I 1. This corresponds to a function r(λ = 1/(λ on L s eigenvalues. The standard kernel is the radial basis function (RBF kernel with bandwidth σ = 5. All parameters here and below are tuned by cross validation. Figure 2 (center shows the results of using kernel logistic regression with the semi-supervised kernel and with the RBF kernel; here the sequence structure is ignored. For each training set size, which ranges from 20 to 400 points, 10 random trials were performed. The error intervals shown are one standard error. When the labeled set size is small, the graph kernel is much better than the RBF kernel. However both kernels saturate at the 20% Bayes error rate. Next we apply both kernels to the semiparametric KCRF model in section 2.3; see Figure 2 (right. Note the x-axis is the number of training sequences since each sequence has 20 instances, the range is the same as Figure 2 (center. The kernel CRF is capable of getting under the 20% Bayes error floor of the non-sequence model, with both kernels and sufficient labeled data. However, the graph kernel is able to learn the structure much faster than the RBF kernel. Evidently, the high error rate for low label data sizes prevents the RBF model from effectively using the context. HMM with Gaussian mixtures. This more difficult dataset is generated from a 3-state HMM. Each state is a mixture of 2 Gaussians with random mean and covariance. The Gaussians strongly overlap; see Figure 3 (left. The transition probabilities favor remaining in the state, with a probability of 0.8, and to transition to each of the other two states with equal probability 0.1; we generate 100 sequences of length 30. We use an RBF kernel with σ =. (A graph kernel is slightly worse than the RBF kernel on this dataset, and is not shown. We perform 20 trials for each training set size, and in each trial we perform clique selection to select the top 20 vertices. The center and right plots in Figure 3 show that the semiparametric KCRF again outperforms kernel logistic regression with the same RBF kernel. Figure 4 shows clique selection, with a training size 20 sequences, averaged over 20 random trials. The regularized risk (left, which is training set likelihood plus regularizor, always decreases as we select more vertices into the KCRF. On the other hand, the test set likelihood (center and accuracy (right saturate and even worsen slightly, showing signs of overfitting. All curves change dramatically at first, demonstrating the effectiveness of the clique selection algorithm. In fact, fewer than 10 vertex cliques are sufficient for this problem. 5. Protein Secondary Structure Prediction For the protein secondary structure prediction task, we used the RS126 dataset, on which many current methods have been developed and tested (Cuff & Barton, It is a non-homologous dataset, since among the 126 protein chains, no two proteins share more than 25% sequence identity over a length of more than 80 residues (Cuff & Barton, The dataset can be downloaded from http: //barton.ebi.ac.uk/. We adopt the DSSP definition of protein secondary structure (Kabsch & Sander, 1983, which is based on hydrogen bonding patterns and geometric constraints. Following the discussion in (Cuff & Barton, 1999, the 8 DSSP labels are reduced to a 3 state model as follows: H & G map to helix (H, E & B to sheets (E, and all other states to coil (C. The state-of-the-art performance for secondary structure prediction is achieved by window-base methods, using the position-specific scoring matrices (PSSM as input features, i.e., PSI-BLAST profiles, together with Support Vector Machines (SVMs as the underlying learning algorithm (Jones, 1999; Kim & Park, Finally, the raw predictions are fed into a second layer SVM to filter out physically unrealistic predictions, such as one sheet residue surrounded by helix residues (Jones, 1999.

6 semi supervised RBF 5 semi supervised RBF 5 5 Test error rate 5 Test error rate Training set size Training set size Figure 2. Left: The galaxy data. Center: Kernel logistic regression, comparing two kernels: RBF and a graph kernel using the unlabeled data. Right: Kernel conditional random fields, which take into account the sequential structure of the data Test error rate 5 Test error rate Training sequences Training sequences Figure 3. Left: The Gaussian mixture data (only a few data points are shown. Center: Kernel logistic regression with an RBF kernel. Right: Kernel CRF with the same kernel. In our experiments, we apply a linear transformation L to the PSSM matrix elements according to L(x = 0 for x 5, L(x = x 10 for 5 x 5, and L(x = 1 for x 5. This is the same transform used by Kim and Park (2003, which achieved one of the best results in the recent CASP (Critical Assessment of Structure Predictions competition. The window size is set to 13 by cross-validation. Therefore the number of features per position is (the number of amino acids plus gap. Clique selection. We use an RBF kernel with bandwidth σ = 0.1 chosen by cross-validation. Figure 5 (left shows the kernel CRF risk reduction as clique selection proceeds, when only vertex clique candidates are allowed (note there are always position independent edge parameters in the KCRF models, to prevent the models from degrading into kernel logistic regression, and when both vertex and edge cliques are allowed. (The kernel between vertex cliques is K(z, z = K(x, x δ(y 1, y 1, and between edge cliques it is K(z, z = K(x, x δ(y 1, y 1δ(y 2, y 2. The total number of clique candidates is about 4800 (vertex only and (vertex and edge. The rapid reduction in risk indicates sparse training of kernel CRFs is successful. Also when more flexibility is allowed by including edge cliques, the risk reduction is much faster. The more flexible model also has higher test set log likelihood (center although this does not improve the test set accuracy too much (right. These observations are generally true for other trials too. Per-residue accuracy. To evaluate prediction performance, we use the overall per-residue accuracy (also known as Q 3. We experiment with training set size of 5 and 10 sequences respectively. For each size we perform 10 trials where the training sequences are randomly sampled, and the remaining proteins are used as the test set. For kernel CRF we select 300 cliques, again from either vertex candidates alone or vertex and edge candidates. We compare them with the SVM-light package (Joachims, 1998 for SVM classifier. All methods use the same RBF kernel. See Table 1. KCRFs and SVMs have comparable performance. Transition accuracy. Further information can be obtained by studying transition boundaries, for example, the transition from coil to sheet. From the point of view of structural biology, these transition boundaries may provide important information about how proteins fold in three dimension. On the other hand, those are the positions where most secondary structure prediction systems will fail. The transition boundary is defined as a pair of adjacent positions (i, i + 1 whose true labels differ. It is classified correctly only if both labels are correct. This is a very hard problem, as can be seen in Table 2 and KCRFs are able to achieve a considerable improvement over SVM. Semi-supervised learning. We start with an unweighted 10 nearest neighbor graph over positions in both training and

7 regularized risk test log likelihood test accuracy number of selected vertices number of selected vertices number of selected vertices Figure 4. Clique selection on the Gaussian mixture data. Left: regularized risk; Center: test set log likelihood; Right: test set accuracy vertex only vertex and edge 0.85 x regularized risk test log likelihood test accuracy number of cliques number of cliques vertex only vertex and edge number of cliques vertex only vertex and edge Figure 5. Clique selection for KCRFs on the protein data. Left: regularized risk; Center: test set log likelihood; Right: test set accuracy. The two curves represent the cases where only vertex cliques are selected (dashed vs. both vertex and edge cliques are selected (solid. test sequences, with the metric being Euclidean distance in the feature space. Then the eigensystem of the normalized Laplacian is computed. The semi-supervised graph kernel 1 is obtained with the function r(λ i = λ i+0.01 on the first i 200 eigenvalues. The rest eigenvalues are set to zero. We use the graph kernel together with the RBF kernel in KCRF. As a clique candidate is associated with a kernel, we now select two best candidates per iteration, one with the graph kernel and the other with the RBF kernel. We still run for 300 iterations for all trials. We also report the results using Transductive SVMs (TSVMs (Joachims, 1999 with the RBF kernel. From the results in Table 3, we can see that the semi-supervised graph kernel is significantly better than TSVMs on the 5-protein dataset while achieves no improvement on the other one. To diagnose the cause, we look at the graph together with all the test labels. We find that the labels are not smooth w.r.t. the graph: on average only 54.5% of a node s neighbors have the same label as that node. Detecting faulty graphs without using large amount of labels, and constructing better graphs remain future research. The approximate average running time of each trial, including both training and testing, is 30 minutes for KCRFs, 7 minutes for SVMs, and 16 hours for TSVMs. For KCRFs the majority of the time is spent on clique selection. 6. Conclusion Kernel conditional random fields have been introduced as a framework for approaching graph-structured classification problems. A representer theorem was derived which shows how KCRFs can be motivated by regularization theory. The resulting techniques combine the strengths of hidden Markov models, or more general Bayesian networks, kernel machines, and standard discriminative linear classifiers including logistic regression and SVMs. The formalism presented is quite general, and should apply naturally to a wide range of problems. Our experimental results on synthetic data, while carefully controlled to be simple, clearly indicate how sequence modeling, graph kernels for semi-supervised learning, and clique selection for sparse representations work together within this framework. The success of these methods in real problems will depend on the choice of suitable kernels that capture the structure of the data. For protein secondary structure prediction, our results are only suggestive. Secondary structure prediction is a problem that has been extensively studied for more than 20 years; yet the task remains difficult, with prediction accuracies remaining low. The major bottleneck lies in beta-sheet prediction, where there are long range interactions between regions of the protein chain that are not necessarily consecutive in the primary sequence. Our experimental results indicate that KCRFs and semi-supervised kernels have the

8 5 protein set 10 protein set Method Accuracy std Accuracy std KCRF (v KCRF (v+e SVM Table 1. Per-residue accuracy of different methods for secondary structure prediction, with the RBF kernel. KCRF (v uses vertex cliques only; KCRF (v+e uses vertex and edge cliques. 5 protein set 10 protein set Method Accuracy std Accuracy std KCRF (v KCRF (v+e SVM Table 2. Transition accuracy with different methods. 5 protein set 10 protein set Method Accuracy std Accuracy std KCRF (v KCRF (v+e Trans. SVM Table 3. Per-residue accuracy with semi-supervised methods. potential to lead to progress on this problem, where the state of the art has been based on heuristic sliding window methods. However, our results also suggest that the improvement due to semi-supervised learning is hindered by the lack of a good similarity measure with which to construct the graph. The construction of an effective graph is a challenge that may best be tackled by biologists and machine learning researchers working together. Acknowledgments This work was supported in part by NSF ITR grants CCR , IIS and IIS References Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003. Hidden Markov support vector machines. ICML 03. Belkin, M., & Niyogi, P. (2002. Semi-supervised learning on manifolds (Technical Report TR University of Chicago. Chapelle, O., Weston, J.,, & Schoelkopf, B. (2002. Cluster kernels for semi-supervised learning. NIPS 02. Collins, M. (2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Proceedings of EMNLP 02. Cuff, J., & Barton, G. (1999. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 34, Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997. Inducing features of random fields. IEEE PAMI, 19, Joachims, T. (1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. ECML. Joachims, T. (1999. Transductive inference for text classification using support vector machines. ICML 99. Jones, D. (1999. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol., 292, Kabsch, W., & Sander, C. (1983. Dictionary of protein secondary structure: Pattern recognition of hydrogenbonded and geometrical features. Biopolymers, 22, Kim, H., & Park, H. (2003. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng., 16, Kimeldorf, G., & Wahba, G. (1971. Some results on Tchebychean spline functions. J. Math. Anal. Applic., 33, Kumar, S., & Hebert, M. (2003. Discriminative fields for modeling spatial dependencies in natural images. NIPS 03. Lafferty, J., McCallum, A., & Pereira, F. (2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML 01. Lanckriet, G., Cristianini, N., an Laurent El Ghaoui, P. B., & Jordan, M. (2004. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, McCallum, A. (2003. Efficiently inducing features of conditional random fields. UAI 03. Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003. Table extraction using conditional random fields. SI- GIR 03. Sha, F., & Pereira, F. (2003. Shallow parsing with conditional random fields. Proceedings of HLT-NAACL. Smola, A., & Kondor, R. (2003. Kernels and regularization on graphs. COLT 03. Taskar, B., Guestrin, C., & Koller, D. (2003. Max-margin Markov networks. NIPS 03. Zhu, J., & Hastie, T. (2001. Kernel logistic regression and the import vector machine. NIPS 01. Zhu, X., Gharahmani, Z., & Lafferty, J. (2003. Semisupervised learning using Gaussian fields and harmonic functions. ICML 03.

Kernel Conditional Random Fields: Representation, Clique Selection, and Semi-Supervised Learning

Kernel Conditional Random Fields: Representation, Clique Selection, and Semi-Supervised Learning John Lafferty, Yan Liu and Xiaojin Zhu February 5, 4 CMU-CS-4-115 School of Computer Science Carnegie Mellon