Kernel Conditional Random Fields: Representation and Clique Selection

Size: px
Start display at page:

Download "Kernel Conditional Random Fields: Representation and Clique Selection"

Transcription

1 Kernel Conditional Random Fields: Representation and Clique Selection John Lafferty Xiaojin Zhu Yan Liu School of Computer Science, Carnegie Mellon University, Pittsburgh PA, USA Abstract Kernel conditional random fields (KCRFs are introduced as a framework for discriminative modeling of graph-structured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Mercer kernels on labeled graphs. A procedure for greedily selecting cliques in the dual representation is then proposed, which allows sparse representations. By incorporating kernels and implicit feature spaces into conditional graphical models, the framework enables semi-supervised learning algorithms for structured data through the use of graph kernels. The framework and clique selection methods are demonstrated in synthetic data experiments, and are also applied to the problem of protein secondary structure prediction. 1. Introduction Many classification problems involve the annotation of data items having multiple components, with each component requiring a classification label. Such problems are challenging because the interaction between the components can be rich and complex. In text, speech, and image processing, for example, it is often useful to label individual words, sounds, or image patches with categories to enable higher level processing; but these labels can depend on one another in a highly complex manner. For biological sequence annotation, it is desirable to annotate each amino acid in a protein with a label, with the collection of labels representing the global geometric structure of the molecule. Here the labels in principle depend on the physical characteristics of the molecule and its ambient chemical envi- Appearing in Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, Copyright 2004 by the authors. ronment. In each case, classification tasks naturally arise which clearly violate the assumption of independent and identically distributed instances that is made in the majority of classification procedures in statistics and machine learning. It is therefore of central importance to extend recent advances in classification theory and practice to structured, non-independent data classification problems. Conditional random fields (Lafferty et al., 2001 have been proposed as an approach to modeling the interactions between labels in such problems using the tools of graphical models. A conditional random field (CRF is a model that assigns a joint probability distribution over labels conditional on the input, where the distribution respects the independence relations encoded in a graph. In general, the labels are not assumed to be independent, nor are the observations conditionally independent given the labels, as is assumed in generative models such as hidden Markov models. The CRF framework has already been used to obtain promising results in a number of domains where there is interaction between labels, including tagging, parsing and information extraction in natural language processing (Collins, 2002; Sha & Pereira, 2003; Pinto et al., 2003 and the modeling of spatial dependencies in image processing (Kumar & Hebert, In related work, Taskar et al. (2003 have studied random fields (also known as Markov networks fit using loss functions that incorporate a generalized notion of margin, and have observed how the kernel trick applies to this family of models. We present an extension of conditional random fields that permits the use of implicit features spaces through Mercer kernels, using the framework of regularization theory. Such an extension is motivated by the significant body of recent work that has shown kernel methods to be extremely effective in a wide variety of machine learning techniques; for example, they enable the integration of multiple sources of information in a principled manner. Our introduction of Mercer kernels into conditional graphical models is also motivated by the problem of semi-supervised learning. In many domains, the collection of annotated training data is difficult and costly, as it requires the efforts of expert human annotators, while the collection of unlabeled data may

2 be relatively easy and inexpensive. The emerging theme in recent research in semi-supervised learning is that kernel methods, in particular those based on graphical representations of unlabeled data, form a theoretically attractive and empirically promising set of techniques for combining labeled and unlabeled data (Belkin & Niyogi, 2002; Chapelle et al., 2002; Smola & Kondor, 2003; Zhu et al., In Section 2 we formalize the learning problem and present a version of the classical representer theorem of Kimeldorf and Wahba (1971. Unlike the classical result, for kernel conditional random fields the dual parameters depend on all potential assignments of labels to cliques in the graph, not only the observed labels. This motivates the need for algorithms to derive sparse representations, since the full representation has parameters for each labeled clique in the graphs appearing in the training data. In Section 3 we present a greedy algorithm for selecting a small number of representative cliques. This clique selection algorithm parallels the import vector selection algorithms of kernel logistic regression (Zhu & Hastie, 2001, and the feature selection methods that have been previously proposed for random fields and conditional random fields using explicit features (McCallum, In Section 4 the ideas and methods are demonstrated on two synthetic data sets, where the effects of the underlying graph kernels, clique selection, and sequential modeling can be clearly seen. In Section 5 we report the results of experiments using kernel CRFs for protein secondary structure prediction. This is the task of mapping primary sequences of amino acids onto a string of secondary structure assignments, such as helix, sheet, or coil. It is widely believed that secondary structure can contribute valuable information to discerning how proteins fold in three dimensions. We compare kernel conditional random fields, estimated using clique selection, against support vector machine classifiers, with both methods using kernels derived from position-specific scoring matrices (PSI-BLAST profiles as input features. In addition, we give results for the use of graph kernels derived from the PSI-BLAST profiles in a transductive, semi-supervised framework for estimating the kernel CRFs. The paper concludes with a brief discussion in Section Representation Before proceeding with formalism, we give some intuition for what our framework is intended to capture. Our goal is to annotate structured data, where the structure is represented by a graph. Labels are to be assigned to the nodes in the graph in order to minimize some loss function, such as 0-1 error; the labels come from a small set Y, for example, Y = {red, blue, green}. Each vertex in the graph is associated with a feature vector x v X. In image processing, the feature vector at a node might include a pixel intensity, as well as average pixel intensities smoothed over neighboring regions using wavelets. In protein secondary structure prediction, each node might correspond to an amino acid in the protein, and the feature vector at a node may include an amino acid histogram of all protein fragments in a database which closely match the given protein at that node. In the following section we present our notation and formal framework for such problems Cliques and labeled graphs Let G denote a collection of finite graphs. For example, G might be the set of finite chains, appropriate for sequence modeling, or the rectangular 2-dimensional grids, appropriate for some image processing tasks. The set of vertices of a graph g G is denoted by V (g, and size of the graph is the number of vertices, denoted g = V (g. A clique is a subset of the vertices which is fully connected, with any pair of vertices joined by an edge; we denote the set of cliques in the graph by C(g. The number of vertices in a clique is denoted by c. Similarly, we denote by C(G = {(g, c g G, c C(g} the collection of cliques across varying graphs. In other words, a member of C(G consists of a graph and a distinguished clique of that graph. We will work with kernels that compare components of different graphs. For example, we could consider a kernel K : C(G C(G {0, 1} given by K((g, c, (g, c = δ( c, c. We next consider labelings of a graph. Let Y be a finite set of labels; infinite Y is also possible in a regression framework, but we restrict to finite Y for simplicity. The set of Y-labelings of a graph g is denoted Y(g = { y y Y g }, and the collection of all Y- labeled graphs is Y(G = {(g, x g G, y Y(g}. Similarly, let X be an input feature space; for example, X = R n. The set X(g = { x x X g } denotes the set of assignments of a feature vector to each vertex of the graph g; X(G = {(g, x g G, x X(g} is the collection of all such annotated graphs. Finally, let Y C (g = { (c, y c c C(g, y c Y c } be the set of Y- labeled cliques in a graph. As above, we similarly define X Y C (g = {(x, c, y c x X(g, (c, y c Y C (g} and X Y C (G = {(g, x, c, y c (x, c, y c X Y C (g} Representer Theorem The prediction task for conditional graphical models is to learn a function h : X(G Y(G where h(g, x Y(g is a labeling of g, with the goal of minimizing a suitably defined loss function. The classifier h = h n is chosen based on a labeled sample { (g (i, x (i, y (i } n, with each i=1 (g (i, x (i, y (i being a labeled graph, the graph possibly changing from example to example.

3 To limit the complexity of the hypothesis, we will assume that it is determined completely by a function f : X Y C (G R. Let f(g, x denote the collection of values {f(g, x, c, y c }, with c C(g varying over the cliques of g and y c Y c varying over all possible labelings of that clique. We assume that a loss function φ(y, f(g, x is given. As an important example, and the loss function used in this paper, consider the negative log loss φ(y, f(g, x = (1 f c (x, y c + log exp f c (x, y c c C(g y Y(g c C(g where f c (x, y c is shorthand for f(g, x, c, y c. The negative log marginal loss could also be considered for minimizing the per-node error. The negative log loss function corresponds to a conditional random field given by ( p(y g, x = Z 1 (g, x, f exp f c (x, y c (2 We now discuss how the representer theorem of kernel machines (Kimeldorf & Wahba, 1971 applies to conditional graphical models. While this is a simple extension, we re not aware of an analogous formulation in the statistics or machine learning literature. Let K be a Mercer kernel on X Y C (G; thus K((g, x, c, y c, (g, x, c, y c R for each (x, c, y c X Y C (g and (x, c, y c X Y C(g. Intuitively, this assigns a measure of similarity between a labeled clique in one graph and a labeled clique in a (possibly different graph. We denote by H K the associated reproducing kernel Hilbert space, and by K the associated norm on L 2 (X Y C (G. Consider a regularized loss function of the form R φ f = i=1 ( φ y (i, f(g (i, x (i + Ω ( f K It is important to note that the loss depends on all possible assignments y c of labels to each clique, not just those observed in the labeled data y (i. Suppressing the dependence on the graph g in the notation, let K c (x, y c ;, = K((g, x, c, y c,. Following the argument for the standard representer theorem, it can easily be shown that the minimizer of a regularized loss function of the above form can be expressed in terms of the basis functions f( = K(, (g (i, x (i, c, y c. Proposition (Representer theorem for CRFs. Let K be a Mercer kernel on X Y C (G with associated RKHS norm c K, and let Ω : R + R + be strictly increasing. Then the minimizer f of R φ f = i=1 if it exists, has the form f ( = ( φ y (i, f(g (i, x (i + Ω ( f K i=1 c C(g (i c (y c K c(x (i, y c ; y c Y c The key property distinguishing this result from the standard representer theorem is that the dual parameters c (y c now depend on all assignments of labels Two special cases Let K be a Mercer kernel on Z = X Y Y. Thus, the kernel is defined in terms of the matrix entries K(z, z where z = (x, y 1, y 2. Using K we can define a kernel on edges in X Y C (G by K ((g, x, (v 1, v 2, (y 1, y 2, (g, x, (v 1, v 2, (y 1, y 2 = K((x v1, y 1, y 2, (x v y 1, 1, y 2. For the regularized risk minimization problem min f H K R φ (x, f, λ = min f H K φ(y (i, f(x (i + λ f K i=1 where f H K, the CRF representer theorem implies that the solution f has the form f(v (x, y 1,v 2 1, y 2 = (v,v (y, y K((x v1, y 1, y 2, (x (i v, y, y i=1 y,y (v,v In the special case of kernel K(z, z = K(x, x δ(y 1, y 1 it follows that f (v 1,v 2 (x, y 1, y 2 = i=1 v V (g (i v (y 1 K(x v1, x (i v Under the probabilistic model (2, this is simply kernel logistic regression. In the special case of K(z, z = K(x, x δ(y 1, y 1 + δ(y 1, y 1δ(y 2, y 2 we get that f (v 1,v 2 (x, y 1, y 2 = i,v v (y 1K(x (i v, x v 1 + α(y 1, y 2 and we recover a simple type of semiparametric CRF. 3. Clique Selection The representer theorem shows that the minimizing function f is supported by labeled cliques over the training

4 examples; however, this may result in an extremely large number of parameters. We therefore pursue a strategy of incrementally selecting cliques in order to greedily reduce the regularized risk. The resulting procedure is parallel to forward stepwise logistic regression, and to related methods for kernel logistic regression (Zhu & Hastie, 2001, as well as to the greedy selection procedure presented in (Della Pietra et al., { Our algorithm will maintain an active set A = (g (i, c, y c } Y C (G of labeled cliques, where the labelings are not restricted to those appearing in the training data. Each such candidate clique can be represented by a basis function h( = K((g (i, x (i, c, y c, H K, and is assigned a parameter α h = c (y c. We work with the regularized risk R φ (f = ( φ y (i, f(g (i, x (i + λ 2 f 2 K (3 i where φ is the log-loss of equation (1. To evaluate a candidate h, one strategy is to compute the gain sup α R φ (f R φ (f + αh, and to choose the candidate h having the largest gain. This presents an apparent difficulty, since the optimal parameter α cannot be computed in closed form, and must be evaluated numerically. For sequence models this involves forward-backward calculations for each candidate h, the cost of which is prohibitive. As an alternative, we adopt the functional gradient descent approach, which evaluates a small change to the current function. For a given candidate h, consider adding h to the current model with small weight ε; thus f f + εh. Then R φ (f + εh = R φ (f + εdr φ (f, h + O(ε 2, where the functional derivative of R φ at f in the direction h is computed as dr φ (f, h = E f [h] Ẽ[h] + λ f, h K where Ẽ[h] = i h(x(i, y (i is the empirical expectation and E f [h] = i y p(y x(i, fh(x (i, y is the model expectation conditioned on x, combined with the empirical distribution on x. The idea is that in directions h where the functional gradient dr φ (f, h is large, the model is mismatched with the labeled data; this direction should be added to the model to make a correction. This results in the greedy clique selection algorithm summarized in Figure 1. Following our earlier notation, h(x (i, y = h(g (i, x (i, c, y c c C(g (i is the sum over all cliques. The candidate functions h might include functions of the form h( = K((g (i, x (i, c, y c, Initialize with f = 0, and iterate: 1. For each candidate h H K, supported by a single labeled clique, calculate the functional derivative dr φ (f, h. 2. Select the candidate h = arg max h dr φ (f, h having the largest gradient direction. Set f f + α h h. 3. Estimate parameters α f for each active f by minimizing R φ (f. Figure 1. Greedy Clique Selection. Labeled cliques encode basis functions h which are greedily added to the model, using a form of functional gradient descent. where i is a specific instance, c is a particular clique of g (i, and y c is a labeling of that clique. Alternatively, in a slightly less greedy manner, at each step in the selection procedure a specific instance and clique may be selected, and functions for each clique labeling may be added. In the experiments reported below for sequences, the marginal probabilities p(y t = y x and expected counts for the state transitions are required; these are computed using the forward-backward algorithm, with log domain arithmetic to avoid underflow. A quasi-newton method (BFGS, cubic-polynomial line search is used to estimate the parameters in step 3. Prediction is carried out using the forward-backward algorithm to compute marginals rather than using the Viterbi algorithm Combining multiple kernels The above use of kernels enables semi-supervised learning for structured prediction problems. One of the emerging themes in semi-supervised learning is that graph kernels can provide a useful framework for combining labeled and unlabeled data. Here an undirected graph is defined over labeled and unlabeled data instances, and generally the assumption is that labels vary smoothly over the graph. The graph is represented by the weight matrix W, and one can construct a kernel from the graph Laplacian, substituting eigenvalues λ by r(λ, where r is a non-negative and (typically decreasing function. This regularizes high frequency components and encourages smooth functions on the graph; see (Smola & Kondor, 2003 for a description of this unifying view of graph kernels. It is important to note that such a use of a graph kernel for semi-supervised learning introduces an additional graphical structure, which should not be confused with the graph representing the explicit dependencies between labels in a CRF. For example, when modeling sequences, the natural CRF graph structure is a chain. By incorporating unlabeled data through the use of a graph kernel, an additional

5 graph that will generally have many cycles is implicitly introduced. However, the graph kernel and a more standard kernel may be naturally combined as a linear combination; see, for example, (Lanckriet et al., Synthetic Data Experiments To demonstrate the properties and advantages of KCRFs, we prepared two synthetic datasets: a galaxy dataset to investigate the relation to semi-supervised and sequential learning, and an HMM with Gaussian mixture emission probabilities to demonstrate the properties of clique selection and the advantages of incorporating kernels. Galaxy. The galaxy dataset is a variant of two spirals; see Figure 2 (left. Note the dense core of points from both classes. The sequences are generated from a 2-state hidden Markov model (HMM, where each state emits instances uniformly from one of the classes. There is a 90% chance of staying in the same state. The idea is that under a sequence model, an example from the core will have a better than random chance to be labeled correctly based on the context. This is not true under a non-sequence model, and the dataset as a whole will thus have about a 20% Bayes error rate under the iid assumption. We sample 100 sequences of length 20. Note the choice of semi-supervised vs. standard kernels and sequence vs. non-sequence models are orthogonal; the four combinations are all tested on. We construct a semi-supervised graph kernel by first creating an unweighted 10-nearest neighbor graph. We then compute the graph Laplacian L, and form the kernel K = 10 ( L I 1. This corresponds to a function r(λ = 1/(λ on L s eigenvalues. The standard kernel is the radial basis function (RBF kernel with bandwidth σ = 5. All parameters here and below are tuned by cross validation. Figure 2 (center shows the results of using kernel logistic regression with the semi-supervised kernel and with the RBF kernel; here the sequence structure is ignored. For each training set size, which ranges from 20 to 400 points, 10 random trials were performed. The error intervals shown are one standard error. When the labeled set size is small, the graph kernel is much better than the RBF kernel. However both kernels saturate at the 20% Bayes error rate. Next we apply both kernels to the semiparametric KCRF model in section 2.3; see Figure 2 (right. Note the x-axis is the number of training sequences since each sequence has 20 instances, the range is the same as Figure 2 (center. The kernel CRF is capable of getting under the 20% Bayes error floor of the non-sequence model, with both kernels and sufficient labeled data. However, the graph kernel is able to learn the structure much faster than the RBF kernel. Evidently, the high error rate for low label data sizes prevents the RBF model from effectively using the context. HMM with Gaussian mixtures. This more difficult dataset is generated from a 3-state HMM. Each state is a mixture of 2 Gaussians with random mean and covariance. The Gaussians strongly overlap; see Figure 3 (left. The transition probabilities favor remaining in the state, with a probability of 0.8, and to transition to each of the other two states with equal probability 0.1; we generate 100 sequences of length 30. We use an RBF kernel with σ =. (A graph kernel is slightly worse than the RBF kernel on this dataset, and is not shown. We perform 20 trials for each training set size, and in each trial we perform clique selection to select the top 20 vertices. The center and right plots in Figure 3 show that the semiparametric KCRF again outperforms kernel logistic regression with the same RBF kernel. Figure 4 shows clique selection, with a training size 20 sequences, averaged over 20 random trials. The regularized risk (left, which is training set likelihood plus regularizor, always decreases as we select more vertices into the KCRF. On the other hand, the test set likelihood (center and accuracy (right saturate and even worsen slightly, showing signs of overfitting. All curves change dramatically at first, demonstrating the effectiveness of the clique selection algorithm. In fact, fewer than 10 vertex cliques are sufficient for this problem. 5. Protein Secondary Structure Prediction For the protein secondary structure prediction task, we used the RS126 dataset, on which many current methods have been developed and tested (Cuff & Barton, It is a non-homologous dataset, since among the 126 protein chains, no two proteins share more than 25% sequence identity over a length of more than 80 residues (Cuff & Barton, The dataset can be downloaded from http: //barton.ebi.ac.uk/. We adopt the DSSP definition of protein secondary structure (Kabsch & Sander, 1983, which is based on hydrogen bonding patterns and geometric constraints. Following the discussion in (Cuff & Barton, 1999, the 8 DSSP labels are reduced to a 3 state model as follows: H & G map to helix (H, E & B to sheets (E, and all other states to coil (C. The state-of-the-art performance for secondary structure prediction is achieved by window-base methods, using the position-specific scoring matrices (PSSM as input features, i.e., PSI-BLAST profiles, together with Support Vector Machines (SVMs as the underlying learning algorithm (Jones, 1999; Kim & Park, Finally, the raw predictions are fed into a second layer SVM to filter out physically unrealistic predictions, such as one sheet residue surrounded by helix residues (Jones, 1999.

6 semi supervised RBF 5 semi supervised RBF 5 5 Test error rate 5 Test error rate Training set size Training set size Figure 2. Left: The galaxy data. Center: Kernel logistic regression, comparing two kernels: RBF and a graph kernel using the unlabeled data. Right: Kernel conditional random fields, which take into account the sequential structure of the data Test error rate 5 Test error rate Training sequences Training sequences Figure 3. Left: The Gaussian mixture data (only a few data points are shown. Center: Kernel logistic regression with an RBF kernel. Right: Kernel CRF with the same kernel. In our experiments, we apply a linear transformation L to the PSSM matrix elements according to L(x = 0 for x 5, L(x = x 10 for 5 x 5, and L(x = 1 for x 5. This is the same transform used by Kim and Park (2003, which achieved one of the best results in the recent CASP (Critical Assessment of Structure Predictions competition. The window size is set to 13 by cross-validation. Therefore the number of features per position is (the number of amino acids plus gap. Clique selection. We use an RBF kernel with bandwidth σ = 0.1 chosen by cross-validation. Figure 5 (left shows the kernel CRF risk reduction as clique selection proceeds, when only vertex clique candidates are allowed (note there are always position independent edge parameters in the KCRF models, to prevent the models from degrading into kernel logistic regression, and when both vertex and edge cliques are allowed. (The kernel between vertex cliques is K(z, z = K(x, x δ(y 1, y 1, and between edge cliques it is K(z, z = K(x, x δ(y 1, y 1δ(y 2, y 2. The total number of clique candidates is about 4800 (vertex only and (vertex and edge. The rapid reduction in risk indicates sparse training of kernel CRFs is successful. Also when more flexibility is allowed by including edge cliques, the risk reduction is much faster. The more flexible model also has higher test set log likelihood (center although this does not improve the test set accuracy too much (right. These observations are generally true for other trials too. Per-residue accuracy. To evaluate prediction performance, we use the overall per-residue accuracy (also known as Q 3. We experiment with training set size of 5 and 10 sequences respectively. For each size we perform 10 trials where the training sequences are randomly sampled, and the remaining proteins are used as the test set. For kernel CRF we select 300 cliques, again from either vertex candidates alone or vertex and edge candidates. We compare them with the SVM-light package (Joachims, 1998 for SVM classifier. All methods use the same RBF kernel. See Table 1. KCRFs and SVMs have comparable performance. Transition accuracy. Further information can be obtained by studying transition boundaries, for example, the transition from coil to sheet. From the point of view of structural biology, these transition boundaries may provide important information about how proteins fold in three dimension. On the other hand, those are the positions where most secondary structure prediction systems will fail. The transition boundary is defined as a pair of adjacent positions (i, i + 1 whose true labels differ. It is classified correctly only if both labels are correct. This is a very hard problem, as can be seen in Table 2 and KCRFs are able to achieve a considerable improvement over SVM. Semi-supervised learning. We start with an unweighted 10 nearest neighbor graph over positions in both training and

7 regularized risk test log likelihood test accuracy number of selected vertices number of selected vertices number of selected vertices Figure 4. Clique selection on the Gaussian mixture data. Left: regularized risk; Center: test set log likelihood; Right: test set accuracy vertex only vertex and edge 0.85 x regularized risk test log likelihood test accuracy number of cliques number of cliques vertex only vertex and edge number of cliques vertex only vertex and edge Figure 5. Clique selection for KCRFs on the protein data. Left: regularized risk; Center: test set log likelihood; Right: test set accuracy. The two curves represent the cases where only vertex cliques are selected (dashed vs. both vertex and edge cliques are selected (solid. test sequences, with the metric being Euclidean distance in the feature space. Then the eigensystem of the normalized Laplacian is computed. The semi-supervised graph kernel 1 is obtained with the function r(λ i = λ i+0.01 on the first i 200 eigenvalues. The rest eigenvalues are set to zero. We use the graph kernel together with the RBF kernel in KCRF. As a clique candidate is associated with a kernel, we now select two best candidates per iteration, one with the graph kernel and the other with the RBF kernel. We still run for 300 iterations for all trials. We also report the results using Transductive SVMs (TSVMs (Joachims, 1999 with the RBF kernel. From the results in Table 3, we can see that the semi-supervised graph kernel is significantly better than TSVMs on the 5-protein dataset while achieves no improvement on the other one. To diagnose the cause, we look at the graph together with all the test labels. We find that the labels are not smooth w.r.t. the graph: on average only 54.5% of a node s neighbors have the same label as that node. Detecting faulty graphs without using large amount of labels, and constructing better graphs remain future research. The approximate average running time of each trial, including both training and testing, is 30 minutes for KCRFs, 7 minutes for SVMs, and 16 hours for TSVMs. For KCRFs the majority of the time is spent on clique selection. 6. Conclusion Kernel conditional random fields have been introduced as a framework for approaching graph-structured classification problems. A representer theorem was derived which shows how KCRFs can be motivated by regularization theory. The resulting techniques combine the strengths of hidden Markov models, or more general Bayesian networks, kernel machines, and standard discriminative linear classifiers including logistic regression and SVMs. The formalism presented is quite general, and should apply naturally to a wide range of problems. Our experimental results on synthetic data, while carefully controlled to be simple, clearly indicate how sequence modeling, graph kernels for semi-supervised learning, and clique selection for sparse representations work together within this framework. The success of these methods in real problems will depend on the choice of suitable kernels that capture the structure of the data. For protein secondary structure prediction, our results are only suggestive. Secondary structure prediction is a problem that has been extensively studied for more than 20 years; yet the task remains difficult, with prediction accuracies remaining low. The major bottleneck lies in beta-sheet prediction, where there are long range interactions between regions of the protein chain that are not necessarily consecutive in the primary sequence. Our experimental results indicate that KCRFs and semi-supervised kernels have the

8 5 protein set 10 protein set Method Accuracy std Accuracy std KCRF (v KCRF (v+e SVM Table 1. Per-residue accuracy of different methods for secondary structure prediction, with the RBF kernel. KCRF (v uses vertex cliques only; KCRF (v+e uses vertex and edge cliques. 5 protein set 10 protein set Method Accuracy std Accuracy std KCRF (v KCRF (v+e SVM Table 2. Transition accuracy with different methods. 5 protein set 10 protein set Method Accuracy std Accuracy std KCRF (v KCRF (v+e Trans. SVM Table 3. Per-residue accuracy with semi-supervised methods. potential to lead to progress on this problem, where the state of the art has been based on heuristic sliding window methods. However, our results also suggest that the improvement due to semi-supervised learning is hindered by the lack of a good similarity measure with which to construct the graph. The construction of an effective graph is a challenge that may best be tackled by biologists and machine learning researchers working together. Acknowledgments This work was supported in part by NSF ITR grants CCR , IIS and IIS References Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003. Hidden Markov support vector machines. ICML 03. Belkin, M., & Niyogi, P. (2002. Semi-supervised learning on manifolds (Technical Report TR University of Chicago. Chapelle, O., Weston, J.,, & Schoelkopf, B. (2002. Cluster kernels for semi-supervised learning. NIPS 02. Collins, M. (2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Proceedings of EMNLP 02. Cuff, J., & Barton, G. (1999. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 34, Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997. Inducing features of random fields. IEEE PAMI, 19, Joachims, T. (1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. ECML. Joachims, T. (1999. Transductive inference for text classification using support vector machines. ICML 99. Jones, D. (1999. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol., 292, Kabsch, W., & Sander, C. (1983. Dictionary of protein secondary structure: Pattern recognition of hydrogenbonded and geometrical features. Biopolymers, 22, Kim, H., & Park, H. (2003. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng., 16, Kimeldorf, G., & Wahba, G. (1971. Some results on Tchebychean spline functions. J. Math. Anal. Applic., 33, Kumar, S., & Hebert, M. (2003. Discriminative fields for modeling spatial dependencies in natural images. NIPS 03. Lafferty, J., McCallum, A., & Pereira, F. (2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML 01. Lanckriet, G., Cristianini, N., an Laurent El Ghaoui, P. B., & Jordan, M. (2004. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, McCallum, A. (2003. Efficiently inducing features of conditional random fields. UAI 03. Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003. Table extraction using conditional random fields. SI- GIR 03. Sha, F., & Pereira, F. (2003. Shallow parsing with conditional random fields. Proceedings of HLT-NAACL. Smola, A., & Kondor, R. (2003. Kernels and regularization on graphs. COLT 03. Taskar, B., Guestrin, C., & Koller, D. (2003. Max-margin Markov networks. NIPS 03. Zhu, J., & Hastie, T. (2001. Kernel logistic regression and the import vector machine. NIPS 01. Zhu, X., Gharahmani, Z., & Lafferty, J. (2003. Semisupervised learning using Gaussian fields and harmonic functions. ICML 03.

Kernel Conditional Random Fields: Representation, Clique Selection, and Semi-Supervised Learning

Kernel Conditional Random Fields: Representation, Clique Selection, and Semi-Supervised Learning Kernel Conditional Random Fields: Representation, Clique Selection, and Semi-Supervised Learning John Lafferty, Yan Liu and Xiaojin Zhu February 5, 4 CMU-CS-4-115 School of Computer Science Carnegie Mellon

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA Retrospective ICML99 Transductive Inference for Text Classification using Support Vector Machines Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA Outline The paper in

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Comparisons of Sequence Labeling Algorithms and Extensions

Comparisons of Sequence Labeling Algorithms and Extensions Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

Computationally Efficient M-Estimation of Log-Linear Structure Models

Computationally Efficient M-Estimation of Log-Linear Structure Models Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith, Doug Vail, and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Basis Functions. Volker Tresp Summer 2017

Basis Functions. Volker Tresp Summer 2017 Basis Functions Volker Tresp Summer 2017 1 Nonlinear Mappings and Nonlinear Classifiers Regression: Linearity is often a good assumption when many inputs influence the output Some natural laws are (approximately)

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM) School of Computer Science Probabilistic Graphical Models Structured Sparse Additive Models Junming Yin and Eric Xing Lecture 7, April 4, 013 Reading: See class website 1 Outline Nonparametric regression

More information

Semi-supervised protein classification using cluster kernels

Semi-supervised protein classification using cluster kernels Semi-supervised protein classification using cluster kernels Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany weston@tuebingen.mpg.de Dengyong Zhou, Andre Elisseeff

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

CRFs for Image Classification

CRFs for Image Classification CRFs for Image Classification Devi Parikh and Dhruv Batra Carnegie Mellon University Pittsburgh, PA 15213 {dparikh,dbatra}@ece.cmu.edu Abstract We use Conditional Random Fields (CRFs) to classify regions

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

LTI Thesis Defense: Riemannian Geometry and Statistical Machine Learning

LTI Thesis Defense: Riemannian Geometry and Statistical Machine Learning Outline LTI Thesis Defense: Riemannian Geometry and Statistical Machine Learning Guy Lebanon Motivation Introduction Previous Work Committee:John Lafferty Geoff Gordon, Michael I. Jordan, Larry Wasserman

More information

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Support Vector Machine Learning for Interdependent and Structured Output Spaces Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2015 1 : Introduction to GM and Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Wenbo Liu, Venkata Krishna Pillutla 1 Overview This lecture

More information

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others Introduction to object recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others Overview Basic recognition tasks A statistical learning approach Traditional or shallow recognition

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

New String Kernels for Biosequence Data

New String Kernels for Biosequence Data Workshop on Kernel Methods in Bioinformatics New String Kernels for Biosequence Data Christina Leslie Department of Computer Science Columbia University Biological Sequence Classification Problems Protein

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the

More information

Opinion Mining by Transformation-Based Domain Adaptation

Opinion Mining by Transformation-Based Domain Adaptation Opinion Mining by Transformation-Based Domain Adaptation Róbert Ormándi, István Hegedűs, and Richárd Farkas University of Szeged, Hungary {ormandi,ihegedus,rfarkas}@inf.u-szeged.hu Abstract. Here we propose

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors

More information

A Note on Semi-Supervised Learning using Markov Random Fields

A Note on Semi-Supervised Learning using Markov Random Fields A Note on Semi-Supervised Learning using Markov Random Fields Wei Li and Andrew McCallum {weili, mccallum}@cs.umass.edu Computer Science Department University of Massachusetts Amherst February 3, 2004

More information

12 Classification using Support Vector Machines

12 Classification using Support Vector Machines 160 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 12 Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F. Markowetz.

More information

Graph Laplacian Kernels for Object Classification from a Single Example

Graph Laplacian Kernels for Object Classification from a Single Example Graph Laplacian Kernels for Object Classification from a Single Example Hong Chang & Dit-Yan Yeung Department of Computer Science, Hong Kong University of Science and Technology {hongch,dyyeung}@cs.ust.hk

More information

Transductive Phoneme Classification Using Local Scaling And Confidence

Transductive Phoneme Classification Using Local Scaling And Confidence 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion

More information

Low-dimensional Representations of Hyperspectral Data for Use in CRF-based Classification

Low-dimensional Representations of Hyperspectral Data for Use in CRF-based Classification Rochester Institute of Technology RIT Scholar Works Presentations and other scholarship 8-31-2015 Low-dimensional Representations of Hyperspectral Data for Use in CRF-based Classification Yang Hu Nathan

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Learning with infinitely many features

Learning with infinitely many features Learning with infinitely many features R. Flamary, Joint work with A. Rakotomamonjy F. Yger, M. Volpi, M. Dalla Mura, D. Tuia Laboratoire Lagrange, Université de Nice Sophia Antipolis December 2012 Example

More information

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國 Conditional Random Fields - A probabilistic graphical model Yen-Chin Lee 指導老師 : 鮑興國 Outline Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional

More information

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Exponentiated Gradient Algorithms for Large-margin Structured Classification Exponentiated Gradient Algorithms for Large-margin Structured Classification Peter L. Bartlett U.C.Berkeley bartlett@stat.berkeley.edu Ben Taskar Stanford University btaskar@cs.stanford.edu Michael Collins

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours. CS 188 Spring 2010 Introduction to Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet. Please use non-programmable calculators

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Efficient Iterative Semi-supervised Classification on Manifold

Efficient Iterative Semi-supervised Classification on Manifold . Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

A Geometric Perspective on Machine Learning

A Geometric Perspective on Machine Learning A Geometric Perspective on Machine Learning Partha Niyogi The University of Chicago Collaborators: M. Belkin, V. Sindhwani, X. He, S. Smale, S. Weinberger A Geometric Perspectiveon Machine Learning p.1

More information

Large-Scale Face Manifold Learning

Large-Scale Face Manifold Learning Large-Scale Face Manifold Learning Sanjiv Kumar Google Research New York, NY * Joint work with A. Talwalkar, H. Rowley and M. Mohri 1 Face Manifold Learning 50 x 50 pixel faces R 2500 50 x 50 pixel random

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

TRANSDUCTIVE LINK SPAM DETECTION

TRANSDUCTIVE LINK SPAM DETECTION TRANSDUCTIVE LINK SPAM DETECTION Denny Zhou Microsoft Research http://research.microsoft.com/~denzho Joint work with Chris Burges and Tao Tao Presenter: Krysta Svore Link spam detection problem Classification

More information

Improving the Performance of Text Categorization using N-gram Kernels

Improving the Performance of Text Categorization using N-gram Kernels Improving the Performance of Text Categorization using N-gram Kernels Varsha K. V *., Santhosh Kumar C., Reghu Raj P. C. * * Department of Computer Science and Engineering Govt. Engineering College, Palakkad,

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Support vector machines

Support vector machines Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and

More information

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001 Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/20/2010 Announcements W7 due Thursday [that s your last written for the semester!] Project 5 out Thursday Contest running

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Structured Max Margin Learning on Image Annotation and Multimodal Image Retrieval

Structured Max Margin Learning on Image Annotation and Multimodal Image Retrieval Structured Max Margin Learning on Image Annotation and Multimodal Image Retrieval Zhen Guo Zhongfei (Mark) Zhang Eric P. Xing Christos Faloutsos Computer Science Department, SUNY at Binghamton, Binghamton,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Modifying Kernels Using Label Information Improves SVM Classification Performance

Modifying Kernels Using Label Information Improves SVM Classification Performance Modifying Kernels Using Label Information Improves SVM Classification Performance Renqiang Min and Anthony Bonner Department of Computer Science University of Toronto Toronto, ON M5S3G4, Canada minrq@cs.toronto.edu

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

SVMs for Structured Output. Andrea Vedaldi

SVMs for Structured Output. Andrea Vedaldi SVMs for Structured Output Andrea Vedaldi SVM struct Tsochantaridis Hofmann Joachims Altun 04 Extending SVMs 3 Extending SVMs SVM = parametric function arbitrary input binary output 3 Extending SVMs SVM

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

A Geometric Perspective on Data Analysis

A Geometric Perspective on Data Analysis A Geometric Perspective on Data Analysis Partha Niyogi The University of Chicago Collaborators: M. Belkin, X. He, A. Jansen, H. Narayanan, V. Sindhwani, S. Smale, S. Weinberger A Geometric Perspectiveon

More information

Adapting SVM Classifiers to Data with Shifted Distributions

Adapting SVM Classifiers to Data with Shifted Distributions Adapting SVM Classifiers to Data with Shifted Distributions Jun Yang School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 juny@cs.cmu.edu Rong Yan IBM T.J.Watson Research Center 9 Skyline

More information