Kernels and representation

Similar documents
Learning Dot Product Polynomials for multiclass problems

Divide and Conquer Kernel Ridge Regression

Support Vector Machines.

Kernel Methods & Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Recent Developments in Model-based Derivative-free Optimization

Neural Networks and Deep Learning

ECG782: Multidimensional Digital Signal Processing

Probabilistic Graphical Models

The Pre-Image Problem in Kernel Methods

Perceptron: This is convolution!

Learning with infinitely many features

6 Model selection and kernels

Support Vector Machines

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Contents. I The Basic Framework for Stationary Problems 1

Machine Learning: Think Big and Parallel

Mathematical Programming and Research Methods (Part II)

Theoretical Concepts of Machine Learning

SVMs for Structured Output. Andrea Vedaldi

Support vector machines

Stanford University. A Distributed Solver for Kernalized SVM

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

Bilevel Sparse Coding

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

IE598 Big Data Optimization Summary Nonconvex Optimization

Convexization in Markov Chain Monte Carlo

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Generalized trace ratio optimization and applications

A Dendrogram. Bioinformatics (Lec 17)

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

A generalized quadratic loss for Support Vector Machines

Learning Anisotropic RBF Kernels

Support Vector Machines.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Part 5: Structured Support Vector Machines

Lecture 7: Support Vector Machine

Convex Optimization. 2. Convex Sets. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University. SJTU Ying Cui 1 / 33

Learning Kernels with Random Features

All lecture slides will be available at CSC2515_Winter15.html

Lecture 2. Topology of Sets in R n. August 27, 2008

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

12 Classification using Support Vector Machines

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Efficient Iterative Semi-supervised Classification on Manifold

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

10. Support Vector Machines

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting

SUPPORT VECTOR MACHINES

Large-scale optimization with the primal-dual column generation method

Data Mining in Bioinformatics Day 1: Classification

Generative and discriminative classification techniques

Deep Generative Models Variational Autoencoders

Convexity Theory and Gradient Methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Outline Introduction Problem Formulation Proposed Solution Applications Conclusion. Compressed Sensing. David L Donoho Presented by: Nitesh Shroff

Content-based image and video analysis. Machine learning

Lecture 2 September 3

1 Training/Validation/Testing

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Diffusion Wavelets for Natural Image Analysis

Efficient Enumeration Algorithms for Constraint Satisfaction Problems

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Support Vector Machines

Clustering: Classic Methods and Modern Views

Invariant shape similarity. Invariant shape similarity. Invariant similarity. Equivalence. Equivalence. Equivalence. Equal SIMILARITY TRANSFORMATION

Convex or non-convex: which is better?

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

ECS289: Scalable Machine Learning

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Linear methods for supervised learning

A Brief Look at Optimization

Proximal operator and methods

Support Vector Machines and their Applications

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Heat Kernels and Diffusion Processes

60 2 Convex sets. {x a T x b} {x ã T x b}

A Weighted Kernel PCA Approach to Graph-Based Image Segmentation

Integer Programming Theory

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Machine Learning Basics. Sargur N. Srihari

ELEG Compressive Sensing and Sparse Signal Representations

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Optimization. Industrial AI Lab.

Second Order SMO Improves SVM Online and Active Learning

SOLVING PARTIAL DIFFERENTIAL EQUATIONS ON POINT CLOUDS

Implicit Matrix Representations of Rational Bézier Curves and Surfaces

Programming, numerics and optimization

Machine Learning for NLP

KERNEL methods [1], [2] such as support vector machines

Convex sets and convex functions

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Transcription:

Kernels and representation Corso di AA, anno 2017/18, Padova Fabio Aiolli 20 Dicembre 2017 Fabio Aiolli Kernels and representation 20 Dicembre 2017 1 / 19

(Hierarchical) Representation Learning Hierarchical Representation Learning Recent research has addressed the representation problem by injecting some reasonable priors (regularization), including: Smoothness Multiple explanatory factors Shared factors across the tasks Manifolds Sparsity Hierarchy of explanatory factors Hierarchical structure of factors with more abstract concepts/features higher in the hierarchy Fabio Aiolli Kernels and representation 20 Dicembre 2017 2 / 19

(Hierarchical) Representation Learning Deep Neural Networks Deep architectures Generate models with several levels of abstraction discovering more and more complicated structures; Current state-of-the-art in many different tasks; For instance, Deep Neural Networks (DNNs). Some drawbacks: There is not a clear decoupling between the representation and the model generation; They have a high training time (tens of layers are difficult to be handled); They converge to a sub-optimal solution because of the local minima and the vanishing gradient issues. Fabio Aiolli Kernels and representation 20 Dicembre 2017 3 / 19

Multiple Kernel Learning Kernels and MKL We consider the (implicit) representation given by kernels A kernel can be seen as a scalar product in some Hilbert space, i.e k(x, z) = φ(x) φ(z), where φ(x) is the representation for the example x Multiple Kernel Learning (MKL): Given a family k r (x, z) of kernels such that k r (x, z) = φ r (x) φ r (z) r = 1,..., R MKL optimizes the coefficients of a weighted sum of kernels: R r=1 a r k r (x, z), a r 0, hence computing a new implicit representation for x given by φ(x) = [ a 1 φ 1 (x),..., a R φ R (x)] T Fabio Aiolli Kernels and representation 20 Dicembre 2017 4 / 19

Expressiveness of kernel representations Expressiveness of kernel representations The expressiveness of a kernel function, that is the number of dichotomies that can be realized by a linear separator in that feature space, is captured by the rank of the kernel matrices it produces. Theorem Let K R L L be a kernel matrix over a set of L examples. Let rank(k) be the rank of K. Then, there exists at least one subset of examples of size rank(k) that can be shattered by a linear function. Fabio Aiolli Kernels and representation 20 Dicembre 2017 5 / 19

Expressiveness of kernel representations Spectral Ratio The spectral ratio (SR) for a positive semi-definite matrix K is defined as the ratio between the 1-norm and the 2-norm of its eigenvalues, or equivalently: C(K) = K T K F. (1) Note that, compared to the rank of a matrix, it does not require the decomposition of the matrix. K T = i K ii K F = i,j K 2 ij Fabio Aiolli Kernels and representation 20 Dicembre 2017 6 / 19

Expressiveness of kernel representations Spectral ratio: properties The (squared) spectral ratio can be seen as an (efficient) strict approximation of the rank of a matrix: 1 C(K) rank(k). The spectral ratio C(K) has the following additional nice properties: the identity matrix has the maximal spectral ratio with C(I L ) = L (every possible 2 L dichotomies); the kernel K = 1 L 1 L, the constant matrix, has the minimal spectral ratio with C(1 L 1 L ) = 1 (only 2 dichotomies); it is invariant to multiplication with a positive scalar as C(αK) = C(K), α > 0. Fabio Aiolli Kernels and representation 20 Dicembre 2017 7 / 19

Expressiveness of kernel representations Hierarchical Structure of Kernel Representations Definition Let be given k i, k j, two kernel functions. We say that k i is more general than k j (k i G k j ) whenever for any possible dataset X, we have X ) C(K(j) X ) with K(i) X the kernel function k i. C(K (i) the kernel matrix evaluated on data X using Fabio Aiolli Kernels and representation 20 Dicembre 2017 8 / 19

Expressiveness of kernel representations The Proposed Framework Given a hierarchical set of features F, Learning over a hierarchy of feature spaces: the algorithm 1 Consider a partition P = {F 0,..., F R } of the features and construct kernels associated to those sets of features, in such a way to obtain a set of kernels of increasing expressiveness, that is k 0 G k 1 G G k R ; 2 Apply a MKL algorithm on kernels {k 0,, k R } to learn the coefficients η R R+1 + and define k MKL (x, z) = R s=0 η sk s (x, z). Fabio Aiolli Kernels and representation 20 Dicembre 2017 9 / 19

Expressiveness of kernel representations The Proposed Framework Several MKL methods to learn the weight vector η can be used, such as: based on margin optimization: SPG-GMKL (Jain et al. 2012) very efficient and scalable to many base kernels EasyMKL (Aiolli et al. 2015), very efficient and very scalable to many base kernels (based on KOMD)... based on radius-margin ratio optimization: R-MKL (Do et al. 2009), which optimizes an upper bound of the radius margin ratio RM-GD (Lauriola et al. 2017), able to optimize the exact ratio between the radius and the margin... Fabio Aiolli Kernels and representation 20 Dicembre 2017 10 / 19

Expressiveness of kernel representations EasyMKL EasyMKL (Aiolli and Donini, 2015) is able to combine sets of weak kernels by solving a simple quadratic optimization problem with Empirical effectiveness High scalability with respect to the number of kernels i.e. it is constant in memory and linear in time Main idea: EasyMKL finds the coefficients of the MKL combination maximizing the distance (in feature space) between the convex hulls of positive and negative examples (margin). The effectiveness strongly depends on the pre-defined weak kernels. Fabio Aiolli Kernels and representation 20 Dicembre 2017 11 / 19

Expressiveness of kernel representations RM-GD RM-GD (Lauriola, Polato and Aiolli, 2017) is a MKL algoritm able to combine kernels in a two-steps optimization process. Advantages w.r.t. other similar approaches include: High scalability with respect to the number of kernels (linear in both memory and time) Sparsity of the combination weights Optimization of the exact radius-margin ratio Main idea: It exploits a gradient descent procedure, where at each step k It evaluates the current kernel by using the current combination weights η (k) It computes the gradient direction g (k) and updates the combination weights η (k+1) η (k) λ g (k) Fabio Aiolli Kernels and representation 20 Dicembre 2017 12 / 19

Dot-Product Kernels The case of Dot-Product Kernels Theorem A function f : R R defines a positive definite kernel k : B(0, 1) B(0, 1) R as k : (x, z) f (x z) iff f is an analytic function admitting a Maclaurin expansion with non-negative coefficients, f (x) = s=0 a sx s, a s 0. kernel definition DPP s-th coefficient (a s) Polynomial (x z + c) D ( D s ) c D s RBF e γ x z 2 e 2γ (2γ) 2s ( s! 2 s j=1 2+(j 1) Rational Quadratic 1 x z 2 x z 2 +c Cauchy (1 + x z 2 γ (2+c) s+1 + ) 1 s!! 1 3 s+1 γ s s! sj=1 ) 2+(j 1) (2+c) s Table: DPP coefficients for some well-known dot-product kernels. 1 s! Fabio Aiolli Kernels and representation 20 Dicembre 2017 13 / 19

Dot-Product Kernels Famous examples: RBF The DPP weights of the RBF kernels correspond to a parametric gaussian function. For example: Fabio Aiolli Kernels and representation 20 Dicembre 2017 14 / 19

Dot-Product Kernels Homogeneus Polynomial Kernels 1-degree x 1 x 4 x 5 x 9 2-degree x 1 x 4 x 1 x 5 x 1 x 9 x 4 x 5 x 4 x 9 x 5 x 9 3-degree x 1 x 4 x 5 x 1 x 4 x 9 x 1 x 5 x 9 x 4 x 5 x 9 4-degree x 1 x 4 x 5 x 9 Figure: Arrows represent dependencies among features. Exploiting these dependencies we were able to demonstrate that (when normalized) k s G k s+1 for any s = 0,..., D 1 Fabio Aiolli Kernels and representation 20 Dicembre 2017 15 / 19

Dot-Product Kernels Non-parametric learning of DPK The goal is to learn the coefficients a s directly from data, thus generating a new dot-product kernel: k(x, z) = D a s (x z) s. (2) s=0 This problem can be easly formulated as a MKL problem where the weak kernels are HPKs defined as: for some fixed D > 0. k s (x, z) = (x z) s, s = 0,..., D, (3) Fabio Aiolli Kernels and representation 20 Dicembre 2017 16 / 19

Dot-Product Kernels Dot-Product Kernels of Boolean vectors For boolean vectors, similar expressiveness results can be given by using the conjunctive kernels in place of the homogeneous polynomial kernels. Given x, z {0, 1} n, then any HP-kernel can be decomposed as a finite non-negative linear combination of C-kernels of the form: d κ d HP (x, z) = s=0 h(s, d) κ s (x, z), h(s, d) 0 Given x, z {0, 1} n such that x 1 = z 1 = m, then any DPK can be decomposed as a finite non-negative linear combination of normalized C-kernels: m κ(x, z) = f ( x, z ) = g(m, s) κ s (x, z), g(m, s) 0 s=0 Fabio Aiolli Kernels and representation 20 Dicembre 2017 17 / 19

Dot-Product Kernels Empirical results Average accuracy and ratio on binary datasets, by using different MKL algorithms dataset average EasyMKL RM-GD audiology 99.99 ±0.04 99.99 ±0.04 100.00 ±0.00 (92,84,c) 6.08 ±0.33 5.99 ±0.32 5.38 ±0.25 primary-tumor 72.55 ±4.37 72.69 ±4.30 74.58 ±4.58 (132,24,c) 15.87 ±1.30 15.05 ±0.87 14.31 ±0.72 house-votes 99.11 ±0.41 99.10 ±0.42 99.20 ±0.41 (232,16,b) 8.90 ±1.13 8.90 ±1.17 8.49 ±1.13 spect 82.01 ±3.14 82.06 ±3.02 83.39 ±3.10 (267,23,b) 18.91 ±1.32 18.56 ±1.15 17.53 ±1.08 tic-tac-toe 98.82 ±0.46 99.04 ±0.39 99.74 ±0.20 (958,27,c) 73.39 ±1.57 70.93 ±1.45 60.75 ±1.49 Fabio Aiolli Kernels and representation 20 Dicembre 2017 18 / 19

Dot-Product Kernels Sparsity Figure: Combination weights learned when using 10 conjunctive kernels Fabio Aiolli Kernels and representation 20 Dicembre 2017 19 / 19