Kernel spectral clustering: model representations, sparsity and out-of-sample extensions

Size: px

Start display at page:

Download "Kernel spectral clustering: model representations, sparsity and out-of-sample extensions"

Rolf Hodges
5 years ago
Views:

1 Kernel spectral clustering: model representations, sparsity and out-of-sample extensions Johan Suykens and Carlos Alzate K.U. Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg B-3 Leuven (Heverlee), Belgium 4th Int. Conf. on Computational Harmonic Analysis May, Hong Kong ICCHA4, Hong Kong

2 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

3 Introduction Application studies related to spectral clustering: complex networks: clustering time series in power grid networks clustering of scientific journals clustering time series in the CoRoT exoplanet database prognostics tools for predicting maintenance of machines image segmentation [Alzate et al. 7, 9 ; Varon et al., ] ICCHA4, Hong Kong

Power grid networks Winter.9 Normalized load.8.7.6.5.4.

4 Power grid networks Winter.9 Normalized load Summer Hour Network: 45 substations in Belgian grid Data: hourly load, seasonal/weekly/intra-daily patterns Aim: short-term load forecasting, important for power generation decisions Clustering: identifying customer profiles from time-series (over 5 years) [Espinoza et al., IEEE CSM 7; Alzate et al., 9] ICCHA4, Hong Kong

5 Kernel-based models and learning theory Classification and regression: the use of kernel-based models is well established aim: good predictive model (achieving good generalization) model selection (kernel parameters, regularization constants): CV Clustering: underlying kernel-based models? out-of-sample extensions? learning and generalization? training, validation, test data? tuning parameters: how to determine? ICCHA4, Hong Kong 3

6 Kernel-based models and learning theory Classification and regression: the use of kernel-based models is well established aim: good predictive model (achieving good generalization) model selection (kernel parameters, regularization constants): CV Clustering: underlying kernel-based models? out-of-sample extensions? learning and generalization? training, validation, test data? tuning parameters: how to determine? ICCHA4, Hong Kong 4

7 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

8 Kernel principal component analysis x () x () linear PCA kernel PCA (RBF kernel) Kernel PCA [Schölkopf et al., 998]: eigenvalue decomposition of K(x, x )... K(x, x N ).. K(x N,x )... K(x N,x N ) ICCHA4, Hong Kong 5

9 Kernel PCA: primal and dual problem Underlying primal problem [Suykens et al., 3] Primal problem: min w,b,e wt w + N γ i= e i s.t. e i = w T ϕ(x i ) + b, i =,...,N. (Lagrange) dual problem = kernel PCA : Ω c α = λα with λ = /γ with Ω c,ij = (ϕ(x i ) ˆµ ϕ ) T (ϕ(x j ) ˆµ ϕ ) the centered kernel matrix. Interpretation:. pool of candidates components (objective function equals zero). select relevant components ICCHA4, Hong Kong 6

10 Kernel PCA: model representations Primal and dual model representations: M ր ց (P) : (D) : ê = j w jϕ j (x ) + b = w T ϕ(x ) + b ê = i α ik(x,x i ) + b which can be evaluated at any point x R d, where K(x,x i ) = ϕ(x ) T ϕ(x i ) with K(, ) a positive definite kernel and feature map ϕ( ) : R d R n h. ICCHA4, Hong Kong 7

11 Sparse and robust versions Iteratively weighted L loss (to reduce the influence of outliers): min w,b,e i wt w + γ N i= v ie i subject to e i = w T ϕ(x i ) + b, i =,...,N. Other loss functions: e.g. Huber loss for robustness, ǫ-insensitive loss for sparsity min w,b,e wt w + γ N i= L(e i) i subject to e i = w T ϕ(x i ) + b, i =,...,N. [Alzate & Suykens, IEEE-TNN, 8] ICCHA4, Hong Kong 8

12 Robustness: Kernel Component Analysis original image corrupted image KPCA reconstruction Weighted LS-SVM: robustness [Alzate & Suykens, IEEE-TNN 8] ICCHA4, Hong Kong 9

13 Robustness: Kernel Component Analysis original image corrupted image KPCA reconstruction KCA reconstruction Weighted LS-SVM: robustness and sparsity [Alzate & Suykens, IEEE-TNN 8] ICCHA4, Hong Kong 9

14 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

15 Spectral graph clustering Minimal cut: given the graph G = (V,E), find clusters A, A min q i {,+} w ij (q i q j ) i,j with cluster membership indicator q i (q i = if i A, q i = if i A ) and W = [w ij ] the weighted adjacency matrix cut of size (minimal cut) 6 cut of size ICCHA4, Hong Kong

16 Spectral graph clustering Min-cut spectral clustering problem min q T q= q T L q with L = D W the unnormalized graph Laplacian, degree matrix D = diag(d,...,d N ), d i = j w ij, giving L q = λ q. Cluster member indicators: ˆq i = sign( q i θ) with threshold θ. Normalized cut L q = λd q [Fiedler, 973; Shi & Malik, ; Ng et al. ; Chung, 997; von Luxburg, 7] Discrete version to continuous problem (Laplace operator) [Belkin & Niyogi, 3; von Luxburg et al., 8; Smale & Zhou, 7] ICCHA4, Hong Kong

17 Spectral clustering + K-means ICCHA4, Hong Kong

18 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

19 Kernel spectral clustering: case of two clusters Underlying model (primal representation): ê = w T ϕ(x ) + b with ˆq = sign[ê ] the estimated cluster indicator at any x R d. Primal problem: training on given data {x i } N i= min w,b,e wt w + γ N i= v i e i subject to e i = w T ϕ(x i ) + b, i =,...,N with positive weights v i (will be related to inverse degree matrix). [Alzate & Suykens, IEEE-PAMI, ] ICCHA4, Hong Kong 3

20 Lagrangian: Lagrangian and conditions for optimality L(w,b, e; α) = wt w + γ N v i e i i= N α i (e i w T ϕ(x i ) b) i= Conditions for optimality: L w = w = i α iϕ(x i ) L b = i α i = L = α i = γv i e i, i =,...,N e i L = e i = w T ϕ(x i ) + b, i =,...,N α i Eliminate w,b,e, write solution in α. ICCHA4, Hong Kong 4

21 Kernel-based model representation Dual problem: with V M V Ωα = λα λ = /γ M V = I N T N V N T NV : weighted centering matrix N Ω = [Ω ij ]: kernel matrix with Ω ij = ϕ(x i ) T ϕ(x j ) = K(x i,x j ) Dual model representation: ê = N α i K(x i, x ) + b i= with K(x i,x ) = ϕ(x i ) T ϕ(x ). ICCHA4, Hong Kong 5

22 Choice of weights v i Take V = D where D = diag{d,...,d N } and d i = N j= Ω ij This gives the generalized eigenvalue problem: M D Ωα = λdα with M D = I N T N D N N T N D This is a modified version of random walks spectral clustering. Note that sign[e i ] = sign[α i ] (on training data)... but sign[e ] applies beyond training data ICCHA4, Hong Kong 6

23 Kernel spectral clustering: more clusters Case of k clusters: additional sets of constraints min w (l),e (l),b l k l= w (l)t w (l) + k l= γ l e (l)t D e (l) subject to e () = Φ N nh w () + b N e () = Φ N nh w () + b N. e (k ) = Φ N nh w (k ) + b k N where e (l) = [e (l) ;...;e(l) N ] and Φ N n h = [ϕ(x ) T ;...;ϕ(x N ) T ] R N n h. Dual problem: M D Ωα (l) = λdα (l), l =,...,k. [Alzate & Suykens, IEEE-PAMI, ] ICCHA4, Hong Kong 7

24 Primal and dual model representations k clusters k sets of constraints (index l =,...,k ) M ր ց (P) : sign[ê (l) ] = sign[w (l)t ϕ(x ) + b l ] (D) : sign[ê (l) ] = sign[ j α(l) j K(x,x j ) + b l ] Note: additional sets of constraints also in multi-class and vector-valued output LS-SVMs [Suykens et al., 999] ICCHA4, Hong Kong 8

25 Out-of-sample extension and coding x () x () x () x () ICCHA4, Hong Kong 9

26 Out-of-sample extension and coding x () x () x () x () ICCHA4, Hong Kong 9

27 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

28 Piecewise constant eigenvectors and extension Definition. [Meila & Shi, ] Vector α is called piecewise constant relative to a partition (A,..., A k ) iff α i = α j x i,x j A p,p =,...,k. Proposition. [Alzate & Suykens, ] Assume (i) a training set D = {x i } N i= and validation set Dv = {x v m} N v m= i.i.d. sampled from the same underlying distribution; (ii) a set of k clusters {A,..., A k } with k > ; (iii) an isotropic kernel function such that K(x, z) = when x and z belong to different clusters; (iv) the eigenvectors α (l) for l =,...,k are piecewise constant. Then validation set points belonging to the same cluster are collinear in the k dimensional subspace spanned by the columns of E v R Nv (k ) where Eml v = e(l) m = N i= α(l) i K(x i, x v m) + b l. ICCHA4, Hong Kong

29 Piecewise constant eigenvectors and extension Key aspect of the proof: one has e (l) = N i= α(l) i K(x i,x ) + b (l) = c p (l) i A p K(x i, x ) + N = c p (l) i A p K(x i, x ) + b (l) i/ A p α (l) i K(x i, x ) + b (l) Model selection to determine kernel parameters and k: Looking for line structures in the space (e () i,e () i,...,e (k ) i ), evaluated on validation data (aiming for good generalization) Choice kernel: Gaussian RBF kernel χ -kernel for images ICCHA4, Hong Kong

30 Model selection (looking for lines): toy problem e () i,val...3 σ =.5, BLF = e () i,val..4.6 e () i,val σ =.6, BLF = e () i,val validation set x () x () x () x () train + validation + test data ICCHA4, Hong Kong

31 Model selection (looking for lines): toy problem 8 σ =., BLF = i,val e () x (3) e () i,val 3 x () x ().3 σ =.3, BLF =... 3 e () i,val e () i,val validation set x (3) 3 x () x () train + validation + test data ICCHA4, Hong Kong 3

32 Example: image segmentation (looking for lines) 4 3 i,val e (3) 3 e () i,val e () i,val ICCHA4, Hong Kong 4

33 Image ID Image Proposed method Nyström method Human ICCHA4, Hong Kong 5

34 Example: power grid networks - identifying customer profiles Power load: 45 substations, hourly data (5 years), d = Periodic AR modelling: dimensionality reduction k-means clustering applied after dimensionality reduction.9 normalized load normalized load normalized load normalized load hour hour hour hour normalized load normalized load normalized load normalized load hour hour hour hour ICCHA4, Hong Kong 6

35 Clustering time-series: kernel spectral clustering Application of kernel spectral clustering, directly on d = Model selection on kernel parameter and number of clusters [Alzate, Espinoza, De Moor, Suykens, 9] normalized load normalized load normalized load normalized load hour 5 5 hour 5 5 hour 5 5 hour normalized load normalized load normalized load hour 5 5 hour 5 5 hour ICCHA4, Hong Kong 7

36 Clustering time-series: kernel spectral clustering normalized load normalized load normalized load hour 5 5 hour 5 5 hour Electricity load: 45 substations in Belgian grid (/ train, / validation) x i R : spectral clustering on high dimensional data (5 years) 3 of 7 detected clusters: - : Residential profile: morning and evening peaks - : Business profile: peaked around noon - 3: Industrial profile: increasing morning, oscillating afternoon and evening ICCHA4, Hong Kong 8

37 Out-of-sample eigenvectors From the conditions for optimality: an eigenvector α satisfies and T N α =. α = γd e By defining deg(x) = N j= K(x, x j) the notion of eigenvector is extended to a validation set as follows: α val = [I N v Nv T N v ]γd val e val [I N v Nv T N v ]γd val e val satisfying α val = and T N α val =. N v denotes the validation set size. [Alzate & Suykens, IJCNN ] ICCHA4, Hong Kong 9

38 Model selection (looking for dots): toy problem.3 σ =.3, Fisher =. badly tuned.. α () i,val x (3) α () i,val σ =., Fisher =. 3 well tuned x () x () α () i,val.. x (3) α () i,val 3 x () x () ICCHA4, Hong Kong 3

39 Example: image segmentation (looking for dots) Fisher criterion Number of clusters k..5 α () i,val α () i,val ICCHA4, Hong Kong 3

40 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

41 Kernel spectral clustering: sparse kernel models original image binary clustering Incomplete Cholesky decomposition: Ω GG T η with G R N R and R N Image (Berkeley image dataset): 3 48 (54, 4 pixels), 75 SV e (l) = i S SV α (l) i K(x i, x ) + b l ICCHA4, Hong Kong 3

42 Kernel spectral clustering: sparse kernel models original image sparse kernel model Incomplete Cholesky decomposition: Ω GG T η with G R N R and R N Image (Berkeley image dataset): 3 48 (54, 4 pixels), 75 SV e (l) = i S SV α (l) i K(x i, x ) + b l ICCHA4, Hong Kong 3

43 Highly sparse kernel models on images application on images: x i R 3 (r,g,b values per pixel), i =,...,N pre-processed into z i R 8 (quantization to 8 colors) χ -kernel to compare two local color histograms (5 5 pixels window) N >., select subset M N based on quadratic Renyi entropy as in the fixed-size method [Suykens et al., ] Highly sparse representations: # SV = 3 k Completion of cluster indicators based on out-of-sample extensions sign[ê (l) ] = sign[ j S SV α (l) j K(x,x j ) + b l ] applied to the full image [Alzate & Suykens, Neurocomputing ] ICCHA4, Hong Kong 33

44 Highly sparse kernel models: toy example x () e () i x () e () i only 3k = 9 support vectors ICCHA4, Hong Kong 34

45 Highly sparse kernel models: toy example 4 3 x () x () ICCHA4, Hong Kong 35

46 Highly sparse kernel models: toy example x() x () only 3k = support vectors ICCHA4, Hong Kong 35

47 Highly sparse kernel models: toy example 4 ê (3) i 4 4 ê () i 4 4 ê () i 4 6 ICCHA4, Hong Kong 35

48 Highly sparse kernel models: image segmentation e () i e () i e (3) i ICCHA4, Hong Kong 36

Highly sparse kernel models: image segmentation.5.5 e (3) i.5.5 3 e () i 3.

49 Highly sparse kernel models: image segmentation.5.5 e (3) i e () i 3.5 e () i.5.5 only 3k = support vectors ICCHA4, Hong Kong 36

50 Overview Introduction Kernel PCA: primal and dual model representations Spectral clustering Kernel spectral clustering Model selection Sparsity Incorporating prior knowledge

51 Kernel spectral clustering: adding prior knowledge Pair of points x, x : c = must-link, c = cannot-link Primal problem [Alzate & Suykens, IJCNN 9] min w (l),e (l),b l k l= w (l)t w (l) + k l= γ l e (l)t D e (l) subject to e () = Φ N nh w () + b N. e (k ) = Φ N nh w (k ) + b k N w ()T ϕ(x ) = cw ()T ϕ(x ). w (k )T ϕ(x ) = cw (k )T ϕ(x ) Dual problem: yields rank-one downdate of the kernel matrix ICCHA4, Hong Kong 37

52 Kernel spectral clustering: example original image without constraints ICCHA4, Hong Kong 38

53 Kernel spectral clustering: example original image with constraints ICCHA4, Hong Kong 39

54 Conclusions Spectral clustering within a kernel-based learning framework Training problem: characterization in terms of primal and dual problem Out-of-sample extensions: primal and dual model representations Extend desirable piecewise constant property to validation level New model selection criteria (learning and generalization aspects) (highly) sparse kernel models Suitable for adding prior knowledge through constraints ICCHA4, Hong Kong 4

55 References Downloadable from Alzate C., Suykens J.A.K., Multiway Spectral Clustering with Out-of-Sample Extensions through Weighted Kernel PCA, IEEE Transactions on Pattern Analysis and Machine Intelligence, 3(), , Alzate C., Suykens J.A.K., Sparse Kernel Spectral Clustering Models for Large-Scale Data Analysis, Neurocomputing, 74(9), 38-39, Alzate C., Suykens J.A.K., A Regularized Formulation for Spectral Clustering with Pairwise Constraints, International Joint Conference on Neural Networks (IJCNN 9), Atlanta US, 9, 4-48 Alzate C., Suykens J.A.K., Out-of-Sample Eigenvectors in Kernel Spectral Clustering, to appear International Joint Conference on Neural Networks (IJCNN ) Suykens J.A.K., Data Visualization and Dimensionality Reduction using Kernel Maps with a Reference Point, IEEE Transactions on Neural Networks, 9(9), 5-57, 8 Suykens J.A.K., Alzate C., Pelckmans K., Primal and dual model representations in kernel-based learning, Statistics Surveys, 4, 48-83, ICCHA4, Hong Kong 4

A Weighted Kernel PCA Approach to Graph-Based Image Segmentation

A Weighted Kernel PCA Approach to Graph-Based Image Segmentation Carlos Alzate Johan A. K. Suykens ESAT-SCD-SISTA Katholieke Universiteit Leuven Leuven, Belgium January 25, 2007 International Conference