Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Similar documents
Clustering and Visualisation of Data

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

University of Florida CISE department Gator Engineering. Clustering Part 2

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

MSA220 - Statistical Learning for Big Data

Network Traffic Measurements and Analysis

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Supervised vs. Unsupervised Learning

Gene Clustering & Classification

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

A Dendrogram. Bioinformatics (Lec 17)

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Machine Learning for OR & FE

ECE 5424: Introduction to Machine Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CSE 5243 INTRO. TO DATA MINING

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Region-based Segmentation

Content-based image and video analysis. Machine learning

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Machine Learning. Unsupervised Learning. Manfred Huber

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

SUPPORT VECTOR MACHINES


Support vector machines

Clustering: Classic Methods and Modern Views

Introduction to Machine Learning

Clustering. Supervised vs. Unsupervised Learning

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Generative and discriminative classification techniques

10701 Machine Learning. Clustering

Supervised Learning: Nearest Neighbors

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Machine Learning (BSMC-GA 4439) Wenke Liu

Based on Raymond J. Mooney s slides

Big-data Clustering: K-means vs K-indicators

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

10-701/15-781, Fall 2006, Final

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Cluster Analysis. Ying Shen, SSE, Tongji University

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering. Lecture 6, 1/24/03 ECS289A

Pattern Clustering with Similarity Measures

Clustering CS 550: Machine Learning

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

Introduction to Computer Science

Support Vector Machines + Classification for IR

Introduction to Machine Learning. Xiaojin Zhu

Scalable Clustering of Signed Networks Using Balance Normalized Cut

K-Means. Oct Youn-Hee Han

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Kernel Methods & Support Vector Machines

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

What to come. There will be a few more topics we will cover on supervised learning

Support vector machines

Support Vector Machines

Approaches to Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

ECS 234: Data Analysis: Clustering ECS 234

Kernel Spectral Clustering

Support Vector Machines

Clustering. Chapter 10 in Introduction to statistical learning

Clustering Lecture 5: Mixture Model

Machine Learning: Think Big and Parallel

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Divide and Conquer Kernel Ridge Regression

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Learning : Clustering

S(x) = arg min s i 2S kx

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Instance-based Learning

Unsupervised Learning

Clustering COMS 4771

INF 4300 Classification III Anne Solberg The agenda today:

Cluster Analysis: Agglomerate Hierarchical Clustering

An Introduction to Machine Learning

Clustering web search results

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Perceptron as a graph

Transcription:

Machine Learning for Signal Processing Clustering Bhiksha Raj Class 11. 13 Oct 2016 1

Statistical Modelling and Latent Structure Much of statistical modelling attempts to identify latent structure in the data Structure that is not immediately apparent from the observed data But which, if known, helps us explain it better, and make predictions from or about it Clustering methods attempt to extract such structure from proximity First-level structure (as opposed to deep structure) We will see other forms of latent structure discovery later in the course 2

Clustering 3

How 4

Clustering What is ing Clustering is the determination of naturally occurring grouping of data/instances (with low withingroup variability and high betweengroup variability) 5

Clustering What is ing Clustering is the determination of naturally occurring grouping of data/instances (with low withingroup variability and high betweengroup variability) 6

Clustering What is ing Clustering is the determination of naturally occurring grouping of data/instances (with low withingroup variability and high betweengroup variability) How is it done Find groupings of data such that the groups optimize a within-groupvariability objective function of some kind 7

Clustering What is ing Clustering is the determination of naturally occurring grouping of data/instances (with low withingroup variability and high betweengroup variability) How is it done Find groupings of data such that the groups optimize a within-groupvariability objective function of some kind The objective function used affects the nature of the discovered s E.g. Euclidean distance vs. 8

Clustering What is ing Clustering is the determination of naturally occurring grouping of data/instances (with low withingroup variability and high betweengroup variability) How is it done Find groupings of data such that the groups optimize a within-groupvariability objective function of some kind The objective function used affects the nature of the discovered s E.g. Euclidean distance vs. Distance from center 9

Why Clustering Automatic grouping into Classes Different s may show different behavior Quantization All data within a are represented by a single point Preprocessing step for other algorithms Indexing, categorization, etc. 10

Finding natural structure in data Find natural groupings in data for further analysis Discover latent structure in data 11

Some Applications of Clustering Image segmentation 12

Representation: Quantization TRAINING QUANTIZATION x x Quantize every vector to one of K (vector) values What are the optimal K vectors? How do we find them? How do we perform the quantization? LBG algorithm 13

Representation: BOW How to retrieve all music videos by this guy? Build a classifier But how do you represent the video? 14

Representation: BOW Training: Each point is a video frame Representation: Each number is the #frames assigned to the codeword 4 16 12 17 30 Bag of words representations of video/audio/data 15

Obtaining Meaningful Clusters Two key aspects: 1. The feature representation used to characterize your data 2. The ing criteria employed 16

Clustering Criterion The Clustering criterion actually has two aspects Cluster compactness criterion Measure that shows how good s are The objective function Distance of a point from a To determine the a data vector belongs to 17

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the 18

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the 19

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the Distance between the two farthest points in the 20

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the Distance between the two farthest points in the Total distance of every element in the from the centroid of the 21

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the Distance between the two farthest points in the Total distance of every element in the from the centroid of the 22

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the Distance between the two farthest points in the Total distance of every element in the from the centroid of the 23

Compactness criteria for ing Distance based measures Total distance between each element in the and every other element in the Distance between the two farthest points in the Total distance of every element in the from the centroid of the Distance measures are often weighted Minkowski metrics dist n n n w1 a1 b1 w2 a2 b2... w M a M b M n 24

Clustering: Distance from How far is a data point from a? Euclidean or Minkowski distance from the centroid of the Distance from the closest point in the Distance from the farthest point in the Probability of data measured on distribution 25

Clustering: Distance from How far is a data point from a? Euclidean or Minkowski distance from the centroid of the Distance from the closest point in the Distance from the farthest point in the Probability of data measured on distribution 26

Clustering: Distance from How far is a data point from a? Euclidean or Minkowski distance from the centroid of the Distance from the closest point in the Distance from the farthest point in the Probability of data measured on distribution 27

Clustering: Distance from How far is a data point from a? Euclidean or Minkowski distance from the centroid of the Distance from the closest point in the Distance from the farthest point in the Probability of data measured on distribution 28

Clustering: Distance from How far is a data point from a? Euclidean or Minkowski distance from the centroid of the Distance from the closest point in the Distance from the farthest point in the Probability of data measured on distribution Fit of data to -based regression 29

Optimal ing: Exhaustive enumeration All possible combinations of data must be evaluated If there are M data points, and we desire N s, the number of ways of separating M instances into N s is N 1 i N ( 1) ( N i) M! i 0 i Exhaustive enumeration based ing requires that the objective function (the Goodness measure ) be evaluated for every one of these, and the best one chosen This is the only correct way of optimal ing Unfortunately, it is also computationally unrealistic M 30

Probability of analog value Not-quite non sequitur: Quantization Signal Value Bits Mapped to S >= 3.75v 11 3 * const 3.75v > S >= 2.5v 10 2 * const 2.5v > S >= 1.25v 01 1 * const 1.25v > S >= 0v 0 0 Analog value (arrows are quantization levels) Linear quantization (uniform quantization): Each digital value represents an equally wide range of analog values Regardless of distribution of data Digital-to-analog conversion represented by a uniform table 31

Probability of analog value Not-quite non sequitur: Quantization Signal Value Bits Mapped to S >= 4v 11 4.5 4v > S >= 2.5v 10 3.25 2.5v > S >= 1v 01 1.25 1.0v > S >= 0v 0 0.5 Analog value (arrows are quantization levels) Non-Linear quantization: Each digital value represents a different range of analog values Finer resolution in high-density areas Mu-law / A-law assumes a Gaussian-like distribution of data Digital-to-analog conversion represented by a non-uniform table 32

Probability of analog value Non-uniform quantization Analog value If data distribution is not Gaussian-ish? Mu-law / A-law are not optimal How to compute the optimal ranges for quantization? Or the optimal table 33

Probability of analog value The Lloyd Quantizer Analog value (arrows show quantization levels) Lloyd quantizer: An iterative algorithm for computing optimal quantization tables for non-uniformly distributed data Learned from training data 34

Lloyd Quantizer Randomly initialize quantization points Right column entries of quantization table Assign all training points to the nearest quantization point Reestimate quantization points Iterate until convergence 35

Lloyd Quantizer Randomly initialize quantization points Right column entries of quantization table Assign all training points to the nearest quantization point Draw boundaries Reestimate quantization points Iterate until convergence 36

Lloyd Quantizer Randomly initialize quantization points Right column entries of quantization table Assign all training points to the nearest quantization point Draw boundaries Reestimate quantization points Iterate until convergence 37

Lloyd Quantizer Randomly initialize quantization points Right column entries of quantization table Assign all training points to the nearest quantization point Draw boundaries Reestimate quantization points Iterate until convergence 38

Generalized Lloyd Algorithm: K means ing K means is an iterative algorithm for ing vector data McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 281-297 General procedure: Initially group data into the required number of s somehow (initialization) Assign each data point to the closest Once all data points are assigned to s, redefine s Iterate 39

Problem: Given a set of data vectors, find natural s Clustering criterion is scatter: distance from the centroid Every has a centroid K means The centroid represents the Definition: The centroid is the weighted mean of the Weight = 1 for basic scheme m 1 w i i w i i x i 40

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 41

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 42

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 43

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 44

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 45

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 46

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 47

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 48

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 49

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points are ed, recompute centroids m 1 w i i w x i i i 5. If not converged, go back to 2 50

1. Initialize a set of centroids randomly 2. For each data point x, find the distance from the centroid for each d distance( x, m) K means 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points are ed, recompute centroids m 1 w i i w x i i 5. If not converged, go back to 2 i 51

K-Means comments The distance metric determines the s In the original formulation, the distance is L 2 distance distance Euclidean norm, w i = 1 ( x, m) x m If we replace every x by m (x), we get Vector Quantization K-means is an instance of generalized EM Not guaranteed to converge for all distance metrics 2 m N 1 x i i 52

Random initialization Top-down ing Initialization Initially partition the data into two (or a small number of) s using K means Partition each of the resulting s into two (or a small number of) s, also using K means Terminate when the desired number of s is obtained 53

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 54

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 55

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 56

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 57

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 58

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 59

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 60

K-Means for Top Down ing 1. Start with one 2. Split each into two: Perturb centroid of slightly (by < 5%) to generate two centroids 3. Initialize K means with new set of centroids 4. Iterate Kmeans until convergence 5. If the desired number of s is not obtained, return to 2 61

Non-Euclidean s Basic K-means results in good s in Euclidean spaces Alternately stated, will only find s that are good in terms of Euclidean distances Will not find other types of s 62

Non-Euclidean s f([x,y]) -> [x,y,z] x = x y = y z = a(x 2 + y 2 ) For other forms of s we must modify the distance measure E.g. distance from a circle May be viewed as a distance in a higher dimensional space I.e Kernel distances Kernel K-means Other related ing mechanisms: Spectral ing Non-linear weighting of adjacency Normalized cuts.. 63

The Kernel Trick f([x,y]) -> [x,y,z] x = x y = y z = a(x 2 + y 2 ) Transform the data into a synthetic higher-dimensional space where the desired patterns become natural s E.g. the quadratic transform above Problem: What is the function/space? Problem: Distances in higher dimensional-space are more expensive to compute Yet only carry the same information in the lower-dimensional space 64

Distance in higher-dimensional space Transform data x through a possibly unknown function F(x) into a higher (potentially infinite) dimensional space z = F(x) The distance between two points is computed in the higher-dimensional space d(x 1, x 2 ) = z 1 - z 2 2 = F(x 1 ) F(x 2 ) 2 d(x 1, x 2 ) can be computed without computing z Since it is a direct function of x 1 and x 2 65

Distance in higher-dimensional space Distance in lower-dimensional space: A combination of dot products z 1 - z 2 2 = (z 1 - z 2 ) T (z 1 - z 2 ) = z 1.z 1 + z 2.z 2-2 z 1.z 2 Distance in higher-dimensional space d(x 1, x 2 ) = F(x 1 ) F(x 2 ) 2 = F(x 1 ). F(x 1 ) + F(x 2 ). F(x 2 ) -2 F(x 1 ). F(x 2 ) d(x 1, x 2 ) can be computed without knowing F(x) if: F(x 1 ). F(x 2 ) can be computed for any x 1 and x 2 without knowing F(.) 66

The Kernel function A kernel function K(x 1,x 2 ) is a function such that: K(x 1,x 2 ) = F(x 1 ). F(x 2 ) Once such a kernel function is found, the distance in higher-dimensional space can be found in terms of the kernels d(x 1, x 2 ) = F(x 1 ) F(x 2 ) 2 = F(x 1 ). F(x 1 ) + F(x 2 ). F(x 2 ) -2 F(x 1 ). F(x 2 ) = K(x 1,x 1 ) + K(x 2,x 2 ) - 2K(x 1,x 2 ) But what is K(x 1,x 2 )? 67

A property of the dot product For any vector v, v T v = v 2 >= 0 This is just the length of v and is therefore nonnegative For any vector u = S i a i v i, u 2 >=0 => (S i a i v i ) T (S i a i v i ) >= 0 => S i S j a i a j v i.v j >= 0 This holds for ANY real {a 1, a 2, } 68

The Mercer Condition If z = F(x) is a high-dimensional vector derived from x then for all real {a 1, a 2, } and any set {z 1, z 2, } = {F(x 1 ), F(x 2 ), } S i S j a i a j z i.z j >= 0 S i S j a i a j F(x i ).F(x j ) >= 0 If K(x 1,x 2 ) = F(x 1 ). F(x 2 ) > S i S j a i a j K(x i,x j ) >= 0 Any function K() that satisfies the above condition is a valid kernel function 69

The Mercer Condition K(x 1,x 2 ) = F(x 1 ). F(x 2 ) > S i S j a i a j K(x i,x j ) >= 0 A corollary: If any kernel K(.) satisfies the Mercer condition d(x 1, x 2 ) = K(x 1,x 1 ) + K(x 2,x 2 ) - 2K(x 1,x 2 ) satisfies the following requirements for a distance d(x,x) = 0 d(x,y) >= 0 d(x,w) + d(w,y) >= d(x,y) 70

Typical Kernel Functions Linear: K(x,y) = x T y + c Polynomial K(x,y) = (ax T y + c) n Gaussian: K(x,y) = exp(- x-y 2 /s 2 ) Exponential: K(x,y) = exp(- x-y /l) Several others Choosing the right Kernel with the right parameters for your problem is an artform 71

Kernel K-means K(x,y)= (x T y + c) 2 Perform the K-mean in the Kernel space The space of z = F(x) The algorithm.. 72

The mean of a The average value of the points in the computed in the high-dimensional space 1 m F( xi ) N i Alternately the weighted average m 1 w i i i w F( x i i ) C i w F( x i i ) 73

The mean of a The average value of the points in the computed in the high-dimensional space Alternately the weighted average 1 m F( xi ) N i RECALL: We may never actually be able to compute this mean because F(x) is not known m 1 w i i i w F( x i i ) C i w F( x i i ) 74

K means Initialize the s with a random set of K points Cluster has 1 point m 1 w F i (xi) wi i i For each data point x, find the closest (x) min 2 d(x,) min F(x) m 2 d(x, ) F(x) m F(x) C w if(xi) F(x) C w if(xi ) i i T T T 2 F(x) F(x) 2C w if(x) F(xi) C w iw i i j K(x, x) 2C 2 w K(x, xi) C F(x ) j i T F(x i wiw jk(xi, x j) i i j Computed entirely using only the kernel function! j ) 75

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 76

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 The centroids are virtual: we don t actually compute them explicitly! m 1 w i i w x i i i 77

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the d closest centroid K( x, x) 2C wi K( x, xi ) C Cluster for which i d is minimum 2 w w i i j j K( x, x i j ) 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 78

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 79

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 80

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 81

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 82

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 83

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 84

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points ed, recompute centroid 1 m xi N i 5. If not converged, go back to 2 85

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points are ed, recompute centroids m 1 w i i w x i i 5. If not converged, go back to 2 i We do not explicitly compute the means May be impossible we do not know the high-dimensional space We only know how to compute inner products in it 86

1. Initialize a set of s randomly 2. For each data point x, find the distance from the centroid for each d distance x, m ) Kernel K means ( 3. Put data point in the of the closest centroid Cluster for which d is minimum 4. When all data points are ed, recompute centroids m 1 w i i w x i i 5. If not converged, go back to 2 i We do not explicitly compute the means May be impossible we do not know the high-dimensional space We only know how to compute inner products in it 87

How many s? Assumptions: Dimensionality of kernel space > no. of s Clusters represent separate directions in Kernel spaces Kernel correlation matrix K K ij = K(x i,x j ) Find Eigen values L and Eigen vectors e of kernel matrix No. of s = no. of dominant l i (1 T e i ) terms 88

Spectral Methods Spectral methods attempt to find principal subspaces of the high-dimensional kernel space Clustering is performed in the principal subspaces Normalized cuts Spectral ing Involves finding Eigenvectors and Eigen values of Kernel matrix Fortunately, provably analogous to Kernel K- means 89

Other ing methods Regression based ing Find a regression representing each Associate each point to the with the best regression Related to kernel methods 90

Clustering.. Many many other variants Many applications.. Important: Appropriate choice of feature Appropriate choice of feature may eliminate need for kernel trick.. Google is your friend. 91