A Dendrogram 3/15/05 1
Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and make them part of the same cluster. Replace the pair by an average of the two s ij Try the applet at: http://www.cs.mcgill.ca/~papou/#applet 3/15/05 2
Distance Metrics For clustering, define a distance function: Euclidean distance metrics D d 1/ k k k( X, Y ) ( Xi Yi) = i= 1 Pearson correlation coefficient ρ xy d 1 = d Xi X σx k=2: Euclidean Distance Yi Y σ i= 1 y -1 ρ xy 1 3/15/05 3
Start End 3/15/05 4
K-Means Clustering [McQueen 67] Repeat Start with randomly chosen cluster centers Assign points to give greatest increase in score Recompute cluster centers Reassign points until (no changes) Try the applet at: http://www.cs.mcgill.ca/~bonnef/project.html 3/15/05 5
Self-Organizing Maps [Kohonen] Kind of neural network. Clusters data and find complex relationships between clusters. Helps reduce the dimensionality of the data. Map of 1 or 2 dimensions produced. Unsupervised Clustering Like K-Means, except for visualization 3/15/05 6
SOM Algorithm Select SOM architecture, and initialize weight vectors and other parameters. While (stopping condition not satisfied) do for each input point x winning node q has weight vector closest to x. Update weight vector of q and its neighbors. Reduce neighborhood size and learning rate. 3/15/05 7
SOM Algorithm Details Distance between x and weight vector: Winning node: Weight update function (for neighbors): wi( k + 1) = wi( k) + µ ( k, x, i)[ x( k) wi( k)] Learning rate: q( x) = min i x µ ( k, x, i) η 0( k)exp w i ri rq( x) = 2 σ 2 x wi 3/15/05 8
World Poverty SOM 3/15/05 9
World Poverty Map 3/15/05 10
Neural Networks Synaptic Weights W Bias θ Input X Σ ƒ( ) Output y 3/15/05 11
Learning NN Weights W 1 Input X Σ Error Σ + Adaptive Algorithm Desired Response 3/15/05 12
Recurrent NN Feed-forward NN Layered Types of NNs Other issues Hidden layers possible Different activation functions possible 3/15/05 13
Application: Secondary Structure Prediction 3/15/05 14
Support Vector Machines Supervised Statistical Learning Method for: Classification Regression Simplest Version: Training: Present series of labeled examples (e.g., gene expressions of tumor vs. normal cells) Prediction: Predict labels of new examples. 3/15/05 15
Learning Problems B A B A A B B A 3/15/05 16
SVM Binary Classification Partition feature space with a surface. Surface is implied by a subset of the training points (vectors) near it. These vectors are referred to as Support Vectors. Efficient with high-dimensional data. Solid statistical theory Subsume several other methods. 3/15/05 17
Learning Problems Binary Classification Multi-class classification Regression 3/15/05 18
3/15/05 19
3/15/05 20
3/15/05 21
SVM General Principles SVMs perform binary classification by partitioning the feature space with a surface implied by a subset of the training points (vectors) near the separating surface. These vectors are referred to as Support Vectors. Efficient with high-dimensional data. Solid statistical theory Subsume several other methods. 3/15/05 22
SVM Example (Radial Basis Function) 3/15/05 23
SVM Ingredients Support Vectors Mapping from Input Space to Feature Space Dot Product Kernel function Weights 3/15/05 24
Classification of 2-D (Separable) data 3/15/05 25
Classification of (Separable) 2-D data 3/15/05 26
Classification of (Separable) 2-D data +1-1 Margin of a point Margin of a point set 3/15/05 27
Classification using the Separator x w x i + b > 0 w x j + b < 0 x Separator w x + b = 0 3/15/05 28
Perceptron Algorithm (Primal) Given separable training set S and learning rate η>0 w 0 = 0; // Weight b 0 = 0; // Bias k = 0; R = max 7x i 7 repeat w = Σ a i y i x i for i = 1 to N if y i (w k x i + b k ) 0 then w k+1 = w k + ηy i x i b k+1 = b k + ηy i R 2 k = k + 1 Until no mistakes made within loop Return k, and (w k, b k ) where k = # of mistakes Rosenblatt, 1956 3/15/05 29
Performance for Separable Data Theorem: If margin m of S is positive, then k (2R/m) 2 i.e., the algorithm will always converge, and will converge quickly. 3/15/05 30
Perceptron Algorithm (Dual) Given a separable training set S a = 0; b 0 = 0; R = max 7x i 7 repeat for i = 1 to N if y i (Σa j y x j i x j + b) 0 then a i = a i + 1 b = b + y i R 2 endif Until no mistakes made within loop Return (a, b) 3/15/05 31
Non-linear Separators 3/15/05 32
Main idea: Map into feature space 3/15/05 33
Non-linear Separators X F 3/15/05 34
Useful URLs http://www.support-vector.net 3/15/05 35
Perceptron Algorithm (Dual) Given a separable training set S a = 0; b 0 = 0; R = max 7x i 7 repeat for i = 1 to N if y i (Σa j y k(x j i,x j ) + b) 0 then a i = a i + 1 b = b + y i R 2 Until no mistakes made within loop Return (a, b) k(x i,x j ) = Φ(x i ) Φ(x j ) 3/15/05 36
Different Kernel Functions Polynomial kernel Radial Basis Kernel Sigmoid Kernel κ ( X, Y ) = ( X Y ) κ ( X, Y ) exp X Y = 2 2σ κ ( X, Y ) = tanh( ω( X Y ) + θ ) d 2 3/15/05 37
SVM Ingredients Support Vectors Mapping from Input Space to Feature Space Dot Product Kernel function 3/15/05 38
Generalizations How to deal with more than 2 classes? Idea: Associate weight and bias for each class. How to deal with non-linear separator? Idea: Support Vector Machines. How to deal with linear regression? How to deal with non-separable data? 3/15/05 39
Applications Text Categorization & Information Filtering 12,902 Reuters Stories, 118 categories (91%!!) Image Recognition Face Detection, tumor anomalies, defective parts in assembly line, etc. Gene Expression Analysis Protein Homology Detection 3/15/05 40
3/15/05 41
3/15/05 42
3/15/05 43
SVM Example (Radial Basis Function) 3/15/05 44