CS 6140: Machine Learning Spring 2017

Size: px

Start display at page:

Download "CS 6140: Machine Learning Spring 2017"

Lionel Wood
6 years ago
Views:

1 CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Science Northeastern University Webpage:

2 Grades for Assignment 1 will be out next week. Assignment 3 is out and due on 03/30. Project progress report is due on 03/16. Hard copy in class.

3 Project progress report What changes you have made for the task No change at all Change the data, or else Describe data preprocessing What are the features or Numerical or categorical? Do you use all the data or part of it? What method you have tried E.g regression, SVM

4 Project progress report What results do you have now? Baselines? metrics? Precision, recall, F1, accuracy How are your results compared to the baselines?

5 What we learned Dimension (or feature Principal component analysis (PCA) Singular value (SVD)

7 What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most of the sample's informa@on. Useful for the compression and classifica@on of data.

8 Geometric picture of principal components (PCs) z 1 the 1 st PC z 1 is a minimum distance fit to a line in X space z 2 the 2 nd PC is a minimum distance fit to a line in the plane perpendicular to the 1 st PC PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous.

9 To find first note that where is the covariance matrix. ( )( ) T i n i i x x x x n S = =1 1 a 1 ( ) ( )( ) ) ) (( ] var[ Sa a a x x x x a n x a x a n z z E z T n i T i i T n i T i T = = = = = = isthe mean. 1 1 = = n i x i n x In the following, we assume the Data is centered. 0 x = Algebraic defini@on of PCs

10 Algebraic derivation of PCs We find that a 2 whose eigenvalue is also an eigenvector of S λ = λ 2 is the second largest. In general var[ z k ] = a T k Sa k = λ k The k th largest eigenvalue of S is the variance of the k th PC. z k The k th PC in the sample. retains the k th greatest fraction of the variation

11 PCA for image compression d=1 d=2 d=4 d=8 d=16 d=32 d=64 d=100 Original Image

17 CUR In large-data it is normal for the matrix A being decomposed to be very sparse Documents With SVD, even if A is sparse, U and V will be dense

22 Today s Outline Clustering K means Hierarchical clustering Spectral clustering [some of the slides are borrowed from David Blei]

23 Clustering Goal: segment data into groups of similar points

24 Clustering Goal: segment data into groups of similar points When and why would we want to do this?

25 Clustering Goal: segment data into groups of similar points When and why would we want to do this? Useful for: organizing data Understanding hidden structure in some data high-dimensional data in a lowdimensional space

26 Clustering Examples Facebook user group according to their interests Image search Topic discovery in

27 Setup

29 segment this data into k groups What is a good distance func@on?

30 Squared Euclidean distance:

31 segment this data into k groups What should k be?

32 segment this data into k groups For example, k is 4

33 K means

34 K means The basic idea is to describe each cluster by its mean value. Goal: assign data to clusters and define these clusters with their means.

35 K means algorithm

36 K means algorithm

37 Example: Start

43 K means evalua@on

44 Coordinate descent

45 Coordinate descent However, It finds a local minimum. (MulBple restarts are oden necessary.)

46 for the example data

47 Example: compressing images

48 Each pixel is associated with a red, green, and blue value A 1024 X 1024 image is a collec@on of values <x1, x2, x3>, which requires 3M of storage How can we use k-means to compress this image?

57 Measure of Less distorted with more clusters

58 K-medoids In many semngs, Euclidean distance is not appropriate. Discrete data, such as purchase histories, moving watching histories k-medoids is an algorithm that only requires knowing distances between data points No need to define the mean

59 K-medoids In many semngs, Euclidean distance is not appropriate. Discrete data, such as purchase histories, moving watching histories k-medoids is an algorithm that only requires knowing distances between data points No need to define the mean Each of the clusters is associated with its most typical example

60 k-medoids algorithm

61 Choosing k Choosing k is a nagging problem in cluster analysis Some@mes, the problem determines k Clustering customers for k salespeople in a business It is not well-defined.

62 What happens as k increases?

63 What happens as k increases?

64 What happens as k increases?

65 What happens as k increases?

66 What happens as k increases?

67 What happens as k increases?

68 What happens as k increases?

69 What happens as k increases?

70 A kink in the objec@ve

71 Today s Outline Clustering K means Hierarchical clustering Spectral clustering

72 Hierarchical clustering Hierarchical clustering is a widely used data analysis tool. The idea is to build a binary tree of the data that successively merges similar groups of points. Visualizing this tree provides a useful summary of the data.

73 Hierarchical clustering vs. k-means Recall that k-means or k-medoids requires A number of clusters k An ini@al assignment of data to clusters A distance measure between data Hierarchical clustering only requires a measure of similarity between groups of data points.

74 clustering Algorithm: Place each data point into its own singleton group Repeat: merge the two closest groups all the data are merged into a single cluster

86 clustering Each level of the tree is a segmenta@on of the data The algorithm results in a sequence of groupings It is up to the user to choose a natural clustering from this sequence

87 Dendrogram clustering is monotonic The similarity between merged clusters is monotone decreasing with the level of the merge. Dendrogram: Plot each merge between the two merged groups Provides an interpretable of the algorithm and data Useful tool, part of why hierarchical clustering is popular

88 [source: mmds.org]

89 Group similarity

92 of intergroup similarity Single linkage can produce chaining, where a sequence of close observa@ons in different groups cause early merges of those groups

93 of intergroup similarity Single linkage can produce chaining, where a sequence of close observa@ons in different groups cause early merges of those groups Complete linkage has the opposite problem. It might not merge close groups because of outlier members that are far apart

94 of intergroup similarity Single linkage can produce chaining, where a sequence of close observa@ons in different groups cause early merges of those groups Complete linkage has the opposite problem. It might not merge close groups because of outlier members that are far apart Group average represents a natural compromise, but depends on the scale of the similari@es. Applying a monotone transforma@on to the similari@es can change the results

95 Caveats Hierarchical clustering should be treated with Different decisions about group can lead to vastly different dendrograms. The algorithm imposes a hierarchical structure on the data, even data for which such structure is not appropriate.

96 Today s Outline Clustering K means Hierarchical clustering Spectral clustering [Some slides are borrowed from Royi Itzhak]

97 Spectral Clustering Algorithms that cluster points using eigenvectors of matrices derived from the data Obtain data in the lowdimensional space that can be easily clustered Difficult to understand Easy to implement

98 Elements of Graph Theory A graph G = (V,E) consists of a vertex set V and an edge set E. If G is a directed graph, each edge is an ordered pair of ver@ces A bipar3te graph is one in which the ver@ces can be divided into two groups, so that all edges join ver@ces in different groups.

99 Similarity Graph Represent dataset as a weighted graph G(V,E) V={x i } Set of n ver@ces represen@ng data points { v, v,..., v } E={W ij } Set of weighted edges indica@ng pair-wise similarity between points

100 Similarity Graph W ij represent similarity between vertex If W ij =0, no similarity Set W ii =0

101 The idea: Graph Clustering can be viewed as a similarity graph task: Divide into two disjoint groups (A,B) A 1 5 B V=A U B Graph par@@on is NP hard!

102 Clustering of a good clustering: 1. Points assigned to same cluster should be highly similar. 2. Points assigned to different clusters should be highly dissimilar.

103 Clustering of a good clustering: 1. Points assigned to same cluster should be highly similar. 2. Points assigned to different clusters should be highly dissimilar. Apply these objec@ves to our graph representa@on Minimize weight of between-group connec@ons

104 Graph Cuts Express as a func@on of the edge cut of the par@@on. Cut: Set of edges with only one vertex in a group. We wants to find the minimal cut between groups. The groups that has the minimal cut would be the par@@on A B cut( A, B) = w ij i A, j B cut(a,b) =

105 Graph Cut Criteria Criterion: Minimum-cut Minimise weight of between groups Degenerate case: min cut(a,b) cut Minimum cut Problem: Only considers external cluster Does not consider internal cluster density

106 Graph Cut Criteria Criterion: Normalized-cut (Shi & Malik, 97) Consider the between groups to the density of each group. cut( A, B) min Ncut ( A, B) = + vol( A) cut( A, B) vol( B) Normalize the associa@on between groups by volume. Vol(A): The total weight of the edges origina@ng from group A. Why use this criterion? Produces more balanced par@@ons.

107 Example 2 Spirals Dataset exhibits complex cluster shapes K-means performs very poorly in this space due to bias toward dense spherical clusters In the embedded space given by two leading eigenvectors, clusters are trivial to separate

108 Spectral Graph Theory Possible approach Represent a similarity graph as a matrix Apply knowledge from Linear Algebra The eigenvalues and eigenvectors of a matrix provide global informa@on about its structure. w w 1n!! w n1... w nn x 1! x n = λ x 1! x n Spectral Graph Theory Analyze the spectrum of matrix represen@ng a graph. Spectrum : The eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues. Λ = { 1 2 n λ, λ,..., λ }

109 2 Matrix Adjacency matrix (A) n x n matrix A = [ w ij ] : edge weight between vertex x i and x j Important properbes: Symmetric matrix Eigenvalues are real Eigenvector could span orthogonal base 5 6 x 1 x 2 x 3 x 4 x 5 x 6 x x x x x x

110 Matrix Degree matrix (D) n x n diagonal matrix D ( i, i) = w ij : total weight of edges incident to vertex x i j Important applicabon: Normalize adjacency matrix x 1 x 2 x 3 x 4 x 5 x 6 x x x x x x

111 Matrix Laplacian matrix (L) n x n symmetric matrix L = D - A x 1 x 2 x 3 x 4 x 5 x 6 x x x x x x Important properbes: Eigenvalues are non-nega@ve real numbers Eigenvectors are real and orthogonal Eigenvalues and eigenvectors provide an insight into the connec@vity of the graph.

112 Another normalized laplacian Laplacian matrix (L) n x n symmetric matrix L' = D 0.5 L D Important properbes: Eigenvectors are real and normalized Each A ij (which i,j is not equal) = A ij D ii

113 Find An Min-Cut (Hall 70, Fiedler 73) Express a bi-par@@on (A,B) as a vector f p i = +1 if x i A 1 if x i B We can minimise the cut of the par@@on by finding a nontrivial vector p that minimizes the func@on ( p) wij( pi p j i, j V = ) 2 f (p) = p T L p Laplacian matrix

114 Find An Min-Cut (Hall 70, Fiedler 73) Express a bi-par@@on (A,B) as a vector p i = +1 if x i A 1 if x i B We can minimise the cut of the par@@on by finding a nontrivial vector p that minimizes the func@on f ( p) wij( pi p j i, j V = ) 2 f (p) = p T L p Laplacian matrix The Rayleigh Theorem shows: The minimum value for f(p) is given by the 2nd smallest eigenvalue of the Laplacian L. (because the 1 st smallest one is 0, which corresponds to eigenvector I --- unit vector) The op@mal solu@on for p is given by the corresponding eigenvector, referred as the Fiedler Vector.

115 Spectral Clustering Algorithms Three basic stages: 1. Pre-processing Construct a matrix representa@on of the dataset. 2. Decomposi@on Compute eigenvalues and eigenvectors of the matrix. Map each point to a lower-dimensional representa@on based on one or more eigenvectors. 3. Grouping Assign points to two or more clusters, based on the new representa@on.

116 Spectral Algorithm 1. Pre-processing Build Laplacian matrix L of the graph 2. Decomposi@on Find eigenvalues and eigenvectors of the matrix L Map ver@ces to corresponding components of λ 2 x 1 x 2 x 3 x 4 x 5 x 6 x x x x x x Λ = X = x x x x x x 6-0.7

117 Spectral Algorithm The matrix which represents the eigenvectors of the Laplacian matrix

118 Spectral Grouping Sort components of reduced 1-dimensional vector. clusters by splimng the sorted vector in two. How to choose a splimng point? Naïve approaches: Split at 0, mean or median value More expensive approaches A}empt to minimize normalized cut criterion in 1-dimension x Split at Cluster A: Positive points x 2 x 3 x 4 x 5 x Cluster B: Negative points x 1 x 2 x x 4 x 5 x A B

119 3-Clusters

120 K-Way Spectral Clustering How do we a graph into k clusters? Two basic approaches: 1. Recursive bi-par@@oning (Hagen et al., 91) Recursively apply bi-par@@oning algorithm in a hierarchical divisive manner. Disadvantages: Inefficient, unstable 2. Cluster mul@ple eigenvectors (Shi & Malik, 00) Build a reduced space from mul@ple eigenvectors. Commonly used A preferable approach but its like to do PCA and then k-means

121 Recursive (Hagen et al., 91) using only one eigenvector at Use procedure recursively Example: Image Segmenta@on Uses 2 nd (smallest) eigenvector to define op@mal cut Recursively generates two clusters with each cut

122 Why use Eigenvectors? 1. Approximates the opbmal cut (Shi & Malik, 00) Can be used to approximate the k-way normalized cut. 2. Emphasises cohesive clusters (Brand & Huang, 02) Increases the unevenness in the of the data. between similar points are amplified, between dissimilar points are a}enuated. The data begins to approximate a clustering. 3. Well-separated space Transforms data to a new embedded space, consis@ng of k orthogonal basis vectors.

123 K-Eigenvector Clustering K-eigenvector Algorithm (Ng et al., 01) 1. Pre-processing Construct the scaled adjacency matrix A' 1/ 2 1/ 2 = D AD 2. Decomposi@on Find the eigenvalues and eigenvectors of A'. Build embedded space from the eigenvectors corresponding to the k largest eigenvalues. 3. Grouping Apply k-means to reduced n x k space to produce k clusters.

124 Aside: How to select k? Eigengap: the difference between two eigenvalues. Most stable clustering is generally given by the value k that maximizes the expression Largest eigenvalues of Cisi/Medline data 45 λ Δ k = λ λ k k 1 max Δ k Choose k=2 = λ λ 2 1 Eigenvalue λ K

125 Summary on Spectral Clustering Clustering as a graph par@@oning problem Quality of a par@@on can be determined using graph cut criteria. Iden@fying an op@mal par@@on is NP-hard. Spectral clustering techniques Efficient approach to calculate near-op@mal bi-par@@ons and k-way par@@ons. Based on well-known cut criteria and strong theore@cal background.

126 What we learned today Clustering K means Hierarchical clustering Spectral clustering

127 Homework Readings Murphy Ch Mining of Massive Datasets, chapter 11 h}p://infolab.stanford.edu/~ullman/mmds/ ch11.pdf A tutorial on Spectral Clustering h}p:// files/publica@ons/a}achments/ Luxburg07_tutorial_4488%5b0%5d.pdf

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #21: Graph Mining 2

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #21: Graph Mining 2 Networks & Communi>es We o@en think of networks being organized into modules, cluster, communi>es: VT CS 5614 2 Goal: