Introduction to Computer Science

Size: px

Start display at page:

Download "Introduction to Computer Science"

Bryan Montgomery
6 years ago
Views:

1 DM534 Introduction to Computer Science Clustering and Feature Spaces

Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at Berkeley) PhD at the Max Planck Institute for

2 Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at Berkeley) PhD at the Max Planck Institute for Computer Science in Saarbrücken Since 2014: Assistant Professor at SDU Research Interests: Bioinformatics Machine Learning Clustering Biological Networks Slides are taken from Arthur Zimek 2

3 Clustering & Feature Spaces Learning Objectives Understand the problem of clustering in general Learn about k-means Understand the importance of feature spaces and object representation Understand the influence of distance functions 3

4 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 4

5 Purpose of Clustering Identify a finite number of categories (classes, groups: clusters) in a given dataset Similar objects shall be grouped in the same cluster, dissimilar objects in different clusters similarity is highly subjective, depending on the application scenario 5

6 How Many Clusters? Image taken from: 6

7 How Many Clusters? Everitt, Brian S., et al. Cluster Analysis, 5th Edition (2011). 7

8 How Many Clusters? Everitt, Brian S., et al. Cluster Analysis, 5th Edition (2011). 8

9 A Seemingly Simple Problem Each dataset can be clustered in many meaningful ways Highly problem depended Not known by the algorithm a priori Figure from Tan et al. [2006]. 9

10 About Clustering Clustering is the unsupervised machine-learning task of grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. What is related? For example Customers? Age? Behavior? Kinship? Treatment of Outliers? Ill-posed Problem... That means there exist multiple solutions What is the best? 10

11 Overview of a Cluster Analysis 11

12 Overview of a Cluster Analysis 12

13 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 13

14 Steps to Automatization: Cluster Criteria Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)? Separation: how well is a cluster separated from other clusters? small within cluster distance large between cluster distance 14

15 Steps to Automatization: Cluster Criteria Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)? Separation: how well is a cluster separated from other clusters? There exist many other criteria, e.g., areas with the same density. It is important to choose a criterion which fits the data! small within cluster distance large between cluster distance 15

16 Optimization Partitional clustering algorithms partition a dataset into k clusters, typically minimizing some cost function no overlaps all points must be part of a cluster (compactness criterion), i.e., optimizing cohesion. 16

17 Assumptions for Partitioning Clustering Central assumptions for approaches in this family are typically: number k of clusters known (i.e., given as input) clusters are characterized by their compactness compactness measured by some distance function (e.g., distance of all objects in a cluster from some cluster representative is minimal) criterion of compactness typically leads to convex or even spherically shaped clusters 17

18 Construction of Central Points: Basics objects are points x = x 1,, x d in Euclidean vector space R d dist = Euclidean distance (L 2 ) I centroid μ C : mean vector of all points in cluster C 18

19 Construction of Central Points: Basics objects are points x = x 1,, x d in Euclidean vector space R d dist = Euclidean distance (L 2 ) I centroid μ C : mean vector of all points in cluster C 19

20 Cluster Criteria Measure for compactness: (sum of squares) TD 2 C = p C dist p, μ C 2 Measure of compactness for a clustering: TD 2 C 1,, C k k = TD 2 (C i ) i=1 20

21 Cluster Criteria Measure for compactness: (sum of squares) TD 2 C = p C dist p, μ C 2 Measure of compactness for a clustering: TD 2 C 1,, C k k = TD 2 (C i ) i=1 21

22 Basic Algorithm: Clustering by Minimization of Variance [Forgy, 1965, Lloyd, 1982] start with k (e.g., randomly selected) points as cluster representatives (or with a random partition into k clusters ) repeat: 1. assign each point to the closest representative 2. compute new representatives based on the given partitions (centroid of the assigned points) until there is no change in assignment 22

23 k-means k-means [MacQueen, 1967] is a variant of the basic algorithm: A centroid is immediately updated when some point changes its assignment k-means has very similar properties, but the result now depends on the order of data points in the input file Note: The name k-means is often used indifferently for any variant of the basic algorithm, in particular also for the Algorithm shown before [Forgy, 1965, Lloyd, 1982]. 23

24 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 24

25 k-means Clustering Lloyd/Forgy Algorithm 25

26 k-means Clustering Lloyd/Forgy Algorithm 26

27 k-means Clustering Lloyd/Forgy Algorithm 27

28 k-means Clustering Lloyd/Forgy Algorithm 28

29 k-means Clustering Lloyd/Forgy Algorithm 29

30 k-means Clustering Lloyd/Forgy Algorithm 30

31 k-means Clustering Lloyd/Forgy Algorithm 31

32 k-means Clustering Lloyd/Forgy Algorithm 32

33 k-means Clustering Lloyd/Forgy Algorithm 33

34 k-means Clustering MacQueen Algorithm 34

35 k-means Clustering MacQueen Algorithm 35

36 k-means Clustering MacQueen Algorithm 36

37 k-means Clustering MacQueen Algorithm 37

38 k-means Clustering MacQueen Algorithm 38

39 k-means Clustering MacQueen Algorithm 39

40 k-means Clustering MacQueen Algorithm 40

41 k-means Clustering MacQueen Algorithm 41

42 k-means Clustering MacQueen Algorithm 42

43 k-means Clustering MacQueen Algorithm 43

44 k-means Clustering MacQueen Algorithm 44

45 k-means Clustering MacQueen Algorithm 45

46 k-means Clustering MacQueen Algorithm 46

47 k-means Clustering MacQueen Algorithm 47

48 k-means Clustering MacQueen Algorithm Alternative Ordering 48

49 k-means Clustering MacQueen Algorithm 49

50 k-means Clustering MacQueen Algorithm 50

51 k-means Clustering MacQueen Algorithm 51

52 k-means Clustering MacQueen Algorithm 52

53 k-means Clustering MacQueen Algorithm 53

54 k-means Clustering MacQueen Algorithm 54

55 k-means Clustering MacQueen Algorithm 55

56 k-means Clustering MacQueen Algorithm 56

57 k-means Clustering MacQueen Algorithm 57

58 k-means Clustering MacQueen Algorithm 58

59 k-means Clustering MacQueen Algorithm 59

60 k-means Clustering MacQueen Algorithm 60

61 k-means Clustering Quality 61

62 k-means Clustering Quality 62

63 k-means Clustering Quality 63

64 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 64

65 k-means: Pros Efficient: O(k n) per iteration, number of iterations is usually in the order of 10. Easy to implement, thus very popular Only one parameter, easy to understand Well understood and researched Different variants exists Fuzzy clustering Variants without n-dimensional embedding 65

66 k-means: Disadvantages k-means converges towards a local minimum k-means (MacQueen-variant) is order-dependent Deteriorates with noise and outliers (all points are used to compute centroids) Clusters need to be convex and of (more or less) equal extension Number k of clusters is hard to determine Strong dependency on initial partition (in result quality as well as runtime) 66

67 What to do? How can we tackle the initialization problem? 67

68 What to do? How can we tackle the initialization problem? Repeated runs Furthest-first initialization Subset Furthest-first initialization 68

69 Insertion: Furthest First Initialization Select a random point as start For each point, the minimum of its distances to the selected centers is maintained. While less than k points selected, repeat: Selected point p with the maximum distance to the existing centers Remove p from the not-yet-selected points and add it to the center points For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q. 69

70 Furthest First Initialization: Visualization 70

71 Furthest First Initialization: Visualization 71

72 Furthest First Initialization: Visualization 72

73 Furthest First Initialization: Visualization 73

74 Furthest First Initialization: Visualization Update the minimum distances 74

75 Furthest First Initialization: Visualization 75

76 Furthest First Initialization: Visualization 76

77 Furthest First Initialization: Visualization 77

78 Furthest First Initialization: Visualization 78

79 Learnings of this Section What is Clustering? Basic idea for identifying good partitions into k clusters Selection of representative points Iterative refinement Local optimum k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967] Different initialization methods 79

80 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 80

81 Recall: Clustering as a Workflow 81

82 Similarities Similarity (as given by some distance measure) is a central concept in data mining, e.g.: Clustering: group similar objects in the same cluster, separate dissimilar objects to different clusters Outlier detection: identify objects that are dissimilar (by some characteristic) from most other objects Definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task Images CAD objects Proteins Texts... 82

83 Spaces and Distance Functions 83

84 Spaces and Distance Functions 84

85 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 85

86 Categories of Feature Descriptors for Images Distribution of colors Texture Shapes (contours) Many more 86

Color Histogram A histogram represents the distribution of colors over the pixels of an image Definition of an color histogram: Choose a color space

87 Color Histogram A histogram represents the distribution of colors over the pixels of an image Definition of an color histogram: Choose a color space (RGB, HSV, HLS,... ) Choose number of representants (sample points) in the color space Possibly normalization (to account for different image sizes) 87

88 Impact of Number of Representants 88

89 Impact of Number of Representants 89

90 Impact of Number of Representants 90

91 Impact of Number of Representants 91

92 Impact of Number of Representants 92

93 Impact of Number of Representants 93

94 Impact of Number of Representants 94

95 Impact of Number of Representants 95

96 Distances for Color Histograms 96

97 Distances for Color Histograms 97

98 Distances for Color Histograms 98

99 Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 99

100 Your Choice of a Distance Measure There are hundreds of distance functions [Deza and Deza, 2009]. For time series: DTW, EDR, ERP, LCSS,... For texts: Cosine and normalizations For sets based on intersection, union,... (Jaccard) For clusters (single-link, average-link, etc.) For histograms: histogram intersection, Earth movers distance, quadratic forms with color similarity For proteins: Edit distance, structure, With normalization: Canberra, Quadratic forms / bilinear forms: d(x, y) = x T My for some positive (usually symmetric) definite matrix M. 100

101 Learnings of this Section Distances (L p -norms, weighted, quadratic form) Color histograms as feature (vector) descriptors for images Impact of the granularity of color histograms on similarity measures 101

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation