Introduction to Computer Science

Similar documents
Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CSE 5243 INTRO. TO DATA MINING

University of Florida CISE department Gator Engineering. Clustering Part 2

CSE 5243 INTRO. TO DATA MINING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Kapitel 4: Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Gene Clustering & Classification

Cluster analysis. Agnieszka Nowak - Brzezinska

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering CS 550: Machine Learning

Unsupervised Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

K-Means. Oct Youn-Hee Han

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

K-Means Clustering With Initial Centroids Based On Difference Operator

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Unsupervised Learning Partitioning Methods

CHAPTER 4: CLUSTER ANALYSIS

Supervised vs. Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

ECLT 5810 Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Hierarchical Clustering

Overview of Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning and Clustering

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Informatics. Seon Ho Kim, Ph.D.

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

ECLT 5810 Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

[7.3, EA], [9.1, CMB]

Clustering: Classic Methods and Modern Views

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture on Modeling Tools for Clustering & Regression

Unsupervised Learning and Clustering

Clustering: K-means and Kernel K-means

Data Mining Algorithms

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering. Shishir K. Shah

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

k-means Clustering Todd W. Neller Gettysburg College

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering: Overview and K-means algorithm

Image Processing. Image Features

Unsupervised Learning : Clustering

Statistics 202: Statistical Aspects of Data Mining

CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data

Road map. Basic concepts

2D image segmentation based on spatial coherence

Segmentation Computer Vision Spring 2018, Lecture 27

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering. K-means clustering

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture Notes for Chapter 8. Introduction to Data Mining

Knowledge Discovery in Databases. Part III Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning

Exploratory data analysis for microarrays

Hierarchical Clustering

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Clustering: Overview and K-means algorithm

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Lecture 10: Semantic Segmentation and Clustering

数据挖掘 Introduction to Data Mining

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Redefining and Enhancing K-means Algorithm

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Introduction to Mobile Robotics

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical Clustering Lecture 9

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Clustering. Chapter 10 in Introduction to statistical learning

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Information Retrieval and Web Search Engines

Clustering in Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Clustering Part 2. A Partitional Clustering

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Understanding Clustering Supervising the unsupervised

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

K-Means Clustering Using Localized Histogram Analysis

Machine Learning : Clustering, Self-Organizing Maps


Transcription:

DM534 Introduction to Computer Science Clustering and Feature Spaces

Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at Berkeley) PhD at the Max Planck Institute for Computer Science in Saarbrücken Since 2014: Assistant Professor at SDU Research Interests: Bioinformatics Machine Learning Clustering Biological Networks Slides are taken from Arthur Zimek 2

Clustering & Feature Spaces Learning Objectives Understand the problem of clustering in general Learn about k-means Understand the importance of feature spaces and object representation Understand the influence of distance functions 3

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 4

Purpose of Clustering Identify a finite number of categories (classes, groups: clusters) in a given dataset Similar objects shall be grouped in the same cluster, dissimilar objects in different clusters similarity is highly subjective, depending on the application scenario 5

How Many Clusters? Image taken from: http://fromdatawithlove.thegovans.us/ 6

How Many Clusters? Everitt, Brian S., et al. Cluster Analysis, 5th Edition (2011). 7

How Many Clusters? Everitt, Brian S., et al. Cluster Analysis, 5th Edition (2011). 8

A Seemingly Simple Problem Each dataset can be clustered in many meaningful ways Highly problem depended Not known by the algorithm a priori Figure from Tan et al. [2006]. 9

About Clustering Clustering is the unsupervised machine-learning task of grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. What is related? For example Customers? Age? Behavior? Kinship? Treatment of Outliers? Ill-posed Problem... That means there exist multiple solutions What is the best? 10

Overview of a Cluster Analysis 11

Overview of a Cluster Analysis 12

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 13

Steps to Automatization: Cluster Criteria Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)? Separation: how well is a cluster separated from other clusters? small within cluster distance large between cluster distance 14

Steps to Automatization: Cluster Criteria Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)? Separation: how well is a cluster separated from other clusters? There exist many other criteria, e.g., areas with the same density. It is important to choose a criterion which fits the data! small within cluster distance large between cluster distance 15

Optimization Partitional clustering algorithms partition a dataset into k clusters, typically minimizing some cost function no overlaps all points must be part of a cluster (compactness criterion), i.e., optimizing cohesion. 16

Assumptions for Partitioning Clustering Central assumptions for approaches in this family are typically: number k of clusters known (i.e., given as input) clusters are characterized by their compactness compactness measured by some distance function (e.g., distance of all objects in a cluster from some cluster representative is minimal) criterion of compactness typically leads to convex or even spherically shaped clusters 17

Construction of Central Points: Basics objects are points x = x 1,, x d in Euclidean vector space R d dist = Euclidean distance (L 2 ) I centroid μ C : mean vector of all points in cluster C 18

Construction of Central Points: Basics objects are points x = x 1,, x d in Euclidean vector space R d dist = Euclidean distance (L 2 ) I centroid μ C : mean vector of all points in cluster C 19

Cluster Criteria Measure for compactness: (sum of squares) TD 2 C = p C dist p, μ C 2 Measure of compactness for a clustering: TD 2 C 1,, C k k = TD 2 (C i ) i=1 20

Cluster Criteria Measure for compactness: (sum of squares) TD 2 C = p C dist p, μ C 2 Measure of compactness for a clustering: TD 2 C 1,, C k k = TD 2 (C i ) i=1 21

Basic Algorithm: Clustering by Minimization of Variance [Forgy, 1965, Lloyd, 1982] start with k (e.g., randomly selected) points as cluster representatives (or with a random partition into k clusters ) repeat: 1. assign each point to the closest representative 2. compute new representatives based on the given partitions (centroid of the assigned points) until there is no change in assignment 22

k-means k-means [MacQueen, 1967] is a variant of the basic algorithm: A centroid is immediately updated when some point changes its assignment k-means has very similar properties, but the result now depends on the order of data points in the input file Note: The name k-means is often used indifferently for any variant of the basic algorithm, in particular also for the Algorithm shown before [Forgy, 1965, Lloyd, 1982]. 23

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 24

k-means Clustering Lloyd/Forgy Algorithm 25

k-means Clustering Lloyd/Forgy Algorithm 26

k-means Clustering Lloyd/Forgy Algorithm 27

k-means Clustering Lloyd/Forgy Algorithm 28

k-means Clustering Lloyd/Forgy Algorithm 29

k-means Clustering Lloyd/Forgy Algorithm 30

k-means Clustering Lloyd/Forgy Algorithm 31

k-means Clustering Lloyd/Forgy Algorithm 32

k-means Clustering Lloyd/Forgy Algorithm 33

k-means Clustering MacQueen Algorithm 34

k-means Clustering MacQueen Algorithm 35

k-means Clustering MacQueen Algorithm 36

k-means Clustering MacQueen Algorithm 37

k-means Clustering MacQueen Algorithm 38

k-means Clustering MacQueen Algorithm 39

k-means Clustering MacQueen Algorithm 40

k-means Clustering MacQueen Algorithm 41

k-means Clustering MacQueen Algorithm 42

k-means Clustering MacQueen Algorithm 43

k-means Clustering MacQueen Algorithm 44

k-means Clustering MacQueen Algorithm 45

k-means Clustering MacQueen Algorithm 46

k-means Clustering MacQueen Algorithm 47

k-means Clustering MacQueen Algorithm Alternative Ordering 48

k-means Clustering MacQueen Algorithm 49

k-means Clustering MacQueen Algorithm 50

k-means Clustering MacQueen Algorithm 51

k-means Clustering MacQueen Algorithm 52

k-means Clustering MacQueen Algorithm 53

k-means Clustering MacQueen Algorithm 54

k-means Clustering MacQueen Algorithm 55

k-means Clustering MacQueen Algorithm 56

k-means Clustering MacQueen Algorithm 57

k-means Clustering MacQueen Algorithm 58

k-means Clustering MacQueen Algorithm 59

k-means Clustering MacQueen Algorithm 60

k-means Clustering Quality 61

k-means Clustering Quality 62

k-means Clustering Quality 63

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 64

k-means: Pros Efficient: O(k n) per iteration, number of iterations is usually in the order of 10. Easy to implement, thus very popular Only one parameter, easy to understand Well understood and researched Different variants exists Fuzzy clustering Variants without n-dimensional embedding 65

k-means: Disadvantages k-means converges towards a local minimum k-means (MacQueen-variant) is order-dependent Deteriorates with noise and outliers (all points are used to compute centroids) Clusters need to be convex and of (more or less) equal extension Number k of clusters is hard to determine Strong dependency on initial partition (in result quality as well as runtime) 66

What to do? How can we tackle the initialization problem? 67

What to do? How can we tackle the initialization problem? Repeated runs Furthest-first initialization Subset Furthest-first initialization 68

Insertion: Furthest First Initialization Select a random point as start For each point, the minimum of its distances to the selected centers is maintained. While less than k points selected, repeat: Selected point p with the maximum distance to the existing centers Remove p from the not-yet-selected points and add it to the center points For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q. 69

Furthest First Initialization: Visualization 70

Furthest First Initialization: Visualization 71

Furthest First Initialization: Visualization 72

Furthest First Initialization: Visualization 73

Furthest First Initialization: Visualization Update the minimum distances 74

Furthest First Initialization: Visualization 75

Furthest First Initialization: Visualization 76

Furthest First Initialization: Visualization 77

Furthest First Initialization: Visualization 78

Learnings of this Section What is Clustering? Basic idea for identifying good partitions into k clusters Selection of representative points Iterative refinement Local optimum k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967] Different initialization methods 79

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 80

Recall: Clustering as a Workflow 81

Similarities Similarity (as given by some distance measure) is a central concept in data mining, e.g.: Clustering: group similar objects in the same cluster, separate dissimilar objects to different clusters Outlier detection: identify objects that are dissimilar (by some characteristic) from most other objects Definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task Images CAD objects Proteins Texts... 82

Spaces and Distance Functions 83

Spaces and Distance Functions 84

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 85

Categories of Feature Descriptors for Images Distribution of colors Texture Shapes (contours) Many more 86

Color Histogram A histogram represents the distribution of colors over the pixels of an image Definition of an color histogram: Choose a color space (RGB, HSV, HLS,... ) Choose number of representants (sample points) in the color space Possibly normalization (to account for different image sizes) 87

Impact of Number of Representants 88

Impact of Number of Representants 89

Impact of Number of Representants 90

Impact of Number of Representants 91

Impact of Number of Representants 92

Impact of Number of Representants 93

Impact of Number of Representants 94

Impact of Number of Representants 95

Distances for Color Histograms 96

Distances for Color Histograms 97

Distances for Color Histograms 98

Clustering & Feature Spaces Lecture Content Clustering Clustering in General Partitional Clustering Visualization: Algorithmic Differences Summary Feature Spaces Distances Features for Images Summary 99

Your Choice of a Distance Measure There are hundreds of distance functions [Deza and Deza, 2009]. For time series: DTW, EDR, ERP, LCSS,... For texts: Cosine and normalizations For sets based on intersection, union,... (Jaccard) For clusters (single-link, average-link, etc.) For histograms: histogram intersection, Earth movers distance, quadratic forms with color similarity For proteins: Edit distance, structure, With normalization: Canberra, Quadratic forms / bilinear forms: d(x, y) = x T My for some positive (usually symmetric) definite matrix M. 100

Learnings of this Section Distances (L p -norms, weighted, quadratic form) Color histograms as feature (vector) descriptors for images Impact of the granularity of color histograms on similarity measures 101