Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Size: px
Start display at page:

Download "Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning"

Transcription

1 Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning

2 Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal indexes Application: finding the proper cluster number External index Motivation and general ideas Rand Index Application: Weighted clustering ensemble Summary COMP4 Machine Learning

3 Motivation and Background Motivation Supervised classification Class labels known for ground truth Accuracy: based on labels given Clustering analysis No class labels Evaluation still demanded Validation needs to Oranges: Apples: Evaluation Accuracy = 8/= 8% P Compare clustering algorithms Solve the number of clusters Avoid finding patterns in noise Find the best clusters from data COMP4 Machine Learning 3

4 Motivation and Background Illustrative Example: which one is the best? Data Set (Random Points) K-means (K=3) y y x K-means (K=3) y y x Agglomerative (Complete Link) x COMP4 Machine Learning x 4

5 Motivation and Background Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. How to be quantitative : To employ the measures. How to be objective : To validate the measures! INPUT: DataSet(X) Clustering Algorithm(s) Partitions P Validity Index m* Different settings/configurations COMP4 Machine Learning 5

6 Motivation and Background Internal Criteria (Indexes) Validate without external information With different number of clusters Solve the number of clusters?? External Criteria (Indexes) Validate against ground truth Compare two partitions: (how similar)?? COMP4 Machine Learning 6

7 Internal Index Ground truth is unavailable but unsupervised validation must be done with common sense or a priori knowledge. There are a variety of internal indexes: Variances-based methods Rate-distortion methods Davies-Bouldin index (DBI) Bayesian Information Criterion (BIC) Silhouette Coefficient Minimum description principle (MDL) Stochastic complexity (SC) Modified Huber s Г (MHГ) index COMP4 Machine Learning 7

8 Internal Index Variances-based methods Minimise within cluster variance (SSW) Maximise between cluster variance (SSB) Intra-cluster variance is minimized Inter-cluster variance is maximized c COMP4 Machine Learning 8

9 Internal Index Variances-based methods (cont.) Assume an algorithm leads to a partition of K clusters where cluster i has nn ii data points and ci is its centroid. d(.,.) is a distance used in this algorithm. Within cluster variance (SSW) SSW K n ( K) = d ( x, c i i= j= ij i ) Between cluster variance (SSB) SSB K = n d ( ) i ( ci, c) where c is the global mean (centroid) of the whole data set. K i= COMP4 Machine Learning 9

10 Internal Index Variance based F-ratio index Measures ratio of between-cluster variance against the withincluster variance (original F-test) F-ratio index (W-B index) for a partition of K clusters F( K) where x n K * SSW ( K) SSB( K) K i= j= = = K ij i is the K i= n i n d i d ( x ( c i ij, c, c) jth data point in cluster is the number of data points in cluster i ) c i c i COMP4 Machine Learning

11 Internal Index Application: finding the proper cluster number F-ratio (x^5) S minimum Algorithm PNN Algorithm IS Clusters COMP4 Machine Learning

12 .5.4 S4 Internal Index Application: finding the proper cluster number F-ratio.3.. minimum at 6 Algorithm PNN Algorithm IS Number of clusters COMP4 Machine Learning minimum at 5

13 External Index Ground truth is available but an clustering algorithm doesn t use such information during unsupervised learning. There are a variety of internal indexes: Rand Index Adjusted Rand Index Pair counting index Information theoretic index Set matching index DVI index Normalised mutual information (NMI) index COMP4 Machine Learning 3

14 External Index Main issues If the ground truth is known, the validity of a clustering can be verified by comparing the class or clustering labels. However, this is much more complicated than in supervised classification (where labels used in training) The cluster IDs in a partition resulting from clustering have been assigned arbitrarily due to unsupervised learning permutation. The number of clusters may be different from the number of classes (the ground truth ) inconsistence. The most important problem in external indexes would be how to find all possible correspondences between the ground truth and a partition (or two candidate partitions in the case of comparison). COMP4 Machine Learning 4

15 External Index Rand Index This is the first external index proposed by Rand (97) to address the correspondence problem. Basic idea: considering all pairs in the data set by looking into both agreement and disagreement against the ground truth The index defined as RI (X, Y)= (a+d)/(a+b+c+d) X/Y Y: Pairs in the same class Y: Pairs in different classes X: Pairs in the same cluster a b X: Pairs in different clusters c d COMP4 Machine Learning 5

16 External Index Rand Index (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q To calculate a, b, c and d, we have to list all possible pairs (excluding any data point to itself) [, ], [, 3], [, 4], [, 5] [, 3], [, 4], [, 5] [3, 4], [3, 5] [4, 5] COMP4 Machine Learning 6

17 External Index Rand Index (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q Initialisation: a, b, c, d For data point pair [, ], we have X: [, ] assigned to (i, ii) > in different clusters Y: [, ] labelled by (q, p) > in different classes Thus, d d + = COMP4 Machine Learning 7

18 External Index Rand Index (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 3], we have X: [, 3] assigned to (i, ii) > in different clusters Y: [, 3] labelled by (q, q) > in the same class Thus, c c + = COMP4 Machine Learning 8

19 External Index Rand Index (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 4], we have X: [, 4] assigned to (i, i) > in the same cluster Y: [, 4] labelled by (q, p) > in different classes Thus, b b + = COMP4 Machine Learning 9

20 External Index Rand Index (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 5], we have X: [, 5] assigned to (i, i) > in the same cluster Y: [, 5] labelled by (q, q) > in the same class Thus, a a + = COMP4 Machine Learning

21 External Index Rand Index (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = On-class Exercise: continuing until have the final value of a, b, c and d. COMP4 Machine Learning

22 External Index Rand Index: Contingency Table In general, a, b, c and d are calculated from a contingency table Assume there are k clusters in X and l classes in Y nn iiii : the number of points in both cluster ii and class jj COMP4 Machine Learning

23 External Index Rand Index: Contingency Table (cont.) Example: for a 5-point data set, we have X i ii ii i i Y q p q p q Contingency table p q i 3 ii 3 5 COMP4 Machine Learning 3

24 External Index Rand Index: measure the number of pairs in Same cluster/class in X and Y a = k l i= j= n ij ( n ij ) Same cluster in X /different classes in Y b = ( l n j n ij ). j= i= j= Different clusters in X / same class in Y c = ( k Different clusters/classes in X and Y d = k n i n ij ). i= i= j= ( N + k k l l l i= j= n ij ( k i= n i. + l j= n. j )) Ex. 4 COMP4 Machine Learning

25 Weighted Clustering Ensemble Motivation Evidence-accumulation based clustering ensemble (Fred & Jain, 5) is a simple clustering ensemble algorithm that use the evidence accumulated from multiple yet diversified partitions generated with different algorithms, initial conditions, different distance metrics, However, this clustering ensemble algorithm is subject to limitations: Don t distinguish between non-trivial and trivial partitions; i.e., all partitions used in a clustering ensemble are treated equally important. Sensitive to a cluster distance used in the hierarchical clustering for reaching a consensus Clustering validity indexes provide an effective way to measure non-trivialness or importance of a partition Internal index: directly measure the importance of a partition quantitatively and objectively External index: indirectly measure the importance of a partition via comparing with other partitions used in clustering ensemble for another round evidence accumulation COMP4 Machine Learning 5

26 Weighted Clustering Ensemble Use validity index values as weights Yun Yang and Ke Chen, Temporal data clustering via weighted clustering ensemble with different representations, IEEE Transactions on Knowledge and Data Engineering 3(), pp. 37-3,. Chen & Yang: Temporal Data Clustering with Different Representations COMP4 Machine Learning 6

27 Weighted Clustering Ensemble Example: convert clustering results into binary Distance matrix A B C D Cluster (C) Cluster (C) = D A D C B D C A B distance Matrix 7 Chen & Yang: Temporal Data Clustering with Different Representations

28 Weighted Clustering Ensemble Example: convert clustering results into binary Distance matrix A B C D Cluster (C) Cluster (C) = D A D C B D C A B distance Matrix 8 Chen & Yang: Temporal Data Clustering with Different Representations Cluster 3 (C3)

29 Weighted Clustering Ensemble Evidence accumulation: form the collective weighted distance matrix 9 Chen & Yang: Temporal Data Clustering with Different Representations = D = D ] ) ( ) [( 3 D w w w D w w w D N D M N D M E =

30 Weighted Clustering Ensemble Clustering analysis with our weighted clustering ensemble on CAVIAR database annotated video sequences of pedestrians, a set of high-quality moving trajectories clustering analysis of trajectories is useful for many applications Experimental setting Representations: PCF, DCF, PLS and PDWT Initial clustering analysis: K-mean algorithm (4<K<), 6 initial settings Ensemble: 3 partitions totally (8 partitions/representation) Chen & Yang: Temporal Data Clustering with Different Representations 3

31 Weighted Clustering Ensemble Clustering results on CAVIAR database Chen & Yang: Temporal Data Clustering with Different Representations 3

32 Weighted Clustering Ensemble Application: UCR time series benchmarks COMP4 Machine Learning 3

33 Weighted Clustering Ensemble Application: Rand index (%) values of clustering ensembles (Yang & Chen ) COMP4 Machine Learning 33

34 Summary Cluster validation is a process that evaluate clustering results with a pre-defined criterion. Two different types of cluster validation methods Internal indexes no ground truth available defined based on common sense or a priori knowledge Application: finding the proper number of clusters, External indexes ground truth known or reference given ( relative index ) Application: performance evaluation of clustering with reference information Apart from direct evaluation, both indexes may be applied in weighted clustering ensemble, leading to better yet robust results by ignoring trivial partitions. K. Wang et al, CVAP: Validation for cluster analysis, Data Science Journal, vol. 8, May 9. [Code online available: COMP4 Machine Learning 34

Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017 Learnt Clustering Methods Vector Data Set Data Sequence Data Text

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017 Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

More information

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION Helena Aidos, Robert P.W. Duin and Ana Fred Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal Pattern Recognition

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

Data Clustering. Danushka Bollegala

Data Clustering. Danushka Bollegala Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering

More information

Alternative Clusterings: Current Progress and Open Challenges

Alternative Clusterings: Current Progress and Open Challenges Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1 Introduction Cluster analysis:

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center

More information

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Binary Images Clustering with k-means

Binary Images Clustering with k-means Binary Images Clustering with k-means João Ferreira Nunes Instituto Politécnico de Viana do Castelo joao.nunes@estg.ipvc.pt Faculdade de Engenharia da Universidade do Porto pro09001@fe.up.pt Presentation

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

Clustering Analysis Basics

Clustering Analysis Basics Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group

More information

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS Marcin Pełka 1 1 Wroclaw University of Economics, Faculty of Economics, Management and Tourism, Department of Econometrics

More information

Clustering Lecture 3: Hierarchical Methods

Clustering Lecture 3: Hierarchical Methods Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9 Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Clustering I What

More information

Distances, Clustering! Rafael Irizarry!

Distances, Clustering! Rafael Irizarry! Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to

More information

Tri-modal Human Body Segmentation

Tri-modal Human Body Segmentation Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Xiaotang Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang Extracting Layers and Recognizing Features for Automatic Map Understanding Yao-Yi Chiang 0 Outline Introduction/ Problem Motivation Map Processing Overview Map Decomposition Feature Recognition Discussion

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Bioimage Informatics

Bioimage Informatics Bioimage Informatics Lecture 13, Spring 2012 Bioimage Data Analysis (IV) Image Segmentation (part 2) Lecture 13 February 29, 2012 1 Outline Review: Steger s line/curve detection algorithm Intensity thresholding

More information

Accumulation. Instituto Superior Técnico / Instituto de Telecomunicações. Av. Rovisco Pais, Lisboa, Portugal.

Accumulation. Instituto Superior Técnico / Instituto de Telecomunicações. Av. Rovisco Pais, Lisboa, Portugal. Combining Multiple Clusterings Using Evidence Accumulation Ana L.N. Fred and Anil K. Jain + Instituto Superior Técnico / Instituto de Telecomunicações Av. Rovisco Pais, 149-1 Lisboa, Portugal email: afred@lx.it.pt

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Cluster Analysis (b) Lijun Zhang

Cluster Analysis (b) Lijun Zhang Cluster Analysis (b) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Algorithm Engineering Applied To Graph Clustering

Algorithm Engineering Applied To Graph Clustering Algorithm Engineering Applied To Graph Clustering Insights and Open Questions in Designing Experimental Evaluations Marco 1 Workshop on Communities in Networks 14. March, 2008 Louvain-la-Neuve Outline

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

DETECTING RESOLVERS AT.NZ. Jing Qiao, Sebastian Castro DNS-OARC 29 Amsterdam, October 2018

DETECTING RESOLVERS AT.NZ. Jing Qiao, Sebastian Castro DNS-OARC 29 Amsterdam, October 2018 DETECTING RESOLVERS AT.NZ Jing Qiao, Sebastian Castro DNS-OARC 29 Amsterdam, October 2018 BACKGROUND DNS-OARC 29 2 DNS TRAFFIC IS NOISY Despite general belief, not all the sources at auth nameserver are

More information

IDENTIFYING BEHAVIORS IN CROWDED SCENES THROUGH STABILITY ANALYSIS FOR DYNAMICAL SYSTEMS

IDENTIFYING BEHAVIORS IN CROWDED SCENES THROUGH STABILITY ANALYSIS FOR DYNAMICAL SYSTEMS IDENTIFYING BEHAVIORS IN CROWDED SCENES THROUGH STABILITY ANALYSIS FOR DYNAMICAL SYSTEMS Berkan Solmaz, Dr. Brian Moore and Dr. Mubarak Shah 1/21 Outline Behaviors to be Identified Theoretical Concepts

More information

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Scalable Clustering of Signed Networks Using Balance Normalized Cut Scalable Clustering of Signed Networks Using Balance Normalized Cut Kai-Yang Chiang,, Inderjit S. Dhillon The 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) Oct.

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

More information

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Improving Cluster Method Quality by Validity Indices

Improving Cluster Method Quality by Validity Indices Improving Cluster Method Quality by Validity Indices N. Hachani and H. Ounalli Faculty of Sciences of Bizerte, Tunisia narjes hachani@yahoo.fr Faculty of Sciences of Tunis, Tunisia habib.ounalli@fst.rnu.tn

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

ECM A Novel On-line, Evolving Clustering Method and Its Applications

ECM A Novel On-line, Evolving Clustering Method and Its Applications ECM A Novel On-line, Evolving Clustering Method and Its Applications Qun Song 1 and Nikola Kasabov 2 1, 2 Department of Information Science, University of Otago P.O Box 56, Dunedin, New Zealand (E-mail:

More information

Robust Clustering of PET data

Robust Clustering of PET data Robust Clustering of PET data Prasanna K Velamuru Dr. Hongbin Guo (Postdoctoral Researcher), Dr. Rosemary Renaut (Director, Computational Biosciences Program,) Dr. Kewei Chen (Banner Health PET Center,

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

Community Detection: Comparison of State of the Art Algorithms

Community Detection: Comparison of State of the Art Algorithms Community Detection: Comparison of State of the Art Algorithms Josiane Mothe IRIT, UMR5505 CNRS & ESPE, Univ. de Toulouse Toulouse, France e-mail: josiane.mothe@irit.fr Karen Mkhitaryan Institute for Informatics

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define

More information

Machine Learning: Symbol-based

Machine Learning: Symbol-based 10d Machine Learning: Symbol-based More clustering examples 10.5 Knowledge and Learning 10.6 Unsupervised Learning 10.7 Reinforcement Learning 10.8 Epilogue and References 10.9 Exercises Additional references

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

More information

Relative Clustering Validity Criteria: A Comparative Overview

Relative Clustering Validity Criteria: A Comparative Overview Relative Clustering Validity Criteria: A Comparative Overview Lucas Vendramin, Ricardo J. G. B. Campello and Eduardo R. Hruschka Department of Computer Sciences of the University of São Paulo at São Carlos,

More information