Unsupervised Learning

Similar documents
Unsupervised Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis: Agglomerate Hierarchical Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning

Hierarchical Clustering 4/5/17

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

ECLT 5810 Clustering

Finding Clusters 1 / 60

Stat 321: Transposable Data Clustering

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Modern Multidimensional Scaling

ECLT 5810 Clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

Gene Clustering & Classification

Chapter DM:II. II. Cluster Analysis

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Machine Learning (BSMC-GA 4439) Wenke Liu

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Week 7 Picturing Network. Vahe and Bethany

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

What to come. There will be a few more topics we will cover on supervised learning

CSE 5243 INTRO. TO DATA MINING

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Clustering CS 550: Machine Learning

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Understanding Clustering Supervising the unsupervised

Unsupervised Learning and Clustering

Modern Multidimensional Scaling

CSE 5243 INTRO. TO DATA MINING

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Exploratory data analysis for microarrays

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering Part 3. Hierarchical Clustering

Hierarchical Clustering

Multivariate Analysis

Clustering algorithms

Data Informatics. Seon Ho Kim, Ph.D.

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CSE 5243 INTRO. TO DATA MINING

4. Ad-hoc I: Hierarchical clustering

What is Unsupervised Learning?

3. Cluster analysis Overview

CLUSTERING IN BIOINFORMATICS

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

CS7267 MACHINE LEARNING

10701 Machine Learning. Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning and Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical Clustering

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Unsupervised Learning

Clustering. Lecture 6, 1/24/03 ECS289A

Lecture on Modeling Tools for Clustering & Regression

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

High throughput Data Analysis 2. Cluster Analysis

Introduction to Clustering

3. Cluster analysis Overview

A Dendrogram. Bioinformatics (Lec 17)

Scaling Techniques in Political Science

Redefining and Enhancing K-means Algorithm

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Dimension reduction : PCA and Clustering

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Chapter 9. Classification and Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Road map. Basic concepts

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

Cluster Analysis. Ying Shen, SSE, Tongji University

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Chapter 6: Cluster Analysis

ECS 234: Data Analysis: Clustering ECS 234

Lecture 15 Clustering. Oct

Clustering. Chapter 10 in Introduction to statistical learning

University of Florida CISE department Gator Engineering. Clustering Part 5

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining Concepts & Techniques

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering and Visualisation of Data

Hierarchical Clustering

Introduction to Machine Learning. Xiaojin Zhu

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Transcription:

Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision Support Spring 2005 Unsupervised Learning An overview of clustering and other exploratory data analysis methods Lucila Ohno-Machado

A few synonyms Agminatics Aciniformics Q-analysis Botryology Systematics Taximetrics Clumping Morphometrics Nosography Nosology Numerical taxonomy Typology Clustering A multidimensional space needs to be reduced

Supervised Models Case 1 Case 2 age 0.7 0.6-0.6 0-0.4-0.8 test1-0.2 0.5 0.1-0.9 0.4 0.6 0.8 0.4 0.2 0.3 0.2 0.3 We are chasing PARTICULAR patterns in the data Evaluate against gold standard 0.5-0.7 0.4 Using these we predict probability of diagnosis, prognosis

Unsupervised Models Case 1 Case 2 age 0.7 0.6-0.6 test1 1 0.5 0.1 1 1 2 Cluster We are chasing ANY pattern in the data 0-0.4-0.8 0.5-0.9 0.4 0.6-0.7 3 2 2 3 We will need to interpret (label) the pattern Using these we put cases into clusters

Exploratory Data Analysis Goal is to flatten the dimensions of data to the spaces that we are familiar with (2-D and 3-D) We can see the data in these dimensions and extract patterns We are looking for clusters of data with similar characteristics overall Hypothesis generation versus hypothesis testing Fishing expedition versus confirmatory analysis

Outline Proximity Distance Metrics Similarity Measures Clustering Hierarchical Clustering Agglomerative K-means Multidimensional Scaling

Spatial relations Distance and dissimilarity E.g. Euclidean distance, perceived difference Proximity and similarity measures E.g. correlation coefficient Distance matrix House Harvard MIT BWH House Harvard 15 MIT 18 4 BWH 10 3 5

Unsupervised Learning Raw Data Distance or Similarity Matrix Dimensionality Reduction MDS Clustering Hierarchical Non-hierar. Graphical Representation Validation Internal External

Algorithms, (dis)similarity measures, and graphical representations Most algorithms are not necessarily linked to a particular metric or (dis)similarity measure Also not necessarily linked to a particular graphical representation Cluster techniques were popular in the 50/60s (psychology experiments) There has been recent interest in biomedicine because of the emergence of high throughput technologies Old algorithms have been rediscovered and renamed

Metrics (distances)

i Minkowski r-metric K dimensional data Euclidean d d d ij ij ij = = = K k =1 K k = 1 K k = 1 x ik x jk x x ik ik x x jk jk 2 1 r 1 2 1 1 1 r j

j Minkowski r-metric K dimensional data i Euclidean d ij = K x ik k =1 x jk K Manhattan dij = x ik x jk (city-block) k = 1 Generalized d K 2 1 1 2 1 1 r = x r ik ij x jk k = 1 1

Metric spaces Positivity Reflexivity d ij > d ii = 0 Symmetry d ij = d ji Triangle inequality d ij d ih + d hj j i h

More metrics Ultrametric d ij max[d ih, d hj ] replaces j d ij d ih + d hj i h Four-point d hi + d jk max[(d hj + d ik ), (d hk + d ij )] additive condition replaces d ij d ih + d hj

Similarity measures Similarity function For binary, shared attributes t s i, j) = i j ( i j s ( i, j) = 1 i t = [1,0,1] 2 1 j t = [0,0,1]

Variations Fraction of d attributes shared t s (i, j) = i j d Tanimoto coefficient t i j ( i t t t s, j) = i i + j j i j i t = [1,0,1] 1 s ( i, j) = 2 +1 1 j t = [0,0,1]

Popular similarity measures Correlation Linear Rank Entropy-based Mutual information, based on the P(i j) Ad-hoc Human perception

Clustering

Hierarchical Clustering Agglomerative Technique (average link) Step 1: Merge 2 closest cases into a cluster Step 2: Define cluster representative (e.g., cluster means) as a case and remove the individual cases that compose the cluster Go to step 1 until all cases are linked Visualization Dendrogram, Tree, Venn diagram

Data Visualization dissimilarity Figure by MIT OCW.

Hierarchical Clustering on Small Round Blood Cell Tumours RMS BL NB EWS

Linkages Average-linkage: proximity to the mean (centroid) Single-linkage: proximity to the closest element in another cluster Complete-linkage: proximity to the most distant element

Mean Linkage Assign case according to proximity to the mean (centroid) of another cluster x x x x O O O O

Single Linkage Assign case according to proximity to the closest element in another cluster x x x x O O O O

Complete Linkage Assign case according to proximity to the most distant element x x x x O O O O

Additive Trees Commonly the minimum spanning tree Nearest neighbor approach to hierarchical clustering

k-means clustering (Lloyd s algorithm) 1. Select k (number of clusters) 2. Select k initial cluster centers c 1,,c k 3. Iterate until convergence: For each i, 1. Determine data vectors v i1,,v in closest to c i (i.e., partition space) 2. Update c i as c i = 1/n (v i1 + +v in )

k-means clustering example

k-means clustering example

k-means clustering example

Common mistakes Refer to dendrograms as meaning hierarchical clustering in general Misinterpretation of tree-like graphical representations Ill definition of clustering criterion Declare a clustering algorithm as best Expect classification model from clusters Expect robust results with little/poor data

Dimensionality Reduction

Multidimensional Scaling Geometrical models Uncover structure or pattern in observed proximity matrix Objective is to determine both dimensionality d and the position of points in the d-dimensional space

Classic Multidimensional Scaling Also known as principal coordinates analysis (because it is principal components analysis) From distances, find coordinates Constrain origin to centroid of data

Metric and non-metric MDS Metric (Torgerson 1952) Non-metric (Shepard 1961) Estimates nonlinear form of the monotonic function s ij = f mon (d ij )

Stress and goodness-of-fit Stress Goodness of fit 20 10 5 2.5 0 Poor Fair Good Excellent Perfect

Figures removed due to copyright reasons. Please see: Khan, J., et al. "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks." Nat Med 7, no. 6 (Jun 2001): 673-9.

Visualization Clustering is often good for visualization, but it is generally not very useful to separate data into pre-defined categories But there are counterexamples Figures removed due to copyright reasons. Please see: Khan, J., et al. "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks." Nat Med 7, no. 6 (Jun 2001): 673-9.

Figures removed due to copyright reasons. Please see: Khan, J., et al. "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks." Nat Med 7, no. 6 (Jun 2001): 673-9.