7/19/09 A Primer on Clustering: I. Models and Algorithms 1

Size: px
Start display at page:

Download "7/19/09 A Primer on Clustering: I. Models and Algorithms 1"

Transcription

1 7/19/9 A Primer on Clustering: I. Models and Algorithms 1

2 7/19/9 A Primer on Clustering: I. Models and Algorithms 2

3 7/19/9 A Primer on Clustering: I. Models and Algorithms 3

4 A Primer on Clustering : I.Models and Algorithms Atenção! Os Estados Unidos 2 : A Espanha 1. Clustering: similarity, labels, and partitions 2. U only models : SAHN partitioning 3. V only models : SOM point prototypes 4. (U, V) models : c-means with AO 5. (U V, +) models : GMD with EM-AO 7/19/9 A Primer on Clustering: I. Models and Algorithms 4

5 Numerical Pattern Recognition 4 topics Humans : "Knowledge Tidbits" Process Description Feature Nomination X = Object Data R = Relation Data Design Test Sensors Process Description Feature Analysis Classifier Design Feature Analysis Extraction PreProcessing 2D - Display Identification Prediction Control Classifier Design This Module: Cluster Analysis Cluster Analysis Exploration Validity 2D - Display 7/19/9 A Primer on Clustering: I. Models and Algorithms 5

6 2 Kinds of Data for Clustering Objects O= {o 1,, o n } : o i = i-th physical object Object Data X = {x 1,, x n } R p : x i = feature vector for o i x ji = j-th (measured) feature of x i : 1 j p Relational R = [r ij ] = relationship (o i, o j ) ) or (x i, x j ) d ij Data s ij = pairwise similarity (o i, o j ) or (x i, x j ) d ij = pairwise dissimilarity (o i, o j ) or (x i, x j ) Tyically (R = D) d ii = d ij > d ij = d ji : 1 i n : 1 i,j n : 1 i j n (Positive-definite) (Symmetric) We often convert XD with distance d ij = x i x j 7/19/9 A Primer on Clustering: I. Models and Algorithms 6

7 Measuring (Dis)Similarity : The Usual Suspects Mahalonobis <,> Norms d Maha = x v = (x M 1 v)t M 1 (x v) Diagonal d diag = x v DM -1 = (x v)t D M -1 (x v) Euclidean d 2 = x v I = (x v) T (x v) City Block Minkowski Norms d 1 = x v 1 p = x j v j j=1 Sup or Max d = x v = max{ x j v } j 1jp 7/19/9 A Primer on Clustering: I. Models and Algorithms 7

8 Sample Parameters of the Statistical Norms Mean Vector v = n x k k=1 / n Covariance Matrix M = [m ij ] = n (x k k=1 v )(x k v ) T n Variance (Matrix) m 11 L 2 s 1 L D M = M O M = M O M L m 2 pp L s p 7/19/9 A Primer on Clustering: I. Models and Algorithms 8

9 Label the similar objects What is Clustering? (Unsupervised Learning} What is similar? How many groups (c)? 7/19/9 A Primer on Clustering: I. Models and Algorithms 9

10 mixtures Data sets usually contain of objects Clustering groups and labels objects Easy for Humans! Hard for computers! 7/19/9 A Primer on Clustering: I. Models and Algorithms 1

11 Unlabeled Data X={x 1, x n } R p (or) R=[r ij ] R nxn Clustering 3 Problems P1: Tendency X or R has clusters? No : Stop P2: Clustering Find (c) clusters Most important and difficult problem! No P3: Validity Are c clusters OK? Yes : Stop 7/19/9 A Primer on Clustering: I. Models and Algorithms 11

12 Unipolar Label c = 3 Crisp = Vertices N h3 Fuzzy or Probabilistic = Triangle Possibilistic = Cube-{} N p3 N f3 7/19/9 A Primer on Clustering: I. Models and Algorithms 12

13 U = row i M ship of all o k s in cluster i o 1 o k o n Objects u 11 L u 1k L u 1n M M M u i1 L u ik L u in M M M u c1 L u ck L u cn col k M ship of o k in each cluster Partition Matrices Membership Functions u i :O [,1] u i (o k )=u ik =M ship of o k in cluster i 7/19/9 A Primer on Clustering: I. Models and Algorithms 13

14 Crisp Fuzzy/Prob Possibilistic Row sums k u ik > same same Col sums i u ik = 1 same c u ik c i=1 M ships u ik {,1} u ik [,1] same Set Name M hcn M fcn M pcn Example Take a 2nd 7/19/9 A Primer on Clustering: I. Models and Algorithms 14

15 Clustering is NOT Classifier Design Data X C M pcn M fcn M hcn + Other parameters V, Clustering labels only the submitted data z N hc N pc R p D N fc Classifier D can label all the data in its domain D(z) 7/19/9 A Primer on Clustering: I. Models and Algorithms 15

16 Hardening Partitions Defuzzification (Deprobabilization = Bayes Rule for U M fcn!) U = M pcn H 1 (U) = U MM = M hcn resolve ties randomly -cut thresholding (< <1) filters chosen levels of m ship 1 H.9 (U) = M hcn H.8 (U) = 1 1 M hcn 7/19/9 A Primer on Clustering: I. Models and Algorithms 16

17 Clusters represented by partition matrix (U) or prototypes V = {v k } U c = U f = Crisp : hard boundaries Fuzzy : soft boundaries 7/19/9 A Primer on Clustering: I. Models and Algorithms 17

18 U models : find clusters U M pcn 3 Types of Basic Clustering Models Usually relational models, e.g., single linkage V=F(U) can be computed - if you want it V models : find point prototypes V R cp Usually local sequential models, e.g., SOM U=G(V) can be computed - super important for VL Data Methods (U,V) models : find (U,V) M pcn x R cp Usually global batch models, e.g., c-means (U, V, +) have other parameters, e.g., {p i }, { i } in GMD 7/19/9 A Primer on Clustering: I. Models and Algorithms 18

19 A Primer on Clustering : I.Models and Algorithms 2. U only models : SAHN partitioning Eu - homem de Budweiser do fósforo? Por que não eu? 3. V only models : point prototypes with SOM 4. (U, V) models : c-means with AO 5. (U V, +) models : GMD with EM-AO 7/19/9 A Primer on Clustering: I. Models and Algorithms 19

20 U Models : SAHN Clustering S Sequential : at most one U c M hcn for c from n to 1 A Agglomerative : begins at c=n, merges clusters until c=1 Devisive : begins at c=1, splits clusters until c=n H Hierarchichal : joined objects remain in nested clusters N Non-overlapping : {U k } M hcn are crisp partitions of O 7/19/9 A Primer on Clustering: I. Models and Algorithms 2

21 Algorithm SAHN : find U k M hcn for each k from n to 1 Object data: X R p Relational Data:R R nn Input convert to (or) Dissimilarity: D ii = i, D=D T D n n = [D ij ] = x i x j (or) Similarity: S ii =1 i, S=S T Pick Intercluster (set) distance d SAHN = d SL, d AL or d CL ONLY 1 user choice - no parameters or thresholds! A nice aspect of SAHN clustering Set k = 1 C = { {1}, {2},,{n} } C=initial clusters 7/19/9 A Primer on Clustering: I. Models and Algorithms 21

22 Dissimilarity Input Case For k=2 to n Find W,V in C to Merge W and V min d k = d SAHN,k (W,V) = WV=(WV) min S,T C S T {d SAHN,k (S,T)} Update C C*=C-{W,V} C = C*WV Update D Delete rows/cols in D for W,V Add row/col in D for new WV {d SAHN (WV, S) : S C, S WV} Outputs Crisp partitions {U SAHN,k M h(n-k+1)n : 1kn} Merger distances {min d k : 1kn} 7/19/9 A Primer on Clustering: I. Models and Algorithms 22

23 Geometry of Set Distances (dissimilarities) Single linkage (set) distance Complete linkage (set) distance V V W W d SL (W,V) = min{d ij } i W jv d CL (W,V) = max{d ij } i W jv 7/19/9 A Primer on Clustering: I. Models and Algorithms 23

24 Geometry of Set Distances (dissimilarities) Average linkage (set) distance V W d AL (W,V) = i W jv D ij W V 7/19/9 A Primer on Clustering: I. Models and Algorithms 24

25 SL Works Geometry of Linkage Clusters (dissimilarities) CL Fails AL supposedly mediates between these 2 extremes SL Fails CL Works 7/19/9 A Primer on Clustering: I. Models and Algorithms 25

26 7/19/9 A Primer on Clustering: I. Models and Algorithms 26 D = Dissimilarity Matrix of Euclidean Distances 5 points X in R Ex.1 SL Clustering

27 MST on 5 nodes shows evolution of SL clusters 1st Clusters C 5 ={ {1},{2},{3},{4},{5} } 1st Merger C 3 = {3,4,5} 2nd Merger C 2 = {1,2} 3rd Merger C 1 = {1,2,3,4,5} D = D = D = 2.24 D = [] 7/19/9 A Primer on Clustering: I. Models and Algorithms 27

28 Dendogram on 5 nodes shows evolution of SL clusters. SL Merger Distances min d k c=5 c=3 c= c=1 7/19/9 A Primer on Clustering: I. Models and Algorithms 28

29 Naturalmente posso fechar fora a Espanha V only models : SOM point prototypes A Primer on Clustering : I.Models and Algorithms 4. (U, V) models : c-means with AO 5. (U V, +) models : GMD with EM-AO 7/19/9 A Primer on Clustering: I. Models and Algorithms 29

30 V Models : 2 ways to get Point Prototypes V R cp V S SELECTED in X X V S V R BUILT from X X V R 7/19/9 A Primer on Clustering: I. Models and Algorithms 3

31 Point Prototype Algorithms : MANY choices 7/19/9 A Primer on Clustering: I. Models and Algorithms 31

32 Three Kinds of CL Models Nbhd. Updated Examples (N t ) -1 = v i Winner only SHCM, VQ (N t ) -1 V (N t ) -1 = V A subset of V All c nodes SOM GVQ-F, SCS, DR (N t ) -1 = update "neighborhood" in R p at iterate t can be imbedded in definition of learning rates The update subset of V is not a real neighborhood in R p ; the neighborhood is determined in display space 7/19/9 A Primer on Clustering: I. Models and Algorithms 32

33 x 1 Input Layer Hidden Layer x v 1 Network Architecture of SOM/VQ Models Output Layer x 2 x p x i x v j argmin{ x v } = s j j Winner v s x p x v c 7/19/9 A Primer on Clustering: I. Models and Algorithms 33

34 Input User picks Initialize Unlabeled Object data: X R p c,, T,{ ik,t }; {N 1 t }; p; err V = (v 1,,,v c, ) cp Basic CL Algorithm For t = to T For k = 1 to n s = argmin{ x k v } j,t j v i,t+1 = v i,t i,t (x k v i,t ) v i,t N 1 t (v s,t ) If V t+1 V t err STOP Output Else next k V* cp Update Nbhd is Fcn of Winner 7/19/9 A Primer on Clustering: I. Models and Algorithms 34

35 The Update Geometry of Most CL Models =1 x v new (x-v old ) + (excitation) v old = (inhibition) v new = v old + (x v old ) 7/19/9 A Primer on Clustering: I. Models and Algorithms 35

36 Old Hidden Layer in R p "Display Space" in R q New Hidden Layer in R p v 1,t v k p cell k q 11 v 1,t v 2,t 1 2 v s,t wins in R p v 2,t v i,t i cell s,t in R q v i,t s j N t (v s,t ) in R q v s,t N t 1 (v s,t ) R p v s,t v j.t c new v s N 1 t (v s,t ) v j.t v c,t SOM/VQ Update Neighborhood N 1 t (v s,t ) v c,t 7/19/9 A Primer on Clustering: I. Models and Algorithms 36

37 VQ Example : IRIS ~ Each 4-vector represents an Iris flower n = 15 vectors p = 4 features c = 3 physical subspecies n i = 5 vectors each class x k = x 1k sepal length x 2k sepal width x 3k petal length x 4k petal width c = 3 physical clusters Iris Data, Features (3,4) But only c = 2 geometric clusters 7/19/9 A Primer on Clustering: I. Models and Algorithms 37

38 VQ Prototypes : Initializations I 1, I 2 in conv(iris) I 1 = V Final V Same V* min err =.1 I 2 = m = m M Final V M + m 2 = max err =.225 M = See reference [BP95] for algorithmic parameters 7/19/9 A Primer on Clustering: I. Models and Algorithms 38

39 7/19/9 A Primer on Clustering: I. Models and Algorithms 39 Sestosa Sestosa same same -but not so hot but not so hot- for I for I 3, I, I 4 VQ Prototypes VQ Prototypes : Initializations I 3, I 4 outside outside conv(iris) V Final Final Final Final I V I 4 YIKES!!! YIKES!!! Clusters U=G(V) Clusters U=G(V) you can get with you can get with these prototypes these prototypes will will SUCK!!! SUCK!!!

40 Things you need to about SOM/VQ Kohonen does NOT advocate SOM for clustering! MANY schemes for learning rates { i,t } and neighborhood N t Learning rates { i,t } with (t) force termination artificially Learning rates often linear { i,t = i, (1-(t/T)): t=1,2,,t} Good initialization for V is crucial < k,t < 1 Guarantee {V t } (some) V* but k,t = 2 k,t < properties of V* are unknown! 7/19/9 A Primer on Clustering: I. Models and Algorithms 4

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Pattern Recognition Lecture Sequential Clustering

Pattern Recognition Lecture Sequential Clustering Pattern Recognition Lecture Prof. Dr. Marcin Grzegorzek Research Group for Pattern Recognition Institute for Vision and Graphics University of Siegen, Germany Pattern Recognition Chain patterns sensor

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6 Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects:

Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects: Administration Final Exam: Next Tuesday, 12/6 12:30, in class. Material: Everything covered from the beginning of the semester Format: Similar to mid-term; closed books Review session on Thursday HW 7:

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 9, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method

OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method OPTIMIZATION Optimization Derivative-based optimization Steepest descent (gradient) methods Newton s method Derivative-free optimization Simplex Simulated Annealing Genetic Algorithms Ant Colony Optimization...

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

IBL and clustering. Relationship of IBL with CBR

IBL and clustering. Relationship of IBL with CBR IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that: Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing) k Nearest Neighbors k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors

More information

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Prototype Methods Aaron Bobick School of Interactive Computing Administrivia Problem 2b was extended to March 25. Done? PS3 will be out this real soon (tonight) due April 10.

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Preprocessing DWML, /33

Preprocessing DWML, /33 Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

Clustering algorithms and introduction to persistent homology

Clustering algorithms and introduction to persistent homology Foundations of Geometric Methods in Data Analysis 2017-18 Clustering algorithms and introduction to persistent homology Frédéric Chazal INRIA Saclay - Ile-de-France frederic.chazal@inria.fr Introduction

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

Generalized Information Theoretic Cluster Validity Indices for Soft Clusterings

Generalized Information Theoretic Cluster Validity Indices for Soft Clusterings Generalized Information Theoretic Cluster Validity Indices for Soft Clusterings Yang Lei, James C Bezdek, Jeffrey Chan, Nguyen Xuan Vinh, Simone Romano and James Bailey Department of Computing and Information

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

6. Learning Partitions of a Set

6. Learning Partitions of a Set 6. Learning Partitions of a Set Also known as clustering! Usually, we partition sets into subsets with elements that are somewhat similar (and since similarity is often task dependent, different partitions

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008 Cluster Analysis Jia Li Department of Statistics Penn State University Summer School in Statistics for Astronomers IV June 9-1, 8 1 Clustering A basic tool in data mining/pattern recognition: Divide a

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Machine Learning & Statistical Models

Machine Learning & Statistical Models Astroinformatics Machine Learning & Statistical Models Neural Networks Feed Forward Hybrid Decision Analysis Decision Trees Random Decision Forests Evolving Trees Minimum Spanning Trees Perceptron Multi

More information

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION 1 ZUHERMAN RUSTAM, 2 AINI SURI TALITA 1 Senior Lecturer, Department of Mathematics, Faculty of Mathematics and Natural

More information

Figure (5) Kohonen Self-Organized Map

Figure (5) Kohonen Self-Organized Map 2- KOHONEN SELF-ORGANIZING MAPS (SOM) - The self-organizing neural networks assume a topological structure among the cluster units. - There are m cluster units, arranged in a one- or two-dimensional array;

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

Data Warehousing and Machine Learning

Data Warehousing and Machine Learning Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,

More information

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering. Chapter 4 Fuzzy Segmentation 4. Introduction. The segmentation of objects whose color-composition is not common represents a difficult task, due to the illumination and the appropriate threshold selection

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1 Anàlisi d Imatges i Reconeixement de Formes Image Analysis and Pattern Recognition:. Cluster Analysis Francesc J. Ferri Dept. d Informàtica. Universitat de València Febrer 8 F.J. Ferri (Univ. València)

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Outline Prototype-based Fuzzy c-means

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

Market basket analysis

Market basket analysis Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline Clustering cluster - set of patterns whose inter-pattern distances are smaller than inter-pattern distances for patterns not in the same cluster a homogeneity and uniformity criterion no connectivity little

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Partially Supervised Clustering for Image Segmentation

Partially Supervised Clustering for Image Segmentation Partially Supervised Clustering for Image Segmentation Amine M. Bensaid, Lawrence O. Hall Department of Computer Science and Engineering University of South Florida Tampa, FL 33620 {bensaid,hall}@csee.usf.edu

More information