Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Size: px
Start display at page:

Download "Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1"

Transcription

1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1

2 Introduction Cluster analysis is the formal study of algorithms and methods for grouping data. Cluster analysis is a tool for exploring the structure of the data. Applications: in a variety of engineering and scientific disciplines 2003/3/11 2

3 Applications of Cluster Analysis (1) Biology, Psychology, Archeology, Geology, Marketing, Information retrieval, Remote sensing, etc. 2003/3/11 3

4 Applications of Cluster Analysis (2) Characterizing customer groups based on purchasing patterns. Categorizing Web documents. Grouping genes and proteins that have similar functionality. Grouping spatial locations prone to earth-quakes based on seismological data. Feature extraction. Image segmentation 2003/3/11 4

5 Backgrounds While it is easy to give a functional definition of a cluster, it is very difficult to give an operational definition of a cluster A cluster is a set of entities which are alike, and entities from different clusters are not alike. At global level or local level? 2003/3/11 5

6 2003/3/11 6

7 Data Representation (1) 2003/3/11 7

8 Data Representation (2) Pattern Matrix: It can be viewed as a n x d matrix where n and d represent the number of objects and features, respectively. Ex: /3/11 8

9 Data Representation (3) Proximity Matrix: It accumulates the pairwise indices of proximity in a matrix in which each row and column represents a pattern. Ex: Note: All proximity matrices are symmetry. 2003/3/11 9

10 Data Types and Scales (1) Data Types: the degree of quantization in the data. Binary: 0/1, Yes/No. Discrete: a finite number of possible values. Continuous: a point on the real line. 2003/3/11 10

11 Data Types and Scales (2) Data Scale: It indicates the relative significance of numbers. Qualitative (normal and ordinal) scales: discrete numbers can be coded on these qualitative scales. (1) A nominal scale is not really a scale at all because numbers are simply used as a names. E.g. a (yes, no) response could be coded as (0, 1) or (1,0) or (50, 100). (2) The ordinal scale: the numbers have meaning only in relation to one another. E.g. (1,2,3), (10,20,30), and (100, 200, 300) are all equivalent from an ordinal viewpoint. 2003/3/11 11

12 Data Types and Scales (3) Quantitative (interval and ratio): a unit of measurement exists vs. an absolute zero exists along with a unit of measurement. (1) Interval: The interpretation of the numbers depends on this unit. E.g. 90 degree of Fahrenheit vs. Celsius or judge satisfaction (2) Ratio: The ratio between two numbers has meaning. E.g. the distance between two cities 2003/3/11 12

13 Proximity Indices A proximity index between the ith and kth patterns is denoted d(i,k). The most common proximity index for patterns is the Minkowski metric, which measures dissimilarity. d( i, k) = ( d j= 1 x ij x kj r 1 ) r 2003/3/11 13

14 2003/3/11 14 Common Distance Metrics Euclidean distance (r=2) Manhattan or city block distance (r=1) Mahalanobis distance )] ( ) [( ) ( ), ( 1 2 k i T k i d j kj ij x x x x x x k i d = = = = = d j x ij x kj k i d 1 ), ( ) ( ) ( ), ( 1 k i T k i x x x x k i d Σ =

15 Normalization (1) Some normalization is usually employed based on the requirements of the analysis 2003/3/11 15

16 Normalization (2) Zero mean and unit variance: m1 N N 1 * m = ( M ) = N x 1 * i σ j = ( xij m N j ) i= 1 i= 1 m n (1) Invariant to rigid displacements x ij = x * ij m j (2) All features have zero mean and unit variance 2 xij x ij m = * j σ j 2003/3/11 16

17 Classification Types (1) Clustering is a special kind of classification. 2003/3/11 17

18 Classification Types (2) Exclusive vs. Nonexclusive: Each object belongs to exactly one subset, or cluster. Nonexclusive classification can assign an object to several classes. Unsupervised vs. Supervised: An unsupervised classification uses only the proximity matrix to perform the classification. Supervised classification uses category labels on the subjects as well as the proximity matrix. 2003/3/11 18

19 Classification Types (3) Hierarchical vs. Partitional: A hierarchical classification is a nested sequence of partitions, whereas a partitional classification is a single partition. 2003/3/11 19

20 Hierarchical Clustering (1) A picture of a hierarchical clustering is much easier for a human being to comprehend than is a list of abstract symbols. A dendrogram is a special type of tree structure that provides a convenient picture of a hierarchical clustering. Two types: agglomerative and divisive Agglomerative: It starts with the disjoint clustering, which places each of the n objects in an individual cluster and then merges them in a nested procedure Divisive: It performs the task in the reverse order 2003/3/11 20

21 Hierarchical Clustering (2) Step 1: Assign each object to its own cluster. Step 2: Computer the distances between all clusters. Step 3: Merge the two clusters that are closest to each other. Step 4: Return to step 2 until there is only one cluster left. 2003/3/11 21

22 Hierarchical Clustering (3) {X1}, {X2}, {X3}, {X4}, {X5} {X1, X2}, {X3}, {X4}, {X5} {X1, X2}, {X3, X4}, {X5} {X1, X2, X3, X4}, {X5} {X1, X2, X3, X4, X5} Note: Cutting s dendrogram horizontally creates a clustering. 2003/3/11 22

23 2003/3/11 23 Hierarchical Clustering (4) The single-linkage algorithm The complete-linkage algorithm: The average-linkage algorithm: ), ( max ), (, b a d C C D i C j b C a j i CL = ), ( 1 ), (, b a d N N C C D i C j b C a j i j i SL = ), ( min ), (, b a d C C D i C j b C a j i SL =

24 Hierarchical Clustering (5) The single-linkage algorithm allows clusters to grow long and thin. The complete-linkage algorithm produces more compact clusters. Both the single-linkage algorithm and the complete-linkage algorithm are susceptible to distortion by outliers or deviant observation. The average-linkage algorithm is an attempt to compromise between the extreme of the singlelinkage algorithm and the complete-linkage algorithm. 2003/3/11 24

25 Hierarchical Clustering (6) 2003/3/11 25

26 Partitional Clustering Partitional: We generate a single partition of the data in an attempt to recover natural groups present in the data Basic idea: Simply select a criterion, evaluate it for all possible partitions containing K clusters, and pick the partition that optimizes the criterion Hierarchical techniques: biological, social, and behavior science because of the need to construct taxonomies Partitional technologies: engineering applications where single partitions are important 2003/3/11 26

27 Algorithm for Iterative Partitional Clustering Step 1. Select an initial partition with K clusters. Step 2. Generate a new partition by assigning each pattern to its closest cluster center. Step 3. Compute new cluster centers as the centers of the clusters. Step 4. Repeat step2 and 3 until an optimum value of the criterion function is found. Step 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small, or outlier, clusters. 2003/3/11 27

28 The K-means Algorithm (1) Step 1: Choose K cluster centers: C1(1), L, CK (1) Step 2: At the kth iterative step distribute the samples among the K cluster domains, using the relation x Sj( k) if x cj( k) < x ci ( k) for i j Step 3: Computer the new cluster centers C j ( k where 1 + 1) = x N j N j x S ( k ) j j = 1, LK = the number of samples in S Step 4: If the algorithm has converged and the procedure is terminated. Otherwise go to Step /3/11 28 j ( k)

29 The K-Means Algorithm (2) Seed patterns can be the first K patterns of K randomly chosen data points. Different initial partitions can lead to different final clustering results If the clustering results using several different initial partitions all lead to the same final partition, we have some confidence on the result. The Euclidean distance can be replaced by the Mahalanobis distance. 2003/3/11 29

30 The K-Means Algorithm (3) 2003/3/11 30

31 The K-Means Algorithm (4) 2003/3/11 31

32 The K-Means Algorithm (5) 2003/3/11 32

33 Nearest-Neighbor Clustering Algorithm (1) Step 1: Set i=1 and k=1. Assign pattern to cluster C1 Step 2: Set i=i+1. Find the nearest neighbor of x i among the patterns already assigned to clusters. Let d denote the distance from x i to its nearest neighbor. Suppose that the nearest neighbor is in cluster c k. Step 3: If d m t (a prespecified threshold), then assign xi to cm. Otherwise, set k=k+1 and assign xi to a new cluster ck. Step 4: If every pattern has been assigned to a cluster, stop. Else, go to step 2. x 1 Note: The number of clusters generated, K, is a function of the parameter t. As the value of t increases, fewer clusters are generated. 2003/3/11 33

34 Nearest-Neighbor Clustering Algorithm (2) 2003/3/11 34

35 Nearest-Neighbor Clustering Algorithm (3) 2003/3/11 35

36 Projections Projection algorithms maps a set of N ndimensional patterns onto an m-dimensional space, where m<n. The main motivation for projection algorithms is to permit visual examination of multidimensional data such that one can cluster by eye and qualitatively validate clustering results. Projection algorithms can be categorized into two types linear type and nonlinear type. 2003/3/11 36

37 Linear Projections (1) y = H xi for i = 1, L, N i Linear projection algorithms are relatively simple to use and have well-understood mathematical properties. Eigenvector projection (Karhunen-Loeve method) is commonly used. The eigenvectors of the covariance matrix R defines a linear projection and replace the features in the raw data with uncorrelated features. 2003/3/11 37

38 Linear Projections (2) Let Σ denote the covariance matrix of the data and λ denote the eigenvalue of Σ. λ λ L 1 2 λ d c 1, c2, L, c d denote the corresponding eigenvectors (principal components). m = 1 N 1 N i= 1 x i N T = ( xi m)( x m) 2003/3/11 N i= 1 38

39 Linear Projections (3) Define the m x d transformation matrix H as H = c c M c T 1 T 2 T m 2003/3/11 39

40 Linear Projections (4) This matrix projects the pattern space into an m-dimensional subspace (m<d) whose axes are in the directions of the largest eigenvalues of Σ as follows. y = H xi for i = 1, L, N i The covariance matrix in the new space becomes a diagonal matrix as diag ( 1 2 m λ, λ, L, λ ) 2003/3/11 40

41 Linear Projections (5) This implies that the m new features are uncorrelated. One could choose m so that rm m = λi / λi i= 1 d i= which would assure that 95% of the variance is retained in the new space. Thus a good eigenvector projection is that which retains a large proportion of the variance present in the original feature space with only a few features in the transformed space. 2003/3/11 41

42 Linear Projections (6) 2003/3/11 42

43 Linear Projections (7) 2003/3/11 43

44 Linear Projections (8) 2003/3/11 44

45 Linear Projections (9) There is no guarantee that the features with the largest eigenvalues will be best for preserving the separation among categories. 2003/3/11 45

46 Nonlinear Projections The inability of linear projections to preserve complex data structures has made nonlinear projections more popular in recent years. Most nonlinear projection algorithms are based on maximizing or minimizing an object function. Nonlinear projection algorithms are expensive to use, so several heuristics are employed to reduce the search time for the optimal solution. In exploratory data analysis, we seek two-dimensional projections to visually perceive the structure present in the data. 2003/3/11 46

47 Sammon s Algorithm (1) Sammon proposed a nonlinear technique that tries to create a two-dimensional configuration of points in which interpattern distances are preserved. Let { x i } denote a set of N n-dimensional patterns and let d( i, j) denote the distance between patterns xi and x j in the n-dimensional space. Let { y i } denote a set of N m- dimensional corresponding patterns to be found and let D( i, j) denote the distance between patterns y and y i j in the m-dimensional space. 2003/3/11 47

48 Sammon s Algorithm (2) Sammon suggested looking for minimizing the error function E called stress d i j D i j E = d 1 [ (, ) (, )] ( i, j) d ( i, j) i< j i< j Sammon s algorithm starts with a random configuration of N patterns in m dimensions and use the method of steepest descent to reconfigure the patterns so as to minimize E in an iterative fashion. The algorithm should be applied with several initial configurations to ensure a global minimum of E /3/11 48

49 2003/3/11 49 Sammon s Algorithm (3) = = = + N i k k ik ij ij ij ij ij y y k D i k i d k D i k i d t y t y t E t y t y 1, ) ]( ), ( ), ( ), ( ), ( [ 2 ) ( ) ( ) ( ) ( 1) ( λ α α where < = j i j i d ), ( λ Ref: N. R. Pal and V. K. E;uri, Two efficient connectionist schemes for structure preserving dimensionality reduction, IEEE Trans. on Neural Networks, vol. 9, no. 6, pp , 1998.

50 Sammon s Algorithm (4) (a) (b) Figure: (a) iris data set; (b) 10-dimensional data set. 2003/3/11 50

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning

Unsupervised Learning Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Market basket analysis

Market basket analysis Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Unsupervised learning, Clustering CS434

Unsupervised learning, Clustering CS434 Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,

More information

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\ Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Lecture 15 Clustering. Oct

Lecture 15 Clustering. Oct Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that: Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Nonlinear dimensionality reduction of large datasets for data exploration

Nonlinear dimensionality reduction of large datasets for data exploration Data Mining VII: Data, Text and Web Mining and their Business Applications 3 Nonlinear dimensionality reduction of large datasets for data exploration V. Tomenko & V. Popov Wessex Institute of Technology,

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

Clustering Lecture 3: Hierarchical Methods

Clustering Lecture 3: Hierarchical Methods Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Hierarchical clustering

Hierarchical clustering Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical

More information

Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects:

Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects: Administration Final Exam: Next Tuesday, 12/6 12:30, in class. Material: Everything covered from the beginning of the semester Format: Similar to mid-term; closed books Review session on Thursday HW 7:

More information

Clustering algorithms

Clustering algorithms Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

Image Analysis - Lecture 5

Image Analysis - Lecture 5 Texture Segmentation Clustering Review Image Analysis - Lecture 5 Texture and Segmentation Magnus Oskarsson Lecture 5 Texture Segmentation Clustering Review Contents Texture Textons Filter Banks Gabor

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Chapter 14: The Elements of Statistical Learning Presented for 540 by Len Tanaka Objectives Introduction Techniques: Association Rules Cluster Analysis Self-Organizing Maps Projective

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

An Unsupervised Technique for Statistical Data Analysis Using Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 5, Number 1 (2013), pp. 11-20 International Research Publication House http://www.irphouse.com An Unsupervised Technique

More information

Community Detection. Community

Community Detection. Community Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline Clustering cluster - set of patterns whose inter-pattern distances are smaller than inter-pattern distances for patterns not in the same cluster a homogeneity and uniformity criterion no connectivity little

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

Multivariate Methods

Multivariate Methods Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology,

More information

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information