The EMCLUS Procedure. The EMCLUS Procedure

Size: px
Start display at page:

Download "The EMCLUS Procedure. The EMCLUS Procedure"

Transcription

1 The EMCLUS Procedure Overview Procedure Syntax PROC EMCLUS Statement VAR Statement INITCLUS Statement Output from PROC EMCLUS EXAMPLES-SECTION Example 1: Syntax for PROC FASTCLUS Example 2: Use of the EMCLUS Procedure

2 Overview The EMCLUS procedure uses a scalable version of the Expectation-Maximization (EM) algorithm to cluster a data set. You have the option to run the standard EM algorithm, in which the parameters are estimated using the entire data set, or the scaled EM algorithm, in which a portion of the data set is used to estimate the parameters at each iteration of the procedure. The standard EM algorithm is run by not specifying the a value for the option NOBS, and can be run provided the entire data set. The entire data set must fits in memory such that total number of observation can be determined. The scaled EM algorithm can be run by specifying a value for the NOBS option that is less than the total number of observations in the data set. When the scaled EM algorithm is used, it is important that the input data set is randomized beforehand. The EMCLUS procedure identifies primary clusters, which are the densest regions of data points, and also identifies secondary clusters, which are typically smaller dense clusters. Each primary cluster is modeled with a weighted n-dimensional multivariate normal distribution, where n is the number of variables in the data set. Thus each primary cluster is modeled with the function w*f(x u,v), where f(x u,v) ~ MVN(u,V) and V is a diagonal matrix. There are four major parts in the EMCLUS procedure: Obtain and possibly refine the initial parameter estimates. Apply the EM algorithm to update parameter values. Summarize data in the primary summarization phase. Summarize data in the secondary summarization phase. The effectiveness of the EMCLUS procedure depends on the initial parameter estimates. Good initial parameter estimates generally heads to faster convergence and better final estimates. The initial parameter estimates can be obtained "randomly", from using PROC FASTCLUS, or from using PROC EMCLUS. See Example 1 for the PROC FASTCLUS syntax. PROC FASTCLUS sometimes returns poor results (clusters corresponding to low frequency counts). In this case, the poor results can be ignored by specifying appropriate values for the INITCLUS option. PROC FASTCLUS can also return clusters that are actually groups of clusters. This can be determined by clusters having a large frequency count and a large root-mean-square standard deviation. In this case, these clusters should not be ignored, and the user should specify a value of INITSTD which is smaller than the root-mean-square standard deviations of the clusters with the large frequency counts and large root-mean-square standard deviations (see Example 2). In the case when PROC FASTCLUS returns poor results, it may be of help to rerun PROC FASTCLUS with a larger number of MAXCLUSTERS, and then choose the best clusters for the initial values for PROC EMCLUS. Initial estimates obtained from PROC EMCLUS can be used to refine the primary clusters or obtain better primary clusters. The EM algorithm is used to find the primary clusters, and update the model parameters. The EM algorithm terminates when two successive log-likelihood values differ in relative and absolute magnitude by a particular amount or when ITER iterations have been reached. The primary summarization phase summarizes observations near each of the primary cluster means and

3 then deletes the summarized observations from memory. For the specified value p=p 0, all observations falling within the region containing 100*p 0 % of the volume of the MNV(u,V) distribution will be summarized. At then end of the primary summarization phase, the primary clusters are checked to see if any of them contain fewer than MIN observations. If a cluster does, then the cluster is declared to be inactive. An inactive cluster is not used in updating the parameter estimates in the EM algorithm. Am inactive cluster remains inactive until one of the following two conditions occur: 1. The inactive cluster gets reseeded at a secondary cluster containing at least MIN observations. 2. The inactive cluster gets reseeded at point determined by an active cluster having at least one variable with standard deviation greater than INITSTD. The secondary summarization phase first uses the k-means clustering algorithm to identify secondary clusters, and then uses a hierarchical agglomerative clustering algorithm to combine similar secondary clusters. At the end of the k-means algorithm, each of the SECCLUS clusters are tested to see if their sample standard deviation for each variable is less than or equal to SECSTD. If yes, then the cluster becomes a secondary cluster. Setting SECCLUS=0 will cause PROC EMCLUS not to perform a secondary summarization phase, which is not recommended. The reason for this is that the secondary summarization phase acts as a backup method for finding the primary clusters when the initial values are poor. If the data set contains many outliers, then setting SECCLUS to be larger than the default value will increase the chances of finding clusters. A secondary cluster is disjoint from all other secondary clusters and from all primary clusters. Although many of the options in PROC EMCLUS are not required to be specified, it is best to specify them if the user has some knowledge of the input data set. Among the most important options is NOBS. NOBS specifies the number of observations that are read in during each iteration, and consequently, NOBS determines the number of iterations. For example, if the input data set contains 1,000 observations and NOBS is set to 100, then there will be 10 iterations. If NOBS is not specified, then it is assumed that you wants to run the standard EM algorithm. Another important option is INITSTD, the maximum initial standard deviation of each variable in each initial primary cluster. If INISTD is chosen too small, then the EM algorithm may have trouble finding the primary clusters.

4 Procedure Syntax PROC EMCLUS <option(s)>; VAR variable(s); INITCLUS integer(s);

5 PROC EMCLUS Statement Invoke the EMCLUS procedure. PROC EMCLUS<option(s)>; Options DATA = IN = <libref.>sas-data-set Specifies the data set to be analyzed. All the variables in this data set must be numerical. Observations with missing data are ignored. ROLE = TRAIN SCORE Specifies the role of the DATA= data set. Setting ROLE=TRAIN will cause the EMCLUS procedure to cluster the data set. Setting ROLE=SCORE will cause the EMCLUS procedure to compute the probabilities that each observation in the DATA= data set is in each primary cluster. If ROLE=SCORE is specified, then SEED=<file> option must also be used where <file> is the name of the OUTSTAT data set. The default value for ROLE is TRAIN. CLUSTERS = positive integer Specifies the number of primary clusters. SECCLUS = nonnegative integer Specifies the number of secondary clusters that the algorithm will search for during the secondary data summarization phase. If SECCLUS=0, then there will not be a secondary data summarization phase. The default value of SECCLUS is twice the number of primary clusters. EPS = positive number Specifies the stopping tolerance. The default value is SECSTD = positive number Specifies the maximum allowable sample standard deviation of any variable in a summarized subset of observations to be deemed a secondary cluster. The default value is the smallest positive sample standard deviation obtained from the observations read in during the first iteration. P = number Defines a radius around each cluster mean, such that any point which lies inside the radius is summarized in that cluster. The value of P must be between 0 and 1. A value close to 1 defines a larger radius than a value close to 0. The default value is 0.5. MAXITER = positive integer Specifies the maximum number of iterations of the EMCLUS procedure. The default value is the largest machine integer. INITSTD = positive number Specifies the maximum standard deviation of the initial clusters. The default value is determined from a sample of data read in during the first iteration. ITER = positive integer Specifies the number of iterations in the EM algorithm to update the model parameters. The default value is 50. INIT = RANDOM FASTCLUS EMCLUS

6 Specifies how the initial estimates are obtained. The default value is RANDOM. If INIT=FASTCLUS or INIT=EMCLUS is specified, then the option SEEDS=<libref.SAS-data-set> must also be specified, and the data set must be the OUTSEEDS data set from PROC FASTCLUS or the OUTSTAT data set from PROC EMCLUS, respectively. MIN = nonnegative integer Specifies the minimum number of observations in each primary cluster. At any iteration, if the total number of observations summarized in a cluster is less than MIN, then the cluster becomes inactive, and the cluster is reseeded at a more appropriate point, of one exists. The default value is 3. NOBS = positive integer Specifies the number of observations to be read in for each iteration. The default value is the number of observations in the data set, provided that this number can be determined. If the number of observations in the data set cannot be determined, the default value is 500. SEED = libref.sas-data-set Specifies the data set that is used for the initial parameter estimates. This option must be used with INIT=FASTCLUS or with INIT=EMCLUS. With PROC FASTCLUS, this specified data set is the resulting SAS data set from the OUTSEEDS option. With PROC EMCLUS, this data set is the resulting SAS data set from the OUTSTAT option. OUTSTAT = libref.sas-data-set Specifies an output data set. This data set has 5+D columns, where D is the number of variables. Column 1 contains the cluster number. Column 2 is the type of cluster (primary or secondary). Column 3 is the cluster frequency. Column 4 is the estimate for the weight parameter. Column 5 is the labelled _TYPE_, where _TYPE_=MEAN or _TYPE_=VAR. Columns 6 through 5+D contain either the estimates for the mean or the variance of each variable. Each variabe corresponds to two rows of this data set in the following form: Output Data Set from the OUTSTAT Option CLUSTER_1 PRIMARY FREQ_1 WEIGHT_1 MEAN MEAN_1 MEAN_2... MEAN_N CLUSTER_1 PRIMARY FREQ_1 WEIGHT_1 VAR VAR_1 VAR_2... VAR_N CLUSTER_2 SECONDAY FREQ_2 WEIGHT_2 MEAN MEAN_1 MEAN_2... MEAN_N CLUSTER_2 SECONDARY FREQ_2 WEIGHT_2 VAR VAR_1 VAR_2... VAR_N DIST = nonnegative number Specifies the minimum distance between initial clusters when INIT=RANDOM is used. It also specifies the minimum distance between initial cluster means in the secondary data summarization phase. The default value is the square root of the average of the sample variances obtained from the observations read in during the first iteration. PRINT = ALL LAST NONE Specifies how much output are printed. If PRINT=LAST is used, then the initial estimates and the output from the last iteration will be shown. The default is PRINT=LAST. SECITER = positive integer Specifies the maximum number of iterations of the k-means algorithm in the secondary data summarization phase. The default value is 1. OUT = libref.sas-data-set Specifies the name of the data set that contains the probabilities PROB_h = P(x is in cluster h), h=1,2,...k. This option must be used with ROLE=SCORE. The resulting data set will also contain the original data. CLEAR = non-negative integer

7 Specifies the value of n for which the EMCLUS procedure deletes the observations that are remained in the memory following the secondary summarization phase after every n iterations. The default value is 0, which means that no observation will be deleted from the memory. OUTLIERS = IGNORE KEEP Specifies how the outlier observations are weighted when the scaled EM algorithm is implemented. If OUTLIERS=IGNORE is specified, observation that are not in the 99th percentile of any estimated primary cluster are weighted less. If OUTLIERS=KEEP is specified, these observations are weighted normally as in the standard EM algorithm.

8 VAR Statement VAR variable(s); variable(s) Specifies which variables are to be used. If this statement is omitted, then all variables from the input data set will be used.

9 INITCLUS Statement INITCLUS integer(s); integer(s) Specifies which clusters from PROC FASTCLUS or PROC EMCLUS are to be used as initial estimates for PROC EMCLUS. For example, when viewing the output from PROC FASTCLUS you see that only clusters 2, 4-8, and 10 have good results, then could use the option INITCLUS 2, 4 TO 8, 10; to only use those cluster estimates as initial estimates in PROC EMCLUS. The number of clusters specified in the INITCLUS statement should not exceed the number of clusters specified with the CLUSTERS= option in PROC EMCLUS statement. The default for INITCLUS with INIT=EMCLUS is that all the clusters will be used as initial estimates. With INIT=FASTCLUS, the default is that the clusters with the highest frequency counts will be used.

10 Output from PROC EMCLUS The beginning of the output shows the initial model parameter estimates. Next, the estimated model parameters, sample means, and sample variances for the active primary clusters are displayed. In active clusters are shown with missing values. The sample mean and variance are calculated from the observations that are summarized in the primary clusters. In the cluster summary table, the following statistics are listed: Current Frequency the number of observations that are summarized in a cluster during the current iteration. Total Frequency the cumulative sum of the current frequencies for each cluster. Proportion of Data Summarized the total frequency divided by the Obs read in. Nearest Cluster the closest primary cluster to a primary cluster based on the euclidean distance between the estimated mean of the two primary clusters. Distance the euclidean distance of a primary cluster to its nearest cluster. The iteration summary table displays: Log-likelihood the average log-likehood over all the observations that are read in. Obs read in this iteration the number of observations that are read in at current iteration. Obs read in the cumulative sum of observations that are read in. Current Summarized is the sum of the current frequencies across the primary clusters. Total Summarized is the sum of the total frequencies across the primary clusters. Proportion Summarized the Total Summarized divided by the Obs read in. If there are secondary clusters, the sample mean, sample variance, and the number of observations in secondary clusters are also displayed after the iteration summary table. Note: The estimated variance parameter for each variable is bound from below by the value

11 (var)*(eps), where var is the sample variance of that variable obtained from the observations read in at the first iteration, and eps is Both the standard and scaled EM algorithm sometimes are slow to convergence, however, the scaled EM algorithm generally runs faster than the standard EM algorithm. Convergence may be sped up by increasing p and /or eps, or by using the CLEAR option. Changing these values may alter the parameter estimates.

12 Example Example 1: Syntax for PROC FASTCLUS Example 2: Use of the EMCLUS Procedure Chapter Contents Previous Next Top of Page

13 Example 1: Syntax for PROC FASTCLUS PROC EMCLUS<DATA =libref.sas-data-set> OUTSEEDS = libref.sas-data-set MAXCLUSTERS = positive integer;

14 Example 2: Use of the EMCLUS Procedure PROC FASTCLUS returns a portion of the total output summarized in the following table: Cluster Frequency RMS Std. Deviation Clusters 1, 3, and 4 should be used as initial estimates for PROC EMCLUS, because of their high frequency counts. Cluster 1 may actually be a group of clusters because of its high RMS Std. Deviation. Therefore, the syntax when using PROC EMCLUS could look like: PROC EMCLUS DATA=<libref.SAS-data-set> CLUSTERS = 5 INIT = FASTCLUS SEED = <the data set from OUTSEEDS option in PROC FASTCLUS> INITSTD = 50.0 INITCLUS 1, 3, 4; run; Note that CLUSTERS is set to 5, but any integer greater than or equal to 3 is appropriate since there are three clusters specified in the INITCLUS option. Also the INITSTD could have been set to any number less than and greater than 4.05.

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis Introduction ti to K-means Algorithm Wenan Li (Emil Li) Sep. 5, 9 Outline Introduction to Clustering Analsis K-means Algorithm Description Eample of K-means Algorithm Other Issues of K-means Algorithm

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

The FASTCLUS Procedure

The FASTCLUS Procedure SAS/STAT 9.2 User s Guide The FASTCLUS Procedure (Book Excerpt) SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete

More information

SAS/STAT 14.1 User s Guide. The FASTCLUS Procedure

SAS/STAT 14.1 User s Guide. The FASTCLUS Procedure SAS/STAT 14.1 User s Guide The FASTCLUS Procedure This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

SAS/STAT 13.1 User s Guide. The FASTCLUS Procedure

SAS/STAT 13.1 User s Guide. The FASTCLUS Procedure SAS/STAT 13.1 User s Guide The FASTCLUS Procedure This document is an individual chapter from SAS/STAT 13.1 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Forestry Applied Multivariate Statistics. Cluster Analysis

Forestry Applied Multivariate Statistics. Cluster Analysis 1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Clustering. K-means clustering

Clustering. K-means clustering Clustering K-means clustering Clustering Motivation: Identify clusters of data points in a multidimensional space, i.e. partition the data set {x 1,...,x N } into K clusters. Intuition: A cluster is a

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Lecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays)

Lecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays) Lecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays) In this lecture, you will: Learn about arrays Explore how to declare and manipulate data into arrays Understand the meaning of

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

SAS/STAT 13.2 User s Guide. The VARCLUS Procedure

SAS/STAT 13.2 User s Guide. The VARCLUS Procedure SAS/STAT 13.2 User s Guide The VARCLUS Procedure This document is an individual chapter from SAS/STAT 13.2 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

K-Means Clustering. Sargur Srihari

K-Means Clustering. Sargur Srihari K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

k-means, k-means++ Barna Saha March 8, 2016

k-means, k-means++ Barna Saha March 8, 2016 k-means, k-means++ Barna Saha March 8, 2016 K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. K-Means: The Most Popular Clustering

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

The PMBR Procedure. Overview Procedure Syntax PROC PMBR Statement VAR Statement TARGET Statement CLASS Statement. The PMBR Procedure

The PMBR Procedure. Overview Procedure Syntax PROC PMBR Statement VAR Statement TARGET Statement CLASS Statement. The PMBR Procedure The PMBR Procedure Overview Procedure Syntax PROC PMBR Statement VAR Statement TARGET Statement CLASS Statement Overview The PMBR procedure is used for prediction as an alternative to other predictive

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

SAS Graphics Macros for Latent Class Analysis Users Guide

SAS Graphics Macros for Latent Class Analysis Users Guide SAS Graphics Macros for Latent Class Analysis Users Guide Version 2.0.1 John Dziak The Methodology Center Stephanie Lanza The Methodology Center Copyright 2015, Penn State. All rights reserved. Please

More information

CSE494 Information Retrieval Project C Report

CSE494 Information Retrieval Project C Report CSE494 Information Retrieval Project C Report By: Jianchun Fan Introduction In project C we implement several different clustering methods on the query results given by pagerank algorithms. The clustering

More information

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable Learning Objectives Continuous Random Variables & The Normal Probability Distribution 1. Understand characteristics about continuous random variables and probability distributions 2. Understand the uniform

More information

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS} MVA MVA [VARIABLES=] {varlist} {ALL } [/CATEGORICAL=varlist] [/MAXCAT={25 ** }] {n } [/ID=varname] Description: [/NOUNIVARIATE] [/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n}

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine Data Mining SPSS 12.0 6. k-means Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Outline K-Means Algorithm in K-Means Node References K-Means Algorithm in Overview The k-means method is a clustering

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

A User Manual for the Multivariate MLE Tool. Before running the main multivariate program saved in the SAS file Part2-Main.sas,

A User Manual for the Multivariate MLE Tool. Before running the main multivariate program saved in the SAS file Part2-Main.sas, A User Manual for the Multivariate MLE Tool Before running the main multivariate program saved in the SAS file Part-Main.sas, the user must first compile the macros defined in the SAS file Part-Macros.sas

More information

COMP5318 Knowledge Management & Data Mining Assignment 1

COMP5318 Knowledge Management & Data Mining Assignment 1 COMP538 Knowledge Management & Data Mining Assignment Enoch Lau SID 20045765 7 May 2007 Abstract 5.5 Scalability............... 5 Clustering is a fundamental task in data mining that aims to place similar

More information

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that: Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Clustering Part 2. A Partitional Clustering

Clustering Part 2. A Partitional Clustering Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

PCOMP http://127.0.0.1:55825/help/topic/com.rsi.idl.doc.core/pcomp... IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Basic Commands. Consider the data set: {15, 22, 32, 31, 52, 41, 11}

Basic Commands. Consider the data set: {15, 22, 32, 31, 52, 41, 11} Entering Data: Basic Commands Consider the data set: {15, 22, 32, 31, 52, 41, 11} Data is stored in Lists on the calculator. Locate and press the STAT button on the calculator. Choose EDIT. The calculator

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

The DMSPLIT Procedure

The DMSPLIT Procedure The DMSPLIT Procedure The DMSPLIT Procedure Overview Procedure Syntax PROC DMSPLIT Statement FREQ Statement TARGET Statement VARIABLE Statement WEIGHT Statement Details Examples Example 1: Creating a Decision

More information

Analyzing Genomic Data with NOJAH

Analyzing Genomic Data with NOJAH Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Unsupervised Learning and Data Mining

Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Supervised Learning ó Decision trees ó Artificial neural nets ó K-nearest neighbor ó Support vectors ó Linear regression

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

More information

CHAPTER THREE THE DISTANCE FUNCTION APPROACH

CHAPTER THREE THE DISTANCE FUNCTION APPROACH 50 CHAPTER THREE THE DISTANCE FUNCTION APPROACH 51 3.1 INTRODUCTION Poverty is a multi-dimensional phenomenon with several dimensions. Many dimensions are divided into several attributes. An example of

More information

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016 CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Data Clustering. Danushka Bollegala

Data Clustering. Danushka Bollegala Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Machine Learning: Symbol-based

Machine Learning: Symbol-based 10d Machine Learning: Symbol-based More clustering examples 10.5 Knowledge and Learning 10.6 Unsupervised Learning 10.7 Reinforcement Learning 10.8 Epilogue and References 10.9 Exercises Additional references

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use

More information