The EMCLUS Procedure. The EMCLUS Procedure
|
|
- Mitchell Harris
- 6 years ago
- Views:
Transcription
1 The EMCLUS Procedure Overview Procedure Syntax PROC EMCLUS Statement VAR Statement INITCLUS Statement Output from PROC EMCLUS EXAMPLES-SECTION Example 1: Syntax for PROC FASTCLUS Example 2: Use of the EMCLUS Procedure
2 Overview The EMCLUS procedure uses a scalable version of the Expectation-Maximization (EM) algorithm to cluster a data set. You have the option to run the standard EM algorithm, in which the parameters are estimated using the entire data set, or the scaled EM algorithm, in which a portion of the data set is used to estimate the parameters at each iteration of the procedure. The standard EM algorithm is run by not specifying the a value for the option NOBS, and can be run provided the entire data set. The entire data set must fits in memory such that total number of observation can be determined. The scaled EM algorithm can be run by specifying a value for the NOBS option that is less than the total number of observations in the data set. When the scaled EM algorithm is used, it is important that the input data set is randomized beforehand. The EMCLUS procedure identifies primary clusters, which are the densest regions of data points, and also identifies secondary clusters, which are typically smaller dense clusters. Each primary cluster is modeled with a weighted n-dimensional multivariate normal distribution, where n is the number of variables in the data set. Thus each primary cluster is modeled with the function w*f(x u,v), where f(x u,v) ~ MVN(u,V) and V is a diagonal matrix. There are four major parts in the EMCLUS procedure: Obtain and possibly refine the initial parameter estimates. Apply the EM algorithm to update parameter values. Summarize data in the primary summarization phase. Summarize data in the secondary summarization phase. The effectiveness of the EMCLUS procedure depends on the initial parameter estimates. Good initial parameter estimates generally heads to faster convergence and better final estimates. The initial parameter estimates can be obtained "randomly", from using PROC FASTCLUS, or from using PROC EMCLUS. See Example 1 for the PROC FASTCLUS syntax. PROC FASTCLUS sometimes returns poor results (clusters corresponding to low frequency counts). In this case, the poor results can be ignored by specifying appropriate values for the INITCLUS option. PROC FASTCLUS can also return clusters that are actually groups of clusters. This can be determined by clusters having a large frequency count and a large root-mean-square standard deviation. In this case, these clusters should not be ignored, and the user should specify a value of INITSTD which is smaller than the root-mean-square standard deviations of the clusters with the large frequency counts and large root-mean-square standard deviations (see Example 2). In the case when PROC FASTCLUS returns poor results, it may be of help to rerun PROC FASTCLUS with a larger number of MAXCLUSTERS, and then choose the best clusters for the initial values for PROC EMCLUS. Initial estimates obtained from PROC EMCLUS can be used to refine the primary clusters or obtain better primary clusters. The EM algorithm is used to find the primary clusters, and update the model parameters. The EM algorithm terminates when two successive log-likelihood values differ in relative and absolute magnitude by a particular amount or when ITER iterations have been reached. The primary summarization phase summarizes observations near each of the primary cluster means and
3 then deletes the summarized observations from memory. For the specified value p=p 0, all observations falling within the region containing 100*p 0 % of the volume of the MNV(u,V) distribution will be summarized. At then end of the primary summarization phase, the primary clusters are checked to see if any of them contain fewer than MIN observations. If a cluster does, then the cluster is declared to be inactive. An inactive cluster is not used in updating the parameter estimates in the EM algorithm. Am inactive cluster remains inactive until one of the following two conditions occur: 1. The inactive cluster gets reseeded at a secondary cluster containing at least MIN observations. 2. The inactive cluster gets reseeded at point determined by an active cluster having at least one variable with standard deviation greater than INITSTD. The secondary summarization phase first uses the k-means clustering algorithm to identify secondary clusters, and then uses a hierarchical agglomerative clustering algorithm to combine similar secondary clusters. At the end of the k-means algorithm, each of the SECCLUS clusters are tested to see if their sample standard deviation for each variable is less than or equal to SECSTD. If yes, then the cluster becomes a secondary cluster. Setting SECCLUS=0 will cause PROC EMCLUS not to perform a secondary summarization phase, which is not recommended. The reason for this is that the secondary summarization phase acts as a backup method for finding the primary clusters when the initial values are poor. If the data set contains many outliers, then setting SECCLUS to be larger than the default value will increase the chances of finding clusters. A secondary cluster is disjoint from all other secondary clusters and from all primary clusters. Although many of the options in PROC EMCLUS are not required to be specified, it is best to specify them if the user has some knowledge of the input data set. Among the most important options is NOBS. NOBS specifies the number of observations that are read in during each iteration, and consequently, NOBS determines the number of iterations. For example, if the input data set contains 1,000 observations and NOBS is set to 100, then there will be 10 iterations. If NOBS is not specified, then it is assumed that you wants to run the standard EM algorithm. Another important option is INITSTD, the maximum initial standard deviation of each variable in each initial primary cluster. If INISTD is chosen too small, then the EM algorithm may have trouble finding the primary clusters.
4 Procedure Syntax PROC EMCLUS <option(s)>; VAR variable(s); INITCLUS integer(s);
5 PROC EMCLUS Statement Invoke the EMCLUS procedure. PROC EMCLUS<option(s)>; Options DATA = IN = <libref.>sas-data-set Specifies the data set to be analyzed. All the variables in this data set must be numerical. Observations with missing data are ignored. ROLE = TRAIN SCORE Specifies the role of the DATA= data set. Setting ROLE=TRAIN will cause the EMCLUS procedure to cluster the data set. Setting ROLE=SCORE will cause the EMCLUS procedure to compute the probabilities that each observation in the DATA= data set is in each primary cluster. If ROLE=SCORE is specified, then SEED=<file> option must also be used where <file> is the name of the OUTSTAT data set. The default value for ROLE is TRAIN. CLUSTERS = positive integer Specifies the number of primary clusters. SECCLUS = nonnegative integer Specifies the number of secondary clusters that the algorithm will search for during the secondary data summarization phase. If SECCLUS=0, then there will not be a secondary data summarization phase. The default value of SECCLUS is twice the number of primary clusters. EPS = positive number Specifies the stopping tolerance. The default value is SECSTD = positive number Specifies the maximum allowable sample standard deviation of any variable in a summarized subset of observations to be deemed a secondary cluster. The default value is the smallest positive sample standard deviation obtained from the observations read in during the first iteration. P = number Defines a radius around each cluster mean, such that any point which lies inside the radius is summarized in that cluster. The value of P must be between 0 and 1. A value close to 1 defines a larger radius than a value close to 0. The default value is 0.5. MAXITER = positive integer Specifies the maximum number of iterations of the EMCLUS procedure. The default value is the largest machine integer. INITSTD = positive number Specifies the maximum standard deviation of the initial clusters. The default value is determined from a sample of data read in during the first iteration. ITER = positive integer Specifies the number of iterations in the EM algorithm to update the model parameters. The default value is 50. INIT = RANDOM FASTCLUS EMCLUS
6 Specifies how the initial estimates are obtained. The default value is RANDOM. If INIT=FASTCLUS or INIT=EMCLUS is specified, then the option SEEDS=<libref.SAS-data-set> must also be specified, and the data set must be the OUTSEEDS data set from PROC FASTCLUS or the OUTSTAT data set from PROC EMCLUS, respectively. MIN = nonnegative integer Specifies the minimum number of observations in each primary cluster. At any iteration, if the total number of observations summarized in a cluster is less than MIN, then the cluster becomes inactive, and the cluster is reseeded at a more appropriate point, of one exists. The default value is 3. NOBS = positive integer Specifies the number of observations to be read in for each iteration. The default value is the number of observations in the data set, provided that this number can be determined. If the number of observations in the data set cannot be determined, the default value is 500. SEED = libref.sas-data-set Specifies the data set that is used for the initial parameter estimates. This option must be used with INIT=FASTCLUS or with INIT=EMCLUS. With PROC FASTCLUS, this specified data set is the resulting SAS data set from the OUTSEEDS option. With PROC EMCLUS, this data set is the resulting SAS data set from the OUTSTAT option. OUTSTAT = libref.sas-data-set Specifies an output data set. This data set has 5+D columns, where D is the number of variables. Column 1 contains the cluster number. Column 2 is the type of cluster (primary or secondary). Column 3 is the cluster frequency. Column 4 is the estimate for the weight parameter. Column 5 is the labelled _TYPE_, where _TYPE_=MEAN or _TYPE_=VAR. Columns 6 through 5+D contain either the estimates for the mean or the variance of each variable. Each variabe corresponds to two rows of this data set in the following form: Output Data Set from the OUTSTAT Option CLUSTER_1 PRIMARY FREQ_1 WEIGHT_1 MEAN MEAN_1 MEAN_2... MEAN_N CLUSTER_1 PRIMARY FREQ_1 WEIGHT_1 VAR VAR_1 VAR_2... VAR_N CLUSTER_2 SECONDAY FREQ_2 WEIGHT_2 MEAN MEAN_1 MEAN_2... MEAN_N CLUSTER_2 SECONDARY FREQ_2 WEIGHT_2 VAR VAR_1 VAR_2... VAR_N DIST = nonnegative number Specifies the minimum distance between initial clusters when INIT=RANDOM is used. It also specifies the minimum distance between initial cluster means in the secondary data summarization phase. The default value is the square root of the average of the sample variances obtained from the observations read in during the first iteration. PRINT = ALL LAST NONE Specifies how much output are printed. If PRINT=LAST is used, then the initial estimates and the output from the last iteration will be shown. The default is PRINT=LAST. SECITER = positive integer Specifies the maximum number of iterations of the k-means algorithm in the secondary data summarization phase. The default value is 1. OUT = libref.sas-data-set Specifies the name of the data set that contains the probabilities PROB_h = P(x is in cluster h), h=1,2,...k. This option must be used with ROLE=SCORE. The resulting data set will also contain the original data. CLEAR = non-negative integer
7 Specifies the value of n for which the EMCLUS procedure deletes the observations that are remained in the memory following the secondary summarization phase after every n iterations. The default value is 0, which means that no observation will be deleted from the memory. OUTLIERS = IGNORE KEEP Specifies how the outlier observations are weighted when the scaled EM algorithm is implemented. If OUTLIERS=IGNORE is specified, observation that are not in the 99th percentile of any estimated primary cluster are weighted less. If OUTLIERS=KEEP is specified, these observations are weighted normally as in the standard EM algorithm.
8 VAR Statement VAR variable(s); variable(s) Specifies which variables are to be used. If this statement is omitted, then all variables from the input data set will be used.
9 INITCLUS Statement INITCLUS integer(s); integer(s) Specifies which clusters from PROC FASTCLUS or PROC EMCLUS are to be used as initial estimates for PROC EMCLUS. For example, when viewing the output from PROC FASTCLUS you see that only clusters 2, 4-8, and 10 have good results, then could use the option INITCLUS 2, 4 TO 8, 10; to only use those cluster estimates as initial estimates in PROC EMCLUS. The number of clusters specified in the INITCLUS statement should not exceed the number of clusters specified with the CLUSTERS= option in PROC EMCLUS statement. The default for INITCLUS with INIT=EMCLUS is that all the clusters will be used as initial estimates. With INIT=FASTCLUS, the default is that the clusters with the highest frequency counts will be used.
10 Output from PROC EMCLUS The beginning of the output shows the initial model parameter estimates. Next, the estimated model parameters, sample means, and sample variances for the active primary clusters are displayed. In active clusters are shown with missing values. The sample mean and variance are calculated from the observations that are summarized in the primary clusters. In the cluster summary table, the following statistics are listed: Current Frequency the number of observations that are summarized in a cluster during the current iteration. Total Frequency the cumulative sum of the current frequencies for each cluster. Proportion of Data Summarized the total frequency divided by the Obs read in. Nearest Cluster the closest primary cluster to a primary cluster based on the euclidean distance between the estimated mean of the two primary clusters. Distance the euclidean distance of a primary cluster to its nearest cluster. The iteration summary table displays: Log-likelihood the average log-likehood over all the observations that are read in. Obs read in this iteration the number of observations that are read in at current iteration. Obs read in the cumulative sum of observations that are read in. Current Summarized is the sum of the current frequencies across the primary clusters. Total Summarized is the sum of the total frequencies across the primary clusters. Proportion Summarized the Total Summarized divided by the Obs read in. If there are secondary clusters, the sample mean, sample variance, and the number of observations in secondary clusters are also displayed after the iteration summary table. Note: The estimated variance parameter for each variable is bound from below by the value
11 (var)*(eps), where var is the sample variance of that variable obtained from the observations read in at the first iteration, and eps is Both the standard and scaled EM algorithm sometimes are slow to convergence, however, the scaled EM algorithm generally runs faster than the standard EM algorithm. Convergence may be sped up by increasing p and /or eps, or by using the CLEAR option. Changing these values may alter the parameter estimates.
12 Example Example 1: Syntax for PROC FASTCLUS Example 2: Use of the EMCLUS Procedure Chapter Contents Previous Next Top of Page
13 Example 1: Syntax for PROC FASTCLUS PROC EMCLUS<DATA =libref.sas-data-set> OUTSEEDS = libref.sas-data-set MAXCLUSTERS = positive integer;
14 Example 2: Use of the EMCLUS Procedure PROC FASTCLUS returns a portion of the total output summarized in the following table: Cluster Frequency RMS Std. Deviation Clusters 1, 3, and 4 should be used as initial estimates for PROC EMCLUS, because of their high frequency counts. Cluster 1 may actually be a group of clusters because of its high RMS Std. Deviation. Therefore, the syntax when using PROC EMCLUS could look like: PROC EMCLUS DATA=<libref.SAS-data-set> CLUSTERS = 5 INIT = FASTCLUS SEED = <the data set from OUTSEEDS option in PROC FASTCLUS> INITSTD = 50.0 INITCLUS 1, 3, 4; run; Note that CLUSTERS is set to 5, but any integer greater than or equal to 3 is appropriate since there are three clusters specified in the INITCLUS option. Also the INITSTD could have been set to any number less than and greater than 4.05.
Cluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More information9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis
Introduction ti to K-means Algorithm Wenan Li (Emil Li) Sep. 5, 9 Outline Introduction to Clustering Analsis K-means Algorithm Description Eample of K-means Algorithm Other Issues of K-means Algorithm
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationThe FASTCLUS Procedure
SAS/STAT 9.2 User s Guide The FASTCLUS Procedure (Book Excerpt) SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete
More informationSAS/STAT 14.1 User s Guide. The FASTCLUS Procedure
SAS/STAT 14.1 User s Guide The FASTCLUS Procedure This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationSAS/STAT 13.1 User s Guide. The FASTCLUS Procedure
SAS/STAT 13.1 User s Guide The FASTCLUS Procedure This document is an individual chapter from SAS/STAT 13.1 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationForestry Applied Multivariate Statistics. Cluster Analysis
1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationNote Set 4: Finite Mixture Models and the EM Algorithm
Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for
More informationClustering. K-means clustering
Clustering K-means clustering Clustering Motivation: Identify clusters of data points in a multidimensional space, i.e. partition the data set {x 1,...,x N } into K clusters. Intuition: A cluster is a
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationIntroduction to Mobile Robotics
Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,
More informationAutomatic Cluster Number Selection using a Split and Merge K-Means Approach
Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationData Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationChapter 2 Basic Structure of High-Dimensional Spaces
Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationLecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays)
Lecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays) In this lecture, you will: Learn about arrays Explore how to declare and manipulate data into arrays Understand the meaning of
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationSAS/STAT 13.2 User s Guide. The VARCLUS Procedure
SAS/STAT 13.2 User s Guide The VARCLUS Procedure This document is an individual chapter from SAS/STAT 13.2 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute
More informationMeasures of Dispersion
Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion
More informationK-Means Clustering. Sargur Srihari
K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationk-means, k-means++ Barna Saha March 8, 2016
k-means, k-means++ Barna Saha March 8, 2016 K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. K-Means: The Most Popular Clustering
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationThe PMBR Procedure. Overview Procedure Syntax PROC PMBR Statement VAR Statement TARGET Statement CLASS Statement. The PMBR Procedure
The PMBR Procedure Overview Procedure Syntax PROC PMBR Statement VAR Statement TARGET Statement CLASS Statement Overview The PMBR procedure is used for prediction as an alternative to other predictive
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationSAS Graphics Macros for Latent Class Analysis Users Guide
SAS Graphics Macros for Latent Class Analysis Users Guide Version 2.0.1 John Dziak The Methodology Center Stephanie Lanza The Methodology Center Copyright 2015, Penn State. All rights reserved. Please
More informationCSE494 Information Retrieval Project C Report
CSE494 Information Retrieval Project C Report By: Jianchun Fan Introduction In project C we implement several different clustering methods on the query results given by pagerank algorithms. The clustering
More informationLearning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable
Learning Objectives Continuous Random Variables & The Normal Probability Distribution 1. Understand characteristics about continuous random variables and probability distributions 2. Understand the uniform
More information[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}
MVA MVA [VARIABLES=] {varlist} {ALL } [/CATEGORICAL=varlist] [/MAXCAT={25 ** }] {n } [/ID=varname] Description: [/NOUNIVARIATE] [/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n}
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationData Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 6. k-means Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Outline K-Means Algorithm in K-Means Node References K-Means Algorithm in Overview The k-means method is a clustering
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationChapter 1. Looking at Data-Distribution
Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw
More informationClustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin
Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014
More informationA User Manual for the Multivariate MLE Tool. Before running the main multivariate program saved in the SAS file Part2-Main.sas,
A User Manual for the Multivariate MLE Tool Before running the main multivariate program saved in the SAS file Part-Main.sas, the user must first compile the macros defined in the SAS file Part-Macros.sas
More informationCOMP5318 Knowledge Management & Data Mining Assignment 1
COMP538 Knowledge Management & Data Mining Assignment Enoch Lau SID 20045765 7 May 2007 Abstract 5.5 Scalability............... 5 Clustering is a fundamental task in data mining that aims to place similar
More informationClustering. Partition unlabeled examples into disjoint subsets of clusters, such that:
Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationClustering Part 2. A Partitional Clustering
Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida
More information9.1. K-means Clustering
424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationUnsupervised Learning Hierarchical Methods
Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach
More informationPCOMP http://127.0.0.1:55825/help/topic/com.rsi.idl.doc.core/pcomp... IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More information10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors
Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationBasic Commands. Consider the data set: {15, 22, 32, 31, 52, 41, 11}
Entering Data: Basic Commands Consider the data set: {15, 22, 32, 31, 52, 41, 11} Data is stored in Lists on the calculator. Locate and press the STAT button on the calculator. Choose EDIT. The calculator
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationThe DMSPLIT Procedure
The DMSPLIT Procedure The DMSPLIT Procedure Overview Procedure Syntax PROC DMSPLIT Statement FREQ Statement TARGET Statement VARIABLE Statement WEIGHT Statement Details Examples Example 1: Creating a Decision
More informationAnalyzing Genomic Data with NOJAH
Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationCluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]
Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of
More informationUnsupervised Learning and Data Mining
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Supervised Learning ó Decision trees ó Artificial neural nets ó K-nearest neighbor ó Support vectors ó Linear regression
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationCHAPTER 2: SAMPLING AND DATA
CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),
More informationCHAPTER THREE THE DISTANCE FUNCTION APPROACH
50 CHAPTER THREE THE DISTANCE FUNCTION APPROACH 51 3.1 INTRODUCTION Poverty is a multi-dimensional phenomenon with several dimensions. Many dimensions are divided into several attributes. An example of
More informationCS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016
CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationMining Social Network Graphs
Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationData Clustering. Danushka Bollegala
Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering
More informationData Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski
Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationMachine Learning: Symbol-based
10d Machine Learning: Symbol-based More clustering examples 10.5 Knowledge and Learning 10.6 Unsupervised Learning 10.7 Reinforcement Learning 10.8 Epilogue and References 10.9 Exercises Additional references
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationSTP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES
STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use
More information