Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Size: px
Start display at page:

Download "Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907"

Transcription

1 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, Abstract Clustering is a technique for discovering patterns and structure in data. Often, the most dicult part of this process is determining the optimal number of patterns or clusters. In this paper, we introduce a novel approach to clustering, that treats the clusters as \players" in a game. During this \clustering game", players compete for ownership of data points, according to a set of game rules. We describe a number of possible rule sets, and conduct an initial evaluation to ascertain which are suitable for the clustering problem. This is followed by a comparison of dierent game rules on four characteristic data sets. Results are promising, showing that the game framework can successfully nd optimal, or near optimal solutions. Without knowing the number of clusters! An analysis of these results highlights possible extensions to the game strategies. 1 The Clustering Problem Clustering is the grouping of similar objects. The aim is to divide n objects, each represented by a point in p-dimensional space, into clusters that reect the underlying structure of the data. This problem is NP-hard, even when the number of clusters is known [1]. Numerous algorithms, both deterministic and stochastic, have been developed for the clustering problem [2, 3, 4]. However, these techniques do not guarantee optimality, and the majority need the number of clusters a priori. Even when an algorithm can determine the number of clusters, this is often external to the clustering procedure (the process is repeated for diering values of k, and the optimal clustering selected). We propose a new approach to clustering, that treats the process as a game. The players in this game are clusters, who compete amongst themselves for ownership of objects according to a set of rules or strategies. These rules are used to determine: how the game is initialised, how turns are allocated, which objects are attacked, how attacks are resolved, and nally, when to stop the game. In this paper, we explore the game approach and investigate which strategies are appropriate for specic data sets. 2 The Game of Clustering The \clustering game" starts by randomly distributing the objects between an initial number of clusters. In turn, each player may attack any object they don't own; the object is defended by the cluster that owns it. If successful, the attacking cluster gains the object. If unsuccessful, ownership of the object does not change. Players are eliminated from the game when they lose their last object. The game concludes when the stopping criterion is met. We dene a game by the following strategies: initialisation (the number of starting players, and the initial distribution of objects), attack allocation (how attacks are assigned to players), object selection (which object a given cluster should attack), attack resolution (whether the attack is successful), and a stopping criterion (when to end the game). 2.1 Initialisation Given n objects, a simple strategy is to start with n players, each owning a single object. Alternatively, we can randomly distribute the objects between a lesser number of clusters. For the latter, it is important

2 The Game of Clustering R Cole and L Barone { Page 2 that the starting number of clusters is suciently high, or that there is some method of introducing new players, to counter the attrition of cluster numbers. 2.2 Attack Allocation We dene an attack as an attempt, by one cluster, to gain a object belonging to another cluster. A sequence of attacks by a single player is known as a turn. Turns are allocated to players according to their starting order; a set of turns (one for each player) forms a round. There are several ways of allocating attacks during a game. The single attack strategy limits each player to one attack per turn, whereas proportional allocation assigns attacks in proportion to the number of objects each cluster owns. Continue while winning allocates an extra attack, to the current player, for a successful attack. This can be used in conjunction with either of the above schemes. 2.3 Object Selection This strategy identies a victim for the attacking cluster. The random strategy chooses any point, provided it does not belong to the attacking cluster; probabilistic nearest neighbour selects closer objects with higher probability. The probabilities for the latter scheme can be calculated as a decreasing geometric series. These values can also be scaled relative to the defending cluster size, so that the attack is focused on smaller players. 2.4 Attack Resolution When a cluster attacks an object, the resolution strategy determines whether the attack is successful. First, two \strengths" are calculated: one for the attacking cluster, and one for the defending cluster. These values quantify the plausibility of each cluster owning the object. Then, ownership of the object is changed with a probability based on these values: P (successful attack) = attacking strength attacking strength + defending strength We dene ve strength measures for our clustering game. These are based on the following assumptions: objects close to a cluster's centroid should probably belong to that cluster, strength should rapidly decrease as the distance from the cluster centroid increases, and the shape and spread of the attacking (or defending) cluster should have some inuence on the strength. The Distance measure is a function of the distance between the attacked object and the relevant cluster centroid: Strength attacker = exp (1? d(x attacker ; x object )) where x attacker is the centroid of the attacking cluster, x object is the object coordinates, and d(x attacking ; x object ) is the Euclidean distance between the attacking cluster's centroid and the object. The defender's strength is calculated in a similar manner. The relative distance strength removes the attacked object from the defending cluster before calculating the distance to the cluster centroid: Strength attacker = e? exp d(xattacker ; x object ) d max Strength defender = e? exp d(xdefender?object ; x object ) d max Here d max is the maximum distance between any two points in the data set, and x defender?object is the centroid of the defending cluster calculated as if the attacked object is not a cluster member. Sum of squared distance within is based on the sum of the squared distances between cluster centroids and the clusters' objects. If the sum of the attacking and defending clusters' squared distances is smaller than when the object is placed with the attacker, we want the probability of a successful attack to be

3 The Game of Clustering R Cole and L Barone { Page 3 Figure 2{1: Multivariate density interpretation of the Ruspini data low (as re-assigning the object will increase the overall spread of the two clusters). Thus, the strengths are calculated as follows: X attacker defender Strength attacker = d(x attacker ; x i ) 2 + d(x defender ; x i ) 2 i X i X attacker+object defender?object Strength defender = d(x attacker+object ; x i ) 2 + d(x defender?object ; x i ) 2 Multivariate normal density strength assumes that the cluster has a multivariate normal distribution (with the cluster centroid and covariance values), and returns the density at the object's position, i X i 1 Strength attacker = (2) p2 j attacker j 2 1 exp (?(x object? x attacker )?1 attacker (x object? x attacker )=2) where p is the number of dimensions, attacker the covariance matrix for the cluster. Figure 2{1 shows the multivariate normal density plot for the four cluster Ruspini [3] data set. Our nal strength measure is based on gravitation attraction, Strength attacker = m attackerm object d(x attacker ; x object ) 2 where m attacker and m object are the mass of the attacking cluster and the attacked object, respectively. We assign each object a mass of one unit, giving the attacking cluster a mass equal to its number of objects, and the attacked object a mass of one. The defense strength is calculated in a similar fashion. We always want a cluster to have some probability of winning a object. This means both the attacking and defending strengths must be non-zero. Additionally, some of the described functions may return undened values. For example, if the defending cluster has only one object (which means that this is also the attacked object), the distance between this object and its cluster centroid is zero. Thus the gravitational strength will be innite. The multivariate density strength will be undened if the cluster's covariance matrix is singular. We avoid undened values by setting a default distance value, and adapting the covariance information we do have. The default distance value is taken as the minimum non-zero distance between any two points in the data set. Unknown covariance values are calculated as the average of known values.

4 The Game of Clustering R Cole and L Barone { Page Stopping Criterion The game is limited to a maximum number of rounds. However, this is not necessarily the best place to stop. Obviously, the game should stop when only one cluster remains. We also want the game to end when a stable clustering is found; we assume stability is reached when there are no successful attacks for several rounds of play. 3 Rule Set Evaluation To judge the performance of our clustering game we evaluate our algorithm over ve sample data sets and compare the results to the known solutions. Figure 3{1 contains the ve two-dimensional data sets we used. The correct clusters are shown in each case. The Ruspini [3] data comprises 75 objects that are arranged into four clusters. The other four data sets were designed to test the abilities of our algorithm. Sparse is a simple data set (25 objects, seven clusters) with well separated, small clusters; Dense has a greater number of objects (98) that form ve compact clusters; Elliptical contains 75 objects divided between four clusters with elongated shape; and Unstructured has 17 evenly spread objects. Correctness is calculated as the percentage of objects correctly clustered, with respect to the structure of the data set. If the number of clusters exceeds that of the known solution, k, only objects in the best k clusters are counted toward the correctness. Testing started with the comparison of various rule sets on the Ruspini data. The aim being to nd a few strategy combinations that produced acceptable results. We then selected several of the combinations, and tested them on the remaining data sets to determine any limitations of the strategies. Each clustering game started with the objects divided randomly between ten clusters. In our experiments, we tested dierent combinations of attack allocation, object selection, and strength mechanism strategies. In each case, the game was stopped after ten rounds of unsuccessful attacks, or upon reaching rounds. All experiments were performed on a Sun Sparc 5 workstation running SunOS under normal load. A non-linear additive feedback random number generator was used to return successive pseudo-random numbers in the range to 2 31? 1. The program was written using C and compiled using the GNU gcc version compiler with maximum optimization options. 4 Results The rst rule set we tested was: random object selection, single attack allocation, and distance strength. Figure 4{1 is the nal clustering of the Ruspini data for these strategies. The objects of each of the correct clusters are shared between several players, and none of the initial players have been eliminated from the game. This rule set attained a correctness of 53.3% after 181 attacks. Changing the selection strategy to probabilistic nearest neighbour improved the correctness and decreased the execution time; the game ended after 478 attacks with a correctness of 7.7% (Figure 4{2). This strategy also retained the ten initial players. Figure 4{3 shows the correctness of the two selection schemes over the course of the game. Next we compared the attack allocation strategies (Figure 4{4). The single attack strategy's nal correctness was 7.7% after 478 turns, proportional allocation reached 65.3% correctness after attacks, and continue while winning (when starting with a single attack) was the most successful strategy with a nal correctness of 76.% for only 361 attacks. All ten players nished each of these games. Figures 4{5 through 4{9 contain the nal clusterings, and the correctness plots for the ve distance measures. Table 4{1 is a summary of strategy performance. The games with the distance and multivariate normal density strengths both nished with all ten clusters (Figures 4{5, 4{8). In each case single clusters from the correct solution were shared between a number of players. These games ended after a small number of attacks. Relative distance and sum of squared distances within strengths resulted in too few clusters (Figures 4{5 and 4{7). The correctness plots show that both games did nd the correct, or near correct, clustering, but the game continued (and the clustering worsened) after this point. The game with gravitational strength clustered the Ruspini data set correctly (Figure 4{9). At this stage we selected the distance, multivariate normal density, and gravitational strength measures

5 The Game of Clustering R Cole and L Barone { Page 5 (a) Ruspini (b) Sparse (c) Dense (d) Elliptical (e) Unstructured Figure 3{1: Experimental data sets

6 The Game of Clustering R Cole and L Barone { Page 6 Figure 4{1: Clustering of Ruspini data using random object selection, single attack turns, and distance strength Strength Clusters Found Correctness (%) Distance Relative distance Sum squared distances Multivariate normal density Gravity Table 4{1: Strength measure performance with probabilistic nearest neighbour selection and single attack while winning allocation for Ruspini data as strategies of interest, and compared their performance on the remaining four data sets (Figures 4{1 through 4{13). Summaries of the clustering performance for the data sets are presented in Table 4{2 through Table 4{5. Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{2: Strength measure performance for Sparse data with probabilistic nearest neighbour selection and single attack while winning allocation None of the games were able to cluster the Sparse data correctly, the main problem being the allocation of the three single-point clusters. Both the distance and density measures added these to nearby groups. The game using gravity strength only did this for one object (Figure 4{1). All of the clustering games performed well on the Dense data set (Figure 4{11). In fact the gravity strength measure resulted in a perfect clustering. The other measures had too many clusters. None of the strength measures were sucient to correctly cluster the Elliptical data, there were too many clusters in all cases (Figure 4{12). The distance clustering divides the elliptical clusters into smaller spherical shapes, the multivariate density strength has overlapping clusters, and the gravity clustering has a few oddly clustered points. Our clustering game successfully clustered the Unstructured data set with both the distance and the gravitational strength measures (Figure 4{13). The number of attacks for the gravitational measure was considerably less than that for the distance strength. The multivariate density strength divided the objects between seven clusters. Finally, we timed our clustering game for the Ruspini data. We used the distance strength, single attack while winning turn allocation and the probabilistic nearest neighbour selection strategies. The game

7 The Game of Clustering R Cole and L Barone { Page Probabilistic Nearest Neighbour Random Figure 4{2: Clustering of Ruspini data using probabilistic nearest neighbour selection, with single attack turns and distance strength Figure 4{3: Correctness versus number of attacks for random and probabilistic nearest neighbour selection, with single attack turns and distance strength on Ruspini data Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{3: Strength measure performance for Dense data with probabilistic nearest neighbour selection and single attack while winning allocation clustered the data set in an average of 6.69 seconds CPU time (standard deviation of 2.74), the average number of attacks was 485 (standard deviation of 1683). 5 Discussion The performance of our clustering game is highly dependent on the choice of the strength measure. Gravitational strength clustered three out of the ve data sets perfectly; one of the remaining sets had only one point clustered incorrectly. Both the relative distance and the sum of the squared distances within measures were too aggressive, eliminating too many players from the game (thus nding too few clusters in the data set). The games using these strengths found near perfect solutions earlier in their play. In contrast, the distance and multivariate normal density measures were not aggressive enough and a number of players shared objects belonging to a single correct cluster. A strength cooling factor which reduces the strength of the attacks as the game progresses, similar to the cooling law in simulated annealing, may help reduce the aggressiveness of certain strength measures. Another approach may be to introduce a method of dividing clusters when they become too large or spread. The non-aggressive measures may benet from the addition of a merging step, which can join two clusters in a single attack step. This should be based on the relative position and spread of the two Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{4: Strength measure performance for Elliptical data with probabilistic nearest neighbour selection and single attack while winning allocation

8 The Game of Clustering R Cole and L Barone { Page Single Proportional While Winning Figure 4{4: Correctness versus number of attacks for dierent attack allocation strategies, using probabilistic nearest neighbour selection and distance strength on the Ruspini data Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{5: Strength measure performance for Unstructured data with probabilistic nearest neighbour selection and single attack while winning allocation clusters. Of the strength measures, gravitational attraction was clearly the best, resulting in optimal, or near optimal, clusterings in all cases. As expected, the one data set this measure had diculty with, was the elliptical set, as the gravitational measure does not account for cluster shape. The distance strength clustered the unstructured data correctly, but was unable to nd the correct number of clusters in the other sets. Games using the multivariate density measure allowed clusters to be elliptical in shape, and to overlap. However, under this scheme, the strength of a cluster rapidly approaches zero as we move away from the cluster centroid. This means the probability of successful attacks are low, and that the game will tend to divide the objects between too many clusters. The execution time of our clustering game was reduced by probabilistic nearest neighbour selection, and continue while winning turn allocation. 6 Conclusions A novel game-based technique has been proposed for nding clusters in data. This approach treats clusters as players in a game competing for object ownership. We found that this technique can successfully discover optimal, or near optimal solutions for a range of data sets. However, the performance of the clustering game is highly dependent on the choice of game rules, particularly the strength measure which is used to resolve object ownership. The major advantage of this approach is that it oers a \natural" method of discovering the correct number of clusters in the data the elimination of players during the game. The clustering game also recognises unstructured data as such. This method is a fast, simple, and, if given suitable game strategies, an accurate technique for recognising structure in data.

9 The Game of Clustering R Cole and L Barone { Page (a) Final clustering (b) Correctness versus number of attacks Figure 4{5: Performance of clustering game using distance strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data Bibliography 1. Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, San Francisco, John A. Hartigan. Clustering Algorithms. John Wiley and Sons, Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data : An Introduction to Cluster Analysis. John Wiley and Sons, Inc., Brian S. Everitt. Cluster Analysis. Halsted Press, third edition, Acknowledgements We thank our supervisors, Dr. Nick Spadaccini and Dr. Lyndon While for their helpful suggestions.

10 The Game of Clustering R Cole and L Barone { Page (a) Final clustering (b) Correctness versus number of attacks Figure 4{6: Performance of clustering game using relative distance strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data (a) Final clustering (b) Correctness versus number of attacks Figure 4{7: Performance of clustering game using sum of squared distances within strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data

11 The Game of Clustering R Cole and L Barone { Page (a) Final clustering (b) Correctness versus number of attacks Figure 4{8: Performance of clustering game using multivariate normal density strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data (a) Final clustering (b) Correctness versus number of attacks Figure 4{9: Performance of clustering game using gravitational strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data

12 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{1: Performance of strength measures for Sparse data, with probabilistic nearest neighbour and single attack while winning

13 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{11: Performance of strength measures for Dense data, with probabilistic nearest neighbour and single attack while winning

14 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{12: Performance of strength measures for Elliptical data, with probabilistic nearest neighbour and single attack while winning

15 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{13: Performance of strength measures for Unstructured data, with probabilistic nearest neighbour and single attack while winning

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda

More information

Preliminary results from an agent-based adaptation of friendship games

Preliminary results from an agent-based adaptation of friendship games Preliminary results from an agent-based adaptation of friendship games David S. Dixon June 29, 2011 This paper presents agent-based model (ABM) equivalents of friendshipbased games and compares the results

More information

Learning in Medical Image Databases. Cristian Sminchisescu. Department of Computer Science. Rutgers University, NJ

Learning in Medical Image Databases. Cristian Sminchisescu. Department of Computer Science. Rutgers University, NJ Learning in Medical Image Databases Cristian Sminchisescu Department of Computer Science Rutgers University, NJ 08854 email: crismin@paul.rutgers.edu December, 998 Abstract In this paper we present several

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis Summer School in Statistics for Astronomers & Physicists June 15-17, 2005 Session on Computational Algorithms for Astrostatistics Cluster Analysis Max Buot Department of Statistics Carnegie-Mellon University

More information

Further Applications of a Particle Visualization Framework

Further Applications of a Particle Visualization Framework Further Applications of a Particle Visualization Framework Ke Yin, Ian Davidson Department of Computer Science SUNY-Albany 1400 Washington Ave. Albany, NY, USA, 12222. Abstract. Our previous work introduced

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * OpenStax-CNX module: m39305 1 Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * Free High School Science Texts Project This work is produced by OpenStax-CNX

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA A Decoder-based Evolutionary Algorithm for Constrained Parameter Optimization Problems S lawomir Kozie l 1 and Zbigniew Michalewicz 2 1 Department of Electronics, 2 Department of Computer Science, Telecommunication

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial Accelerated Learning on the Connection Machine Diane J. Cook Lawrence B. Holder University of Illinois Beckman Institute 405 North Mathews, Urbana, IL 61801 Abstract The complexity of most machine learning

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2002 Paper 105 A New Partitioning Around Medoids Algorithm Mark J. van der Laan Katherine S. Pollard

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II Advances in Neural Information Processing Systems 7. (99) The MIT Press, Cambridge, MA. pp.949-96 Unsupervised Classication of 3D Objects from D Views Satoshi Suzuki Hiroshi Ando ATR Human Information

More information

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented which, for a large-dimensional exponential family G,

More information

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m. Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA

More information

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract Clustering Sequences with Hidden Markov Models Padhraic Smyth Information and Computer Science University of California, Irvine CA 92697-3425 smyth@ics.uci.edu Abstract This paper discusses a probabilistic

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

11/14/2010 Intelligent Systems and Soft Computing 1

11/14/2010 Intelligent Systems and Soft Computing 1 Lecture 8 Artificial neural networks: Unsupervised learning Introduction Hebbian learning Generalised Hebbian learning algorithm Competitive learning Self-organising computational map: Kohonen network

More information

BMVC 1996 doi: /c.10.41

BMVC 1996 doi: /c.10.41 On the use of the 1D Boolean model for the description of binary textures M Petrou, M Arrigo and J A Vons Dept. of Electronic and Electrical Engineering, University of Surrey, Guildford GU2 5XH, United

More information

Powered Outer Probabilistic Clustering

Powered Outer Probabilistic Clustering Proceedings of the World Congress on Engineering and Computer Science 217 Vol I WCECS 217, October 2-27, 217, San Francisco, USA Powered Outer Probabilistic Clustering Peter Taraba Abstract Clustering

More information

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies David S. Dixon University of New Mexico, Albuquerque NM 87131, USA Abstract. A friendship game in game theory is a network

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

Forestry Applied Multivariate Statistics. Cluster Analysis

Forestry Applied Multivariate Statistics. Cluster Analysis 1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Bumptrees for Efficient Function, Constraint, and Classification Learning

Bumptrees for Efficient Function, Constraint, and Classification Learning umptrees for Efficient Function, Constraint, and Classification Learning Stephen M. Omohundro International Computer Science Institute 1947 Center Street, Suite 600 erkeley, California 94704 Abstract A

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Football result prediction using simple classification algorithms, a comparison between k-nearest Neighbor and Linear Regression

Football result prediction using simple classification algorithms, a comparison between k-nearest Neighbor and Linear Regression EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2016 Football result prediction using simple classification algorithms, a comparison between k-nearest Neighbor and Linear Regression PIERRE

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Automated Clustering-Based Workload Characterization

Automated Clustering-Based Workload Characterization Automated Clustering-Based Worload Characterization Odysseas I. Pentaalos Daniel A. MenascŽ Yelena Yesha Code 930.5 Dept. of CS Dept. of EE and CS NASA GSFC Greenbelt MD 2077 George Mason University Fairfax

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Inital Starting Point Analysis for K-Means Clustering: A Case Study

Inital Starting Point Analysis for K-Means Clustering: A Case Study lemson University TigerPrints Publications School of omputing 3-26 Inital Starting Point Analysis for K-Means lustering: A ase Study Amy Apon lemson University, aapon@clemson.edu Frank Robinson Vanderbilt

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD CAR-TR-728 CS-TR-3326 UMIACS-TR-94-92 Samir Khuller Department of Computer Science Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3255 Localization in Graphs Azriel

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Satisfactory Peening Intensity Curves

Satisfactory Peening Intensity Curves academic study Prof. Dr. David Kirk Coventry University, U.K. Satisfactory Peening Intensity Curves INTRODUCTION Obtaining satisfactory peening intensity curves is a basic priority. Such curves will: 1

More information

Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data

Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data EUSFLAT-LFA 2011 July 2011 Aix-les-Bains, France Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data Ludmila Himmelspach 1 Daniel Hommers 1 Stefan Conrad 1 1 Institute of Computer Science,

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

Lower Bounds for Insertion Methods for TSP. Yossi Azar. Abstract. optimal tour. The lower bound holds even in the Euclidean Plane.

Lower Bounds for Insertion Methods for TSP. Yossi Azar. Abstract. optimal tour. The lower bound holds even in the Euclidean Plane. Lower Bounds for Insertion Methods for TSP Yossi Azar Abstract We show that the random insertion method for the traveling salesman problem (TSP) may produce a tour (log log n= log log log n) times longer

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION AN EMPIRICAL STUDY OF PERSPECTIVE-BASED USABILITY INSPECTION Zhijun Zhang, Victor Basili, and Ben Shneiderman Department of Computer Science University of Maryland College Park, MD 20742, USA fzzj, basili,

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty Michael Comstock December 6, 2012 1 Introduction This paper presents a comparison of two different machine learning systems

More information

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017. COSC 6339 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 217 Clustering Clustering is a technique for finding similarity groups in data, called

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Ben Karsin University of Hawaii at Manoa Information and Computer Science ICS 63 Machine Learning Fall 8 Introduction

More information

the number of states must be set in advance, i.e. the structure of the model is not t to the data, but given a priori the algorithm converges to a loc

the number of states must be set in advance, i.e. the structure of the model is not t to the data, but given a priori the algorithm converges to a loc Clustering Time Series with Hidden Markov Models and Dynamic Time Warping Tim Oates, Laura Firoiu and Paul R. Cohen Computer Science Department, LGRC University of Massachusetts, Box 34610 Amherst, MA

More information

Computer Experiments: Space Filling Design and Gaussian Process Modeling

Computer Experiments: Space Filling Design and Gaussian Process Modeling Computer Experiments: Space Filling Design and Gaussian Process Modeling Best Practice Authored by: Cory Natoli Sarah Burke, Ph.D. 30 March 2018 The goal of the STAT COE is to assist in developing rigorous,

More information

Modelling of non-gaussian tails of multiple Coulomb scattering in track fitting with a Gaussian-sum filter

Modelling of non-gaussian tails of multiple Coulomb scattering in track fitting with a Gaussian-sum filter Modelling of non-gaussian tails of multiple Coulomb scattering in track fitting with a Gaussian-sum filter A. Strandlie and J. Wroldsen Gjøvik University College, Norway Outline Introduction A Gaussian-sum

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation

More information

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an Competitive Neural Trees for Vector Quantization Sven Behnke and Nicolaos B. Karayiannis Department of Mathematics Department of Electrical and Computer Science and Computer Engineering Martin-Luther-University

More information

Clustering Using Elements of Information Theory

Clustering Using Elements of Information Theory Clustering Using Elements of Information Theory Daniel de Araújo 1,2, Adrião Dória Neto 2, Jorge Melo 2, and Allan Martins 2 1 Federal Rural University of Semi-Árido, Campus Angicos, Angicos/RN, Brasil

More information

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction Rearrangement of DNA fragments: a branch-and-cut algorithm 1 C. E. Ferreira 1 C. C. de Souza 2 Y. Wakabayashi 1 1 Instituto de Mat. e Estatstica 2 Instituto de Computac~ao Universidade de S~ao Paulo e-mail:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Isabelle Guyon Notes written by: Johann Leithon. Introduction The process of Machine Learning consist of having a big training data base, which is the input to some learning

More information

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015. COSC 6397 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 215 Clustering Clustering is a technique for finding similarity groups in data, called

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Condence Intervals about a Single Parameter:

Condence Intervals about a Single Parameter: Chapter 9 Condence Intervals about a Single Parameter: 9.1 About a Population Mean, known Denition 9.1.1 A point estimate of a parameter is the value of a statistic that estimates the value of the parameter.

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction A Monte Carlo method is a compuational method that uses random numbers to compute (estimate) some quantity of interest. Very often the quantity we want to compute is the mean of

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information