HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

Size: px
Start display at page:

Download "HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm"

Transcription

1 HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur , Assam India {bgb,dkb}.tezu.ernet.in & J K Kalita Department of Computer Science University of Colorado Colorado Springs CO USA jkalita@uccs.edu October 10, 2005 Abstract Clustering is an important data mining technique. There are many algorithms that cluster either numeric or categorical data. However few algorithms cluster mixed type datasets with both numerical and categorical attributes. In this paper, we propose a similarity measure between two clusters that enables hierarchical clustering of data with numerical and categorical attributes. This similarity measure is derived from a frequency vector of attribute values in a cluster. Experimental results establish that our algorithm produces good quality clusters. Key words: Mixed type of attributes, clustering, frequency vector of attributes, similarity of clusters. 1 Introduction Data mining has become increasingly important in recent years. In data mining, clustering is a useful technique for discovering interesting distributions and patterns in the underlying data. It is a technique for grouping data points such that points within a single group, called a cluster, have similar characteristics (or are close to each other) while points in different groups are dissimilar. Clustering techniques are broadly classified into two categories: partitional and hierarchical. Given a set of objects and a clustering criterion, partitional 1

2 clustering obtains a partition of the objects into k clusters such that each cluster contains at least one object and each object belongs to exactly one cluster. The number of clusters to be found, k, is pre-specified. A hierarchical clustering is a nested sequence of partitions. An agglomerative hierarchical clustering starts by placing each object in its own cluster and then merges these atomic clusters into larger clusters, until all objects are in a single cluster. Divisive hierarchical clustering reverses the process by starting with all objects in a cluster and subdividing into smaller pieces. The goal of clustering is to discover dense and sparse regions in a data set. Most work in clustering has focussed on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between objects. However, many data sets consist of categorical attributes on which distance functions are not naturally defined. As an example, the cap color attribute of the MUSHROOM data set in the popular UCI Machine Learning repository [1] can take values from the domain: {brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow}. It is hard to reason that one color is like or unlike another color in a way similar to real numbers. An important characteristic of categorical domains is that they typically have a small number of attribute values. Categorical attributes with large domain sizes typically do not contain information that may be useful for grouping tuples into classes [2]. Real world databases contain both numeric and categorical attributes, and as a consequence most clustering algorithms cannot be applied to such data. There has been a need for clustering algorithms that are capable of dealing with data sets containing different types of attributes, numeric or categorical. This paper presents a bottom up hierarchical clustering algorithm that can cluster mixed type data. In the clustering process, two clusters are merged using a similarity measure we define. The similarity measure is based on attribute value frequency counts and works well for data sets containing a mixture of categorical and numerical attributes. It also works equally well if the data set contains only numeric or only categorical data. The similarity measure can be computed efficiently as it requires only one pass through the two clusters whose similarity needs to be computed. We demonstrate the effectiveness of our algorithm using a number of synthetic and real world data sets. The rest of the paper is organized as follows. Section 2 gives an overview of related clustering algorithms. We present our new clustering algorithm in Section 3. Section 4 gives the experimental results. Section 5 contains conclusions and directions for future work. 2 Related Work The k-means [3] clustering technique is very popular for partitioning data sets with numerical attributes. In [4], Ralambondrainy proposed an extended version of the k-means algorithm which converts categorical attributes into binary attributes. This method needs increased time and space costs if the categorical

3 attributes have many categories. Further, real values between 0 and 1 representing cluster means do not indicate cluster characteristics. Huang proposed the k-modes algorithm to tackle the problem of clustering large categorical data sets [5, 6]. The k-modes algorithm extends the k-means algorithm by using a simple matching dissimilarity measure for categorical objects: modes instead of means for clusters, and a frequency based method to update modes to minimize the clustering cost. Further, Huang also combined the k-modes algorithm with the k-means algorithm resulting in the so-called k-prototypes algorithm for clustering objects described by mixed numerical and categorical attributes. These extensions remove the numeric-only limitation of the k-means algorithm and enable it to be used for efficient clustering of very large data sets from real world databases. However, the k-modes algorithm is unstable due to non-uniqueness of the modes. The clustering results depend strongly on the selection of modes during clustering. Another clustering algorithm for categorical and mixed data was proposed by Le [7]. The algorithm chooses k number of largest sets from Non-Expandable Strongly Connected sets built using the Breadth First Search algorithm. Therefore, it does not depend on selecting points like means or modes. In addition, the remaining objects are assigned to clusters by testing the minimum distance of the object from all clusters. 3 Our Algorithm 3.1 Notation We assume that the set of objects to be clustered is stored in a data set S defined by a set of attributes A 1,..., A m with domains D 1,..., D m, respectively. Each object in S is represented by a tuple X D 1 D m. We only consider two general data types, namely, numeric and categorical. The domains of attributes associated with these two types are called numerical and categorical, respectively. A numerical domain consists of continuous real values. As such, each numerical data object is considered a point in a multidimensional metric space that uses a distance metric such as the Euclidean or the Manhattan measure. A domain D i is defined as categorical if it is finite, discrete valued and not naturally ordered. Logically, each data object X in the dataset is represented as a conjunction of attribute-value pairs A 1 = x 1 A m = x m (1) where x i D i for 1 i m. For simplicity, we represent X as a tuple X = (x 1,..., x m ) D 1 D m. (2) If all D i s are categorical domains, then objects in S are called categorical objects. We consider the clustering problem for mixed-type data objects where some domains are numeric, while others are categorical.

4 Let S = {X 1, X 2,..., X n } be a set of n objects. Object X i is represented as X i = (x i1, x i2,..., x im ) (3) where m is the number of total attributes. We write X i = X k if x ij = x kj for 1 j m. The relation X i = X k does not mean that X i and X k are the same object in the real world database. It means the two objects have equal values for the attributes A 1, A 2,..., A m. Consider two clusters C i S and C j S with cardinalities C i and C j respectively. We want to measure similarity between the two clusters C i and C j as defined in the following sections. 3.2 Attribute Frequency Vector of Categorical Attribute We define attribute frequency vector AttV i (A j ) of categorical attribute A j of a given cluster C i as AttV i (A j ) = {(d, rf i (d)) d D j } (4) where D j is the domain of j th attribute A j and relative frequency rf i (d) of cluster C i is defined as rf i (d) = freq(d) (5) C i where freq(d) is defined as C i freq(d) = 1 such that x lj = d. (6) l=1 Here, 0 freq(d) C i and hence 0 rf i (d) Frequency Vector Then we define frequency vector of a cluster C i having mixed type attributes as given below F reqv i = (V i1, V i2,..., V im ) (7) where m is the total number of attributes, and V ij is defined by { Meani (A V ij = j ), if A j is numerical attribute; AttV i (A j ), if A j is categorical attribute. (8) where for numerical attribute A j, mean of j th attribute of cluster C i is Mean i (A j ) defined as Mean i (A j ) = 1 C i x kj (9) C i and for categorical attribute A j, AttV i (A j ) is defined by Equation 4. Here 0 Mean i (A j ) 1. k=1

5 3.4 Similarity of an Attribute of two Clusters Similarity of k th attribute of two Clusters C i & C j is S k (C i, C j ) and is defined by { 1 Vik V S k (C i, C j ) = jk, if A k is numerical (10) V ik V jk, if A k is categorical where attribute frequency vector V ik, is defined by the equations 8, and vector multiplication for k th categorical attribute is defined as D k V ik V jk = [rf i (d l ) rf j (d l )] (11) l=1 where relative frequency rf i (d l ) is defined by the the equation 5. Here, 1 V ik V jk V ik V jk 1 and 0 V ik V jk 1 and hence 0 S k (C i, C j ) 1. All numerical values are assumed to be within the range [0, 1], 0 and 1 inclusive. 3.5 Similarity of Clusters Similarity of two clusters C i & C j is defined by Sim(C i, C j ) = m S k (C i, C j ). (12) Here 0 Sim(C i, C j ) m, where m is the number of total attributes. Example 1 Let us take two cluster C 1 and C 2 as shown in Tables 1 and 2. The clusters are taken from a dataset containing 3 categorical and 2 numerical attributes. The numerical attributes are normalized in the value range [0,1]. The domains of the categorical attributes are: D 1 = {a1, a2, a3, a4} D 2 = {b1, b2, b3, b4, b5, b6} D 3 = {c1, c2, c3, c4} D 4 = {d1, d2, d3, d4, d5, d6} D 5 = {e1, e2, e3, e4, e5} Let us find the similarity of clusters C 1 & C 2 for which frequency vectors are computed in Table 3 and 4 respectively. Now the similarity of individual attributes S k (C 1, C 2 ) (for k = 1, 2,..., 5) are S 1 (C 1, C 2 ) = /3 + 3/5 0 k=1

6 A0 A1 A2 A3 A4 a3 b c a3 b c a4 b c a4 b c a3 b c Table 1: Cluster C 1 A0 A1 A2 A3 A4 a2 b c a2 b c a2 b c Table 2: Cluster C 2 + 2/5 0 = 0 S 2 (C 1, C 2 ) = /3 + 1/ / / /5 0 = 0 S 3 (C 1, C 2 ) = / /3 = S 4 (C 1, C 2 ) = /5 2/3 + 2/5 1/3 = S 5 (C 1, C 2 ) = / /3 = Hence Sim(C 1, C 2 ) = 5 k=1 S k(1, 2) = The Algorithm Our algorithm is given below. Input: k = Number of required clusters. 1. Begin with n clusters, each consisting of one object. 2. Repeat steps 3 a total of n k times. 3. Find the pair of most similar clusters C i and C j using the similarity measure Sim(C i, C j ) by equation 12 and merge C i and C j into a single cluster. We have one fewer cluster each time. Output: Remaining k numbers of disjoint clusters.

7 V 00 V 01 V 02 V 03 V 04 (a1, 0) (b1, 0) 3.469/5 (c1, 0) 0.033/5 (a2, 0) (b2, 0) (c2, 0) (a3, 3/5) (b3, 1/5) (c3, 3/5) (a4, 2/5) (b4, 1/5) (c4, 2/5) (b5, 2/5) (b6, 1/5) Table 3: Frequency vector F reqv 1 of cluster C 1 V 00 V 01 V 02 V 03 V 04 (a1, 0) (b1, 0) 1.530/3 (c1, 0) 0.011/3 (a2, 3/3) (b2, 3/3) (c2, 0) (a3, 0) (b3, 0) (c3, 2/3) (a4, 0) (b4, 0) (c4, 1/3) (b5, 0) (b6, 0) Table 4: Frequency vector F reqv 2 of cluster C Complexity Analysis The complexity of the hierarchical clustering algorithm becomes O(n 2 ) [9] if implemented as stated below. 1. The initial similarity matrix can be computed in O(n 2 ) for n number of data records. While computing the matrix the maximum entry for each row is found and the column where it occurs is recorded. 2. Prior to searching for the pair of clusters to be merged in the kth merge, there are n + 1 k clusters remaining, and they are described by n k active rows of the similarity matrix. To find the most similar pair, search is performed for the maximum of the row maxima which were found initially in step 1 and repeatedly updated in step 3. At the kth stage the search involves n k 1 comparisons. Total number of comparisons for n 1 stages is of the order of O(n 2 ) 3. After the most similar pair is found at kth stage, it is necessary to update the n k similarity matrix entries corresponding to the cluster resulting from the merger. Additionally, it is necessary to find the new row maximum corresponding to the new cluster and check the remaining updated values against their respective row maxima. For n 1 stages the number of comparisons and the number of updates are each O(n 2 ). Finally it is necessary to search for a new row maximum in any row where the cluster or column corresponding to the maximum is involved in the merger; that is, the most similar pair in such rows includes one of the clusters involved in the merger. The computation burden for

8 this latter aspect of updating is highly contingent on the pattern of numbers in the similarity matrix and the particular clustering method employed; however in the typical case it would be expected that one row would be updated for this reason at each stage and the expected length of the row at the kth stage is (n k)/2. Therefore, the expected complexity of this computation is O(n 2 ). 4. To avoid testing each row of the similarity matrix as to whether it represents an active cluster, a list of active clusters is maintained. At each stage one element of the list is removed and the remaining are pushed down. It is necessary to search this list for the location of the cluster to be deleted: the expected number of comparisons for n-1 stages is O(n 2 ). Adding up the complexities required for all steps gives a total complexity of O(n 2 ). 4 Experimental Results In this section, we present experimental evaluation of our algorithm and compare its performance with the other counterparts such as ROCK [8] and k-sets [7]. The primary use of clustering algorithms is to discover the grouping structures inherent in data. For this purpose an assumption is first made that a certain structure may exist in a given data set and then a clustering algorithm is used to verify the assumption and recover the structure. We adopted an external criterion which measures the degree of correspondence between the clusters obtained from our clustering algorithm and the classes assigned a priori. 4.1 Accuracy Calculation of Clustering Result The accuracy measure r of clustering results called the clustering accuracy(exactness), is defined as follows: r = 1 k a i (13) n where a i is the maximum number of data objects of cluster i belonging to the same original classes in the test data (correct answer) and n is the number of data objects in the database. Further, the clustering error e is defined as: 4.2 Data Sets i=1 e = 1 r (14) We have experimented with three commonly used real-life data sets as described below. The data sets are available in the UCI Machine Learning Repository. The Soybean (small) disease dataset is chosen to test our algorithm because all of its attributes are categorical. This data set has 47 instances, each being described by 35 attributes, all attributes are categorical. Each instance is labeled as one of four diseases: Diaporthe Stem Cancer(D),

9 Charcoal Rot(C), Rhizoctonia Root Rot(R), and Phytophthora Rot(P). Except for Phytophthora Rot, which has 17 instances, all other diseases have 10 instances each. The Congressional voting data set has 435 instances described by 16 attributes (e.g., education spending, crime, etc.). All attributes are Boolean with Yes or No values. The data set contains 168 Republican(R) and 267 Democrat(D) instances. The credit approval data set has 690 instances each is described by 6 numeric and 9 categorical attributes. The instances are classified into 2 classes, approved labeled (A) as + and rejected labeled (R) as. There are 37 instances having missing values on 7 attributes. The data set contains 307 approved instances and 383 reject instances. 4.3 Results on Soybean Dataset We used our algorithm to cluster the soybean small disease data set with 4 clusters(k = 4). The result is shown in Table 5, where C i is the cluster names produced by the algorithm, and D(Diaporthe Stem Canker), C(Charcoal Rot), R(Rhizoctonia Root Rot), P(Phytophthora) are the names of the original classes. Cluster No. D C R P C C C C Table 5: The clustering result on Soybean Dataset with our algorithm (k = 4). The algorithm discovers soybean disease clusters that are completely identical with the original clusters with accuracy r = 1.0. This clustering accuracy Algorithm Accuracy (average) Fuzzy k-modes 0.79 Fuzzy k-modes with tabu search 0.99 k-modes 0.89 k-modes with refinement initialization 0.98 k-sets 1.00 Our Algorithm 1.00 Table 6: Accuracy of clustering algorithms on Soybean Disease Data, results of K-Sets and others are obtained from [7]

10 Cluster No Republican Democrat C C Table 7: Results of our algorithm on Congressional Voting Dataset with k = 2 and with 432 records out of 435 records ROCK (with 372 records out of 435 records) Cluster No Republican Democrat C C K-Sets (with 350 records out of 435 records) Cluster No Republican Democrat C C Table 8: Results of ROCK and k-set algorithm on Congressional Voting Dataset. Results are found from [7, 8] is the highest in comparison with other methods as shown in the Table Results on Congressional Voting Dataset In the test on the Congressional Voting Dataset we took 432 records out of 435 records as input( records having missing values in all the attributes are excluded). The result for this data set (with k = 2) is shown in Table 7. We found 2 clusters namely C 1, C 2 of sizes 173 and 259 respectively. In the cluster for Republicans (cluster C 1 ) around 7.51% of the members are Democrats and in the cluster for Democrats(cluster C 2 ), around 2.31% of the members of Republicans. We found the accuracy for this data set as Table 8 contains the results of the Congressional Voting Data set using ROCK and k-sets algorithms. The accuracies of clusters on this dataset are shown on the Table 9, and we observed that accuracy of our clustering algorithm is the maximum with respect to the accuracy of ROCK and k-sets algorithm. Algorithm Accuracy (average) ROCK 0.93 k-sets algorithm 0.93 Our algorithm 0.96 Table 9: Accuracies of clustering algorithms on Congressional Voting Dataset

11 Credit Approval Dataset: with 666 records out of 690 records) Cluster No. A R C C Table 10: Results of our algorithm on Credit Approval Dataset. A=Approved class, R=Rejected class Credit Approval Dataset: with 666 records out of 690 records) Cluster No. A R C C Table 11: Results of k-sets algorithm on Credit Approval Dataset. A=Approved class, R=Rejected class 4.5 Results on Credit Approval Dataset For the Credit Approval dataset we have excluded 24 instances having missing values in numeric attributes as done in [7]. Using our algorithm we got two clusters namely C 1 and C 2, as shown in the Table 10. In the cluster of rejected class (C 1 ) we have 18.93% of approved instances and in the approved class (C 2 ) we found 8.40% of rejected instances. The accuracy of our clustering algorithm is Table 11 shows the result of the k-sets algorithm. The comparison of accuracies of our algorithm with k-prototypes, k-prototypes with the tabu search and k-sets algorithm (accuracy measures are found from [7]) are shown in Table 12, which indicates our algorithm bests the others as shown in Table 12. Algorithm Accuracy (average) k-prototype algorithm 0.77 Tabu search k-prototypes algorithm 0.80 k-sets algorithm 0.83 Our algorithm 0.85 Table 12: Accuracies of clustering algorithms on Credit Approval Dataset, results of K-Sets and others are obtained from [7]

12 5 Conclusion In this paper, we have presented the concept of frequency vectors for clusters and based on this idea we provided a method to measure the similarity between a pair of clusters with categorical as well as numerical attributes. We also developed a bottom up hierarchical clustering algorithm that employs the similarity measure for merging clusters. Our method naturally extends to non-metric similarity measures that are relevant in situations where clusters are the only source of knowledge. Our method clusters not only categorical or numerical data but also the mixture of both numerical and categorical data. We have demonstrated the effectiveness of our algorithm in a number of standard datasets. In future work, we will attempt to extend the frequency vector method to extend to find subspace clusters also. References [1] E. Keogh C. Blake and C.J. Merz, UCI repository of machine learning databases, [2] J. Gehrke V. Ganti and R. Ramakrishnan, CACTUS: Clustering categorical data using summaries,in Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD), USA, 1999, [3] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining Knowl. Discov., Vol. 2, No. 2, pp , [4] H. Ralambondrainy, A conceptual version of the kmeans algorithm, Pattern Recogn. Lett., Vol. 15, No. 11, pp , [5] Z. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining, In Research Issues on Data Mining and Knowledge Discovery, [6] Z. Huang, Clustering large datasets with mixed numeric and categorical value, Inprocedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference World Scientific, Singapore, [7] S.Q. Le and T.B. Ho, A k-sets Clustering algorithm for categorical and mixed data, In Proceedings of the 6th SANKEN (ISIR) International Synposium, pp , [8] Rajeeb Rastogi Sudipto Guha and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, In Proc. of the 15th Intl Conf. on Data Eng., [9] Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973.

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA E-mail:

More information

CLUSTERING CATEGORICAL DATA USING k-modes BASED ON CUCKOO SEARCH OPTIMIZATION ALGORITHM

CLUSTERING CATEGORICAL DATA USING k-modes BASED ON CUCKOO SEARCH OPTIMIZATION ALGORITHM ISSN: 2229-6956 (ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2017, VOLUME: 08, ISSUE: 01 DOI: 10.21917/ijsc.2017.0218 CLUSTERING CATEGORICAL DATA USING k-modes BASED ON CUCKOO SEARCH OPTIMIZATION

More information

Using Categorical Attributes for Clustering

Using Categorical Attributes for Clustering Using Categorical Attributes for Clustering Avli Saxena, Manoj Singh Gurukul Institute of Engineering and Technology, Kota (Rajasthan), India Abstract The traditional clustering algorithms focused on clustering

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

Centroid Based Text Clustering

Centroid Based Text Clustering Centroid Based Text Clustering Priti Maheshwari Jitendra Agrawal School of Information Technology Rajiv Gandhi Technical University BHOPAL [M.P] India Abstract--Web mining is a burgeoning new field that

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

ROCK: A Robust Clustering Algorithm for Categorical Attributes

ROCK: A Robust Clustering Algorithm for Categorical Attributes ROCK: A Robust Clustering Algorithm for Categorical Attributes S. Guha, R. Rastogi and K. Shim S. Guha, R. Rastogi and K. Shim ROCK Mark Harrison and John-Paul Cunliffe Introduction 1 Clustering, traditional

More information

The Effect of Word Sampling on Document Clustering

The Effect of Word Sampling on Document Clustering The Effect of Word Sampling on Document Clustering OMAR H. KARAM AHMED M. HAMAD SHERIN M. MOUSSA Department of Information Systems Faculty of Computer and Information Sciences University of Ain Shams,

More information

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm Research Journal of Applied Sciences, Engineering and Technology 11(7): 798-805, 2015 DOI: 10.19026/rjaset.11.2043 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Subspace Clustering for High Dimensional Categorical Data

Subspace Clustering for High Dimensional Categorical Data Subspace Clustering for High Dimensional Categorical Data ABSTRACT Guojun Gan Department of Mathematics and Statistics York University Toronto, Canada gjgan@mathstat.yorku.ca Data clustering has been discussed

More information

A Spectral Based Clustering Algorithm for Categorical Data with Maximum Modularity

A Spectral Based Clustering Algorithm for Categorical Data with Maximum Modularity A Spectral Based Clustering Algorithm for Categorical Data with Maximum Modularity Lazhar labiod and Younès bennani LIPN-UMR 7030, Université Paris 13, 99, av. J-B Clément, 93430 Villetaneuse, France email:

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Clustering Method for Mixed Categorical and Numerical Data

Clustering Method for Mixed Categorical and Numerical Data Clustering Method for Mixed Categorical and Numerical Data Prajapati Madhavi 1, J. S. Dhobi 2 1 Student, Computer science and engineering, GEC Gandhinagar, Gujarat, India 2 Professor, Computer science

More information

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Module: CLUTO Toolkit. Draft: 10/21/2010

Module: CLUTO Toolkit. Draft: 10/21/2010 Module: CLUTO Toolkit Draft: 10/21/2010 1) Module Name CLUTO Toolkit 2) Scope The module briefly introduces the basic concepts of Clustering. The primary focus of the module is to describe the usage of

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data

PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data 2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Improved Performance of Unsupervised Method by Renovated K-Means

Improved Performance of Unsupervised Method by Renovated K-Means Improved Performance of Unsupervised Method by Renovated P.Ashok Research Scholar, Bharathiar University, Coimbatore Tamilnadu, India. ashokcutee@gmail.com Dr.G.M Kadhar Nawaz Department of Computer Application

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Building a Concept Hierarchy from a Distance Matrix

Building a Concept Hierarchy from a Distance Matrix Building a Concept Hierarchy from a Distance Matrix Huang-Cheng Kuo 1 and Jen-Peng Huang 2 1 Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 hckuo@mail.ncyu.edu.tw

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Chuck Cartledge, PhD. 23 September 2017

Chuck Cartledge, PhD. 23 September 2017 Introduction Definitions Numerical data Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Agglomerative Clustering Chuck Cartledge, PhD 23 September 2017 1/30 Table of contents

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 1 Number 1, May - June (2010), pp. 158-165 IAEME, http://www.iaeme.com/ijcet.html

More information

High Accuracy Clustering Algorithm for Categorical Dataset

High Accuracy Clustering Algorithm for Categorical Dataset Proc. of Int. Conf. on Recent Trends in Information, Telecommunication and Computing, ITC High Accuracy Clustering Algorithm for Categorical Dataset Aman Ahmad Ansari 1 and Gaurav Pathak 2 1 NIMS Institute

More information

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Determination of Similarity Threshold in Clustering Problems for Large Data Sets Determination of Similarity Threshold in Clustering Problems for Large Data Sets Guillermo Sánchez-Díaz 1 and José F. Martínez-Trinidad 2 1 Center of Technologies Research on Information and Systems, The

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

A Comparative Analysis between K Means & EM Clustering Algorithms

A Comparative Analysis between K Means & EM Clustering Algorithms A Comparative Analysis between K Means & EM Clustering Algorithms Y.Naser Eldin 1, Hythem Hashim 2, Ali Satty 3, Samani A. Talab 4 P.G. Student, Department of Computer, Faculty of Sciences and Arts, Ranyah,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING Susan Tony Thomas PG. Student Pillai Institute of Information Technology, Engineering, Media Studies & Research New Panvel-410206 ABSTRACT Data

More information

Genetic Algorithm and Simulated Annealing based Approaches to Categorical Data Clustering

Genetic Algorithm and Simulated Annealing based Approaches to Categorical Data Clustering Genetic Algorithm and Simulated Annealing based Approaches to Categorical Data Clustering Indrajit Saha and Anirban Mukhopadhyay Abstract Recently, categorical data clustering has been gaining significant

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

Agglomerative clustering on vertically partitioned data

Agglomerative clustering on vertically partitioned data Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com

More information

Unsupervised learning, Clustering CS434

Unsupervised learning, Clustering CS434 Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Privacy Preservation Data Mining Using GSlicing Approach Mr. Ghanshyam P. Dhomse

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Swarm Based Fuzzy Clustering with Partition Validity

Swarm Based Fuzzy Clustering with Partition Validity Swarm Based Fuzzy Clustering with Partition Validity Lawrence O. Hall and Parag M. Kanade Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract

More information

Clustering: An art of grouping related objects

Clustering: An art of grouping related objects Clustering: An art of grouping related objects Sumit Kumar, Sunil Verma Abstract- In today s world, clustering has seen many applications due to its ability of binding related data together but there are

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications & Algorithms 14 (2007) 103-111 Copyright c 2007 Watam Press FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

XML Clustering by Bit Vector

XML Clustering by Bit Vector XML Clustering by Bit Vector WOOSAENG KIM Department of Computer Science Kwangwoon University 26 Kwangwoon St. Nowongu, Seoul KOREA kwsrain@kw.ac.kr Abstract: - XML is increasingly important in data exchange

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

A Genetic k-modes Algorithm for Clustering Categorical Data

A Genetic k-modes Algorithm for Clustering Categorical Data A Genetic k-modes Algorithm for Clustering Categorical Data Guojun Gan, Zijiang Yang, and Jianhong Wu Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J 1P3 {gjgan,

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Relative Constraints as Features

Relative Constraints as Features Relative Constraints as Features Piotr Lasek 1 and Krzysztof Lasek 2 1 Chair of Computer Science, University of Rzeszow, ul. Prof. Pigonia 1, 35-510 Rzeszow, Poland, lasek@ur.edu.pl 2 Institute of Computer

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

Unsupervised learning on Color Images

Unsupervised learning on Color Images Unsupervised learning on Color Images Sindhuja Vakkalagadda 1, Prasanthi Dhavala 2 1 Computer Science and Systems Engineering, Andhra University, AP, India 2 Computer Science and Systems Engineering, Andhra

More information

On the Consequence of Variation Measure in K- modes Clustering Algorithm

On the Consequence of Variation Measure in K- modes Clustering Algorithm ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal Published By: Oriental Scientific Publishing Co., India. www.computerscijournal.org ISSN:

More information

Lecture 15 Clustering. Oct

Lecture 15 Clustering. Oct Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information