Journal of Chemical and Pharmaceutical Research, 2013, 5(12): Research Article

Similar documents
Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University

Analysis of Documents Clustering Using Sampled Agglomerative Technique

3D Model Retrieval Method Based on Sample Prediction

Image Segmentation EEE 508

Ones Assignment Method for Solving Traveling Salesman Problem

New HSL Distance Based Colour Clustering Algorithm

DATA MINING II - 1DL460

Performance Comparisons of PSO based Clustering

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

Stone Images Retrieval Based on Color Histogram

arxiv: v2 [cs.ds] 24 Mar 2018

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

Pattern Recognition Systems Lab 1 Least Mean Squares

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Text Feature Selection based on Feature Dispersion Degree and Feature Concentration Degree

Software Fault Prediction of Unlabeled Program Modules

Criterion in selecting the clustering algorithm in Radial Basis Functional Link Nets

COMP9318: Data Warehousing and Data Mining

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

Study on effective detection method for specific data of large database LI Jin-feng

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Research Article A Self-Adaptive Fuzzy c-means Algorithm for Determining the Optimal Number of Clusters

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

Bayesian Network Structure Learning from Attribute Uncertain Data

Sectio 4, a prototype project of settig field weight with AHP method is developed ad the experimetal results are aalyzed. Fially, we coclude our work

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Dimensionality Reduction PCA

An Efficient Algorithm for Graph Bisection of Triangularizations

CS 683: Advanced Design and Analysis of Algorithms

Novel pruning based hierarchical agglomerative clustering for mining outliers in financial time series

Classification of binary vectors by using DSC distance to minimize stochastic complexity

A Novel Feature Extraction Algorithm for Haar Local Binary Pattern Texture Based on Human Vision System

Web Text Feature Extraction with Particle Swarm Optimization

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Performance Plus Software Parameter Definitions

15 UNSUPERVISED LEARNING

Improving Template Based Spike Detection

BASED ON ITERATIVE ERROR-CORRECTION

Accuracy Improvement in Camera Calibration

Octahedral Graph Scaling

A Study on the Performance of Cholesky-Factorization using MPI

ANN WHICH COVERS MLP AND RBF

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

Research on K-Means Algorithm Based on Parallel Improving and Applying

A Parallel DFA Minimization Algorithm

A Hybrid Clustering Method Using Genetic Algorithm with New Variation Operators

Fuzzy Minimal Solution of Dual Fully Fuzzy Matrix Equations

Python Programming: An Introduction to Computer Science

Novel Encryption Schemes Based on Catalan Numbers

A METHOD OF GENERATING RULES FOR A KERNEL FUZZY CLASSIFIER

Lower Bounds for Sorting

Optimization for framework design of new product introduction management system Ma Ying, Wu Hongcui

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

Relationship between augmented eccentric connectivity index and some other graph invariants

Chapter 3 Classification of FFT Processor Algorithms

Our second algorithm. Comp 135 Machine Learning Computer Science Tufts University. Decision Trees. Decision Trees. Decision Trees.

A Note on Least-norm Solution of Global WireWarping

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

Counting the Number of Minimum Roman Dominating Functions of a Graph

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

. Written in factored form it is easy to see that the roots are 2, 2, i,

A Flexible Hierarchical Classification Algorithm for Content Based Image Retrieval

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

IMP: Superposer Integrated Morphometrics Package Superposition Tool

An Efficient Algorithm for Graph Bisection of Triangularizations

Fuzzy Rule Selection by Data Mining Criteria and Genetic Algorithms

A New Bit Wise Technique for 3-Partitioning Algorithm

Neuro Fuzzy Model for Human Face Expression Recognition

Eigenimages. Digital Image Processing: Bernd Girod, Stanford University -- Eigenimages 1

Cubic Polynomial Curves with a Shape Parameter

CLUSTERING TECHNIQUES TO ANALYSES IN DENSITY BASED SOCIAL NETWORKS

Intrusion Detection using Fuzzy Clustering and Artificial Neural Network

Python Programming: An Introduction to Computer Science

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Cluster Analysis. Andrew Kusiak Intelligent Systems Laboratory

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Vaseem Durrani Technical Analyst, Aedifico Tech Pvt Ltd., New Delhi, India

Optimal Mapped Mesh on the Circle

An Image Retrieval Method Based on Hu Invariant Moment and Improved Annular Histogram

Solving Fuzzy Assignment Problem Using Fourier Elimination Method

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Evaluation of Support Vector Machine Kernels for Detecting Network Anomalies

OCR Statistics 1. Working with data. Section 3: Measures of spread

Intrusion Detection Method Using Protocol Classification and Rough Set Based Support Vector Machine

Algorithms for Disk Covering Problems with the Most Points

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

Our Learning Problem, Again

INTERSECTION CORDIAL LABELING OF GRAPHS

Customer Portal Quick Reference User Guide

State-space feedback 6 challenges of pole placement

A Semi- Non-Negative Matrix Factorization and Principal Component Analysis Unified Framework for Data Clustering

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters.

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs

Data-Driven Nonlinear Hebbian Learning Method for Fuzzy Cognitive Maps

Transcription:

Available olie www.jocpr.com Joural of Chemical ad Pharmaceutical Research, 2013, 5(12):745-749 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 K-meas algorithm i the optimal iitial cetroids based o dissimilarity Wag Shuye, Cui Yeqi, Ji Zuotao ad Liu Xiyua Departmet of Computer Sciece ad Techology, Lagfag Teachers Uiversity, Chia ABSTRACT K-meas clusterig algorithm is oe of the most popular clusterig algorithms ad has bee applied i may fields. A major problem of the basic k-meas clusterig algorithm is that the cluster result heavily depeds o the iitial cetroids which are chose at radom. At the same time, it is ot suitable for the sparse spatial datasets which use space distace as the similarity measuremet o the algorithm. I this paper, a improved k-meas clusterig algorithm i the optimal iitial cetroids based o dissimilarity is proposed. It adopts the dissimilarity to reflect the degree of correlatio betwee data objects, ad the uses Huffma tree to fid the iitial cetroids. May experimets cofirm that the proposed algorithm is a efficiet algorithm with better clusterig accuracy o the same maily time complexity. Key words: k-meas, iitial cetroids, Huffma tree, dissimilarity INTRODUCTION These days may datasets are produced from variety of scietific disciplies ad reality life, ad data geeratio, collectio ad aalysis is becomig the mai role i research. The data is gathered by wheever, wherever ad whatever, ad should take value i differet fields. Data miig is the process of fidig useful iformatio i large data warehouse. May data miig techiques are used to discover the importat patters from datasets ad predict the capabilities i future. Cluster aalysis is the most importat usupervised-learig method. The mai purpose is to fid a structure i a collectio of ulabelled data. Totally the clusterig ivolves partitioig a give dataset ito some groups of data whose members are similar i some way. Clusterig aalysis has bee widely used i data recovery, text ad web miig, patter recogitio, image segmetatio ad software reverse egieerig [1]. K-meas clusterig is a popular clusterig algorithm. It is partitioig a dataset ito k groups i the viciity of its iitializatio such that the similar data objects are grouped i the same cluster while dissimilar data objects are i differet clusters. However, k- meas clusterig algorithm also has some limitatios. (1), k, the umber of clusters is user-parameter, it eeds may professioal kowledge, ad a good clusterig with smaller k ca have a lower SSE (Sum of the Squared Error) tha a poor clusterig with higher k [2]. (2), the algorithm heavily depeds o the iitial coditios, ad it is sesitive to the sequece of iput, it eve ofte makes coverge to local rather tha global optimum. (3), it has may problems with outliers, how to detect ad mie them is also importat. (4), some clusterig may produce ew problems o high dimesioal ad sparse characteristics of dataset, for example the disaster of dimesio. (5), it also may make empty clusters. (6), it has problems whe clusters are of differig sizes, desities or o-globular shapes. Recetly, may improved k-meas clusterig algorithms have bee proposed to solve the iitial cetroids problem. The geeral solutio icludes usig multiple rus, clusterig a sample first, bisectig k-meas that is ot as susceptible to iitializatio issues [2]. J. C. Bezdek raised fuzzy c-meas, which a object belogs to all clusters with a weight, ad the sum of the weights is 1[3]. Redmod [4] proposed a method that iitial cetroids are selected 745

through combiig desity of data distributio ad the kd-tree. Hag Ligbo[5] improved the iitial cetroids through the desity of data ad the average distace. Tog Xuejiao[6] costructed k clusters, the decided each data object belogig to the cluster whether or ot depedig o the threshold. Zhag Jig [7] preseted a method to improve the iitial cetroids through idividual silhouette coefficiet. The high quality clusterig is to obtai high itra-cluster similarity ad low iter-cluster similarity. How to measure the similarity iflueces the results of the clusterig. May similarity measuremets are chose to meet differet applicatios or data types. Most algorithms adopt traditioal similarity based o spatial distace to describe the relatioship betwee data objects. It icludes Euclidea, Mahatta, Mikowsky [8] ad Chebychev, especially the Euclidea. They are good at low dimesioal data space but failig i dealig with high-dimesioal dataset. There are characteristics of sparse, empty space pheomea, the traditioal methods i the high-dimesioal space are greatly decreased, ad the results become ustable [9]. Ad may papers use similarity to measure the relatioship betwee data objects [10]. I this paper, a improved k-meas clusterig algorithm based o dissimilarity to optimize iitial cetroids is proposed. It draws the lessos from the Huffma tree i Wu Xiaorog[11]. Dissimilarity is adopted istead of the space distace method. Ad it uses the dimesio cotributio rate to reflect the importace of differet attitude to clusterig results. So it also ca be used i dimesio reductio i order to improve efficiecy. IRIS, Wie, Balace-scale datasets i UCI [12] are chose to be traied. Experimets show that the proposed algorithm is good at accuracy rate, especially at high-dimesioal space. K-MEANS CLUSTERING ALGORITHM The k-meas clusterig algorithm is oe of the top te data miig algorithms [13]. A descriptio of the basic algorithm follows. The data set D={x 1, x 2,, x m } is assumed. The first k data objects are chose at radom as the iitial cetroids. The k is user-parameter that the umber of clusters desired. Each data object is the assiged to the earest iitial cetroid. The idea is to choose radom cluster cetroids, oe for each cluster. The cetroid of each cluster is the updated based o meas of each group which assig as a ew cetroids. The the assigmet is repeated ad cetroids are updated util o data object chages. It meas o object avigates from each cluster to aother or equivaletly, each cetroid remais the same comparig with the previous iteratio. Algorithm 1: the basic k-meas clusterig algorithm Choose k objects as the iitial cetroids at radom Repeat Assig each object to the earest cluster ceter Recompute the cluster ceters of each cluster Util covergece criterio is met The time complexity of the basic k-meas cluster algorithm is O ( k*l*m*d ), k represets the umber of clusters, l is the umber of iteratios i order to meet the covergece criterio, m is the size of dataset, d is the umber of attributes. So the umber k, l, m ad d all ifluece the efficiecy of the algorithm. IMPROVED K-MEANS CLUSTERING ALGORITHM Formal defiitio I order to explai the algorithm proposed i this paper, relative defiitios are itroduced as follows. Defiitio 1: The dataset D is defies as D={x 1,x 2,,x m }, the size is m, ad each object has may attributes, the umber is d. Defiitio 2: Attribute dissimilarity ad. A dataset D, x i D, x j D, represets ay attribute, the attribute dissimilarity of x i ad x j o the attribute is: ad ij x x i max x x j mi The x i is the value of x i i the attribute, x j is the value of x j i the attribute, x max is the maximal value of the dataset i the attribute, ad x mi is the miimal value of the dataset i the attribute. (1) 746

Because dimesioally homogeeous is exist i huge dataset, the value rage of each attribute is absolutely differet. Data reprocessig which coverts raw data ito suitable iformatio is very importat. It ormalizes the dataset i order to avoid the ifluece o the data of differet dimesio. Defiitio 3: Object dissimilarity od. The object dissimilarity of x i ad x j i dataset D is: od( i, j) d 1 w ad d ij (2) The w is dimesio cotributio rate to weight the differece ifluece of each attribute i the clusterig procedure. It rages from 0 to 1, ad ca get from differet expressios ad also ca get from experts i practical applicatio. The w i [9] is adopted i this paper. Defiitio 4: Dissimilarity matrix dm (m*m). The dissimilarity matrix of the give dataset D is: od(1,1) od(2,1) od(2,2) dm......... od( m.1) od( m,2)... od( m, m) (3) The optimal iitial cetroids based o dissimilarity Formula (1) is used to calculate the dissimilarity betwee each data object i each attitude. Formula (2) is used to calculate the dissimilarity betwee each data object icludig every attitude. The value of the od(i,j) reflects the degree of correlatio betwee x i ad x j. The smaller the value, the closer, ad it is the greater of possibility to partitio i the same cluster. Ad formula (3) is used to create the dissimilarity matrix dm. It is a symmetric matrix. Huffma tree (Huffma) is a kid of weighted legth of the shortest path tree. Huffma tree is used to calculate the iitial cetroids. Dissimilarity that defied ahead is adopted to measure the differece betwee each data objects, ad usig the dissimilarity matrix to store all the values. Selectig the smallest value i the iitial dissimilarity matrix, it meas that the most possibility the two objects will be i the same cluster. We compute the average value of the two objects ot the sum as a ew object, delete the two objects from the dataset, recompute the od(i,j) ad get a ew dissimilarity matrix dm(m-1,m-1), circle the procedure util gettig oe object usig the Huffma algorithm. Accordig the Huffma tree ad the k value, k-1 odes will be foud from the root to leaf odes. Whe deletig them, k sub-trees are left. The values of each sub-trees are the iitial cetroids which will use i the basic k-meas clusterig algorithm. The descriptio of the improved algorithm The improved algorithm uses the iitial cetroids which come from the Huffma tree. It bases o the dissimilarity to describe the degree of correlatio. The rest procedure is as the same as the basic k-meas clusterig algorithm. The improved algorithm is described as follow. Algorithm 2: the improved k-meas clusterig algorithm Iput the dataset D with m objects, each object with d attributes, ad k the umber of clusters Calculate the ad, od, ad get the dissimilarity matrix dm Costruct the Huffma tree accordig the dissimilarity matrix dm Delete the k-1 ode from the Huffma tree, left k sub-trees, get each the k sub-trees ode value as the iitial cetroids Repeat Assig each object to the earest cluster ceter Recompute the cluster ceters of each cluster Util covergece criterio is met Algorithm 2 shows the procedure of the improved k-meas clusterig algorithm. The time complexity is affected by the size of dataset (m), the umber of attributes (d), the umber of iteratios (l) ad the umber of clusters (k). The time complexity of computig the dissimilarity is O (m*), it is idetical with the distace-based method i [14]. The time complexity of costructig the Huffma tree is O (m*logm). The time complexity of clusterig is O (m*k*l*d). The total complexity of the improved algorithm is O (m*d + m*logm + m*k*l*d). Although this algorithm speds more time o Huffma algorithm ad the value of logm is very small, the algorithm s time 747

cosumptio maily depeds o the basic k-meas clusterig algorithm. The maily time complexity is O (m*k*l*d). The data size, the umber of iteratio ad the umber of attributes are the mai factors i clusterig. It also dimiishes the iteratio through the Huffma tree. So it drops the time cosumptio. At the same time the clusterig result is stable ad less depeds o the iitial cetroids. RESULTS AND DISCUSSION I order to evaluate the improved k-meas clusterig algorithm, the stadard datasets IRIS, Wie, Balace-scale are chose from the UCI machie learig repository. They all have 3 clusters ad the umber of data objects i each cluster is show i Table 1. Table 1: The umber of data objects i each cluster Cluster IRIS Wie Balace-scale first cluster 50 59 49 secod cluster 50 71 288 third cluster 50 48 288 sum 150 178 625 Table 2 describes the accuracy rate which is defied i [14] of the improved algorithm i this paper. It shows that the accuracy rate is all above the value of the algorithm which is based o distace i [14] especially i the big dataset ad the high-dimesioal dataset. Table 2: Accuracy rate of improved algorithm Names clusters IRIS Wie Balace-scale first 50 64 56 Dataset secod 47 58 259 third 53 56 310 first 50 51 45 right secod 44 56 218 third 46 39 251 first 0 13 11 wrog secod 3 2 41 third 7 17 59 accuracy_rate 93.33% 82.02% 82.24% Table 3 describe the fial cluster ceters which defies i [14] at IRIS dataset i each algorithm ( ceter1 symbols the stadard, ceter2 symbols the basic k-meas, ceter3 symbols the clusterig algorithm usig Huffma based o distace, ceter4 symbols the improved algorithm i this paper). As is see from table 3, the improved algorithm i this paper is most close to the stadard cluster ceters. It is better tha the algorithm usig Huffma tree based o distace. The Wie ad Balace-scale datasets also have the same results. It meas the dissimilarity icludig the dimesio cotributio rate is more suitable for the big dataset with high-dimesioal. Table 3: The fial cluster ceters i IRIS Clusters Ceter1 Ceter2 Ceter3 Ceter4 first cluster 5.006 3.418 1.464 0.244 5.830 3.511 1.476 0.250 5.006 1.464 3.418 0.224 5.006 3.418 1.464 0.244 secod cluster 5.936 2.770 5.756 2.716 5.901 2.748 5.905 2.779 4.260 1.326 4.026 1.118 4.394 1.434 4.278 1.354 third cluster 6.588 2.970 6.315 2.895 6.850 3.073 6.628 2.927 5.552 2.026 5.125 1.803 5.742 2.071 5.635 2.051 CONCLUSION The k-meas clusterig algorithm is the secod i the top te data miig algorithms. But the algorithm has ecoutered may limitatios. I this paper it presets a improved k-meas clusterig algorithm i the optimal iitial cetroids based o dissimilarity. It adopts the dissimilarity to reflect the degree of correlatio betwee data objects, the uses Huffma tree to fid the iitial cetroids. So it resolves the problem that the cluster results are sesitive to the iitial cetroids i basic k-meas clusterig algorithm. It cosumes less time because the iteratio dimiishes through the Huffma algorithm tha the basic k-meas which has the same values m, k, ad d. May experimets show that the improved algorithm has better accuracy rate ad cluster results. However, this ew algorithm based o dissimilarity still has problem for further research. We cosider the dimesio cotributio rate to weight the differece ifluece of each attribute i clusterig. The ext step of the research are how to defie the 748

dimesio cotributio rate i differet fields ad datasets, ad how to improve the algorithm's efficiecy to reduce the umber of attributes d through priciple compoet aalysis based o the dimesio cotributio rate. Ackowledgemet This work was supported i part by the Natural Sciece Foudatio of Lagfag Teachers Uiversity i 2013(LSZY201306). REFERENCES [1] Elham Karoussi, Data miig k clusterig problem, Uiversity of Agder, 2012 [2] Ta, Steibach, Kumar, The k-meas cluster, http://www.cs.uvm.edu/~xwu/kdd/slides/ Kmeas-ICDM06.pdf, 2006 [3] J. C. Bezdek, Fuzzy Mathematics i Patter Classificatio, Corell Uiversity, Ithaca, NY. 1973 [4] Redmod S J, Heegha C, Patter recogitio Letters, 2007, 28(8):965-973. [5] Ha Ligbo, Wag Qiag, Jiag Zhegfeg, Computer egieerig ad applicatio, 2010, 46 (17): 150-152. [6] Fu Desheg, Zhou Che, Joural of computer applicatios, 2011,31 (2):432-434 [7] Zhag Jig, Dua Fu, Computer egieerig ad desig, May 2013(5): 1691-1694. [8] B. Shamugapriya, M. Puithavalli, Iteratioal joural of computer applicatio, April 2012(8):26-32. [9] Wag Xiaoyag, Zhag Hogyua, She Liagzhog, Chi Wale, Computer techology ad developmet, May 2013(23): 30-33. [10] Huag Maida, Che Qimai, Microcomputer iformatio, 2009(27):187-188,198. [11] Wu Xiaorog, Research o problems related to the iitial ceter selectio i k-meas clusterig algorithm, Hua Uiversity, May 2008. [12] UCI machie learig repository, http://archiv.ic.uci.edu/ml/. [13] Ta P N, Steibach M, Kumar V, Itroductio to data miig, MA, USA: Addiso-Wesley Logma Publishig Co., Ic. Bosto, 2010. [14] Wag Shuye, A Improved K-meas Clusterig Algorithm Based o Dissimilarity, Proceedigs 2013 Iteratioal Coferece o Mechatroic Scieces, Electric Egieerig ad Computer (MEC), December 20-22, 2013, Chia: 2629-2633. 749