An Internal Clustering Validation Index for Boolean Data

Similar documents
Cluster Analysis of Electrical Behavior

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Machine Learning: Algorithms and Applications

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

The Research of Support Vector Machine in Agricultural Data Classification

Module Management Tool in Software Development Organizations

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Solving two-person zero-sum game by Matlab

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Hierarchical clustering for gene expression data analysis

Unsupervised Learning

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Smoothing Spline ANOVA for variable screening

TN348: Openlab Module - Colocalization

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

An Improved Image Segmentation Algorithm Based on the Otsu Method

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

Available online at Available online at Advanced in Control Engineering and Information Science

Virtual Machine Migration based on Trust Measurement of Computer Node

X- Chart Using ANOM Approach

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Wishing you all a Total Quality New Year!

A Similarity Measure Method for Symbolization Time Series

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Adjustment methods for differential measurement errors in multimode surveys

An Image Fusion Approach Based on Segmentation Region

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

REFRACTIVE INDEX SELECTION FOR POWDER MIXTURES

The Comparison of Calibration Method of Binocular Stereo Vision System Ke Zhang a *, Zhao Gao b

An Optimal Algorithm for Prufer Codes *

S1 Note. Basis functions.

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

APPLICATION OF IMPROVED K-MEANS ALGORITHM IN THE DELIVERY LOCATION

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Unsupervised Neural Network Adaptive Resonance Theory 2 for Clustering

Study of Data Stream Clustering Based on Bio-inspired Model

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

Classifier Selection Based on Data Complexity Measures *

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Support Vector Machines

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

Detection of an Object by using Principal Component Analysis

Unsupervised Learning and Clustering

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

From Comparing Clusterings to Combining Clusterings

A Post Randomization Framework for Privacy-Preserving Bayesian. Network Parameter Learning

Evaluation of the application of BIM technology based on PCA - Q Clustering Algorithm and Choquet Integral

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Unsupervised Learning and Clustering

Query Clustering Using a Hybrid Query Similarity Measure

Research on Categorization of Animation Effect Based on Data Mining

Optimal Fuzzy Clustering in Overlapping Clusters

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Related-Mode Attacks on CTR Encryption Mode

A new segmentation algorithm for medical volume image based on K-means clustering

Support Vector Machines

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Professional competences training path for an e-commerce major, based on the ISM method

A Clustering Algorithm Solution to the Collaborative Filtering

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Network Intrusion Detection Based on PSO-SVM

Brushlet Features for Texture Image Retrieval

Correlative features for the classification of textural images

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

An Entropy-Based Approach to Integrated Information Needs Assessment

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

Associative Based Classification Algorithm For Diabetes Disease Prediction

A fast algorithm for color image segmentation

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Improved Resource Allocation Algorithms for Practical Image Encoding in a Ubiquitous Computing Environment

A Gradient Difference based Technique for Video Text Detection

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

A Clustering Algorithm for Chinese Adjectives and Nouns 1

Optimizing Document Scoring for Query Retrieval

A Gradient Difference based Technique for Video Text Detection

Finite Element Analysis of Rubber Sealing Ring Resilience Behavior Qu Jia 1,a, Chen Geng 1,b and Yang Yuwei 2,c

Load-Balanced Anycast Routing

Pruning Training Corpus to Speedup Text Classification 1

Machine Learning. Topic 6: Clustering

An Improved Spectral Clustering Algorithm Based on Local Neighbors in Kernel Space 1

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Object-Based Techniques for Image Retrieval

Simulation Based Analysis of FAST TCP using OMNET++

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Transcription:

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 6 Specal ssue wth selecton of extended papers from 6th Internatonal Conference on Logstc, Informatcs and Servce Scence LISS 2016 Sofa 2016 Prnt ISSN: 1311-9702; Onlne ISSN: 1314-4081 DOI: 10.1515/cat-2016-0091 An Internal Clusterng Valdaton Index for Boolean Data Lwe Fu, Sen Wu Donlnks School of Economcs and Management, Unversty of Scence and Technology Beng, Beng, Chna Emals: Tavon_Fu@outlook.com wusen@manage.ustb.edu.cn Abstract: Internal clusterng valdaton s recognzed as one of the vtal ssues essental to clusterng applcatons, especally when external nformaton s not avalable. Exstng measures have ther lmtatons n dfferent applcaton crcumstances. There are stll some defcences for Internal Valdaton of Boolean clusterng. Ths paper proposes a new Clusterng Valdaton ndex based on Type of Attrbutes for Boolean data (CVTAB). It evaluates the clusterng qualty n the lght of Dssmlarty of two clusters for Boolean Data (DBD). The attrbutes n the Boolean Data are categorzed nto three types: Type A, Type O and Type E representng respectvely the attrbute values 1,0 and not the same for all the obects n the set. When two clusters are composed nto one, DBD apples the numbers of attrbutes wth the types changed and the numbers of obects changed to measure dssmlarty of two clusters. CVTAB evaluates the clusterng qualty wthout respect to external nformaton Keywords: Clusterng Valdaton ndex based on Type of Attrbutes for Boolean data (CVTAB), Dssmlarty for Boolean Data (DBD), nternal clusterng valdaton ndex, Boolean data, hgh dmensonal data. 1. Introducton 1.1. Background Nowadays, data volume ncreases explosvely along wth the computer technology fully ntegrated nto the socal lfe. Internet applcatons, such as Mcro-Blog, Socal Network, and e-busness, produce a large amount of data partcularly n recent years. Data mnng s the core of knowledge dscovery n databases. Technologcally, t s a process, to get mplct pattern from the varous, ncomplete, 232

fuzzy, and random data. Wth abundant methods of data acquston, data normally has two characterstcs, hgh dmensonalty and no label. Specfcally, a great deal of research work has focused on unsupervsed hgh dmensonal data mnng. In fact, some clusterng algorthms, such as k-means [1, 2], are commonly used n practce. Cluster analyss, as a man task of data mnng, groups obects to clusters, so that obects n the same cluster are more smlar to each other than to those n other clusters. It s also a common technque for statstcal data analyss used n many felds, such as machne learnng, mage analyss, pattern recognton, nformaton retreval, bonformatcs and so on [3]. The result of cluster analyss depends on characterstcs of the data set, but no matter what the data dstrbuton pattern s the clusterng algorthm can always gve a result. So, the ndex evaluatng the qualty of clusterng result s very sgnfcant [4, 5], partcularly for hgh dmensonal and large n sze data, such as Tme-seres data [6]. The clusterng valdaton ndex falls nto three types [7]: External Index, Internal Index and Relatve Index. External Index focuses on comparng clusterng results wth the external nformaton; Internal Index focuses on evaluatng the goodness of a clusterng structure wthout respect to external nformaton; Relatve Index focuses on comparng the results of varous algorthms of clusterng. Among the 3 of them, only Internal Index can evaluate the clusterng results by the nteror nformaton of data set wthout the nformaton outsde of the data set such as orgnal category labels. So, Internal Index s more practcal [8]. In practce, Internal Index can also be used to select the sutable algorthm and parameter of algorthm obectvely. Categorcal data and Boolean data wdely exst [9]. The dfference between the Boolean data and categorcal data s that the attrbutes of the Boolean data are only 0 and 1, but those of the categorcal data are not. The multple category attrbutes can be converted nto bnary attrbutes by usng 0 and 1 to represent ether a category absent or present [10]. Ths paper focuses on the evaluaton of Boolean data clusterng. It s also applcable to categorcal data, whch can be transformed nto Boolean data harmonously. 1.2. Related work Clusterng valdaton measures can be affected by varous data characterstcs [11], such as data type and nose. Internal Index evaluates the clusterng result senstve to the propertes of data set and clusters. Specfcally, compactness nsde a cluster and separaton among the clusters are closely related wth Internal Index [12], and many researches focus on t. For a dataset X, n s the number of obects. The dataset s dvded nto nc subset; X=C 1 C 2 C nc; c s the centrod of X, and c s the centrod of C ; n s the number of obects n C ; d (x, x ) s dstance between x and x. But most of the Internal Indces lack pertnence to Boolean data. Some ndces are shown n Table 1. 233

Table 1. Some of nternal ndces Index Calnsk-Harabasz ndex (CH) [13] Dunn ndex (Dunn) [14] I ndex (I) [15] Slhouette ndex (S) [16] Equaton of Defnton nc 1 2 nd c, c nc 1 1 CH nc 1 2 d c, x n nc 1 xc mn x C,, y C d x y Dunn mn mn max max x, yc d x, y p 1 I nc nc 1 nc 1 xc d x, c d c, x max d c, c ax nc 1 1 b x S nc 1 n xc max a x, b x q 1 1 In Table 1, ax dx, y, bx mn dx y n 1 x, yc, x y n,,, xc, yc and,, p = 1, 2,, nc; q s a parameter of I ndex, and q = 2 n ths paper. Besdes, dssmlarty, as an essental element of the Internal Index, s often used n many algorthms for Boolean data. In k-modes [17], the dssmlarty measure between two obects s defned by the total msmatches of the correspondng attrbute. Formally, m 1 d X, Y x, y, N 1 0 x y, x, y 1 x y, where X and Y denote the categorcal obects, x and y denote the attrbutes of X and Y, = 1, 2,, m; N denotes the coeffcent. If d(x, Y) gves equal mportance to each category of an attrbute, N = 1. But f the frequences of categores n the data set are taken nto account, N s half of the harmonc average of n x and n y, where n x and n y are the numbers of obects n X and Y respectvely. The smaller the number of msmatches s, the more smlar the two obects are. Ths measure can only evaluate the dssmlarty of the obects, not that of categores or clusters. In CABOSFV [18], whch s a clusterng algorthm for hgh dmensonal sparse data, the measure named Sparse Feature Dssmlarty (SFD) s proposed to represent the dssmlarty of the obects n a set, and t s defned as: 234

e SFD, X a where a denotes the number of attrbutes that equal 1 for all obects, e denotes the number of attrbutes that equal 1 for some obects and equal 0 for other obects. X ndcates the number of obects n set X. The smaller the SFD s, the more smlar the obects are. However, SFD can only measure the dssmlarty of the obects n a set or cluster rather than that between two sets or clusters. Boolean data clusterng s mpacted negatvely by lackng Internal Index whch has pertnence to data wth bnary attrbutes. Ths paper proposes a new ndex to evaluate the effectveness of the Boolean data clusterng. 2. Concepts and defntons To evaluate the dssmlarty of two clusters, the new ndex s proposed to measure the valdaton of clusterng. Let A 1, A 2,, A m be m attrbutes of the dataset X, descrbng a space Ω; Ω s a Boolean space f all A 1, A 2,, A m are Boolean. A, whose acceptable values are only 1 and 0, s called a Boolean attrbute, for =1, 2,, m. 2.1. The defnton of attrbute types Let X = {x 1, x 2,, x n} be a set of n obects wth Boolean values and X. For the Boolean dataset, the attrbutes can be categorzed nto three types. Type A s defned for attrbutes whch values are 1 for all the obects n X; Type O s defned for attrbutes whose values are all 0; Type E s defned for attrbutes that have not the same value for all the obects n X. Then the space Ω would be dvded nto three subsets, J A, J O and J E. For dataset X, f the attrbute A belongs to Type A attrbutes, A J A; f the attrbute A belongs to Type O attrbutes, A J O; and f the attrbute A belongs to Type E attrbutes, A J E, for 1 m. Apparently, set J E shows the dfferences and J A J O shows the smlarty. 2.2. Dssmlarty of two Boolean sets For Boolean data, to compare the dssmlarty of two clusters C and C, C and C can merge nto one cluster C U, and C U = C C. For C U, J A, J O and J E can be calculated as (1), (2) and (3): (1) JAU = JA JA, (2) JOU = JO JO, (3) JEU = C U (JAU JOU ), where J A denotes the set J A n C ; J O denotes the set J O n C ; J A denotes the set J A n C ; J O denotes the set J O n C ; J AU denotes the set J A n C U; J OU denotes the set J O n C U; C U ( ) denotes calculatng the complementary set. Composng C and C wll 235

result n some attrbutes of Type A or O altered to E, but Type E attrbutes n C or C cannot be changed n C U obvously. Fg. 1 shows ths process. 236 Fg. 1. The alterng process of attrbutes Accordng to Fg. 1, the attrbute of Type A or Type O meetng the attrbute of Type E before composng wll be altered nto Type E n C U. Ths means that when two dfferent clusters are put nto one, the dfferences wll be more, but smlarty wll be less wthout consderng the number of the obects n clusters. Therefore, the stuaton Type A meetng Type O should be focused on, and that s marked by the bold arrows n Fg. 1. and the number of the Type A attrbutes meetng Type O attrbutes s used to measure the dssmlarty from the attrbutes altered as (4) AlterAO = JA JO, where AlterA O denotes attrbutes altered to E by Type A n C meetng Type O n C ; J A denotes the set J A n C ; J O denotes the set J O n C ; J A denotes the set J A n C ; J O denotes the set J O n C. Furthermore, 1 s usually pad more attenton than 0, partcularly for sparse data, and the dssmlarty from attrbutes alterng can be ndcated by DFAA (Dssmlarty From Altered Attrbutes) as

n n (5) DFAA AlterAO AlterA O, n n n n where denotes the number of elements n the set; n denotes the number of obects n C ; n denotes the number of obects n C. Accordng to (5), AlterA O denotes the Type A attrbutes n C alterng to Type E by meetng Type O n C, and AlterA O n n n n n are the weghts. In symmetrc denotes that n C ; n and varables, 0 and 1 are equally mportant. So the weghts n (5) wll not be taken nto consderaton, and DFAA s the number of Type A meetng Type O, and t s (6) DFAA AlterA O AlterA O On the other hand, the number of obects n C and C respectvely can also effect the dssmlarty of the two clusters. The value of ths dssmlarty s the number of obects n C U mnus the harmonc average of those n C and C, and t s 2 DFN n, (7) 1 1 n n where DFN (Dssmlarty From Number) denotes the dssmlarty from the number of obects n cluster; n denotes the number of obects n C U, and n = n +n. Thus the Dssmlarty for Boolean Data (DBD) between C and C s composed by DFAA and DFN, and t s shown as (8) DBD DFAA DFN. In ths paper, DBD (C, C ) ndcates the DBD of the cluster C and C. 2.3. CVTAB for clusterng valdaton CVTAB (Clusterng Valdaton ndex based on Type of Attrbutes for Boolean data) based on DBD can evaluate the valdaton of clusterng for Boolean data. CVTAB s the average of the DBD between each two clusters. Let X = {x 1, x 2,, x n} be a set of n obects and A 1, A 2,, A m be m attrbutes of the dataset, then X. After clusterng, X s parttoned nto k subsets, and X = C 1 C 2 C k. In C, there are n obects for 1 n k, and CVTAB s (9) k k DBD C, 1 1, C CVTAB, k k 1 CVTAB s postvely assocated wth the effects of the clusterng. The hgher CVTAB, the better effects of clusterng. Because hgher CVTAB denotes more dfferences between each two clusters and more smlartes n each cluster. The bgger value of CVTAB means the better effect of clusterng. 237

3. Steps and examples Let X = {x 1, x 2, x 7}, and X s a Boolean dataset. x 1, x 2,, x 7 are seven obects n a dataset, and A 1, A 2,, A 16 are the attrbutes. Let s assume that C 1 = {x 1, x 2}, C 2 = {x 3, x 4, x 5}, C 3 = {x 6, x 7}. The dataset X s as gven n Table 2. Table 2. The dataset X Cluster C1 Obect Sequence of attrbutes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 X1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 X2 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 X3 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 1 C2 X4 1 1 0 0 1 1 0 0 1 0 0 1 1 0 0 0 X5 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 C3 X6 1 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 X7 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 1 3.1. Example of DBD Accordng to Fg. 1, after C 1 and C 2 merged to C U, the types of attrbutes are transformed as gven n Table 3. Further on, the sets J A, J O and J E of cluster C 1, cluster C 2, and the composed cluster C U are shown n Table 4. Based on Table 4, the AlterAO, DFAA, DFN, and CVTAB can also be calculated, and ths process can be ndcated wth Venn dagram, shown on Fg. 2. Table 3. The types of attrbutes n C1, C2 and CU Sequence of attrbutes Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 C1 O A E E O E O O O E O O E O E O C2 A A E E E A O O E O E A E O O E CU E A E E E E O O E E E E E O E E Table 4. The sets of JA, JO and JE Cluster Type of sets Sequence of attrbutes JA1 2, 13, 15 C1 JO1 1, 5, 7, 8, 9, 11, 12, 14, 16 JE1 3, 4, 6, 10 JA2 1, 2, 6, 12 C2 JO2 7, 8, 10, 14, 15 JE2 3, 4, 5, 9, 11, 13, 16 JAU 2 CU JOU 7, 8, 14 JEU 1, 3, 4, 5, 6, 9, 10, 11, 12, 13, 15, 16 238

(a) JA1, JA2, JAU (b) JO1, JO2, JOU (c) JE1, JE2 (d) JE1, JE2, JEU (e) Type A altered to E n JA1 (f) Type A altered to E n JA2 Fg. 2. Venn dagram of composng C1 and C2 Accordng to Table 4 and Fg. 2, there are 12 elements n set J EU, and 9 of them are from J E1 and J E2 whch are not transformed, but three of them are from J A1 and J A2. Specally, Attrbute 15 s n the ntersecton of J A1 and J O2. Attrbute 1 and attrbute12 are n ntersecton of J O1 and J A2. Accordng to (5), (7) and (8), DBD s calculated as: AlterA O 15 1, 1 2 AlterA O 1,12 2, 2 1 2 3 DFAA 1 2 1.6, 5 5 2 DFN 2 3 2.6, 1 1 2 3 DBD DFAA DFN 4.16. In ths example, DBD of C 1 and C 2 s 4.16. Smlarly, DBD of C 1 and C 3 s 7.00, and DBD of C 2 and C 3 s 3.64. 239

3.2. Example of CVTAB As above, DBD of each two clusters can be calculated, and the results wll be a symmetrc matrx as gven n Table 5, and there are 6 values n t. The average of these values s CVTAB. Accordng to (9), the CVTAB s calculated as: 4. 16 7. 00 4. 16 3. 64 7. 00 3. 64 CVTAB 4. 9333. 3 3 1 Table 5. Symmetrc matrx of DBD Cluster C1 C2 C3 C1 4.16 7.00 C2 4.16 3.64 C3 7.00 3.64 4. Experments 4.1. Experment desgn In ths experment, k-modes algorthm on two UCI data sets (Table 6) are mplemented n Matlab R2015b to measure the effectveness of clusterng. To compare the effectveness of varous clusterng valdaton measures, the selected data set has external nformaton, orgnal clusterng label. For data of zoo, after elmnatng the repeated obects, there are 59 obects n t. Another 6 ndces Calnsk-Harabasz ndex (CH), Dunn ndex (D), I ndex (I), Slhouette ndex (S), Normalzed Mutual Informaton (NMI) [19], and Accuracy Index(ACC) wll also be used as comparatve ndces to evaluate the valdaton propertes and performances of CATAB. Among these ndces, CH, Dunn, I, and S are nternal clusterng valdaton measures, focusng on the data set, but NMI and ACC are external clusterng valdaton measures. Table 6. Data sets for experments Data set Instances Attrbutes Categores Zoo 101 16 7 Small Soybean 47 35 4 Tests of k-modes would be carred out 100 tmes to elmnate the effects of randomness, and each tme nvolves dfferent parameters of k whose range s 2-11. 4.2. Results and analyss Table 7 and Table 8 show averages of 100 tmes clusterng results by k-modes. There are some null values n ACC column because ACC cannot evaluate the result whle calculated number of clusters s larger than that of actual clusters. 240

Table 7. Results of data Zoo by k-modes k CVTAB CH D ( 10 1 ) S ( 10 3 ) I ( 10 1 ) NMI ACC 2 5.99 49.18 4.69 21.75 19.29 0.39 0.51 3 9.52 31.18 4.27 32.18 17.22 0.56 0.65 4 11.14 23.37 4.31 42.94 12.22 0.63 0.71 5 12.47 18.96 4.44 47.70 9.60 0.67 0.72 6 13.41 16.66 4.15 58.57 8.29 0.69 0.75 7 13.60 13.95 4.12 69.23 7.02 0.82 0.75 8 13.90 12.22 4.02 81.43 6.14 0.68 9 13.32 11.50 4.04 93.11 5.39 0.68 10 13.17 10.21 4.00 105.27 4.96 0.67 11 12.92 9.65 4.00 105.98 4.38 0.67 Table 8. Results of data Small Soybean by k-modes k CVTAB CH D ( 10 1 ) S ( 10 2 ) I ( 10 1 ) NMI ACC 2 32.10 39.04 7.22 2.86 4.61 0.45 0.57 3 43.73 22.34 6.67 4.34 3.18 0.65 0.78 4 44.69 16.73 5.74 5.49 2.30 0.90 0.87 5 39.21 12.63 5.32 5.74 1.67 0.72 6 35.39 10.72 5.29 6.13 1.35 0.70 7 31.99 9.14 5.33 7.56 1.10 0.69 8 28.79 8.14 5.39 8.25 0.93 0.67 9 26.38 7.06 5.42 8.94 0.81 0.64 10 24.04 6.26 5.45 9.65 0.71 0.62 11 22.80 5.91 5.59 10.20 0.64 0.64 Accordng to Table 7 and Table 8, the Fg. 3 shows the trends of the results under varous numbers of clusters (k) to compare these ndces after data standardzaton. ACC s gnored on Fg. 3 consderng the null values. In fact, determnng the number of clusters s one of the most mportant clusterng valdaton problems [20]. Accordng to Fg. 3a and b show the results of nternal valdaton ndces. CH, S, and I have a sgnfcant monotonc relatonshp wth k. Among them, CH and S show monotonc ncrease, but ndex of I shows monotone decrease. Ths means that obvously, CH, S, and I are affected by k on Boolean data. Because of beng senstve to k, they cannot evaluate the results of clusterng obectvely and valdly. As for Dunn ndex, t shows statonarty both n (a) and (b) for beng senstve to nose. And ths wll add the uncertanty to evaluaton. So, CH, I, S and D are not sutable for Boolean data, and all of them cannot determne the best k for k-modes algorthm. However, on data Zoo, CVTAB ncreases rapdly from k=2 to k=5, and ncreases smoothly from k=5 to k=9, then decreases. On data Small Soybean, CVTAB shoots up untl k=3, and ncreases smoothly from k=3 to k=4, then falls rapdly. In summary, CVTAB has the obvous peak value as the ncrease of k values, and t suggests the recent, even correct, cluster number on Boolean data, whle CH, I, S and D cannot do t. 241

(a) Internal ndces on data Zoo (b) Internal ndces on data Small Soybean (c) CVTAB and NMI on data Zoo Fg. 3. Clusterng results by k-modes (d) CATAB and NMI on data Small Soybean On Fg. 3c and d s shown the trend of CVTAB and NMI as the ncrease of k values. As an nternal evaluaton ndex, CVTAB shows consstency wth the external clusterng valdaton measures, and the normalzed value of CVTAB s near that of NMI. NMI s more accurate than CVTAB on data Zoo, but on data Small Soybean, both of them show the same superorty. Apparently, snce NMI s an external valdaton measure, t knows the true cluster number n advance and usually s more precse than an nternal measure. However, nternal valdaton measures are the only opton for cluster valdaton wthout external nformaton, and the condtons when the external nformaton s not avalable are more common n practce. From ths perspectve, CVTAB s more applcable. 5. Concluson CVTAB s an effectve nternal clusterng valdaton measure for hgh dmensonal Boolean data. It s also sutable for categorcal data whch can be translated nto Boolean data. Expermental results show that compared wth some nternal valdaton ndces (CH ndex, I ndex and S ndex), CVTAB shows consstency wth 242

the external clusterng valdaton measure (NMI); CVTAB can pont the best clusterng result nstead of showng monotoncty wth the parameter of algorthm. Meanwhle, CVTAB s not as senstve to nose as Dunn ndex (nternal valdaton ndex). Compared wth NMI whch s external valdaton ndex, CVTAB evaluates the clusterng valdaton wthout external nformaton. From ths perspectve, CVTAB s more applcable. In addton, CVTAB can optmze clusterng algorthm by determnng the parameter of the clusterng algorthm, or by selectng optmal results from many experments to avod the negatve mpacts from randomness. Acknowledgments: Ths work s supported by Natonal Natural Scence Foundaton of Chna (NSFC) under Grant No 71271027, and the Research Fund for the Doctoral Program of Hgher Educaton under Grant No 20120006110037. R e f e r e n c e s 1. E l a n g a s n g h e, M. A., N. S n g h a l, K. N. D r k s et al. Complex Tme Seres Analyss of PM 10, and PM 2.5, for a Coastal Ste Usng Artfcal Neural Network Modellng and k-means Clusterng. Atmospherc Envronment, Vol. 94, 2014, pp. 106-116. 2. F e r r a n d e z, S. M., T. H a r b s o n, T. W e b e r et al. Optmzaton of a Truck-Drone n Tandem Delvery Network Usng k-means and Genetc Algorthm. Journal of Industral Engneerng & Management, Vol. 9, 2016, No 2, pp. 374-388. 3. G u a n, N., D. Tao, Z. L u o et al. NeNMF: An Optmal Gradent Method for Nonnegatve Matrx Factorzaton. IEEE Transactons on Sgnal Processng, Vol. 60, 2012, No 6, pp. 2882-2898. 4. N e n n a t t r a k u l, V., C. A. R a t a n a m a h a t a n a. On Clusterng Multmeda Tme Seres Data Usng k-means and Dynamc Tme Warpng. Internatonal Conference on Multmeda and Ubqutous Engneerng, IEEE, 2007, pp. 733-738. 5. N e n n a t t r a k u l, V., C. A. R a t a n a m a h a t a n a. On Clusterng Multmeda Tme Seres Data Usng k-means and Dynamc Tme Warpng. Internatonal Conference on Multmeda and Ubqutous Engneerng, IEEE, 2007, pp. 733-738. 6. R a n, S., G. S k k a. Recent Technques of Clusterng of Tme Seres Data: A Survey. Internatonal Journal of Computer Applcatons, Vol. 52, 2012, No 15, pp. 1-9. 7. Lu, Y. Research on Internal Clusterng Valdaton Measures. Unversty of Scence and Technology Beng, 2012, pp. 16-20. 8. K r e m e r, H., P. K r a n e n, T. J a n s e n et al. An Effectve Evaluaton Measure for Clusterng on Evolvng Data Streams. In: Proc. of 17th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, ACM, 2011, pp. 868-876. 9. F e n g, X., S. W u, Y. Lu. Imputng Mssng Values for Mxed Numerc and Categorcal Attrbutes Based on Incomplete Data Herarchcal Clusterng. In: Proc. of Internatonal Conference on Knowledge Scence, Engneerng and Management, Sprnger Verlag, 2011, pp. 414-424. 10. R a l a m b o n d r a n y, H. A Conceptual Verson of the k-means Algorthm. Pattern Recognton Letters, Vol. 16, 1995, No 11, pp. 1147-1157. 11. Lu, Y., Z. L, H. X o n g et al. Understandng and Enhancement of Internal Clusterng Valdaton Measures. IEEE Transactons on Systems Man & Cybernetcs Part B. Cybernetcs A Publcaton of the IEEE Systems Man & Cybernetcs Socety, Vol. 43, 2012, No 3, pp. 982-994. 12. K r a u s, J. M., C. M ü s s e l, G. P a l m et al. Mult-Obectve Selecton for Collectng Cluster Alternatves. Computatonal Statstcs, Vol. 26, 2011, No 2, pp. 341-353. 13. Z h a n g, G. X., L. Q. P a n. School of Electrcal Engneerng, Unversty S. J., Chengdu. A Survey of Membrane Computng as a New Branch of Natural Computng. Chnese Journal of Computers, Vol. 33, 2010, No 2, pp. 208-214. 243

14. B u s, N. Usng Well-Structured Transton Systems to Decde Dvergence for Catalytc P Systems. Theoretcal Computer Scence, Vol. 372, 2007, No 2-3, pp. 125-135. 15. N s h d a, T. Y. An Approxmate Algorthm for NP-Complete Optmzaton Problems Explotng P Systems. In: Proc. of 8th World Mult-Conference on Systems, Cybernetcs and Informaton, 2004, pp. 109-112. 16. H u a n g, L. Research on Membrane Computng Optmzaton Methods. Zheang Unversty, 2007. 17. H u a n g, Z. A Fast Clusterng Algorthm to Cluster Very Large Categorcal Data Sets n Data Mnng. Research Issues on Data Mnng & Knowledge Dscovery, 1998, pp. 1-8. 18. W u, S., X. Gao. CABOSFV Algorthm for Hgh Dmensonal Sparse Data Clusterng. Journal of Unversty Scence & Technology Beng, Vol. 11, 2004, No 3, pp. 283-288. 19. K n o p s, Z. F., J. B. M a n t z, M. A. V e r g e v e r et al. Normalzed Mutual Informaton Based Regstraton Usng k-means Clusterng and Shadng Correcton. Medcal Image Analyss, Vol. 10, 2006, No 3, pp. 432-439. 20. C h e n, L. F., Q. S. J a n g, S. R. W a n g. A Herarchcal Method for Determnng the Number of Clusters. Journal of Software, Vol. 19, 2008, No 1, pp. 62-72. 244