A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Zujun CHEN, Dan LI, Chongquan ZHONG, Xiaorui XU School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China Abstract Missing data handling is a challenging issue often dealt with in data mining and pattern classification. In this paper, a fuzzy c-means clustering algorithm based on pseudo-nearest-neighbor intervals for incomplete data is given. The data are first completed using the pseudo-nearest-neighbor intervals approach, then the data set can be clustered based on the fuzzy c-means algorithm for interval-valued data. The proposed algorithm estimates the missing attribute values without normalization, thus captures the essence of pattern similarities in the original untouched data set. Additionally, the pseudonearest-neighbor intervals representation takes account of implicit uncertainly of missing attribute values, and considers the angle between incomplete data and other data as well. Results on several incomplete data sets demonstrate the effectiveness of the proposed algorithm. Keywords: Fuzzy C-means; Missing Data Recovery; Pseudo-nearest-neighbor; Clustering 1 Introduction The fuzzy c-means (FCM) algorithm is a well-known partitional clustering method and has long played an important role in various application domains, such as pattern recognition and data mining. However, sometimes data sets can be incomplete as a result of random noise, limited time, etc. What s more, FCM is not directly applicable to such incomplete data. If not handled properly, these incomplete data may lead to large errors or biased clustering results. Much research has been done on clustering of incomplete data. Hathaway and Bezdek advocated four different strategies for doing FCM clustering of incomplete data sets [1]. In the whole data strategy (WDS), only complete data are regarded during FCM clustering. The partial distance strategy (PDS) calculates the partial distances between incomplete data and prototypes using available attribute values, and then scales the quantity by the reciprocal of the proportion of components used. The third approach is referred to as optimal completion strategy (OCS). In this approach, missing attribute values are viewed as additional parameters which will be This work is partially supported by the National Science Foundation of China under Grant 61305034, and the Fundamental Research Funds for the Central Universities DUT13JS03. Corresponding author. Email address: ldan@dlut.edu.cn (Dan LI). 1553 9105 / Copyright 2015 Binary Information Press DOI: 10.12733/jcis13783 March 15, 2015

2140 Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 optimized to find better estimates in each iteration. The fourth strategy (nearest prototype strategy, NPS) is a simple modification of OCS, and each missing attribute value is substituted by the corresponding attribute value of the nearest prototype. By taking into account the essence of pattern similarities, Zhu [2] studied a new missing data recovery method, namely the pseudonearest-neighbor substitution approach. In 2003, a kernel-based fuzzy c-means (KFCM) algorithm was presented to cluster incomplete data [3]. Lim [4] proposed a hybrid network comprising fuzzy artmap and fuzzy c-means clustering for pattern classification with incomplete data. Li [5] developed a fuzzy c-means algorithm for incomplete data based on nearest-neighbor intervals (FCM-NNI). And Li [6] proposed a hybrid genetic algorithm-fuzzy c-means approach, in which the missing attribute values estimation is limited to the subsets that contain nearest neighbors of incomplete data rather than the entire attribute space. Wang [7] proposed a hybrid clustering methods for incomplete data with nearest-neighbor interval. In this paper, a novel fuzzy c-means algorithm for incomplete data based on pseudo-nearestneighbor intervals (FCM-PNNI) is given. Firstly, missing attribute values are represented by pseudo-nearest-neighbor intervals (PNNI) with regard to the implicit uncertainty of missing attribute values and the angle between vectors in data set especially. What s more, the proposed missing data recovery approach is performed using original the original data set. Secondly, a FCM clustering algorithm for interval-valued data is used to cluster incomplete data, and the interval prototypes obtained can reflect the shape of the clusters to some extent. The rest of this paper is organized as follows. Section 2 describes the basis of pseudo-nearestneighbor interval for missing data recovery. Section 3 introduces the FCM algorithm based on PNNI for incomplete data (FCM-PNNI). Section 4 gives experimental results of the proposed method and compares it with five other methods. Section 5 contains the concluding remarks. 2 Pseudo-nearest-neighbor Intervals Determination Recently, the use of nearest-neighbor (NN) based techniques has been a valuable tool for missing data recovery. But often, the NN method leads to over-fitting, and it is extended to k nearest neighbor (KNN) method. From the perspective of the accuracy of imputed data, Zhang [8] proposed a Shelly Neighbors method, in which each missing attribute value was imputed by those neighbors that form a shell surround the missing datum. To deal with heterogeneous data sets, a lately novel KNN imputation method based on gray distance [9] was presented, named gray KNN (GKNN) imputation. In the above methods, the missing attribute values are estimated by numerical values, which ignore the implicit uncertainty. In addition, numerical data always need normalization to avoid the bias that related to the magnitude. However, this causes the data points to become closer to each other in Euclidean space. In this paper, the pseudo-nearest-neighbor method for missing data recovery is based on the partial cosine similarity without normalization, thus captures the essence of pattern similarities in the original data set. To derive the pseudo-nearest-neighbor rule for missing data recovery, the concept of pseudonearest-neighbor and pseudo similarity is firstly introduced. Let x k and x l be two incomplete data vectors of an s-dimensional data set X={x 1,x 2,...,x n } R s. Then the order of the data elements are rearranged as follows [3]. (1) If the elements x kj and x lj (1 j s) both have their values missing, then they are placed

Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 2141 toward the right end of the vectors and the missing values are represented by the symbol. (2) If one of the elements x kj and x lj (1 j s) has its value missing, then it is placed toward the right end of the vectors and the missing value is represented by the symbol #. (3) If both of the elements x kj and x lj (1 j s) have their values non-missing, then their locations and values keep unchanged. The pseudo-similarity between x k and x l is defined as i=d x S p (x k, x l ) = d ( ki i=1 x li i=d x 2 i=d ki x 2 li i=1 i=1 ) (1) Note that it is similar to the cosine similarity when d = s. The measurement is actually a modification of the well-known cosine similarity, except that S p (x k, x l ) has a certain coefficient d. However, it gives more weight on the vectors which have more present elements. For a vector x k, if x l (l k) has the largest pseudo-similarity value, it means x k and x l are pseudo-nearestneighbors. The pseudo similarity between x k and x l is both effected by the angel between x k and x l and the coefficient d. The procedure of the pseudo-nearest-neighbor intervals method for missing data recovery is presented as follows. Algorithm 1 Pseudo-nearest-neighbor intervals method (PNNI) Input: an s-dimensional incomplete data set X={x 1,x 2,...,x n } R s Output: an s-dimensional interval-valued data set X={x 1, x 2,..., x n }, and j, k : x kj = [x kj, x+ kj ] Procedure: 1: for each datum x k (1 k n) in data set X 2: for each attribute value x kj (1 j s) in x k 3: if x kj is missing 4: compute S p (x k, x l ), where x lj is non-missing (1 l n, l k) 5: search x k s q pseudo-nearest-neighbors, which form a data set Y ={y 1,y 2,...,y q } R s 6: x kj = min{y 1j, y 2j,..., y qj } 7: x + kj = max{y 1j, y 2j,..., y qj } 8: else if x kj is non-missing 9: x kj = x kj 10: x + kj = x kj 11: end if 12: end for 13: end for According to the strategy of PNNI mentioned above, it s a key point to choose an appropriate value of q. If the value of q is too small, the PNNI acquired is likely to be a bias estimation. On the contrary, if the value of q is large enough, the clustering performance will be seriously affected. To put it simply, if the value of q is equivalent to the number of date in the data set, the range of PNNI is too large to represent the missing values properly. In this paper, it is assumed that only one non-missing attribute value is supposed to be missing randomly from the given

2142 Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 incomplete data set each time, which changes the distribution of original incomplete attributes as little as possible. Then test whether the non-missing attribute fall within the corresponding PNNI by selected q = 2, 3,... The whole process should be repeated Z times, in which the value of Z equals to the number of missing attribute values. The selected q is located at the inflection point of the curve about the percent probability. Thus it can not only reduce the bias of PNNI estimation but also avoid the attribute confusion among clusters, which result from the excessive estimation of interval ranges. 3 Fcm Clustering Algorithm Based on Pseudo-nearestneighbor Intervals The FCM algorithm for incomplete data based on PNNI (FCM-PNNI) can be divided into two stages, first the missing attribute values are represented by appropriate intervals according to the PNNI determination approach, and then the intervals are clustered based on the fuzzy c-means algorithm for inter-valued data. The clustering algorithm is given as follows. Let X = {x 1, x 2,..., x n } be a set of n vectors in an s-dimensional interval-valued data set to be partitioned into c-(fuzzy) clusters, where x k = [x k1, x k2,..., x ks ], j, k : x kj = [x kj, x+ kj ] and x kj x+ kj. The clustering algorithm minimizes the following objective function c n c n J(U, V ) = (u m ik x k v i 2 2) = (u m ik[(x k v i )(x k v i )T +(x + k v+ i )(x+ k v+ i )T ]) (2) subject to i=1 i=1 c u ik = 1, k = 1, 2,..., n (3) i=1 The v i = [v i, v+ i ], 1 i c is the i-th interval cluster prototype. Following equations are used to minimize the above objective function [10]. n u ik = [ c t=1 v i = v + i = u m ik x k n u m ik n u m ik x+ k n u m ik, i = 1, 2,..., c (4), i = 1, 2,..., c (5) ( (x k v i )(x k v i )T + (x + k v+ i )(x+ k v+ i )T (x k v t )(x k v t ) T + (x + k v+ t )(x + k v+ t ) T ) 1 m 1 ] 1 And if k, h, 1 k n, 1 h c, j : x kj v hj, in other words, x k is within the convex hyper-polyhedron formed by v h, then x k can be considered to belong fully to the h-th cluster with membership 1, and belong to the other clusters with membership 0. Thus if i = h, then u ik = 1, otherwise, u ik = 0. For an s-dimensional incomplete data set X={x 1,x 2,...,x n } R s, the procedure of FCM-PNNI can be described as follows. (6)

Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 2143 Step 1 Choose the number of pseudo-nearest-neighbors q based on the determination strategy. Step 2 k, j, according to the strategy of PNNI, x kj can be represented by the interval [x kj, x+ kj ]. Step 3 Fix m, c and ε, then initialize the partition matrix U (0). Step 4 When the iteration index is l(l = 1, 2,...), calculate the matrix of cluster prototypes V (l) and V +(l) using Eq. (4) and Eq. (5) and U (l 1). Step 5 Update the partition matrix U (l) using Eq. (6) and V (l) and V +(l). Step 6 Compare U (l) to U (l 1) using U (l) U (l 1) ε. If true, then stop. Otherwise set l = l + 1 and return to Step 4. 4 Numerical Experiments 4.1 Data sets In the following experiments, we show the clustering performance of FCM-PNNI for three wellknown data sets: Iris, Bupa Liver Disorder and New Thyroid, which are often used as benchmarks to test the performance of clustering algorithms. The Iris data contains three types of Iris plants and each vector is described by 4 attributes. The Bupa Liver Disorder database consists of two classes: liver disorders and no liver disorders, and 345 vectors with 6 attributes included. The New Thyroid data contains three classes with 215 vectors and each vector has 5 attributes. In this paper, a certain number of attribute values are randomly selected and removed to generate incomplete data sets. That is, the missing attribute values are missing completely at random (MCAR). The random selection of missing attribute values is constrained so that [1]: (1) each original attribute vector x k retains at least one component; (2) each attribute has at least one value present in the incomplete data set x k. 4.2 Experimental results Table 1: The values of q on the three data sets with four different missing percentages %missing (NNI) %missing (PNNI) Data sets 5% 10% 15% 20% 5% 10% 15% 20% Iris 5 5 6 7 5 5 6 6 Thyroid 5 5 5 5 5 5 5 5 Bupa 5 5 5 5 5 5 5 5 The performance of the given algorithm (FCM-PNNI) is compared with four versions of FCM (WDS-FCM, PDS-FCM, OCS-FCM, NPS-FCM) proposed by Hathaway and Bezdek [1] and the FCM-NNI algorithm advocated by Li [5]. Some experimental details notable are as follows. The

2144 Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 corresponding stopping criterion is U (l) U (l 1) ε, where ε = 10 6 and the fuzzification parameter m is set to 2. To avoid the potential variation introduced by randomizing the membership grade matrix and the random distribution of the missing attribute values in the initialization, all the results shown in Tables 2-4 are obtained by the averages over 30 trails. The numbers of q on the three incomplete data sets with four different missing percentages are selected by the strategy given at the end of the Section 2. The optimal solutions in each row are presented by boldfaced, and the suboptimal solutions are underlined. 5% missing 10% missing 15% missing 20% missing NNI Accuracy NNI Accuracy NNI Accuracy Num. of NN (q) IRIS Num. of NN (q) Thyroid Num. of NN (q) BUPA Fig. 1: Change curves of averaged NNI accuracies of 30 trials PNNI Accuracy PNNI Accuracy PNNI Accuracy Num.of PNN(q) IRIS Num. of PNN (q) Thyroid Num. of PNN (q) BUPA Fig. 2: Change curves of averaged PNNI accuracies of 30 trials Table 2: Averaged results of 30 trials using incomplete Iris data sets Mean num. of iterations to termination Mean num. of misclassification WDS PDS OCS NPS NNI PNNI WDS PDS OCS NPS NNI PNNI 0 30.48 31.02 30.34 30.52 30.44 29.62 16 16 16 16 16 16 5 30.77 30.23 39.50 32.50 29.93 30.00 16.57 17.03 17.13 16.87 16.70 15.30 10 29.80 30.77 50.07 32.73 30.53 28.23 16.37 16.67 16.80 16.60 16.17 14.93 15 29.60 30.90 44.07 31.67 31.37 28.93 16.67 16.70 16.73 16.60 16.17 14.03 20 31.20 29.30 52.47 32.67 31.20 28.60 17.53 17.97 18.37 17.93 16.70 15.37 4.3 Discussion (1) Overall comparison: In terms of misclassification error, FCM-PNNI always performs to be the best one or the better one. As for average number of iterations to terminate, FCM-PNNI needs minimal or subminimal numbers of iterations to achieve convergence in the three data sets expect

Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 2145 Table 3: Averaged results of 30 trials using incomplete Thyroid data sets Mean num. of iterations to termination Mean num. of misclassification WDS PDS OCS NPS NNI PNNI WDS PDS OCS NPS NNI PNNI 0 134.23 137.87 139.10 129.23 145.50 133.40 20 20 20 20 20 20 5 167.23 150.97 172.57 127.83 147.83 123.67 44.00 41.23 48.23 54.60 28.17 25.03 10 121.83 135.00 130.83 90.40 113.73 82.93 54.53 33.80 61.40 60.13 32.37 21.60 15 102.63 112.00 126.50 77.10 107.33 88.53 51.13 45.40 47.10 43.27 35.87 21.53 20 72.07 88.87 123.43 92.03 89.57 72.00 46.30 47.67 48.30 47.67 44.57 21.40 Table 4: Averaged results of 30 trials using incomplete Bupa data sets Mean num. of iterations to termination Mean num. of misclassification WDS PDS OCS NPS NNI PNNI WDS PDS OCS NPS NNI PNNI 0 40.8 39.82 40.44 39.68 39.98 39.58 177 177 177 177 177 177 5 41.47 41.40 45.00 43.07 40.23 41.07 176.93 177.03 177.03 177.03 176.67 175.67 10 43.13 40.57 53.07 45.03 41.90 42.70 177.77 176.87 176.60 176.97 174.43 174.67 15 40.87 42.00 61.67 46.57 41.70 41.27 176.60 177.00 177.60 177.90 173.03 172.40 20 42.10 42.70 74.70 53.83 66.00 42.03 178.07 177.20 177.23 177.30 172.13 172.90 for the 10% case of incomplete Bupa data set. However, clustering is often done off-line, the misclassification errors are more focused on than numbers of iterations to achieve convergence. (2) Comparison of FCM-PNNI and WDS-FCM, PDS-FCM, OCS-FCM, NPS-FCM: The WDS- FCM approach deletes all incomplete data, which results in a loss of information. The PDS- FCM scales partial distances by the reciprocal of the proportion of components used, while the distribution information of missing attribute values implicitly embodied in the other data is not taken into account. The OCS-FCM treats missing attribute values as variables that are to be optimized in order to get smallest possible value of the clustering objective function of FCM. The NPS-FCM, as a simple modification of OCS-FCM, replaces missing attribute values by the corresponding attribute value of the nearest prototype during each iteration. Compared with the four algorithms above, FCM-PNNI achieves interval estimations of missing attribute values by using the pseudo-nearest-neighbor information of incomplete data sufficiently. On one hand, the interval estimations represent the uncertainty of missing attribute values. On the other hand, the convex hyper-polyhedrons formed by interval prototypes can present the shape of clusters and sample distribution to some extent. (3) Comparison of FCM-PNNI and FCM-NNI: The difference between the two algorithms lies in similarity measurement in searching for similar samples (nearest-neighbors or pseudonearest-neighbors). While the strategy of NNI is based on the partial Euclidean distance and the PNNI is based on the partial cosine similarity. From Tables 2-4, it can be seen that FCM- PNNI generally performs better than FCM-NNI except for the 10% and 20% case of incomplete BUPA data sets. Compared with FCM-NNI, FCM-PNNI takes the angle between incomplete data and other data into consideration. In addition, the incomplete data set is unnormalized before recovery, thus captures the essence of pattern similarities in the original untouched data set without normalization. (4) The number of pseudo-nearest-neighbors q: According to the determination approach for q mentioned at the end of Section 2, 30 trials have been taken to eliminate the variation in the

2146 Z. Chen et al. /Journal of Computational Information Systems 11: 6 (2015) 2139 2146 experiment. Fig. 2 shows the average percentage probability that non-missing attribute values fall within the corresponding PNNI (PNNI accuracy) based on different q. The selected q is located at the inflection point of the curve about the percent probability. The values of nearest-neighbors q are shown in Fig. 1, and the procedure is similar to the PNNI, and hence is omitted here. 5 Conclusion In this paper, a novel fuzzy c-means algorithm based on pseudo-nearest-neighbor intervals for incomplete data (FCM-PNNI) is given. The proposed algorithm takes into account the following characteristics: 1) pseudo similarity, which takes the angle between incomplete data and other data into consideration; 2) calculates the missing attribute values in the original untouched data set; 3) the implicit uncertainly of missing attribute values, which are represented by intervals. The experiments have shown the capability compared with the other five approaches at numbers of iterations to termination and misclassification error especially. References [1] Hathaway R J, Bezdek J C. Fuzzy c-means clustering of incomplete data [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2001, 31(5): 735-744. [2] Huang X, Zhu Q. A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets [J]. Pattern Recognition Letters, 2002, 23(13): 1613-1622. [3] Zhang D Q, Chen S C. Clustering incomplete data using kernel-based fuzzy c-means algorithm [J]. Neural Processing Letters, 2003, 18(3): 155-162. [4] Lim C P, Leong J H, Kuan M M. A hybrid neural network system for pattern classification tasks with missing features [J]. IEEE transactions on pattern analysis and machine Intelligence, 2005, 27(4): 648-653. [5] Li D, Gu H, Zhang L. A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data [J]. Expert Systems with Applications, 2010, 37(10): 6942-6947. [6] Li D, Gu H, Zhang L. A hybrid genetic algorithm-fuzzy c-means approach for incomplete data clustering based on nearest-neighbor intervals [J]. Soft Computing, 2013, 17(10): 1787-1796. [7] Zhang L, Wang B. Hybrid Clustering Methods for Incomplete Data with Nearest-neighbor Interval [J]. Journal of Computational Information Systems, 2014, 10(14): 6007-6014. [8] Zhang S. Shell-neighbor method and its application in missing data imputation [J]. Applied Intelligence, 2011, 35(1): 123-133. [9] Jahromi M Z, Parvinnia E, John R. A method of learning weighted similarity function to improve the performance of nearest neighbor [J]. Information Sciences, 2009, 179(17): 2964-2973. [10] Yu C, Fan Z. A FCM clustering algorithm for multiple attribute information with interval numbers [J]. Xitong Gongcheng Xuebao, 2004, 19: 387-393.