Rough Fuzzy c-means Subspace Clustering

Size: px

Start display at page:

Download "Rough Fuzzy c-means Subspace Clustering"

Oscar Victor Skinner
6 years ago
Views:

1 Chapter 4 Rough Fuzzy c-means Subspace Clustering In this chapter, we propose a novel adaptation of rough fuzzy c-means algorithm for high dimensional data by modifying its objective function. The proposed algorithm automatically detects the relevant cluster dimensions of the high dimensional data set. The assignment of weights to attributes being specific to each cluster, an efficient subspace clustering scheme is generated. We have also discussed the convergence of the proposed algorithm. The remainder of this chapter is organised as follows: section 4.1 introduces rough set theory, in section 4.2 on related work, we describe how classical clustering methods have been adapted to suit the requirements of high dimensional data, in section 4.3, we extend the rough fuzzy c-means algorithm for subspace clustering in the form of Rough Fuzzy c-means Subspace (RFCMS) algorithm, in section 4.4 we discuss the convergence of proposed algorithm, in section 4.5, we present the results of applying RFCMS algorithm on several UCI data sets, and finally section 4.6 summarizes the chapter. 41

2 4.1 Introduction Pawlak introduced rough set theory as a new framework for dealing with imperfect knowledge [Pawlak, 1991]. Rough set theory provides a methodology for addressing the problem of relevant feature selection, by selecting a set of information rich features from a data set that retains the semantics of the original data and requires no human inputs unlike statistical approaches [Jensen, 1999]. It is often possible to arrive at a minimal feature set (called reduct in rough set theory) that can be used for data analysis tasks such as classification and clustering [Lingras and West, 2004], [Mitra et al., 2006]. When feature selection approaches based on rough sets are combined with an intelligent classification system like those based on fuzzy systems or neural networks, they retain the descriptive power of the overall classifier and result in simplified system structure which enhances the understandability of the resultant system [Shen, 2007]. Following Rutkowski we describe the notion of rough sets used to model uncertainty in information systems [Rutkowski, 2008]. Formally, an information system is a pair (U, A), where U is a non-empty finite set of objects and A is a non-empty finite set of attributes such that each attribute a has an associated value set V a, i.e. a : U V a for every a A. A Decision System DS is defined as a pair (U, A {d}), d / A is called decision attribute and the elements of A are called condition attributes. For an attribute set B A, the set of objects in the information system, indiscernible w.r.t. B is described by the indiscernibility relation IND IS (B) defined as: IND IS (B) = {(x 1, x 2 ) U 2 a(x 1 ) = a(x 2 ) a B}. The objects x 1 and x 2 are indiscernible from each other by attributes from B if (x 1, x 2 ) IND IS (B). The equivalence classes of the B-indiscernibility relation are denoted by [x] B. If X U then X can be approximated using B by constructing three approximations, namely, B lower approximation: BX = {x [x] B X}, B upper approximation: BX = {x [x] B X φ}, 42

3 and the B boundary region: BX BX of X. Evidently, the boundary region consists of all objects in upper approximation but not in lower approximation of X. Bazan et al. discuss various techniques for rough set reduct generation and argue that the classical reducts being static may not be stable in randomly chosen samples of a given decision table [Bazan et al., 2000]. To deal with such situations they focus on reducts that are stable over different subsets of samples chosen from a given decision table. Such reducts are called dynamic reducts. They compute reducts using an order based genetic algorithm and subsequently extract dynamic reducts which are used to generate classification rules. Each rule set is associated with a measure called the rule strength which is used later to resolve conflicts when several rules are applicable. Slezak generalized the concept of reduct by introducing the notion of association reducts corresponding to both association rules and rough set reducts [Ślezak, 2005]. He defined association reduct as a pair (A, B) of disjoint subsets of attributes such that all data supported patterns involving A approximately determine those involving B. He developed an information theory based algorithm to compute association reducts. As the algorithm needs to examine all association reducts, it has exponential time requirements. In order to alleviate this hardship, Slezak targeted significantly smaller ensembles of dependencies providing reasonably rich knowledge, and developed an order based genetic algorithm to achieve this [Ślezak, 2009]. Shen and Jensen proposed the concept of retainer as an approximation of a reduct [Richard and Qiang, 2001]. The authors suggest a heuristic to compute the retainer and demonstrate its usefulness for the classification task. For clustering textual database consisting of N documents, with a vocabulary of size V, Li et al. developed an algorithm based on approximate reducts that works in time O(VN) [Li et al., 2006]. 43

4 4.2 Related Work Rough sets have been widely used for classification and clustering [Lingras and West, 2004], [Mitra et al., 2006], [Pawlak, 1991]. The classical k-means algorithm has been extended to rough k-means algorithm by Lingras et al. [Lingras and West, 2004]. In rough k-means algorithm, a cluster in the lower approximation, called the core cluster, is surrounded by a buffer or boundary set having objects with unclear membership status [Lingras and West, 2004]. A data point in the lower approximation surely belong to a cluster, although, membership of the objects in an upper approximation is uncertain. Signature of each cluster is represented by its center, lower and upper approximation. If lower and upper approximations are equal then buffer set is empty and the data objects are crisply assigned to the cluster. The rough k-means algorithm follows an iterative process, wherein cluster centers are updated until convergence criterion is met. Asharaf et al. have extended rough k- means algorithm in such a way that it does not require prior specification of the number of clusters [Asharaf and Murty, 2004]. They have proposed a two phase algorithm. It identifies a set of leaders which act as prototypes in the first phase. Subsequently a set of supporting leaders are identified, which can act as leaders, provided they yield better partitioning. The evolutionary rough k-medoids algorithm [Peters et al., 2008] is based on the family of rough clustering algorithms and the classical k-medoids algorithm [Kaufman and Rousseeuw, 1990]. In Malyszko et al. have extended rough k-means clustering to rough entropy clustering [Malyszko and Stepaniuk, 2009]. It is an iterative process: firstly a predefined number of weight pairs are selected, for each weight pair a new offspring clustering is determined, rough entropy is computed, and the partition which gives highest rough entropy is selected. Liu et al. have proposed a feature selection method ISODATA-RFE for high dimensional gene expression datasets [Liu et al., 2012]. Bhattacharya distance is used to rank the features of training set. Features with low Bhat- 44

5 tacharya distance are removed from feature set. For separating different classes, fuzzy ISODATA algorithm is used to calculate sensitivity index of each feature. A recursive feature elimination method is applied to feature set for removing unimportant features. It generates multiple nested candidate feature subsets. Finally, the feature subset with least error is selected for use in classification and clustering algorithms. Own and Abraham have proposed a new weighted rough set framework based classification for neonatal jaundice [Own and Abraham, 2012]. The weighted information table is built by applying class equal sample weighting. While samples in majority class have smaller weight, the samples in minority class have larger weight. A weighted reduction algorithm MLEM2 exploits the significance of the attributes to extract a set of diagnosis rules from decision system of NeoNatal Jaundice database. Deng et al. have proposed an enhanced entropy weighting subspace clustering algorithm for high dimensional gene expression data [Deng et al., 2011]. Its objective function integrates the fuzzy within cluster compactness and between cluster information simultaneously. [Cordeiro de Amorim and Mirkin, 2012] have extended the weighted K-means algorithm proposed by Huang et al.. They have replaced Euclidean distance metric by minkowski metric for measuring distances as the Euclidean distance cannot capture the relationship between scales of the feature values and feature weights. Bai et al. have proposed a novel weighting algorithm for categorical data [Bai et al., 2011]. The algorithm computes two weights for each dimension in each cluster. These weight values are used to identify the subsets of attributes which can categorize different clusters. Rough set theory has been applied in conjunction with fuzzy set theory in several domains such as fuzzy rule extraction, reasoning with uncertainty, fuzzy modelling, and feature selection [Maji and Pal, 2010]. The classical fuzzy c-means algorithm has been used in conjunction with rough sets to develop rough fuzzy c-means (RFCM) algorithm [Mitra and Banka, 2007]. The concept of membership in FCM enables efficient handling of overlapping 45

6 partitions, while, the rough sets are aimed at modelling uncertainty in data. Such hybrid techniques provide a strong paradigm for uncertainty handling in various application domains such as pattern recognition, image processing, mining stock prices, vocabulary for information retrieval, fuzzy clustering, dimensionality reduction, data mining and knowledge discovery [Maji and Paul, 2011], [Maji and Pal, 2010]. Maji and Pal proposed an algorithm RFCMdd for selecting the most informative bio-basis (medoids), where each partition is represented by a medoid computed as weighted average of the crisp lower approximation and fuzzy boundary [Maji and Pal, 2007b]. In Maji introduced a quantitative measure of similarity among genes based on fuzzy rough sets to develop fuzzy-rough supervised attribute clustering (FRSAC) algorithm [Maji, 2011]. 4.3 Rough Fuzzy c-means Subspace Clustering In this section, we propose an algorithm based on rough fuzzy c-means algorithm for subspace clustering Rough c-means Rough c-means algorithm [Lingras and West, 2004], has extended the concept of c-means by considering each cluster as an interval or rough set, where lower and upper approximations BX and BX are characteristics of rough set X. A rough set has following properties: (i) An object x j can belong to at most one lower approximation. (ii) If x j BX of cluster X, then x j BX also. (iii) If x j does not belongs to any lower approximation, then it belongs to two or more upper approximations, i.e. overlap between clusters is possible. 46

7 The iterative steps of the Rough c-means Algorithm are as follows: Algorithm 2 Rough c-means Algorithm 1. Chose initial means z i, 1 i k, for the k clusters. 2. Assign each data point x j, 1 j n, to the lower approximation BU i or upper approximations BU i, BU i of cluster pairs U i, U i by computing the difference in its distance d i j d ij, where x j be a j th data point at distance d ij from i th centroid z i of cluster U i. 3. Let d ij be minimum and d i j be the next to minimum. If d i j d ij is less than some threshold then x j BU i and x j BUí and x j cannot be a member of any lower approximation, else x j BU i such that distance d ij is minimum over the k clusters. 4. Compute new mean z i for each cluster, as x x j j (BUi BU i ) BU i BU i x x z i = j w low j BUi BU i +w up x x j j BUi BU i x x j j (BUi BU i ) BU i BU i if BU i = φ BU i BU i φ if BU i φ BU i BU i φ otherwise. where the parameters w low and w up represents the relative importance of the lower and upper approximations respectively. Thus, RCM generates three types of clusters, with objects (i) in both the lower and upper approximations, (ii) only in lower approximation, and (iii) only in upper approximation. 5. Repeat Steps 2-4 until convergence, i.e., there are no more new assignments, or upper limit on the number of iterations is reached. Note: w up = 1 w low, 0.5 < w low < 1, and 0 < threshold <

8 4.3.2 Rough-Fuzzy c-means Rough-fuzzy c-means algorithm [Mitra et al., 2006] incorporates weighted distance in terms of fuzzy membership value u ij of a data point x j to a cluster mean z i, instead of the absolute individual distance d ij of j th data point from i th cluster center. The iterative steps of the algorithm are as follows: Algorithm 3 Rough Fuzzy c-means Algorithm 1. Chose initial means z i, 1 i k, for the k clusters. 2. Compute u ij by eq. 3.9 for k clusters and n data objects. 3. Assign each data point x j to the lower approximation BU i or upper approximation BU i, BU i of cluster pairs U i, U i by computing the difference in its membership u ij u i j. 4. Let u ij be maximum and u i j be the next to maximum. If u ij - u i j is less than some threshold then x j BU i and x j BU i and x j cannot be a member of any lower approximation, else x j BU i such that membership u ij is maximum over the k clusters. 5. Compute new mean z i for each cluster, as x j (BU i BU i ) µα ij x j if BU i = φ BU i BU i φ x j (BU i BU i ) µα ij µ z i = α x w j BU i ij x j x low + w j (BU i BU i ) µα ij x j µ α up if BU i φ BU i BU i φ x j BU i ij x j (BU i BU i ) µα ij µ x α j BU i ij x j, otherwise. xj BU i µ α ij 6. Repeat Steps 2-5 until convergence, i.e., there are no more new assignments, or upper limit on the number of iterations is reached. Note: w up = 1 w low, 0.5 < w low < 1, and 0 < threshold <

9 4.3.3 Rough Fuzzy c-means Subspace Clustering Algorithm The proposed algorithm called Rough Fuzzy c-means Subspace (RFCMS) has been developed by hybridizing the concept of fuzzy membership for objects (in clusters) and dimensions (fuzzy membership serves as weight of dimension) and rough set based approximations of clusters. Objective Function Let, BU i, BU i and BU i BU i denote lower approximation, upper approximation, and boundary region of i th cluster U i respectively. In [Lingras and West, 2004] the classical objective function of fuzzy c-means algorithm has been modified in the rough framework by incorporating the lower and upper approximations of the clusters. We have extended the objective function of rough fuzzy c-means algorithm [Lingras and West, 2004] by incorporating the weights of dimensions as relevant to different clusters. We associate with i th cluster, the weight vector, ω i which represents the relative relevance of different attributes for the i th cluster. Thus, in the matrix W = [ω ir ] k d, ω ir denote the contribution of r th dimension to the i th cluster. The sum of contributions from all dimensions adds to 1 for each cluster. d ω ir = 1, 1 i k, (4.1) r=1 ω ir [0, 1], 1 i k, 1 r d (4.2) The proposed RFCMS algorithm minimizes the following objective function J RF CMS to partition data set into k clusters. aa + bb if BU φ BU BU φ J RF CMS = A if BU φ BU BU = φ B otherwise. 49

10 where, A = k d µ α ijωird β 2 ijr x j BU i i=1 r=1 B = k d µ α ijωird β 2 ijr (4.3) x j (BU i BU i ) i=1 r=1 In the above formulation, A and B correspond to lower and upper approximations. Parameters a and b control the contribution of lower and upper approximation of a cluster. d 2 ijr = (x jr z ir ) 2 (4.4) is the distance between i th cluster center and j th data object along r th dimension. Parameters α (1, ), β (1, ) are weighting components. These parameters control the fuzzification of µ ij and ω ir respectively. Solving 4.3 w.r.t µ ij and ω ir we get: µ ij = 1 kl=1 [ d r=1 (ω ir) β d 2 ijr d r=1 (ω lr) β d 2 ljr ] 1/(α 1) (4.5) ω ir = d l =1 1 [ n ] 1/(β 1) (4.6) j=1 (µ ij) α d 2 ijr n j=1 (µ ij) α d 2 ijl The weights of dimensions are computed using eq. 4.6 as in [Kumar and Puri, 2009]. Cluster Center The cluster centers are computed as: x j (BU i BU i ) µα ij x jr if BU i = φ BU i BU i φ z ir = a x j (BU i BU i ) µα ij x j BU i µ α ij x jr x j BU i µ α ij x +b j (BU i BU i ) µα ij x jr x j (BU i BU i ) µα ij x j BU µ α i ij x jr x j BU µ α i ij 50 if BU i φ BU i BU i φ otherwise.

11 (4.7) As the objects lying in lower approximation definitely belong to the cluster so they are assigned higher weights as compared to weight for objects lying in boundary region. For the case a 1 cluster center may get stuck in local optimum because clusters cannot find the objects lying in the boundary region and therefore, they may not be able to move towards the best cluster center. In order to maintain the greater degree of freedom to move, the values of parameters a and b are set as o < a < b < 1 such that a + b = 1 [Maji and Pal, 2007a]. Like FCM [Bezdek et al., 1987], and Yan s fuzzy curve tracing algorithm [Yan, 2004] the proposed RFCMS algorithm converges, at least along a subsequence, to a local optimum solution. The iterative steps of the algorithm are as follows: Algorithm 4 Rough Fuzzy c-means Subspace Clustering Algorithm 1. Chose initial cluster centers z i, 1 i k, for the k clusters. 2. Compute µ ij by eq. 4.5 for k clusters and n data objects. 3. Let µîj be maximum and µ ij be the next to maximum for an object x j. If µîj - µ ij is less than some threshold ɛ then x j BUî and x j BU i and x j cannot be a member of any lower approximation, else x j BUî such that membership µîj is maximum over the k clusters. 4. Compute ω ir by eq. 4.6 for k clusters and d dimensions. 5. Compute new cluster centers z i for each cluster, as in eq Repeat steps 2-5 until convergence, i.e., there are no more new assignments, or limit on maximum number of iterations is reached. Note: a = 1 b, 0.5 < a < 1, and 0 < threshold <

12 4.4 Convergence In this section, we discuss the convergence criteria of the proposed algorithm along with its proof. On the similar lines, as global convergence property of FCM algorithm, global convergence property of RFCMS states that for any data set and initialization parameters, an iteration sequence of RFCMS algorithm either (i) converges to a local minimum or (ii) there exists a subsequence of the iteration sequence that converges to a stationary point. Theorems 4.1, 4.2 and 4.3 below show that necessary and sufficient conditions hold for U, W, and Z respectively. Theorem 4.1 Let η : M kn f Z R kd is fixed. Then U M kn f calculated by equation: µ ij = 1 R, η(u) = J RF CMS (U, W, Z), where W M kd f k l=1 and is the strict local minima of η if and only if U is [ d r=1 (ω ir )β d 2 ijr d r=1 (ω lr )β d 2 ljr ] 1/α 1 Proof 4.1 We have to minimize J RF CMS with respect to U, W, subject to constraints 2.11, and 4.1 where α (1, ), and β (1, ). non-negativity of µ ij and ω ir we set µ ij = S 2 ij and ω ir = P 2 ir. In order to ensure The constraints 2.11, and 4.1 have been adjoined to J RF CMS with a set of Lagrange multipliers {λ j, 1 j n}, and {φ i, 1 i k} to formulate the new objective function as follows: J RF CMS = n j=1 ki=1 dr=1 S 2α ij P 2β ir d 2 ijr+ n j=1 λ j ( ki=1 S 2 ij 1 ) + k i=1 φ i ( dr=1 P 2 ir 1 ) Now, we compute the first order derivate of J RF CMS with respect to S ij, necessary condition for optimality. J RF CMS S ij J RF CMS S ij = 2α = 2S ij d r=1 [ α Sij 2α 1 P 2β ir d 2 ijr + 2λ j S ij = 0 (4.8) d r=1 ] Sij 2α 2 P 2β ir d 2 ijr + λ j = 0 (4.9) 52

13 Assuming that S ij 0, 1 j n, 1 i k, we get: α d r=1 Sij 2α 2 P 2β ir d 2 ijr + λ j = 0 or λ j = α d r=1 Sij 2α 2 P 2β ir d 2 ijr or S 2α 2 ij = or Sij 2 = λ j α d r=1 P 2β λ j α d r=1 P 2β ir d 2 ijr ir d 2 ijr 1 (α 1) λ µ ij = j α d r=1 P 2β ir d 2 ijr 1 (α 1) (4.10) Using constraint eq in eq. 4.10, we get: k k λ µ ij = j i=1 i=1 α d r=1 P 2β ir d 2 ijr 1 (α 1) = 1 Substituting the value of λ j in eq. 4.10, we obtain: µ ij = 1 kl=1 [ d r=1 ωβ ir d2 ijr d r=1 ωβ lr d2 ljr ] 1/(α 1) (4.11) Now, to prove the sufficiency condition we compute the second order partial derivative. 2 J RF CMS S ij S i j = 2α(2α 1) d r=1 Sij 2α 2 P 2β ir d 2 ijr + 2λ j if i = i j = j, 0 otherwise. = 2α(2α 1) d r=1 µ (α 1) ij P 2β ir d 2 ijr + 2λ j (4.12) = 2α(2α 1)µ (α 1) ij d 2 ij + 2λj (4.13) 53

14 where d ij = d r=1 P 2β ir d 2 ijr Substituting the value of µ ij and λ j in 4.13, we get: / 1/(α 1) (β 1) 2α(2α 1)d 2 k / ij 1 d2 ij k l=1 d /(α 1) (α 1) lj l=1 αd 2 lj / k = (2α(2α 1) 2α) 1 1 l=1 d 2 lj 1/(α 1) (α 1) (4.14) [ k [ ] = 4α(α 1) d 2 1/(1 α) ] (1 α) lj l=1 [ k ] (1 α) Letting, a j = (d 2 lj ) 1/(1 α), 1 j n, l=1 (4.15) 2 J RF CMS S ij S i j = γ j where, γ j = 4α(α 1)a j 1 j n. (4.16) Hence there are n distinct eigen values each of multiplicity k, of Hessian matrix of U which is a diagonal matrix. With the assumptions α > 1, β > 1 and d 2 ij > 0 l, j it implies γ j > 0 j. Thus, Hessian matrix of U is positive definite and hence, the sufficiency condition is proved. Theorem 4.2 Let ζ : M kd f Z R kd is fixed. Then W M kd f 1 is calculated by equation: ω ir = [ n R, ζ(w ) = J RF CMS (U, W, Z), where U M kn f d l =1 and is the strict local minima of ζ if and only if W j=1 (µ ij )α d 2 ijr n j=1 (µ ij )α d 2 ijl ] 1/β 1 Proof 4.2 In order to obtain the first order necessary condition for optimality, we set the gradient of J α,β w.r.t P ir equal to zero. J RF CMS P ir = 2β n j=1 S 2α ij P 2β 1 ir d 2 ijr + 2φ i P ir = 0 (4.17) We assume that P ir 0, 1 i k, 1 r d. Computing in a manner as in theorem 3.1 we obtain: P 2 ir = [ φ i β n j=1 S 2α ij d 2 ijr 54 ] 1 (β 1)

15 Since, ω ir = P 2 ir we get: ω ir = Using constraint eq. 3.4 we get: [ [ d d ω ir = r=1 r=1 φ i β n j=1 S 2α ij d 2 ijr φ i β n j=1 S 2α ij d 2 ijr Substituting the value of φ i in eq. 4.18, we obtain: ] 1 (β 1) ] 1 (β 1) = 1 (4.18) 1 ω ir = [ n ] 1/(β 1) (4.19) d l j=1 µα ij d2 ijr =1 n j=1 µα ij d2 ijl Now, to prove the sufficiency condition we compute the second order partial derivative 2 J RF CMS P ir P i r = 2β(2β 1) n j=1 P 2β 2 ir Sij 2α d 2 ijr + 2φ i if i = i, r = r 0 otherwise. where = 2β(2β 1) n j=1 = 2β(2β 1)ω (β 1) ir ˆ d 2 ir = ω (β 1) ir Sij 2α d 2 ijr + 2φ i (4.20) n j=1 (4.21) dˆ 2 ir + 2φ i (4.22) S 2α ij d 2 ijr (4.23) Substituting the value of ω ir and φ i in 4.22, we get: / = 2β(2β 1) d ˆ2 d ˆ ir 1 d 2 1/(β 1) (β 1) / ir d 2 ˆ l =1 d /(β 1) (β 1) il l =1 βd ˆ 2 il / d = (2β(2β 1) 2β) 1 1ˆ 1/(β 1) (β 1) l =1 d2 il (1 β) d = 4β(β 1) ( d ˆ2 il )1/(1 β) l =1 55

16 (1 β) d Letting, b i = ( d ˆ2 il )1/(1 β) 1 i k, l =1 2 J RF CMS P ir P i r = κ i where, κ i = 4β(β 1)b i 1 i k. (4.24) Hence there are k distinct eigen values each of multiplicity r, of Hessian matrix of W which is a diagonal matrix. With the assumption α > 1, β > 1 and d ˆ2 > 0 i, l il it implies κ i > 0 i. Thus, Hessian matrix of W is positive definite and hence, the sufficiency condition is proved. Theorem 4.3 Let ξ : R kd R, ξ(z) = J RF CMS (U, W, Z), where U M fk and W M kd f is fixed. Then Z is the strict local minima of ξ if and only if Z is calculated using eq Proof 4.3 Now, we discuss necessary and sufficient condition for cluster centers Z to converge. Now, we compute the first order derivative of J RF CMS with respect to Z, which is again a necessary condition for optimality. J z ir = a x j BU i µ α ij(x jr z ir ) + b x j (BU i BU i ) µ α ij(x jr z ir ) = 0 Thus, cluster centers are computed as the weighted average the crisp lower approximation and fuzzy boundary. z ir = a x j BU i µ α ijx jr x j BU i µ α ij when BU i φ BU i BU i φ Hence eq can be written as: + b x j (BU i BU i ) µα ijx jr x j (BU i BU i ) µα ij (4.25) z ir = a z irlower approx + b z irupper approx where z irlower approx = x j BU i x jr BU i (4.26) 56

17 z irupper approx = x j (BU i BU i ) µα ijx jr x j (BU i BU i ) µα ij(4.27) As an object may not belong to both lower approximation and upper approximation, thus, the convergence of cluster center depends on both the lower and upper approximation of cluster center. Eqs and 4.27 can be written as: BU i z irlower approx = x jr (4.28) x j BU i BU i BU i µ α ijz irlower approx = µ α ijx jr (4.29) x j (BU i BU i ) Eqs and 4.29 represents a linear set of equations. In order to prove the convergence we treat eqs and 4.27 as a Gauss-seidel iterations for solving the set of equations with µ ij considered to be fixed. The sufficient condition by Gauss-seidel algorithm for assuring the convergence of the matrix, representing each iteration is that it should be diagonally dominant. The matrices corresponding to eqs and 4.27 are: BU Ã = 0 BU BU k η B = 0 η η k where η i = x j (BU i BU i ) µα ijx j The sufficient condition for matrices A and B to be diagonally dominant is: BU i > 0 and η i > 0 respectively. 57

18 Also, going by the convergence theorem proposed by [Bezdek et al., 1987] for FCM, [Maji and Pal, 2007a] and [Yan, 2004] convergence analysis of the fuzzy curve tracing algorithm, matrices Ã and B are hessian of A and B w.r.t z irlower approx and z irupper approx respectively with all postive eigen values and hence proved that these matrices are diagonally dominant. Thus, by theorem 4.1, 4.2 and 4.3 the proposed algorithm RFCMS converges, at least along a subsequence, to a local optimum solution. 4.5 Experiments In this section, we present the comparative performance of proposed subspace clustering algorithm RFCMS with FCM, RCM, RFCM, DOC, and PROCLUS, using UCI data sets [uci, ]. While FCM, RCM, RFCM are full dimensional clustering algorithms, PROCLUS and DOC, are subspace clustering algorithms tailored for high-dimensional applications. We used MATLAB version of FCM, opensubspace weka [osw, ] implementation for DOC and PROCLUS, and implemented RCM, RFCM, and RFCMS algorithms in MATLAB. In all the experiments, with FCM, RCM, RFCM and RFCMS algorithm the stopping criterion parameter ɛ was set as 10 3 and the maximum number of iterations was restricted to 100. However, in all the experiments we conducted, the algorithms always converged before the limit on the number of iterations was reached. The normed differences between successive iterations of matrix Z is compared with the threshold parameter ɛ, set to define convergence criterion. Based on experimentation, we set the value of parameters a = 0.85 and b = 0.25 for RCM, RFCM and RFCMS algorithms. The parameters for DOC algorithm were used as mentioned in [Procopiuc et al., 2002]. The number of clusters k was set equal to the number of classes given in each data set, as indicated in Table 4.1. We have evaluated the effect of fuzzification parameters α and β of RFCMS algorithm and fuzzification parameter m of FCM and RFCM algorithms. We evaluated the performance of all the algorithms w.r.t. quality and validity measures. The set of relevant dimensions computed by each 58

19 Data Sets Instances Attributes Classes Alzheimr Breast Cancer Spambase Wine Diabetes Magic Table 4.1: Data Sets of the subspace clustering algorithms RFCMS, DOC and PROCLUS have been shown for all the data sets Data Sets We experimented with Alzhemir, Breast Cancer, Spambase, Wine, Diabetes and Magic data sets from the UCI data repository [uci, ]. These data sets are heterogeneous in terms of size, number of clusters, and distribution of classes and have no missing values. General characteristics of the data sets are summarized in Table Effect of Fuzzification Parameters For the RFCMS algorithm, the best combination of fuzzification parameters α and β was determined by varying the values of α and β in the range 2-10 independent of each other. This was done for each data set. Similarly, the best value of fuzzification parameter m for FCM and RFCM algorithm was determined by varying the values of m. Table 4.2 shows the complete list of fuzzification parameters we found for different data sets as a result of fine-tuning. 59

20 Data Sets RFCMS FCM RFCM α β m m Alzehmir Breast Cancer Spambase Wine Diabetes Magic Table 4.2: Fuzzifier Values: RFCMS, FCM, and RFCM Data Sets RFCMS FCM RCM RFCM PROCLUS DOC Alzehmir Breast Cancer Spambase Wine Diabetes Magic Table 4.3: Accuracy: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC Cluster Validity Table 4.3 shows accuracy results for all the algorithms and data sets. RFCMS algorithm has highest accuracy for Breast Cancer, Spambase and Wine data sets. FCM algorithm achieves highest accuracy for Alzehmir data set, RFCM algorithm achieves highest accuracy for Magic data set and Doc algorithm achieves highest accuracy for Diabetes data set. In Table 4.4, 4.5, 4.6, and 4.7, we present the results of applying recall, specificity, precision and F1-measure to the outcomes of clustering schemes produced by different algorithms. RFCMS algorithm achieves highest recall and specificity for Breast Cancer, Spambase and Wine data sets. FCM algorithm achieves highest recall and specificity for Alzehmir data set, RFCM algorithm achieves highest recall and specificity for Magic data set and Doc 60

21 Data Sets RFCMS FCM RCM RFCM PROCLUS DOC Alzehmir Breast Cancer Spambase Wine Diabetes Magic Table 4.4: Recall: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC Data Sets RFCMS FCM RCM RFCM PROCLUS DOC Alzehmir Breast Cancer Spambase Wine Diabetes Magic Table 4.5: Specificity: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC algorithm achieves highest recall and specificity for Diabetes data set. RFCMS algorithm has highest precision for Breast Cancer, Spambase, Diabetes, Magic and Wine data sets. FCM algorithm achieves highest precision for Alzehmir data set. RFCMS algorithm achieves highest F1-measure for Breast Cancer, Spambase and Wine data sets. FCM algorithm achieves highest F1-measure for Alzehmir data set, RFCM algorithm achieves highest F1-measure for Magic data set. FCM, RCM and RFCM algorithms achieve highest F1-measure for Diabetes data set. In summary itcan be seen that no algorithm is a clear winner w.r.t all measures for all the algorithms and all the data sets. 61

22 Data Sets RFCMS FCM RCM RFCM PROCLUS DOC Alzehmir Breast Cancer Spambase Wine Diabetes Magic Table 4.6: Precision: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC Data Sets RFCMS FCM RCM RFCM PROCLUS DOC Alzehmir Breast Cancer Spambase Wine Diabetes Magic Table 4.7: F1-measure: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC Subspaces Generated The proposed algorithm RFCMS, is an objective function based subspace clustering algorithm. For such algorithms fewer the number of dimensions lesser will be the error or scatter among objects of a cluster. We have compared RFCMS, DOC and PROCLUS algorithms in terms of the number of dimensions found. Tables 4.8, 4.9, 4.10, 4.11, 4.12 and 4.13 show the sets of dimensions found for Alzehmir, Breast Cancer, Spambase, Wine, Diabetes and Magic data sets by RFCMS, PROCLUS and DOC algorithms. For all the data sets mentioned above, RFCMS algorithm finds subspaces with fewer dimensions. 62

23 Cluster No. RFCMS PROCLUS DOC 1 4 4,6,7 1,2,3,4,5,6,7 2 4, 5, 7 4,5,6 1,2,3,4,5,6,7 3 4, 5, 6 4,5,6 1,2,3,4,5,6,7 Table 4.8: Dimensions: RFCMS, PROCLUS and DOC for Alzehmir Cluster No. RFCMS PROCLUS DOC 1 10, 15, , , , , , 15, 20 1,2 1-3, , Table 4.9: Dimensions: RFCMS, PROCLUS and DOC for Breast Cancer Cluster No. RFCMS PROCLUS DOC 1 28, 29, 32, 34, 38, 44, , 46, 47, 51, 52 40, Table 4.10: Dimensions: RFCMS, PROCLUS and DOC for Spambase Cluster No. RFCMS PROCLUS DOC 1 3, 8, 11 1,2,3,6,7,8,9,11, , 8, 11 1,3,6,7,8,9,11, , 7, 8, 9, 11 1, Table 4.11: Dimensions: RFCMS, PROCLUS and DOC for Wine Cluster No. RFCMS PROCLUS DOC 1 1,6,7 1,6-8 1, ,7 1,4,5,7 1, 6-8 Table 4.12: Dimensions: RFCMS, PROCLUS and DOC for Diabetes 63

24 Cluster No. RFCMS PROCLUS DOC 1 4,5 3,4,5,8,9 2-6,8,9 2 4,5 1,2,3,4,5 1-5,8 Table 4.13: Dimensions: RFCMS, PROCLUS and DOC for Magic Experiments on Biological Datasets In this section, we present the comparative performance of proposed projected clustering algorithm RFCMS with EWKM, FWKM and LAC algorithms for biological data sets. RFCMS, EWKM, FWKM and LAC algorithms are subspace clustering algorithms tailored for high-dimensional applications. We used weka implementation for EWKM, FWKM and LAC [Peng and Zhang, 2011]. The parameters for EWKM, FWKM and LAC algorithm were used as mentioned in [Jing et al., 2007], [Jing et al., 2005] and [Domeniconi et al., 2007]. We have evaluated the effect of fuzzification parameters α and β of RFCMS algorithm. We evaluated the performance of all the algorithms w.r.t. validity measures. The set of relevant dimensions computed by each of the subspace clustering algorithms RFCMS have been shown for all the data sets Data Sets We experimented with Colon, Embroynal Tumours, Prostate and Leukemia data sets [bio, ]. These data sets are heterogeneous in terms of size, and have no missing values. We have chosen datasets which are pre-classified as it helps in evaluating the results of applying clustering algorithms. General characteristics of the data sets are summarized in Table Effect of Fuzzification Parameters For the RFCMS algorithm the best combination of fuzzification parameters α and β was determined by varying the values of α and β in the range 2-5 independent of each other. This was done for each data set. Table 4.15 shows the complete list of 64

25 Data Sets Instances Attributes Classes Colon Cancer Embroynal Tumours Leukemia Prostate Table 4.14: Data Sets fuzzification parameters we found for different data sets as a result of fine-tuning. Data Sets α β Colon Cancer 2 4 Embroynal Tumours 3 5 Leukemia 3 4 Prostate 2 2 Table 4.15: Fuzzifier Values Cluster Validity Table 4.16 shows accuracy results for all the algorithms and data sets. RFCMS algorithm achieves highest accuracy for Colon and Leukemia datasets. FWKM and LAC algorithm achieves highest accuracy for Embroynal Tumour data set. FWKM algorithm achieves highest accuracy for Prostate data set. However accuracy of Data Sets RFCMS EWKM FWKM LAC Colon Cancer Embroynal Tumours Leukemia Prostate Table 4.16: Accuracy: RFCMS, EWKM, FWKM and LAC 65

26 Data Sets RFCMS EWKM FWKM LAC Colon Cancer Embroynal Tumours Leukemia Prostate Table 4.17: Specificity: RFCMS, EWKM, FWKM and LAC RFCMS was comparable with FWKM algorithm for both Embryonal Tumour and Prostate data set. In Table 4.18, 4.17, 4.19, and 4.20, we present the results of applying recall, specificity, precision and F1-measure to the outcomes of clustering schemes produced by different algorithms. RFCMS algorithm achieves highest recall for Leukemia data set. EWKM, FWKM and LAC algorithms achieve highest recall for Embryonal Tumour data set. FWKM algorithm achieves highest recall for Prostate data set. RFCMS algorithm achieves highest specificity for Colon, Embroynal Tumours, Prostate and Leukemia data sets. RFCMS algorithm has highest precision for Colon and Leukeima data sets. EWKM, FWKM and LAC algorithms achieve highest precision for Embryonal Tumour data set. EWKM and LAC algorithm achieves highest precision for Prostate data set. RFCMS algorithm achieves highest F1-measure for Colon, Embroynal Tumours, and Prostate data sets. FWKM achieves highest F1-measure for Leukemia data set Subspaces Generated Figures 4.1 to 4.12 show the set of dimensions found for Colon, Embroynal Tumours, Prostate and Leukemia data sets by RFCMS, EWKM and LAC algorithms. RFCMS algorithm finds fewer dimensions as compared to EWKM and LAC algorithms. For Embroynal Tumours data set EWKM and LAC algorithms fails to 66

27 Data Sets RFCMS EWKM FWKM LAC Colon Cancer Embroynal Tumours Leukemia Prostate Table 4.18: Recall: RFCMS, EWKM, FWKM and LAC Data Sets RFCMS EWKM FWKM LAC Colon Cancer Embroynal Tumours Leukemia Prostate Table 4.19: Precision: RFCMS, EWKM, FWKM and LAC distinguish the relevance of dimensions for cluster 2. However RFCMS algorithm distinguishes the relevant and non relevant dimensions for cluster 2. For Prostate data set RFCMS algorithm finds fewer dimensions as compared to EWKM and LAC algorithms. For Leukemia data set results of RFCMS, EWKM, and LAC algorithms are comparable. Data Sets RFCMS EWKM FWKM LAC Colon Cancer Embroynal Tumours Leukemia Prostate Table 4.20: F1-measure: RFCMS, EWKM, FWKM and LAC 67

28 Figure 4.1: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for Colon Dataset Figure 4.2: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for Colon Dataset 68

29 Figure 4.3: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Colon Dataset Figure 4.4: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for Embryonal Tumours Dataset 69

30 Figure 4.5: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for Embryonal Tumours Dataset Figure 4.6: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Embryonal Tumours Dataset 70

31 Figure 4.7: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for Prostate Dataset Figure 4.8: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for Prostate Dataset 71

and cluster 2 for Prostate Dataset 10: RFCMS:

32 Figure 4.9: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Prostate Dataset Figure 4.10: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for Leukemia Dataset 72

33 Figure 4.11: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for Leukemia Dataset Figure 4.12: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Leukemia Dataset 73

34 4.6 Summary In this chapter, we have proposed a novel subspace clustering algorithm which employs a combination of rough sets and fuzzy set theory. Rough fuzzy c-means Subspace (RFCMS) algorithm is an extension of rough fuzzy c-means algorithm, which incorporates fuzzy membership of data points and dimensions in each cluster. In each iteration, cluster centers are updated and a data point is assigned to lower approximations or upper approximation of a cluster. This process is repeated until convergence criterion is met. We have also discussed the convergence of the proposed algorithm. The results of applying the proposed approach to UCI data sets shows that the proposed algorithm scores over its competitors in terms of several validity measures. The proposed algorithm can be used in conjunction with density based algorithms to automatically detect the number of clusters. 74

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

Fundamenta Informaticae 8 (7) 475 495 475 RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets Pradipta Maji and Sankar K. Pal Center for Soft Computing Research Indian Statistical Institute