CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically analyze huge amount of highly diverse medical records stored in heterogeneous databases. Also, there is an increasing demand for accessing those data. The volume, complexity and variety of databases are used for data handling, which cause serious difficulties in deploying the distributed information. Clustering algorithms are utilized to offer proper and structured data from the data warehouse for the persistence of creating reports, queries, analysis, etc,. The main goal of clustering analysis is to group the objects of similar kind into appropriate categories. Nowadays, most of the data mining algorithms are very helpful for bringing the data together, which to be mined in a single, centralized data warehouse. Most of the research societies practices partitional and hierarchical approaches. Partitioning algorithms determine all clusters at once, whereas the hierarchical algorithms discover successive clusters by using previously established clusters. It also partitions the data set into a particular number of clusters and then these clusters are assessed on the basis of a criterion. A divisive algorithm begins with the entire set and partition is done into successively smaller clusters. The cluster label attained from this method

71 does not provide a natural ordering in a way similar to real numbers. To overcome these issues K-modes clustering algorithm are introduced, which is simple in nature and does not involve complex steps. The steps involved in this research work are given in Figure 3.1. Figure 3.1 Data Mining in Medical Informatics Data Warehouse In this research work, a K-modes clustering technique is used for grouping the similar data in a medical databases. K-modes contain the value of attributes with high frequency. The attribute values that happen frequently are used as modes. A dissimilarity measure is used to associate each object with the modes and each object is allocated in the nearest cluster. After the distribution of each object to the clusters, mode of the each cluster is updated. So, all the similar objects are placed in one cluster and then the classification is carried out by using the fuzzy logic function. By applying these techniques, proper medical data are mined from the database, which provides the required information. Medical informatics data warehouse is a beneficial technique for supporting medical data analysis.

72 3.2 K-MODES CLUSTERING ALGORITHM The K-modes algorithm is an extension to the familiar k-means algorithm, which helps to cluster the larger dataset by using, A simple matching similarity measure called as chi-square distance for categorical objects Modes instead of means for clusters A frequency based method is utilized to update the modes and to reduce the cost function of clustering. The simple matching dissimilarity measure can be defined as follows. Let M and N are two categorical objects defined by x categorical attributes. The dissimilarity measure between M and N can be defined by the over-all mismatches of the corresponding attribute categories of the two objects. The smaller the number of mismatches is, the more similar the two objects. Mathematically, it can be indicated as follows, ( ) ( ) (3.1) Where, ( ) { ( ) ( ) (3.2) ( ) ( ) (3.3) Where and [ ] (3.4) The K-modes algorithm reduces the cost function defined in Equation 3.3. The K-modes algorithm consists of the following steps.

73 K-modes Algorithm Choose k initial modes one for each cluster Allot an object to the cluster whose mode is the adjacent to it according to Equation 3.1. After all the objects have been consigned to clusters, reevaluate the dissimilarity of objects against the current modes. If an object is identified then the adjacent mode belongs to another cluster rather than its current one, reallocation of the object to the particular cluster and updation of mode is carried out in both clusters. Repeat the above step, until no object has changed the clusters after a full cycle test of the entire data set. 3.2.1 Steps Involved in Clustering Algorithm The inputs to the K-modes algorithm are the data set and the number of cluster K. The selections of K initial modes as either the K distinct objects or most frequent happening attribute values. Figure 3.2 shows the steps involved in K-modes clustering algorithm. Step 1: For each cluster k, select k initial modes and the steps to choose the k initial modes is given below.

74 Figure 3.2 Flow of K-modes Clustering Algorithm Initial K-mode selection method a) For all the attributes, calculate the frequencies of all categories and then store the results of frequencies of all categories, in a category array in the descending order of frequencies. The category array is shown below in Figure 3.3. This figure displays the category array of dataset with 4 categorical attributes having 4, 2, 5, 3 categories respectively. (3.5) { }

75 Figure 3.3 Initial K-modes Selection Method (redraw the image) Here C i, j specifies the category i of attributes j ( ) Where ( ) represents the frequency of category C i, j. b) Allocate the most frequent categories equally to the initial k modes. c) Start choosing the records, which are similar in characteristics. Step c is used here to evade the existence of empty clusters. The purpose of this selection process is to make the initial modes to be resulted in better clustering. Step 2: Calculate the dissimilarity measure for the categorical objects termed by categorical attributes with the K-mode. Step 3: According to the dissimilarity measure allot an object to the cluster, whose mode is the nearest to the mode of the cluster. Step 4: After each allocation of the object, update mode s time. Step 5: The dissimilarity of objects against the current mode is re-tested, after the allocation of all objects to the cluster. If the mode of one object belongs to another cluster, then the object is re-allocated to that actual cluster. Finally, the modes of both the clusters also updated.

76 Step 6: Repeat step 5, until there is no modification between the clusters after a full cycle test of the complete data set. 3.2.2 Attributes Involved in K-Modes Clustering Generally, in clustering algorithm there are two types of attributes, which are used with the input data, namely, numerical and categorical attributes. Attributes that has finite or infinite number of ordered values are called as numerical attributes. Attributes with the finite unordered values are named as categorical attributes. The similarity measurements typically deliberate the numerical attributes. But in the case of medical databases both numerical and categorical databases are used. 3.3 VARIANTS OF K-MODES ALGORITHM In certain cluster study, there survives a class of algorithms whose members vary hugely in the way that the similarities among two data objects are examined. These algorithms include spherical K-means, K-means, K-modes and k-prototypes. Each algorithm can be deliberated as a decedent of the common archetype of k-means like algorithm where each produces a partitioning of a dataset, within a given X, k and a specific function of similarity. By altering the similarity function of a certain algorithm to accompany a data type problem where any of the members of this algorithm class can be altered to work on any data type. Later, the modified algorithms are called in this way called variants. For instance, a K-modes variant is an algorithm and the similarity function of this algorithm has to be modified. The categorical data is subscribed and a variant of k-means is an algorithm. The similarity of k-means role has been transformed to the Euclidean distance. The main cause for defining variants in this way is that the different clustering algorithms

77 have various stopping criteria and handle ties differently, and it is simple to discuss these various specifications as variants. 3.3.1 Cluster Variant This cluster variant is developed based on the typical Huang s original K-modes algorithm. It is inferred from the Huang s algorithm, that the policy named type 2 tie-breaking policy is borrowed. This algorithm recomputed the mode vectors every time whenever a vector is moved. This variant only estimates the mode vector once per iteration. This cluster variant halts the clusters, which are not altered. Step1: Start with k initial mode vectors, one for each cluster ( ) ( ) ( ) (3.6) Step2: Assign each data vector in X to the cluster whose mode vector is most similar to that data vector to attain the partitioning ( ) ( ) ( ) (3.7) Step3: Update the modes of each cluster to acquire a mode vector for each cluster ( ) ( ) ( ) (3.8) Step 4: Re-examine the similarity of all data vectors with every mode vector. If a vector is identified such that it is nearest to the mode of a cluster other than its current one, reallocate that vector to the closer cluster to obtain ( ) ( ) ( ) (3.9)

78 Step 5: Repeat step 2 until no object has modified, the clusters after full cycle of the entire data set, such that ( ) (3.10) 3.3.2 Center Variant The second algorithm known as center variants, which is analogous to the cluster variant and contain various stopping criterion. This algorithm terminates when no center object has altered upon re-calculation. The first four methods of the center variant match the variants, which are present in the cluster variant. The only difference takes place in the fifth and final stages. Both the center and cluster variants break type-2 ties in the same way. The same steps mentioned in the cluster variants are utilized, which differs in the final step mentioned below. Until no center has changed upon re-calculation of all centers i.e. ( ) (3.11) 3.3.3 Objective Function Variant The third variant of K-modes observes the objective function variant based on the Dhillon s spherical k-means algorithm. By varying the typical field of the data obtained from R m to the categorical/qualitative field. It also modifies the cosine similarity, which is attained from similarity measure of matches. This variant halts during the occurrence of complete change in the objective function is below a particular threshold. Disparate, the first two clusters and center variants, where every data vector occurs in the cluster till an enhanced cluster is established. The objective function variant efficiently dissolves the clusters throughout each iteration and reallocates

79 each data vector in X. Though, the number of assessments does not differ amongst the three variants. Step 1: Start by specifying initial clusters ( ) ( ) ( ) (3.12) Step 2: Calculate mode vector for each cluster to acquire a mode vector for each cluster ( ) ( ) ( ) (3.13) Step 3: Assign each data vector from X to the cluster whose mode is most related to attain the new clusters, ( ) ( ) ( ) (3.14) Step 4: Re-compute the mode vector for each cluster to achieve new mode vectors, ( ) ( ) ( ) (3.15) Step 5: Repeat step 3 until the change in the objective function is less than a certain threshold. (3.16) The significance of selecting these three variants of K-modes is that they has diverse convergence criteria, handle ties contrarily during data vector-assignment and specify starting values differently.

80 3.4 DISSIMILARITY MEASURE OF K-MODES ALGORITHM The dissimilarity measure of K-modes algorithm involves two steps and a new dissimilarity measure between two objects is defined based on the rough membership functions. Rationale 1: Let IS= (U, A, V, f) be a categorical information system and, a binary relation as: ( ) known as indiscernibility relation, is defined ( ) {( ) ( ) ( )} (3.17) Informally two object are invisible in the context of a set of attributes if they have the similar values for those attributes. ( )is an equivalence relation in and ( ) ({ }). The relation ( ) persuades a partition of U, represented by ( ) {[ ] }, where [ ] denotes the equivalence class identified by x with respect to P i.e. [ ] { ( ) ( )} Rationale 2: Let IS= (U, A, V, f) be a categorical information system, for any x, y defined as, the similarity measure between x and y with respect to P is ( ) ( ) (3.18) Where ( ) ( ) (3.19)

81 The K-modes algorithm with dissimilarity measure 1 Initialize the variable old-modes as a empty array 2 Randomly choose k distinct objects from U 3 and assign to the array variables new-modes 4 for l=1 to k 5 for j=1 to 6 calculate the similarity ( ) according to Rationale1 7 End; 8 End; 9 While old-modes <> new-modes do 10 Old-modes=new-modes 11 For 12 For l=1 to k 13 Calculate the similarity between the ith object and 14 The lth mode according to Rationale 2 and classify the ith 15 Object into the cluster whose mode is closest to it 16 End; 17 End; 18 For l=1 to k 19 Find the mode z l of each cluster and assign to new-modes 20 For j=1 to 21 Calculate the similarity ( ) according to Rationale 1 22 Calculate Rationale 2 23 End; 24 End; 25 If old-modes==new-modes 26 Break; 27 End; 28 End.

82 3.5 CLASSIFICATION OF CLUSTER USING FUZZY LOGIC Fuzzy Inference is a technique of implementing a mapping from a known input to an output by using fuzzy logic. Then, the mapping delivers a base, from which the results can be produced or patterns discriminated. In the fuzzy inference process certain functions, such as Membership Functions, Logical Operations, and If-Then Rules are utilized. The phases of Fuzzy Inference Systems are described below. Fuzzification Fuzzy Rules Generation Defuzzification The main component of the method is the fuzzy logic reasoning unit, which has two main kinds of information. A database defining the number, labels, kinds of membership functions and the fuzzy sets used as values for each system variable. There are two types of variables, namely, the input and output variables. The designers have to define the corresponding fuzzy sets for every variable. The proper choice of these labels is considered as the major critical steps in the design process. It also intensely distresses the system performance. The fuzzy set of every variable makes the universe of dissertation of the variable. A rule base is considered, which basically plans fuzzy values of the inputs to fuzzy values of the outputs. This fundamentally replicates the policy called decision making. The control plan is kept in the rule base, which in fact is a

83 group of fuzzy control rules and characteristically involves weighting and merging a quantity of fuzzy sets subsequent from the fuzzy inference process. The computation of this process provides a distinct crisp value for each output. The fuzzy rules are joined in the rule base, which express the control relationship usually in an IF-THEN format. For example, a two-input-one-output fuzzy logic controller works in the case of control rule, which has the general form. Rule i: IF x is A i and y is B i THEN z is C i Where x and y are input variables, z is the output variable; A i, B i and C i are linguistic terms such as negative, positive or zero. The ifpart of the rule is termed as premise or condition or antecedent and the then part is known as the consequence or action. Usually the actual values are obtained from or sent to the system of concern, which are crisp. Therefore, the fuzzification and defuzzification operations are required to map them to and from the fuzzy values, which are used internally by the fuzzy inference system. The structure of fuzzy inference system is illustrated in Figure 3.4. Figure 3.4 Structure of Fuzzy Inference System

84 The fuzzy reasoning unit accomplishes numerous fuzzy logic operations to conclude the result (decision) obtained from the given fuzzy inputs. During the fuzzy inference, the subsequent processes are included for each fuzzy rule: Determination of degree of match between the fuzzy input data and the predefined fuzzy sets for each system input variable. Computation of the degree of relevance or applicability for each rule, which is based on the degree of match and the connectives used with input variables in the antecedent part of the rule. Derivation of the control outputs, which are based on the computed strength and the defined fuzzy sets for each output variable in the subsequent part of each rule. Some techniques are used for the inference of the fuzzy output based on the rule base. The most commonly used inference methods are described below. The Max-Min fuzzy inference method The Max-product fuzzy inference method Assume that there exist two input variables, where e (error) and ce (change of error), one output variable, cu (change of output) and two rules: Rule 1: If e is A 1 AND ce is B 1 THEN cu is C 1 Rule 2: If e is A 2 AND ce is B 2 THEN cu is C 2

85 In the Max-Min inference method, the fuzzy operator AND (intersection) means that the minimum value of the antecedents is taken: { } (3.20) While for the Max-product one the product of the antecedents is taken: (3.21) For any two membership values and of the fuzzy subsets A, B, respectively. All the contributions of the rules are aggregated using the union operator, thus generating the output fuzzy space C. 3.5.1 Fuzzification During the fuzzification process, the cluster quantities are transformed into fuzzy. The input given for this process is C-1, C-2, and C-3. After giving the input, maximum and minimum value for each cluster is calculated from the input features. The process of fuzzification is estimated by applying the following equations. ( ) ( ) (3.22) ( ) ( ) (3.23) ( ) Where denotes the minimum limit values of the feature M and ( ) indicates the maximum limit values of the feature M. Similarly by using the above equation, maximum and minimum values are calculated for other clusters C-2 and C-3. Using these values three conditions are provided for generating the fuzzy values.

86 The entire cluster 1 (C-1) values are associated with Minimum Limit Value ( ). If the cluster 1 values are less than the value of ( ), then those values are set as L. All the cluster 1(C-1) values are compared with Maximum Limit Value ( ). If any values of cluster 1 are less than the value of ( ), then those values are denoted as H. If any values of cluster 1(C-1) values are greater than the value ( ) and less than the value ( ), then those values are set as M. Similarly, the conditions for other clusters C-2 and C-3 also implemented for the generation of fuzzy rules. 3.5.2 Fuzzy Rules Generation According to the fuzzy values for each feature that are generated in the fuzzification process, the fuzzy rules are also generated. The fuzzy modeling involves initialization and fine-tuning the fuzzy model. The model identification process consists of three stages, namely, initialization, weights learning and tuning of membership function. The last two stages are repeated until the objective function meets the stopping criterion or the number of iterations overdoes a given limit. The rule generation is done in three steps: Partition of feature space: Membership functions of trained FNN divide the feature space into fuzzy regions, which has fuzzy concepts. Generation of fuzzy rules: Fuzzy rules are generated from each pair of data by determining, which subspace the data falls

87 into. Each feature s degree of membership are assessed and the feature is considered as belonging to the fuzzy set that has the maximal degree of membership. Significance measure of the fuzzy rules: The number of the fuzzy rules from the above mentioned steps should be the same as the number of the data pairs. This rule bank may comprise the conflicting and redundant rules. In order to resolve this confliction and to remove redundancy, the support of a rule is examined by counting the number of data that gives the same rule in each class. Then the fuzzy rules in the rule bank are ranked according to their supports. 3.5.3 Defuzzification Unit Defuzzification naturally has weighting and joining a quantity of fuzzy sets resultant from the fuzzy inference practice in a computation, which gives a single brittle value for each output. The input given for the Defuzzification process is the fuzzy set and the output acquired is a distinct number. As much as fuzziness supports the rule assessment during the intermediate steps and the final output for every variable is generally a single number. The single number output is a value L, M or H. This value of output f 1 signifies whether the given input dataset is in the Low range, Medium range or in the High range. The FIS is trained with the use of the fuzzy rules and the testing process is completed with the help of the datasets. 3.6 SYSTEM REQUIREMENTS The proposed MPSO-AFKM algorithm is simulated using MATLAB R2009b simulation tool with the hardware setup of 1GB DDR RAM, 250 GB hard disk. Image Processing Toolbox in MATLAB provides a

88 comprehensive set of reference-standard algorithms and graphical tools for image processing, analysis, visualization, and algorithm development. You can perform image enhancement, image deblurring, feature detection, noise reduction, image segmentation, spatial transformations, and image registration. Many functions in the toolbox are multithreaded to take advantage of multi core and multiprocessor computers. 3.7 SUMMARY K-modes clustering algorithm with a new dissimilarity measure is used for warehouse large heterogeneous databases. Using a dissimilarity measure, each object with the modes was compared and each object was allocated to the adjacent cluster. After the distribution of each object to the clusters, the mode of the cluster was updated. Thus all the similar objects were placed in one cluster. Then the classification was done with the help of fuzzy logic. Later, the user can simply gather the appropriate medical data to offer the essential information in a direct, speedy and significant way. This technique assures that the medical informatics data warehouse is a beneficial technique for supporting medical data analysis. This approach will be one of the imperative data sources for medical data mining. The technique increased the speed of query processing and reduced the mining cost.