CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have high efficiency in cases of low dimensionality. But, the amount of documents available is very large. So, when clustering is applied to these documents, it degrades the performance of the clustering algorithm. This motivated us to reduce the dimension of the documents, and then to cluster them, so that the performance of the clustering algorithm is improved. Clustering algorithms are divided into two classes: hierarchical and partitioning. The hierarchical method has the disadvantage that once the clustering (i.e. merge or split) is done, it cannot be undone. Therefore, the documents have to be analysed properly before they are clustered. This led us to retrieve the associated documents alone from the entire dataset before they are clustered. In order to retrieve the related documents, the association rule mining algorithm is being used. The Association rule mining algorithm is generally used in transactional databases to find the associated items. This has been modified to be used for text databases to retrieve the associated documents. Then the associated documents are clustered, using the hierarchical algorithm and the results have shown improved performance.

42 Next, in the partitioning algorithm, the disadvantage is the initial selection of k value. In the simple k-means algorithm the centroids are selected randomly by the user. This method of selecting the centroids randomly affects the performance of the algorithm. The centroids should be placed very correctly, because different locations will lead to different results. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a particular distance, we propose a method to locate the initial centroids. This method uses dissimilar documents as the initial centroids, and these disisimilar documents are identified, using the Apriori algorithm. The Apriori algorithm is used to find similar documents. The remaining documents are identified as dissimilar documents. These documents are selected as the initial centroids, and the K-means algorithm is applied on those documents. The experimental results have shown that on selecting the dissimilar documents as the initial centroids, the clustering performance has been improved. These two points motivated us to develop this work. This chapter is divided into two parts. In the first part, association rule mining integrated with the hierarchical algorithm is described. In the second part, the K-means algorithm integrated with association rule algorithm is explained. The rest of the chapter is organized as follows. In section 3.1, association rule based hierarchical clustering algorithm is described. In section 3.2, the K-means algorithm based on association rule is described. 3.2 ASSOCIATION RULE BASED HIERARCHICAL CLUSTERING ALGORITHM An Apriori algorithm is a popular association rule algorithm, that is used in data mining to find the frequent items. The same algorithm is used in

43 this work to find similar documents. The remaining dissimilar documents are removed, which reduces the size of the database. The associated documents alone are then clustered which improved the efficiency of the clustering algorithm. This algorithm consists of various phases like, pre-processing, feature selection, and finally, clustering. The block diagram of the proposed system is depicted in Figure 3.1. Input Text File Preprocessing Feature Selection Clustered Documents as Output Hierarchical Clustering Association Rule Miner Figure 3.1 Block Diagram of the Proposed Method 3.2.1 Pre-processing and Feature Selection Method Pre-processing involves the stop word removal and stemming process. Stop words are normal English words which have less meaning, such document. Therefore, these words are removed using a stop word list. The next process in pre-processing is stemming. Stemming is defined as the process of reducing the derived words to their base, stem or root form. The words are stemmed using the porter stemmer algorithm. After these pre-processing steps, the important features are selected using a feature selection method as proposed by (Xiao-Bing Xue & Zhi-Hua Zhou 2007). In this method, the importance of terms is calculated based on

44 the compactness of appearance of the word, and the position of the first appearance of the word. The distribution of the word is calculated in two steps. First the document is divided into several parts, and then it is entered in an array, where each entry corresponds to the count of appearance of that word in the particular part. In this paper, the number of parts corresponds to the number of sentences present in the document. Suppose a document d contains n sentences, the distributional array of the word t is represented as array (t, = [c 0, c 1 c n-1 ]. Compactness measures, whether the word is concentrated in a specific part of a document, or spread over the whole document.the compactness (ComPact) of the appearances of the word t and the position of the first appearance (FirstApp) of the word t are defined, respectively, in Equations (3.1) to (3.6) as follows: FirstApp ( t, min i {0.. n 1} c i 0? i : n (3.1) n 1 CompPact ( t, 0?1: 0 (3.2) PartNum c i i 0 LastApp ( t, max c i i {0.. n 1} 0? i : 1 (3.3) ComPact FLDist ( t, LastApp ( t, FirstApp ( t, (3.4) centroid ( t, n 1 c i n 1 i 0 i count ( t, c i (3.5) count ( t, i 0 ComPact ci i 0 ( t, i centroid ( t, PosVar (3.6) count ( t, The frequency of a word is calculated using the Term Frequency (TF) feature selection method. The compactness of the appearances of a word

45 is calculated, using the ComPactness (CP) method. The first appearance of a word is found, using the FirstAppearance (FA) method.tf, CP and FA are calculated using Equations (3.7) to (3.12) as follows: count( t, TF ( t, (3.7) size( ComPact ParNum( t, CPPN ( t, (3.8) len ( ComPact FLDist ( t, 1 CPFLD ( t, (3.9) len ( ComPact PosVar ( t, 1 CPPV ( t, (3.10) len ( FA ( t, f ( FirstApp ( t,, len ( ) (3.11) len ( p f ( p, len ( (3.12) len ( The importance of each word is calculated as a summation of TF, CP and FA. The words are then ranked according to their values, with the highest value on the top. From this the topmost four words are selected for each text file. All the text files are processed in a similar manner and a final output is obtained. 3.2.2 Association Rule Based Hierarchical Clustering Method The features that are selected are passed into the association rule miner. The association rule miner uses the Apriori algorithm to find the association rules between the text documents. The association algorithm is defined as follows: Let I = {i 1, i 2... i m } be a set of words. Let D be a set of

46 documents, where each document T contains a set of words. An association rule is an implication of the form X Y, where X I, Y I, and =. The association rule X Y holds in the set of documents D with confidence c if c% of transactions in D that contain X also contain Y. The association rule X Y has support s, if s% of transactions in D contain X Y. Mining the association rules is to find all the association rules that have support and confidence greater than or equal to the user-specified minimum support (called minsup) and minimum confidence (called minconf), respectively (Agarwal & Srikant 1994). The threshold vale for support and confidence lies between 0 and 1. In this work the threshold value is taken as 0.01 as constant for all the documents, so that the evaluation criteria remain the same throughout the process. The documents that are associated with one another are obtained from the association rule miner, and the remaining documents are discarded. As a result of this, the database size is reduced to a great extent. The documents that are associated with one another are selected and given as input to the hierarchical algorithm. This gives the advantage of improved cluster quality, and also reduces the time taken for clustering. The hierarchical algorithm performs agglomerative clustering and the final results are obtained. The clusters that are obtained as a result of ARBHC are found to have better quality, than the clusters produced using hierarchical clustering alone. 3.2.3 Experimental Results The experiments are tested on the Reuters-21578 dataset. Out of the entire document, four Reuters were taken for testing purposes. Each one had 180 documents. The topics include commodity, corporate, currency, energy, economic indicator and subject codes. The clusters are evaluated using the

47 measures F-measure and Recall. The cluster quality is evaluated using the cophenetic correlation coefficient. The time taken to cluster the documents using both the methods, is also calculated. 3.2.3.1 Cophenetic correlation Coefficient This coefficient is used to evaluate the quality of the cluster. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. The clusters are evaluated and the results are given in Table 3.1. Table 3.1 Cluster Quality D0 D1 D2 D3 Hier 0.73-1.52-0.52-1.29 ARBHC 4.13 3.54 0.18 5.17 The tabular values show that the clusters formed using ARBHC have better quality than the clusters formed using the hierarchical algorithm. 3.2.3.2 F-measure The F-measure is the weighted harmonic mean of precision and recall. The formula of the F-measure in Equation (3.13) as follows: F 2. precision. recall ( precision recall ) Precision is the fraction of the documents retrieved, that are relevant to the user's information need. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. The formula is given in Equation (3.14) as follows:

48 precision ( i, k) Recall ( i, k) Nik NC N N i ik k (3.14) where N is the total number of documents, i is the number of classes (predefine, K is the number of clusters in unsupervised classification, NCi is the number of documents of class i, NK is the number of documents of cluster CK, Nik is the number of documents of class i in the cluster CK. The values are given in Table 3.2. Table 3.2 F-Measure Values D0 D1 D2 D3 Hier 0.04 0.04 0.045 0.048 ARBHC 0.29 0.22 0.25 0.28 The tabulated values have shown that the ARBHC method has better F-Measure values than those of the hierarchical method. Figure 3.2 F-Measure Values

49 Figure 3.2 shows the diagrammatic representation of the tabular values and finally concludes that the ARBHC method outperforms the other method. 3.2.3.3 Entropy In the information theory, entropy is a measure of the uncertainty associated with a random variable. The values are given in Table 3.3. Table 3.3 Entropy Values D0 D1 D2 D3 Hier 5.66 5.8 6.1 5.9 ARBHC 3.4 3.5 3.66 3.57 The tabulated values show that the ARBHC method has lesser entropy values than those of the hierarchical method, which is a sign of better performance. Figure 3.3 Entropy Values

50 Figure 3.3 shows the diagrammatic representation of tabular values which colcudes that the ARBHC method outperforms the hierarchical method. 3.2.3.4 Time The time taken to cluster the documents using the normal hierarchical method and the ARBHC method is calculated, and the results are tabulated in Table 3.4. Table 3.4 Time Values (Seconds) D0 D1 D2 D3 Hier 178 178 178 178 ARBHC 54 46 50 38 Figure 3.4 Time Values The tabular values and Figure 3.4 have shown that the time taken to cluster the documents is less for the ARBHC method than for the hierarchical method.

51 3.3 MODIFIED K-MEANS ALGORITHM One of the other fundamental clustering algorithms is the K-means algorithm. K-means, a heuristic algorithm, is a method of cluster analysis, which partitions n observations into k clusters, such that each observation is grouped into a cluster with the nearest mean. The k-means algorithm is less accurate than the hierarchical algorithm, but provides greater efficiency. In the simple k-means algorithm the centroids are selected randomly by the user which affects the performance of the algorithm. 3.3.1 Introduction In this work, a method has been proposed to locate the initial centroids. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a distance, we propose a method to use dissimilar documents as the initial centroids. In the proposed method, the text data are first preprocessed. Then the features are selected using an appropriate method. Then the dissimilar documents are identified and selected as the initial centroids. With these initial centroids the k-means algorithm is run and the performance is evaluated. 3.3.2 Preprocessing The stop words in the input text files are first removed. The standard stop word list available is used for this purpose. Then the words are stemmed using the Porter Stemming algorithm, which reduces the word to its base word.

52 3.3.3 Feature Selection This method is used to pick out important words from the file which will eventually represent the entire document. This process is done using the term frequency method and title match method. Term frequency, is defined as a measure of how often a word appears in a document. It is defined as the number of times a word appears in the document divided by the total number of words. In the title match method, if the words in the file match with the title, then it is given a weight value. The term frequency value and the title match value are added to calculate the weight of each word. Then the words are ranked according to their values and the top four words are taken for each file. The process is continued for the entire text files and the words are selected. 3.3.4 Finding Dissimilar Documents The Apriori algorithm is used to find the similar documents and the remaining documents are considered as dissimilar documents.the number of documents thus found are given as the initial K-value. The dissimilar documents are selected as the initial centroids for clustering. These documents are given as input to the modified K-means algorithm. 3.3.5 Modified K-Means Algorithm The dissimilar documents that are found using the Apriori algorithm are selected as the initial centroids. This helps the algorithm to classify the documents as disjoint sets.

53 The procedure for finding the cluster is defined as The dissimilar documents are selected as the initial centroids; i.e m 1, m 2 k. Until there is no change in the mean The estimated means are used to classify the samples into clusters For i from 1 to k m i is replaced with the mean of all the samples for cluster i end_for end_until The modified K-means algorithm run by selecting the dissimilar documents as initial centroids, has proved that the clustering performance has been improved. It was also proved that the number of iterations taken to converge is less, when compared to the traditional K-means algorithm. 3. 3.6 Experimental Results All the experiments have been carried out on Reuters- 21578 dataset. Reut2-000, Reut2-001, Reut2-002, Reut2-001 have been taken for the analysis. These are named as D0, D1, D2 and D3 respectively. From each, 180 documents have been selected. Totally, 720 documents from various topics like commodity, corporate, currency, energy, economic indicator and subject codes have been taken. Evaluation metrics, namely,the F-measure and Entropy are used. The number of iterations taken to converge is also evaluated.

54 3.3.6.1 F-Measure The formula of F-measure is given in Equation (3.15) as follows: F 2.Precision. Recall (Precision Recall ) (3.15) Precision which is the measure of exactness is defined as the fraction of the documents retrieved that are relevant to the user's information need. Recall, which is the measure of completeness, is defined as the fraction of the documents that are relevant to the query that are successfully retrieved. Table 3.5 F-Measure Values D0 D1 D2 D3 K- means 0.18 0.16 0.18 0.22 Modified K-Means 0.17 0.154 0.14 0.2 The values tabulated in Table 3.5 show that the modified K-means algorithm has F-measure values somewhat similar to those of the K-means algorithm. 3.3.6.2 Entropy Entropy is a measure of the uncertainty associated with a random variable. It refers to the information contained in a message. The general form of entropy is given in Equation (3.16) as follows: E j i p ij log (p ij ) (3.16) p ij is the probability that a cluster j belongs to class i.

55 Table 3.6 Entropy Values D0 D1 D2 D3 K- means 9.5 10.11 8.8 4.21 Modified K-Means 8.3 8.9 8.4 5.7 The values tabulated in Table 3.6 have shown that the modified K-means algorithm has reduced values than the K-means algorithm, which shows its better performance. The lower the value the better the performance. 3.3.6.3 Number of Iterations The number of iterations taken by the algorithm to converge depends on the initial centroids. Therefore, the number of steps taken to converge is calculated. It is found that the modified K-means algorithm takes less number of iterations than the simple K-means algorithm. The values are tabulated in Table 3.7. Table 3.7 Number of iterations D0 D1 D2 D3 K- means 6 4 6 8 Modified K-Means 4 3 6 6 The tabular values show that the number of iterations taken by the modified K-means algorithm to converge is less than the time taken by the K- means algorithm.

56 3.4 SUMMARY In this work, we have proposed a new hierarchical clustering algorithm based on association rules. The results are evaluated using the Reuters-21578 dataset. Our algorithm is motivated by the fact that the increased database size will eventually reduce the clustering performance. In order to overcome this, we have integrated the association rule algorithm with the hierarchical algorithm. The associated documents identified using the association algorithm alone, are clustered using the hierarchical algorithm, which greatly improves the efficiency of the clustered output. The experimental results show that the association based hierarchical clustering outperforms the traditional hierarchical clustering algorithm. The cluster quality was also tested, using the cophenetic correlation coefficient, and the results showed improved performance. Also, the time taken to complete the clustering process was also tested and the ARBHC algorithm took less time when compared to the hierarchical clustering algorithm. Also, a solution to improve the performance of the K-means algorithm has been formulated. The clustering performance of the simple K- means algorithm depends on the initial centroids. These centroids are selected at random and thus the performance varies. To overcome this, dissimilar documents are identified using the Apriori algorithm, and these are selected as the initial centroids. This has yielded better results. The number of iterations taken to converge is also calculated, and the modified K-means algorithm has used lesser number of iterations than the simple K-means algorithm. The next chapter describes a new method, that has been used for keyword extraction from text documents.