SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering 1 G. Loshma, 2 Nagaratna P Hedge 1 Jawaharlal Nehru Technological University, Hyderabad 2 Vasavi College of Engineering, Hyderabad Abstract Text document clustering has gained substantial research interest, owing to the rate of data growth. This paper presents a new text clustering algorithm namely Semantic Analysis based Ant Colony Optimization algorithm (SAACO). This entire work is decomposed into several phases such as document pre-processing, similarity measure computation, semantic analysis, application of clustering algorithm and cluster labelling. The preprocessing step aims at removing stop words, performing stemming operation and representing documents in a suitable format. The similarity measure computation is performed by the cosine similarity measure. The semantic analysis is performed by the exploitation of the WordNet. This is followed by the application of SAACO algorithm and finally, the cluster is labelled. The experimental results of the proposed algorithm are satisfactory with maximum accuracy rate. Keywords Text document clustering, WordNet, ant colony optimization algorithm. I. INTRODUCTION Data plays the vital role in all the domains and it grows hand-in-hand with time. Thus, the data management is the complex task to be achieved. The main concern of data management is hassle-free search and retrieval of the required data. At this juncture, the concept of clustering is beneficial. The main objective of a clustering algorithm is to group data, which are similar to each other. The degree of similarity of documents within a cluster is more, which when compared to the degree of similarity between documents from other clusters. This process of clustering makes the retrieval and search processes easier. Besides this, all the manipulations can be done effectively, as the related data are clustered together. This work aims at clustering the text documents by exploiting external knowledge base namely WordNet and the clustering algorithm being employed is the Ant Colony Optimization (ACO) algorithm. The proposed work is divided into four stages. The first stage is responsible for pre-processing, in order to make the documents appropriate for further processing of data. The second stage aims at computing the similarity of the data. The actual clustering operation is performed in the third stage. Finally, the clustered documents are labelled, in order to achieve easier retrievability. The rest of the paper is organised as follows. Section 2 reviews the related literature on text document clustering. Section 3 presents the proposed algorithm for clustering. The proposed algorithm is tested for its effectiveness in the section 4. Finally, the concluding remarks are presented in section 5. II. BACKGROUND This section intends to review the foundational concepts of the proposed text clustering algorithms. 2.1 WordNet WordNet is one of the largest thesauruses of English language. It connects all the terms to relevant terms, with respect to their meaning. It contains synonyms and the relationship of terms. WordNet 2.1 consists of 1,55,327 words in 1,17,597 senses. Synset is a technical term of WordNet, which aggregates nouns, verbs, adjectives and adverbs to form synonym set. This lexical database is employed for text clustering applications to improve the accuracy based on semantics [1]. 2.2 Ant Colony Optimization algorithm Ant Colony Optimization (ACO) algorithm was initially introduced by M. Dorigo and team in early 19s [2-4]. ACO algorithm is a bio-inspired algorithm, which imitates the behaviour of ants. It is based on the idea that the ants roam around the surrounding area of their nests, in order to obtain best food source. The ants check for the quality and the quantity of the food source, as soon as a food source is located. The so verified food source is then brought to the nest. During this backward locomotion, the ants implant the pheromone trail all along the way. The quantity of pheromone deposit depends on the quality and quantity of the food. The concentration of pheromone determines the quality of the food source. This pheromone trail paves way for the discovery of shortest path between the nest and food source. The primary component of ACO algorithm is the pheromone and these values are updated over iterations. The ants establish the solutions for the given problem at every round, on the basis of pheromone. The local search procedure is then applied to the established 21

solutions. This is followed by the process of pheromone update. The proposed text clustering algorithm relies on the semantic analysis and the ACO algorithm. III. PROPOSED ALGORITHM This section proposes a new text clustering algorithm that is based on semantic analysis and ACO algorithm. The entire algorithm is classified into several phases such as pre-processing, text document representation, text document clustering and labelling. The overall flow of the proposed algorithm is presented in fig 1. 3.1 Pre-processing Data pre-processing is the preliminary step which makes the data ready to be processed further. This preprocessing step enhances the speed of the execution of the clustering algorithm. Some of the major preprocessing tasks are removing stop words and perform stemming operation. Stop words are the words that do not have meaning on their own. Instead, they are meaningful only when it is read with the sentence or text. In other words, the stop words are included to enrich the grammatical context. Some of the stop words are articles, prepositions, conjunctions, pronouns and so on. The sample stop words are listed in table 1. Table 1: Sample stop words List of stop words A Before Till To An After With For The Put Without Further Up Of During You Above Off Include Me Down On Exclude I Below In Neither Myself Across Out Nor Around Beside Towards Either Behind Aside Over Or Beneath Because Under And Underneath Become Until Not Into After the process of eliminating the stop words, the proposed work strives to perform stemming operation. Stemming operation can be defined as the clipping of words in order to arrive at the root of the word. This Fig 1: Overall flow of SAACO operation saves memory and reduces the boosts up the speed of the algorithm. 3.1.1 Stemming operation with Porter-Stemmer algorithm The Porter-Stemmer algorithm is exploited for performing the stemming operation. The main advantages of this algorithm are the following. documents are claimed to be similar if those documents 22 This algorithm eliminates the plural suffix The suffixes ed, -ing are removed The ending alphabet is turned to i from y. Clips suffixes such as full, -ness, -ant, -ence etc., When the ending alphabet is e, the value is removed. Some of the samples are listed below. Example 1: Ants ant; Possesses - Possess Example 2: Presented - Present Example 3: Furry Furri, Really - Realli Example 4: Playful Play, Completeness Complete Example 5: Precedent Preced Example 6: Bearable Bearabl Thus, the pre-processing step deals with removing stop words and stemming operation is performed. This preprocessing step enhances the execution speed of the algorithm and saves memory by avoiding unwanted words [5]. 3.2 Text document representation The text documents are represented in such a way that the documents are represented as vectors. Two

have high degree of correlation between them. All the documents are organised as vectors in the vector space as matrix. The term weights of all documents are given by doc i = wt 1i, wt 2i,.. wt hi (4) Where doc i is the specific document, wt 1i is the weight of first term in the i th document, wt hi is the weight of the h th term in the i th document. Vdoc i = {wt 1,i, wt 2,i,.. wt h,i } (5) Equation 5 notifies the vector space model of the documents. wt 1,i, wt 2,i are the term weights of the documents and are computed by wt h,i = tf h IDF (6) IDF = log ( Doc docf h ) (7) tf h is the occurrence frequency of h in the i th document, docf h is the total count of documents that possesses the term h, Doc is the total number of documents in the dataset. The weight of the document is fixed on the basis of the importance of term. However, the above equations from 4 to 7 focus on the occurrence frequency of the terms alone. This work formulates the vector space model by taking the semantic of the term into account and is presented below. 3.3 Semantic similarity The semantic similarity between terms is computed by the incorporation of WordNet [6-10]. WordNet is a lexical database which accumulates the terms called as synsets. The semantical relationship between terms is calculated by taking the semantic correlation between the terms. Every word is checked for the semantic relationship of another word in WordNet. Let α h1,h2 is the semantic relationship between two terms wd 1 and wd 2. In case, if wd 2 is present in the synset of wd 1, then α h1,h2 is set to 1; otherwise α h1,h2 is set to 0 and is represented in (8). wd 2 wd 1 α h1,h2 = 1 wd 2 wd 1 α h1,h2 = 0 (8) The weight wd ij1 of term t i1 in document doc x is given by (9). wd ij1 = wd ij1 + i h2=1 α h1,h2 h2 h1 wd ij2 (9) By this way, the semantic relationship between every pair of terms is computed. This is followed by the computation of similarity measure. This work exploits the cosine similarity between the documents and is presented in (10, 11). Sim Doc a, Doc b = cosine Doc a, Doc b (10) cosine Doc a, Doc b = n i=1 wd ij 1 wd ij 2 n wd 2 i=1 ij 1.wd ij 2 2, a (11) Thus, the semantic similarity between the terms and the documents are found out. This step is followed by the process of clustering. 3.4 Clustering algorithm This work proposes a bio-inspired text clustering algorithm that is based on Ant Colony Optimization algorithm. ACO algorithm mimics the behaviour of real ants. In this work, the clusters are formed by the ACO algorithm. The algorithm is efficient in providing results. The algorithm is presented below. 1. Initialize the algorithm parameters 2. Pre-process the documents 3. Assign the population of food sources in a random fashion 4. Calculate the fitness of the population by (12) 5. Do 6. For each forward ant 7. Store the address of food source in memory; 8. Select next hop by (13); 8. Calculate fitness of the food source; 9. Update the pheromone; 10. Discard the ant when it reaches the source point; 11. Save the best food source; 12. while (termination condition not met); Algorithm description Step 1: The first step is concerned with the initialization of parameters such as the maximum iteration count, maximum time bound for algorithm execution, centre point of the cluster. This algorithm takes the similarity measure as the fitness function of the ACO algorithm. Step 2: The document pre-processing as explained in the previous section is performed in this step. Stopping and stemming is performed in this work. Step 3: The fitness function of ACO algorithm is computed by the following equation. n cl=1 dc s ϵcm i (12) f i = dc s CM i 2 Where cl is the cluster, CM is the cluster midpoint and dc is the document. This equation computes the distance between the document and cluster midpoint. Step 4: As soon as the fitness is computed, new food source is searched for by the ants. These ants strive to provide a new high quality food source from their neighbourhood locations. Suppose, if the similarity between the new document and the cluster midpoint is greater which when compared to the previous execution, then the new document is loaded into the memory. Step 5: This step concerns with the computation of the probability function and is provided in (13). The 23

Precision Rate (%) International Journal of Recent Advances in Engineering & Technology (IJRAET) probability rate of an ant z to start from source s and destination d is provided below. p z s, d = (13) [T(s,d)] 0.5 [E(d)] 0.5 u M _z[t(s,u)] 0.5 [E(u)] 0.5 if d M_z 0 otherwise Where p z s, d is the probability rate of z to traverse from location s to d. T is the routing table of each node that saves the concentration of pheromone from s to d. E is the visibility function, which can be computed by (11). Step 6: The ant becomes backward ant b and the pheromone is updated along its path and is loaded in memory. The ant selects the document on the basis of the probability function and it strives to find a suitable document in its neighbourhood. In case of the detection of a document in the neighbouring location, the similarity between the documents is computed. The best solution is found out and stored in memory. This process is repeated until the stopping criterion is met. Step 7: The amount of pheromone is calculated by T z = 1 N td z ; Here, N is the total number of documents and td f is the distance travelled by forward ant z. Step 8: When the backward ant is back to the source node s from d, then the routing table is updated by the following p z s, d = 1 ρ p z s, d + T z (14) Where ρ is the coefficient and 1 ρ indicates the evaporation of trail, since the last updated version of p z s, d. 6. When the ant reaches the node at which its journey started, the goal is attained and the ant is eliminated. After several iterations, the node can identify the most similar documents and cluster them together. 3.5 Cluster labelling The cluster labelling is the most important step, which makes the entire cluster understandable with a single keyword. Thus, a meaningful cluster label is always preferable, which gives sense to the complete cluster. The cluster label is always chosen with a distinctive keyword. The distinct words from the documents of a cluster are collected. This is followed by the computation of the self-explanatory score, which can be computed with the help of the WordNet. Finally, the term with the high self-explanatory score is chosen as the cluster label. 4. Experimental Analysis This section evaluates the performance of the proposed algorithm in terms of precision rate, recall rate, F- measure, accuracy and misclassification rate. The proposed work is compared with the outcome of k- means, bisecting k-means and UPGMA algorithms. The proposed work which is based on semantic analysis proves accurate results. The dataset being exploited for evaluating the performance of the proposed work is Reuters-21578 R8, which has got 8 classes [11]. On the whole, the dataset contains 7674 documents, which consists of 5485 training documents and 2189 testing documents. The experimental results are presented in graphical format from fig 2 to 6. Precision rate: Precision rate is the total number of documents whose actual label is x, but misclassified with label y. P rate = do c xy do c y 100 (14) Where doc xy is the total number of documents with actual label x, but wrongly classified as y. doc y is the documents which as correctly labelled as y. Thus, a clustering algorithm works well with greater precision rates. 100 98 96 94 92 88 86 84 82 Fig 2: Precision rate analysis From the experimental results, it is evident that the proposed work shows the maximum precision rate of 97.2%. This proves that the proposed algorithm efficiently clusters the documents than the other comparative algorithms. Recall rate: Recall rate is the total number of documents whose actual label is x, but misclassified with label y. R rate Precision Rate = do c xy do c x 100 (15) Where doc xy is the total number of documents with actual label x, but wrongly classified as y. doc x is the documents which as correctly labelled as x. Thus, a clustering algorithm works well with greater recall rates. 24

Misclassification rate Recall Rate (%) Accuracy (%) International Journal of Recent Advances in Engineering & Technology (IJRAET) 100 95 85 75 Recall Rate 100 70 60 50 40 30 20 10 0 Accuracy Rate Fig 3: Recall rate analysis The recall rate analysis shows that the recall rate of the proposed work is 96.5%. This shows that the misplacement of the document inside irrelevant clusters is prevented. F-measure: F-measure is computed by taking precision and recall rate into account. F-measure of a cluster and a class is given by F cls, cltr = 2 P rate R rate P rate +R rate 100 (16) 98 96 94 92 88 86 84 82 78 Fig 4: F-measure analysis The greater the value of F-measure, the higher is the quality of the cluster. On observing the experimental results, the proposed work shows the maximum quality of cluster with 96.4%. Accuracy rate: The accuracy rate of the algorithm is determined by the sum of correctly clustered documents and the correctly rejected documents (as they are not relevant) to the total number of clustered documents. acc = ccd +crd total clustered documents F-measure Analysis (17) Fig 5: Accuracy rate The accuracy rate of the proposed work is comparatively better than other algorithms, whereby the objective of the work is fulfilled. Misclassification rate: Misclassification rate is the rate of wrong clustering of documents. The misclassification rate must relatively be low and is calculated by mis rate = 1 acc (18) Fig 6: Misclassification rate analysis Thus, the misclassification rate of the proposed work is the least, which when compared with all the other algorithms. Thus, the Semantic analysis based ACO (SAACO) algorithm is presented which shows maximum accuracy and the least misclassification rate. V. CONCLUSION This paper presents a new text document clustering algorithm namely SAACO which is based on semantic analysis and Ant Colony Optimization algorithm. As the algorithm relies on the semantic analysis along with the ACO algorithm, it proves the greatest accuracy rate and the quality of the clusters is very high. The performance of the algorithm is compared with the existing algorithms and the experimental outcome of the proposed algorithm is satisfactory. REFERENCES [1] Liu, Y., Scheuermann, P., Li, X., and Zhu, X. 2007. Using WordNet to Disambiguate Word Senses for Text Classification. In Workshop on Text Data Mining in conjunction with 7 th 25 20 18 16 14 12 10 8 6 4 2 0 Misclassification Rate

International Conference on Computational Science. [2] M. Dorigo, Optimization, learning and natural algorithms (in Italian), Ph.D. Thesis, Dipartimento di Elettronica, Politecnico di Milano, Italy, 1992. [3] M. Dorigo, V. Maniezzo, A. Colorni, Positive feedback as a search strategy, Tech. Report 91-016, Dipartimento di Elettronica, Politecnico di Milano, Italy, 1991. [4] M. Dorigo, V. Maniezzo, A. Colorni, Ant system: optimization by a colony of cooperating agents, IEEE Trans. Systems, Man, Cybernet.-Part B 26 (1) (1996) 29 41. [5] Porter, Martin F. "An algorithm for suffix stripping." Program 14.3 (19): 130-137. [6] D. Hindle, Noun classification from predicateargument structures, Proc. of the Annual meeting of the association for computational linguistics, pp. 268-275, 19. [7] S. Caraballo, Automatic construction of a hypernym based noun hierarch from text, Proc. of the Annual meeting of the association for computational linguistics, pp. 120-126, 1999. [8] P. Velardi, R. Fabriani, and M. Missikoff, Using text processing techniques to automatically enrich a domain ontology, Proc. of the international conference on Formal ontology in information systems, pp. 270-284, 2001. [9] P. Cimiano, A. Hotho, and S. Staab, Learning concept hierarchies from text corpora using formal concept analysis, Journal of Artificial Intelligence Research, 24 (2005), pp. 305-339. [10] C. Fellbaum, WordNet: an electronic lexical database, MIT Press., 1998. [11] http://www.csmining.org/index.php/r52-and-r8- of-reuters-21578.html. 26