SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering

Similar documents
SEMANTIC ANALYSIS BASED TEXT CLUSTERING BY THE FUSION OF BISECTING K-MEANS AND UPGMA ALGORITHM

Adaptive Model of Personalized Searches using Query Expansion and Ant Colony Optimization in the Digital Library

KNAPSACK BASED ACCS INFORMATION RETRIEVAL FRAMEWORK FOR BIO-MEDICAL LITERATURE USING SIMILARITY BASED CLUSTERING APPROACH.

A Comprehensive Analysis of using Semantic Information in Text Categorization

Ant Colony Optimization for dynamic Traveling Salesman Problems

Image Edge Detection Using Ant Colony Optimization

Solving Travelling Salesmen Problem using Ant Colony Optimization Algorithm

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

IMPLEMENTATION OF ACO ALGORITHM FOR EDGE DETECTION AND SORTING SALESMAN PROBLEM

An Efficient Analysis for High Dimensional Dataset Using K-Means Hybridization with Ant Colony Optimization Algorithm

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, August 18, ISSN

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Chapter 6: Information Retrieval and Web Search. An introduction

Web Information Retrieval using WordNet

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

Automatic Programming with Ant Colony Optimization

String Vector based KNN for Text Categorization

Encoding Words into String Vectors for Word Categorization

Ant Algorithms. Simulated Ant Colonies for Optimization Problems. Daniel Bauer July 6, 2006

Solving a combinatorial problem using a local optimization in ant based system

A hybrid method to categorize HTML documents

CS 6320 Natural Language Processing

A Review: Optimization of Energy in Wireless Sensor Networks

An Ant Approach to the Flow Shop Problem

Weighted Suffix Tree Document Model for Web Documents Clustering

A new improved ant colony algorithm with levy mutation 1

WordNet-based User Profiles for Semantic Personalization

Dynamic Robot Path Planning Using Improved Max-Min Ant Colony Optimization

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Ant Colony Based Load Flow Optimisation Using Matlab

Memory-Based Immigrants for Ant Colony Optimization in Changing Environments

Hybrid Ant Colony Optimization and Cuckoo Search Algorithm for Travelling Salesman Problem

Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques

Making Sense Out of the Web

Applying Opposition-Based Ideas to the Ant Colony System

TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Workflow Scheduling Using Heuristics Based Ant Colony Optimization

ResPubliQA 2010

Improving Suffix Tree Clustering Algorithm for Web Documents

Classification Using Unstructured Rules and Ant Colony Optimization

ACO for Maximal Constraint Satisfaction Problems

Information Retrieval and Web Search

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Network routing problem-a simulation environment using Intelligent technique

HYBRID APROACH FOR WEB PAGE CLASSIFICATION BASED ON FIREFLY AND ANT COLONY OPTIMIZATION

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Document Clustering: Comparison of Similarity Measures

Enabling Semantic Search in Large Open Source Communities

Ontology Based Search Engine

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

An Ant Colony Optimization Algorithm for Solving Travelling Salesman Problem

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Optimization using Ant Colony Algorithm

An Ant System with Direct Communication for the Capacitated Vehicle Routing Problem

Ant Colony Optimization Algorithm for Reactive Production Scheduling Problem in the Job Shop System

Ant-Colony Optimization for the System Reliability Problem with Quantity Discounts

First approach to solve linear system of equations by using Ant Colony Optimization

SWARM INTELLIGENCE -I

Parallel Implementation of Travelling Salesman Problem using Ant Colony Optimization

MIRROR SITE ORGANIZATION ON PACKET SWITCHED NETWORKS USING A SOCIAL INSECT METAPHOR

Results of NBJLM for OAEI 2010

Improvement of a car racing controller by means of Ant Colony Optimization algorithms

Ant Colony Optimization

ANT COLONY OPTIMIZED ROUTING FOR MOBILE ADHOC NETWORKS (MANET)

Wordnet Based Document Clustering

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU

NORMALIZATION OF ACO ALGORITHM PARAMETERS

Task Scheduling Using Probabilistic Ant Colony Heuristics

On-Line Scheduling Algorithm for Real-Time Multiprocessor Systems with ACO and EDF

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

CADIAL Search Engine at INEX

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Cluster-based Similarity Aggregation for Ontology Matching

Chapter 27 Introduction to Information Retrieval and Web Search

International Journal of Current Trends in Engineering & Technology Volume: 02, Issue: 01 (JAN-FAB 2016)

Navigation of Multiple Mobile Robots Using Swarm Intelligence

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Annotated Suffix Trees for Text Clustering

Tasks Scheduling using Ant Colony Optimization

A genetic algorithm based focused Web crawler for automatic webpage classification

An Efficient Approach for Requirement Traceability Integrated With Software Repository

Robust Descriptive Statistics Based PSO Algorithm for Image Segmentation

CHAOTIC ANT SYSTEM OPTIMIZATION FOR PATH PLANNING OF THE MOBILE ROBOTS

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Solving the Shortest Path Problem in Vehicle Navigation System by Ant Colony Algorithm

Combined A*-Ants Algorithm: A New Multi-Parameter Vehicle Navigation Scheme

Semantic Search in s

Research Article A Novel Steganalytic Algorithm based on III Level DWT with Energy as Feature

Concept-Based Document Similarity Based on Suffix Tree Document

Searching for Maximum Cliques with Ant Colony Optimization

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Information Retrieval

Blind Image Deconvolution Technique for Image Restoration using Ant Colony Optimization

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

Clustering Technique with Potter stemmer and Hypergraph Algorithms for Multi-featured Query Processing

Transcription:

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering 1 G. Loshma, 2 Nagaratna P Hedge 1 Jawaharlal Nehru Technological University, Hyderabad 2 Vasavi College of Engineering, Hyderabad Abstract Text document clustering has gained substantial research interest, owing to the rate of data growth. This paper presents a new text clustering algorithm namely Semantic Analysis based Ant Colony Optimization algorithm (SAACO). This entire work is decomposed into several phases such as document pre-processing, similarity measure computation, semantic analysis, application of clustering algorithm and cluster labelling. The preprocessing step aims at removing stop words, performing stemming operation and representing documents in a suitable format. The similarity measure computation is performed by the cosine similarity measure. The semantic analysis is performed by the exploitation of the WordNet. This is followed by the application of SAACO algorithm and finally, the cluster is labelled. The experimental results of the proposed algorithm are satisfactory with maximum accuracy rate. Keywords Text document clustering, WordNet, ant colony optimization algorithm. I. INTRODUCTION Data plays the vital role in all the domains and it grows hand-in-hand with time. Thus, the data management is the complex task to be achieved. The main concern of data management is hassle-free search and retrieval of the required data. At this juncture, the concept of clustering is beneficial. The main objective of a clustering algorithm is to group data, which are similar to each other. The degree of similarity of documents within a cluster is more, which when compared to the degree of similarity between documents from other clusters. This process of clustering makes the retrieval and search processes easier. Besides this, all the manipulations can be done effectively, as the related data are clustered together. This work aims at clustering the text documents by exploiting external knowledge base namely WordNet and the clustering algorithm being employed is the Ant Colony Optimization (ACO) algorithm. The proposed work is divided into four stages. The first stage is responsible for pre-processing, in order to make the documents appropriate for further processing of data. The second stage aims at computing the similarity of the data. The actual clustering operation is performed in the third stage. Finally, the clustered documents are labelled, in order to achieve easier retrievability. The rest of the paper is organised as follows. Section 2 reviews the related literature on text document clustering. Section 3 presents the proposed algorithm for clustering. The proposed algorithm is tested for its effectiveness in the section 4. Finally, the concluding remarks are presented in section 5. II. BACKGROUND This section intends to review the foundational concepts of the proposed text clustering algorithms. 2.1 WordNet WordNet is one of the largest thesauruses of English language. It connects all the terms to relevant terms, with respect to their meaning. It contains synonyms and the relationship of terms. WordNet 2.1 consists of 1,55,327 words in 1,17,597 senses. Synset is a technical term of WordNet, which aggregates nouns, verbs, adjectives and adverbs to form synonym set. This lexical database is employed for text clustering applications to improve the accuracy based on semantics [1]. 2.2 Ant Colony Optimization algorithm Ant Colony Optimization (ACO) algorithm was initially introduced by M. Dorigo and team in early 19s [2-4]. ACO algorithm is a bio-inspired algorithm, which imitates the behaviour of ants. It is based on the idea that the ants roam around the surrounding area of their nests, in order to obtain best food source. The ants check for the quality and the quantity of the food source, as soon as a food source is located. The so verified food source is then brought to the nest. During this backward locomotion, the ants implant the pheromone trail all along the way. The quantity of pheromone deposit depends on the quality and quantity of the food. The concentration of pheromone determines the quality of the food source. This pheromone trail paves way for the discovery of shortest path between the nest and food source. The primary component of ACO algorithm is the pheromone and these values are updated over iterations. The ants establish the solutions for the given problem at every round, on the basis of pheromone. The local search procedure is then applied to the established 21

solutions. This is followed by the process of pheromone update. The proposed text clustering algorithm relies on the semantic analysis and the ACO algorithm. III. PROPOSED ALGORITHM This section proposes a new text clustering algorithm that is based on semantic analysis and ACO algorithm. The entire algorithm is classified into several phases such as pre-processing, text document representation, text document clustering and labelling. The overall flow of the proposed algorithm is presented in fig 1. 3.1 Pre-processing Data pre-processing is the preliminary step which makes the data ready to be processed further. This preprocessing step enhances the speed of the execution of the clustering algorithm. Some of the major preprocessing tasks are removing stop words and perform stemming operation. Stop words are the words that do not have meaning on their own. Instead, they are meaningful only when it is read with the sentence or text. In other words, the stop words are included to enrich the grammatical context. Some of the stop words are articles, prepositions, conjunctions, pronouns and so on. The sample stop words are listed in table 1. Table 1: Sample stop words List of stop words A Before Till To An After With For The Put Without Further Up Of During You Above Off Include Me Down On Exclude I Below In Neither Myself Across Out Nor Around Beside Towards Either Behind Aside Over Or Beneath Because Under And Underneath Become Until Not Into After the process of eliminating the stop words, the proposed work strives to perform stemming operation. Stemming operation can be defined as the clipping of words in order to arrive at the root of the word. This Fig 1: Overall flow of SAACO operation saves memory and reduces the boosts up the speed of the algorithm. 3.1.1 Stemming operation with Porter-Stemmer algorithm The Porter-Stemmer algorithm is exploited for performing the stemming operation. The main advantages of this algorithm are the following. documents are claimed to be similar if those documents 22 This algorithm eliminates the plural suffix The suffixes ed, -ing are removed The ending alphabet is turned to i from y. Clips suffixes such as full, -ness, -ant, -ence etc., When the ending alphabet is e, the value is removed. Some of the samples are listed below. Example 1: Ants ant; Possesses - Possess Example 2: Presented - Present Example 3: Furry Furri, Really - Realli Example 4: Playful Play, Completeness Complete Example 5: Precedent Preced Example 6: Bearable Bearabl Thus, the pre-processing step deals with removing stop words and stemming operation is performed. This preprocessing step enhances the execution speed of the algorithm and saves memory by avoiding unwanted words [5]. 3.2 Text document representation The text documents are represented in such a way that the documents are represented as vectors. Two

have high degree of correlation between them. All the documents are organised as vectors in the vector space as matrix. The term weights of all documents are given by doc i = wt 1i, wt 2i,.. wt hi (4) Where doc i is the specific document, wt 1i is the weight of first term in the i th document, wt hi is the weight of the h th term in the i th document. Vdoc i = {wt 1,i, wt 2,i,.. wt h,i } (5) Equation 5 notifies the vector space model of the documents. wt 1,i, wt 2,i are the term weights of the documents and are computed by wt h,i = tf h IDF (6) IDF = log ( Doc docf h ) (7) tf h is the occurrence frequency of h in the i th document, docf h is the total count of documents that possesses the term h, Doc is the total number of documents in the dataset. The weight of the document is fixed on the basis of the importance of term. However, the above equations from 4 to 7 focus on the occurrence frequency of the terms alone. This work formulates the vector space model by taking the semantic of the term into account and is presented below. 3.3 Semantic similarity The semantic similarity between terms is computed by the incorporation of WordNet [6-10]. WordNet is a lexical database which accumulates the terms called as synsets. The semantical relationship between terms is calculated by taking the semantic correlation between the terms. Every word is checked for the semantic relationship of another word in WordNet. Let α h1,h2 is the semantic relationship between two terms wd 1 and wd 2. In case, if wd 2 is present in the synset of wd 1, then α h1,h2 is set to 1; otherwise α h1,h2 is set to 0 and is represented in (8). wd 2 wd 1 α h1,h2 = 1 wd 2 wd 1 α h1,h2 = 0 (8) The weight wd ij1 of term t i1 in document doc x is given by (9). wd ij1 = wd ij1 + i h2=1 α h1,h2 h2 h1 wd ij2 (9) By this way, the semantic relationship between every pair of terms is computed. This is followed by the computation of similarity measure. This work exploits the cosine similarity between the documents and is presented in (10, 11). Sim Doc a, Doc b = cosine Doc a, Doc b (10) cosine Doc a, Doc b = n i=1 wd ij 1 wd ij 2 n wd 2 i=1 ij 1.wd ij 2 2, a (11) Thus, the semantic similarity between the terms and the documents are found out. This step is followed by the process of clustering. 3.4 Clustering algorithm This work proposes a bio-inspired text clustering algorithm that is based on Ant Colony Optimization algorithm. ACO algorithm mimics the behaviour of real ants. In this work, the clusters are formed by the ACO algorithm. The algorithm is efficient in providing results. The algorithm is presented below. 1. Initialize the algorithm parameters 2. Pre-process the documents 3. Assign the population of food sources in a random fashion 4. Calculate the fitness of the population by (12) 5. Do 6. For each forward ant 7. Store the address of food source in memory; 8. Select next hop by (13); 8. Calculate fitness of the food source; 9. Update the pheromone; 10. Discard the ant when it reaches the source point; 11. Save the best food source; 12. while (termination condition not met); Algorithm description Step 1: The first step is concerned with the initialization of parameters such as the maximum iteration count, maximum time bound for algorithm execution, centre point of the cluster. This algorithm takes the similarity measure as the fitness function of the ACO algorithm. Step 2: The document pre-processing as explained in the previous section is performed in this step. Stopping and stemming is performed in this work. Step 3: The fitness function of ACO algorithm is computed by the following equation. n cl=1 dc s ϵcm i (12) f i = dc s CM i 2 Where cl is the cluster, CM is the cluster midpoint and dc is the document. This equation computes the distance between the document and cluster midpoint. Step 4: As soon as the fitness is computed, new food source is searched for by the ants. These ants strive to provide a new high quality food source from their neighbourhood locations. Suppose, if the similarity between the new document and the cluster midpoint is greater which when compared to the previous execution, then the new document is loaded into the memory. Step 5: This step concerns with the computation of the probability function and is provided in (13). The 23

Precision Rate (%) International Journal of Recent Advances in Engineering & Technology (IJRAET) probability rate of an ant z to start from source s and destination d is provided below. p z s, d = (13) [T(s,d)] 0.5 [E(d)] 0.5 u M _z[t(s,u)] 0.5 [E(u)] 0.5 if d M_z 0 otherwise Where p z s, d is the probability rate of z to traverse from location s to d. T is the routing table of each node that saves the concentration of pheromone from s to d. E is the visibility function, which can be computed by (11). Step 6: The ant becomes backward ant b and the pheromone is updated along its path and is loaded in memory. The ant selects the document on the basis of the probability function and it strives to find a suitable document in its neighbourhood. In case of the detection of a document in the neighbouring location, the similarity between the documents is computed. The best solution is found out and stored in memory. This process is repeated until the stopping criterion is met. Step 7: The amount of pheromone is calculated by T z = 1 N td z ; Here, N is the total number of documents and td f is the distance travelled by forward ant z. Step 8: When the backward ant is back to the source node s from d, then the routing table is updated by the following p z s, d = 1 ρ p z s, d + T z (14) Where ρ is the coefficient and 1 ρ indicates the evaporation of trail, since the last updated version of p z s, d. 6. When the ant reaches the node at which its journey started, the goal is attained and the ant is eliminated. After several iterations, the node can identify the most similar documents and cluster them together. 3.5 Cluster labelling The cluster labelling is the most important step, which makes the entire cluster understandable with a single keyword. Thus, a meaningful cluster label is always preferable, which gives sense to the complete cluster. The cluster label is always chosen with a distinctive keyword. The distinct words from the documents of a cluster are collected. This is followed by the computation of the self-explanatory score, which can be computed with the help of the WordNet. Finally, the term with the high self-explanatory score is chosen as the cluster label. 4. Experimental Analysis This section evaluates the performance of the proposed algorithm in terms of precision rate, recall rate, F- measure, accuracy and misclassification rate. The proposed work is compared with the outcome of k- means, bisecting k-means and UPGMA algorithms. The proposed work which is based on semantic analysis proves accurate results. The dataset being exploited for evaluating the performance of the proposed work is Reuters-21578 R8, which has got 8 classes [11]. On the whole, the dataset contains 7674 documents, which consists of 5485 training documents and 2189 testing documents. The experimental results are presented in graphical format from fig 2 to 6. Precision rate: Precision rate is the total number of documents whose actual label is x, but misclassified with label y. P rate = do c xy do c y 100 (14) Where doc xy is the total number of documents with actual label x, but wrongly classified as y. doc y is the documents which as correctly labelled as y. Thus, a clustering algorithm works well with greater precision rates. 100 98 96 94 92 88 86 84 82 Fig 2: Precision rate analysis From the experimental results, it is evident that the proposed work shows the maximum precision rate of 97.2%. This proves that the proposed algorithm efficiently clusters the documents than the other comparative algorithms. Recall rate: Recall rate is the total number of documents whose actual label is x, but misclassified with label y. R rate Precision Rate = do c xy do c x 100 (15) Where doc xy is the total number of documents with actual label x, but wrongly classified as y. doc x is the documents which as correctly labelled as x. Thus, a clustering algorithm works well with greater recall rates. 24

Misclassification rate Recall Rate (%) Accuracy (%) International Journal of Recent Advances in Engineering & Technology (IJRAET) 100 95 85 75 Recall Rate 100 70 60 50 40 30 20 10 0 Accuracy Rate Fig 3: Recall rate analysis The recall rate analysis shows that the recall rate of the proposed work is 96.5%. This shows that the misplacement of the document inside irrelevant clusters is prevented. F-measure: F-measure is computed by taking precision and recall rate into account. F-measure of a cluster and a class is given by F cls, cltr = 2 P rate R rate P rate +R rate 100 (16) 98 96 94 92 88 86 84 82 78 Fig 4: F-measure analysis The greater the value of F-measure, the higher is the quality of the cluster. On observing the experimental results, the proposed work shows the maximum quality of cluster with 96.4%. Accuracy rate: The accuracy rate of the algorithm is determined by the sum of correctly clustered documents and the correctly rejected documents (as they are not relevant) to the total number of clustered documents. acc = ccd +crd total clustered documents F-measure Analysis (17) Fig 5: Accuracy rate The accuracy rate of the proposed work is comparatively better than other algorithms, whereby the objective of the work is fulfilled. Misclassification rate: Misclassification rate is the rate of wrong clustering of documents. The misclassification rate must relatively be low and is calculated by mis rate = 1 acc (18) Fig 6: Misclassification rate analysis Thus, the misclassification rate of the proposed work is the least, which when compared with all the other algorithms. Thus, the Semantic analysis based ACO (SAACO) algorithm is presented which shows maximum accuracy and the least misclassification rate. V. CONCLUSION This paper presents a new text document clustering algorithm namely SAACO which is based on semantic analysis and Ant Colony Optimization algorithm. As the algorithm relies on the semantic analysis along with the ACO algorithm, it proves the greatest accuracy rate and the quality of the clusters is very high. The performance of the algorithm is compared with the existing algorithms and the experimental outcome of the proposed algorithm is satisfactory. REFERENCES [1] Liu, Y., Scheuermann, P., Li, X., and Zhu, X. 2007. Using WordNet to Disambiguate Word Senses for Text Classification. In Workshop on Text Data Mining in conjunction with 7 th 25 20 18 16 14 12 10 8 6 4 2 0 Misclassification Rate

International Conference on Computational Science. [2] M. Dorigo, Optimization, learning and natural algorithms (in Italian), Ph.D. Thesis, Dipartimento di Elettronica, Politecnico di Milano, Italy, 1992. [3] M. Dorigo, V. Maniezzo, A. Colorni, Positive feedback as a search strategy, Tech. Report 91-016, Dipartimento di Elettronica, Politecnico di Milano, Italy, 1991. [4] M. Dorigo, V. Maniezzo, A. Colorni, Ant system: optimization by a colony of cooperating agents, IEEE Trans. Systems, Man, Cybernet.-Part B 26 (1) (1996) 29 41. [5] Porter, Martin F. "An algorithm for suffix stripping." Program 14.3 (19): 130-137. [6] D. Hindle, Noun classification from predicateargument structures, Proc. of the Annual meeting of the association for computational linguistics, pp. 268-275, 19. [7] S. Caraballo, Automatic construction of a hypernym based noun hierarch from text, Proc. of the Annual meeting of the association for computational linguistics, pp. 120-126, 1999. [8] P. Velardi, R. Fabriani, and M. Missikoff, Using text processing techniques to automatically enrich a domain ontology, Proc. of the international conference on Formal ontology in information systems, pp. 270-284, 2001. [9] P. Cimiano, A. Hotho, and S. Staab, Learning concept hierarchies from text corpora using formal concept analysis, Journal of Artificial Intelligence Research, 24 (2005), pp. 305-339. [10] C. Fellbaum, WordNet: an electronic lexical database, MIT Press., 1998. [11] http://www.csmining.org/index.php/r52-and-r8- of-reuters-21578.html. 26