Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Similar documents
Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Machine Learning: Algorithms and Applications

Cluster Analysis of Electrical Behavior

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Query Clustering Using a Hybrid Query Similarity Measure

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

UB at GeoCLEF Department of Geography Abstract

Performance Evaluation of Information Retrieval Systems

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Research of Support Vector Machine in Agricultural Data Classification

Load Balancing for Hex-Cell Interconnection Network

Hierarchical clustering for gene expression data analysis

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

A Binarization Algorithm specialized on Document Images and Photos

Available online at Available online at Advanced in Control Engineering and Information Science

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Online Text Mining System based on M2VSM

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Support Vector Machines

Module Management Tool in Software Development Organizations

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Web Document Classification Based on Fuzzy Association

Load-Balanced Anycast Routing

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

A NOTE ON FUZZY CLOSURE OF A FUZZY SET

Graph-based Clustering

Semantic Image Retrieval Using Region Based Inverted File

Unsupervised Learning

Unsupervised Learning and Clustering

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

K-means and Hierarchical Clustering

S1 Note. Basis functions.

Classic Term Weighting Technique for Mining Web Content Outliers

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Mathematics 256 a course in differential equations for engineering students

An Improved Image Segmentation Algorithm Based on the Otsu Method

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

A Clustering Algorithm for Chinese Adjectives and Nouns 1

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

DOCUMENT clustering is a special version of data clustering

An Optimal Algorithm for Prufer Codes *

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Classifier Selection Based on Data Complexity Measures *

F Geometric Mean Graphs

A Deflected Grid-based Algorithm for Clustering Analysis

Personalized Concept-Based Clustering of Search Engine Queries

From Comparing Clusterings to Combining Clusterings

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Pruning Training Corpus to Speedup Text Classification 1

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

(1) The control processes are too complex to analyze by conventional quantitative techniques.

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Unsupervised Learning and Clustering

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Optimizing Document Scoring for Query Retrieval

Information Retrieval

On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering

Smoothing Spline ANOVA for variable screening

A Method of Hot Topic Detection in Blogs Using N-gram Model

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

Machine Learning. Topic 6: Clustering

The Effect of Similarity Measures on The Quality of Query Clusters

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Constructing Minimum Connected Dominating Set: Algorithmic approach

An Internal Clustering Validation Index for Boolean Data

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Alignment Results of SOBOM for OAEI 2010

Optimal Workload-based Weighted Wavelet Synopses

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Conditional Speculative Decimal Addition*

Virtual Machine Migration based on Trust Measurement of Computer Node

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

An Image Fusion Approach Based on Segmentation Region

Cross-Language Information Retrieval

SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB

Problem Set 3 Solutions

Transcription:

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model Nguyen Ch Thanh and Koch Yamada Department of Management and Informaton System Scence, Nagaoka Unversty of Technology, Nagaoka-sh, 940-288 Japan Abstract Most studes on document clusterng tll date use Vector Space Model (VSM) to represent documents n the document space, where documents are denoted by a vector n a word vector space. The standard VSM does not take nto account the semantc relatedness between terms. Thus, terms wth some semantc smlarty are dealt wth n the same way as terms wth no semantc relatedness. Snce ths unconcern about semantcs reduces the qualty of clusterng results, many studes have proposed varous approaches to ntroduce knowledge of semantc relatedness nto VSM model. Those approaches gve better results than the standard VSM. However they stll have ther own ssues. We propose a new approach as a combnaton of two approaches, one of whch uses Rough Sets theory and cooccurrence of terms, and the other uses WordNet knowledge to solve these ssues. Experments for ts evaluaton show advantage of the proposed approach over the others. Keywords: document clusterng, document representaton, rough sets, text mnng.. Introducton Document clusterng s an mportant text mnng technque to generate useful nformaton from text collectons such as news artcles, research papers, books, dgtal lbrares, e- mal messages, and web pages. Text-based document clusterng attempts to group documents nto clusters where each cluster mght represent a topc that s dfferent from topcs of the other clusters. Document clusterng algorthms are dvded nto two categores n general: parttonal clusterng and herarchcal clusterng. Parttonal clusterng dvdes a document collecton nto groups n a sngle level, whle herarchcal clusterng creates a tree structure of documents. There are varous document clusterng methods proposed n recent years, ncludng herarchcal clusterng algorthms usng results from a k-way parttonal clusterng soluton [], sphercal k-means [2], bsectng k-means [3], frequent term meanng sequences based method [4], k-means wth Harmony Search Optmzaton [5]. Vector space model s a popular model for document representaton n document clusterng ncludng the above methods. Documents are represented by vectors of weghts, where each weght n a vector denotes mportance of a term n the document. In the standard VSM, however, semantc relatons between terms are not taken nto account. Two terms wth a close semantc relaton and two other terms wth no semantc relaton are both treated n the same way. Ths unconcern about semantcs could reduce qualty of the clusterng result. There are some approaches proposed to deal wth ths problem. Tolerance Rough Set model (TRSM) [6] and Smlarty Rough Set Model (SRSM) [7] extended the vector space model usng Rough Sets theory and cooccurrence of terms. TRSM and SRSM have been successfully appled to document clusterng. However, the results showed that SRSM had better performance than TRSM and some other conventonal methods [7]. There are other approaches that employ WordNet based semantc smlarty to enhance the performance of document clusterng [8, 9]. They modfed the VSM model by readustng term weghts n the document vectors based on ts relatonshps wth other terms co-occurrng n the document. SRSM and WordNet based methods performed better results than the standard VSM. However, they stll have ther own ssues as dscussed later. We propose a new method by combnng ther strength and reducng ther weakness. The new method uses both Rough Sets theory and WordNet based semantc smlarty to defne a new representaton model of documents. Expermental results show that t gves better clusterng results than the other methods dscussed n the paper. The paper s organzed by sx sectons. In Secton 2 and Secton 3 we dscuss SRSM and WordNet semantc smlarty based methods, respectvely. Secton 4 descrbes our proposed method. Secton 5 presents the results of our experments on document collectons. Fnally, Secton 6 concludes wth a summary and dscusson about future research.

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, September 20 ISSN (Onlne): 694-084 www.ijcsi.org 2 2. Smlarty rough set model Smlarty Rough Set Model s a mathematcal model extended from Pawlak s Rough Set model [0] usng smlarty relaton nstead of equvalence relaton [7]. It s also an expanson from Tolerance Rough Set Model [6] wth a tolerance relaton. Equvalence, tolerance and smlarty relatons are bnary relatons that could be used to represent relatons between terms n document clusterng. An equvalence relaton must satsfy reflexve, symmetrc and transtve propertes, whle a tolerance relaton does not have to satsfy transtve one. A smlarty relaton must be reflexve, but not requred to be symmetrc and transtve [, 2]. TRSM based on a tolerance relaton was successfully appled to nformaton retreval and document clusterng n [6, 3, 4]. Recently, SRSM based on a smlarty relaton was proposed and appled to document clusterng by authors of ths paper [7]. It showed that SRSM produces better results than TRSM both n qualty and robustness, where co-occurrence of terms was used to obtan tolerance and smlarty relatons, respectvely. SRSM could be defned as follows: Let the par apr = (U, R) be an approxmaton space, where U s the unverse, and R U x U s a smlarty relaton on U. r(x): U 2 U s an uncertanty functon whch corresponds to the smlarty relaton R understood as yrx y r(x), whch mght represent that y s smlar to x. r(x) s a smlarty class of all obects that are consdered to have smlar nformaton to x. The functon r(x) satsfes reflexve property: x r(x), however t s not necessary symmetrc and transtve. Gven an arbtrary set X U, X can be characterzed by a par of lower and upper approxmatons as follows: apr( X ) x U r ( x) X, () apr( X ) r( x), (2) r x xx where ( ) denotes the nverse relaton of R, whch s the class of referent obects to whch x s smlar: r ( x) y U xry (3) We proposed a new model of document representaton for document clusterng usng the above generalzed rough set theory Smlarty Rough Set Model [7]. The new model s defned as follows. The unverse U of the approxmaton space (U, R) s the set of all terms T used n the document vectors. The bnary relaton R s defned by t Rt f t, t ). f ( t ), (4) D( D where f D (t, t ) s the number of documents n the document set D n whch term t and t co-occur, f D (t ) s the number of documents n D n whch term t occurs and s a parameter (0 < < ). The relaton R defned above s a smlarty relaton that satsfes only reflexvty. An uncertanty functon I (t ) correspondng to the smlarty relaton s defned as I (t ) = {t U t Rt }, (5) where I (t ) s a set of all terms smlar to t. The lower and upper approxmaton of any subset X T based on ths model can be obtaned usng equatons () and (2), where U and r are replaced by T and I, respectvely. I In ths case, ( t ) s the set of terms to whch t s smlar, and s defned as I ( t ) t f D ( t, t ). f In the document clusterng wth SRSM (referred to as SRSM later, whle the ordnary approach s referred to as VSM), we appled sphercal k-means algorthm [2] to term vectors that conssts of terms n upper approxmatons of ordnary document vectors (term sets). The usage of upper approxmaton could gve us better clusterng results, because two documents become smlar to each other, f one contans many terms smlar (n the sense of eq. (4)) to terms n the other even f the two documents do not have many common terms. Snce there are many synonyms n natural language n general and people use dfferent terms to represent a certan thng, the upper approxmaton would gve a postve effect on document clusterng. There would be another advantage of usng the upper approxmaton. The number of terms n a document s usually relatvely small n comparson wth the number of terms n a corpus. Therefore, the document vectors are usually hgh dmensonal and sparse. Hence, document smlarty measurements often yeld zero values, whch can lead to the poor clusterng results. Snce the proposed approach puts addtonal terms nto document vectors wthout ncreasng the dmenson, the unwelcome tendency mght be mtgated to some extent. We use tf df weghtng scheme to calculate the weghts of terms n upper approxmatons of the document vectors. The term weghtng method s extended to defne weghts of terms that are not contaned n documents but n the upper approxmatons. It ensures that such terms have a weght smaller than the weght of any other term n the document. The weght a of term t n the upper approxmaton of document d s then defned as follows. D ( t ) (6)

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 4, July 20 ISSN (Onlne): 694-084 www.ijcsi.org 3 N f log f t d t ) N log (7) a f D( t ) mnt f ( ) \ h d wh t apr d d log N t ) 0 f t apr( d ) where f s the frequency of term n document, N s number of documents, d s a set of terms appearng n document, t h s the term wth the smallest weght n the document and w h s the orgnal weght of term t h n the document. Then normalzaton s appled to the upper approxmatons of document vectors. The cosne smlarty measure s used to calculate the smlarty between two vectors. The algorthm s descrbed as follows [7]:. Preprocessng (word stemmng, stopwords removal). 2. Create document vectors. 2.a. Obtan sets of terms appearng n documents. 2.b. Create document vectors usng tf df. 2.b. Generate smlarty classes of terms based on ther co-occurrences. 2.c. Create vectors of upper approxmatons of documents usng equaton (7) and then the vectors are normalzed. 3. Apply the clusterng algorthm 3.a. Start wth a random parttonng of the vectors of upper approxmatons of documents, namely C (0) = {C (0), C (0) 2,..., C (0) k }. Let c (0), c (0) (0) 2,..., c k denote the centrods of the gven parttonng wth the ndex of teraton t = 0. 3.b. For each document vector x, N, fnd the centrod closest n cosne smlarty to ts upper approxmaton apr ( x ). Then, compute the new parttonng C (t + ) based on the old centrods c (t), c (t) 2,..., c (t) k : C (t+) s the set of all document vectors whose upper approxmatons are closest to the centrod c (t). If the upper approxmaton of a document s closest to more than one centrod, then t s randomly assgned to one of the clusters. 3.c. Compute the new centrods: s apr( x ), ( t ) c s s, k, where c (t+) denotes the centrod or the mean of the upper approxmatons of documents n cluster C (t+). 3.d. If some stoppng crteron s met, then set C * = C (t + ) and set c * = c (t + ) for k, and ext. Otherwse, ncrement t by, and go to step 3.b above. In our mplementaton, the teraton stops when the centrods of the generated clusters are dentcal to those generated n the prevous teraton. In SRSM, we used co-occurrence of terms to calculate the semantc relaton between terms. The usage of cooccurrence gves us a mert that lets us defne smlarty relatons automatcally wthout any knowledge base. However, t mght also have a weakness that n some cases co-occurrence of terms does not necessarly mean they have a smlar meanng. In the case, terms that do not appear n a document nor smlar to any term n the document may be contaned n the upper approxmaton. 3. WordNet semantc smlarty based model WordNet s an electronc lexcal database of Englsh, avalable to researchers n computatonal lngustcs and natural language processng [5]. WordNet was developed and s beng mantaned by the Cogntve Scence Laboratory of Prnceton Unversty. In WordNet, a concept represents a meanng of a term. Terms whch have the same concept are grouped n a synset. Each synset has ts defnton (gloss) and lnks wth other synsets hgher or lower n the herarchy by dfferent types of semantc relatons. There are dfferent methods to compute semantc smlarty of terms usng WordNet, whch can be dvded nto four categores: path based, nformaton content based, gloss based and vector based methods. Path based methods use length of the path between concept nodes to calculate the smlarty relatedness [6, 7]. Informaton content based methods [8, 9] measure the relatedness of the two concepts usng the nformaton content of the most specfc shared parent. In gloss based methods [20, 2], glosses of concepts are used to determne the relatedness of concepts. In vector based methods [22, 23], the relatedness between terms are computed usng concept vectors derved from glosses. Recently, some studes used WordNet-based semantc smlarty to enhance performance of document clusterng [8, 9]. They modfed the VSM model by readustng weghts of terms n the documents. The basc dea s that a term s consdered more mportant f other terms semantcally related to t appear n the same document. They ncrease weght values of such terms wth the followng equaton: w ~ w sm( t, t ) w (8) t d 2 2 2 2

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, September 20 ISSN (Onlne): 694-084 www.ijcsi.org 4 where w d, sm( t, t ) 2 s the orgnal weghts of term t n document s the semantc smlarty between the two terms calculated usng a WordNet based measure. They proposed mproved VSM model based on ths dea and showed that the clusterng performance based on the new model was better than that based on the VSM. The advantage of ths approach s the hgh relablty of smlarty gven by the WordNet. The basc dea behnd eq. (8) also seems adequate. A possble weak pont mght come from the general property of WordNet. Snce t s a general dctonary, t mght not work for documents n a specfc feld. Another s that t utlzes the knowledge of smlarty only to adust the mportance of terms n a document. It does not let us fnd smlarty between two documents where one contans many terms smlar to ones n the other but the two do not have many common terms. 4. WordNet based smlarty rough set model for document clusterng In document clusterng, the effect of semantc smlarty between terms s large, and must be taken nto account to enhance the performance of VSM. In SRSM, the semantc relaton between terms s calculated usng co-occurrence of terms. However, there seem cases when terms have hgh co-occurrence but have low semantc smlarty. WordNetbased approaches measure the relatedness of terms usng the lexcal database. Based on the ontology structure of terms or defntons of terms n WordNet, we can compute scores of semantc relatedness. However, as a general dctonary, WordNet does not cover all terms and term meanngs n every specfc subect. Moreover, n dfferent felds, the semantc relaton of terms may be dfferent. Our dea s to explot both approaches to get better clusterng results. In SRSM, we defned the smlarty class of terms usng the relaton R gven by eq. (4). Here, we propose a new relaton that ntegrates WordNet knowledge to elmnate terms havng no smlar meanng but a hgh frequency of co-occurrence. t Rt f t, t ). f ( t ) (( t not n WordNet) D( D ( t not n WordNet) sm( t, t ) > )), (9) where s a threshold value. The relaton defned by Eq. (9) s a smlarty relaton, because t s reflexve, non-symmetrc and non-transtve. The basc dea s that term t s smlar to t when t s smlar to t from the vewpont of co-occurrence and they are also smlar n the semantcs of WordNet. If t or t s not n WordNet, we use only the co-occurrence smlarty. Then we can defne a new representaton model based on ths relaton n the smlar way to the one n secton 2. Let the par apr = (U, R) be an approxmaton space, where U s the set of all ndex terms T n the same way as SRSM, and R U x U s a smlarty relaton on U. r(x): U 2 U s an uncertanty functon whch corresponds to the relaton R understood as yrx y r(x), whch mght represent that y s smlar to x. r(x) s a smlarty class of all obects that are consdered to have smlar nformaton to x. The functon r(x) satsfes reflexve property: x r(x), however t s not necessary symmetrc and transtve. Gven an arbtrary set X U, X can be characterzed by a par of lower and upper approxmatons as equatons () and (2). The bnary relaton R s a relaton that corresponds to an uncertanty functon defned by eq. (9). That s, I θ (t ) = {t U t Rt }. (0) R s a smlarty relaton because t only satsfes the propertes of reflexvty. In SRSM, we assgned weghts to terms that do not occur n the document but belong to smlarty classes of terms n the document, and do not change the weght values of terms n the document. In the new method we mprove the SRSM by readustng weght values of terms based on the dea of WordNet based methods. The weght a of term t n the upper approxmaton of document d s then defned as follows. N f log sm( t, tk ) a t ) tkd k N log a mn t ) thd a h log N t ) 0 k f t f t d apr( d f t apr( d ) ) \ d () The new proposed approach could be regarded as a combnaton of SRSM and WSSM (WordNet Semantc Smlarty based Model) whch ncorporate the advantages of both the models. WSSM s completely ncluded n the proposed approach because weghts of terms n a document are adusted usng eq. (8). In addton, t s an mproved verson of SRSM, because t calculates the upper approxmaton of the term set of a document and uses t as the document vector. The mprovements are the smlarty relaton (eq. (9)) used to calculate the upper approxmaton, and eq. () to readust the weghts of terms that are contaned n the document.

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 4, July 20 ISSN (Onlne): 694-084 www.ijcsi.org 5 5. Expermental results In the experments, we use two test collectons to evaluate the proposed approach n comparson wth SRSM, WSSM and methods n CLUTO toolkt [24]. The algorthms provded n CLUTO toolkt are based on the parttonal, agglomeratve, and graph-parttonng paradgms. They are denoted as rb, rbr, drect, agglo, graph, bagglo. The rb s a repeated bsectng approach. The rbr s the same as the repeated bsectng method except that at the end the overall soluton s globally optmzed. The drect s a partton method whch uses an teratve refnement algorthm to optmze a global clusterng crteron functon. The agglo s an agglomeratve clusterng algorthm. The graph uses a nearest-neghbor graph to model documents, and then dvdes the graph nto k clusters usng a mn-cut graph parttonng algorthm. In the bagglo, agglomeraton process s used to cluster documents after the document collecton s splt nto N clusters usng the rb method. The frst test collecton s a classc data set obtaned by combnng CACM, CISI, CRANFIELD, and MEDLINE abstracts whch s avalable from [25]. The dataset ncludes abstracts of papers n dfferent felds. CACM contans 3204 abstracts from Communcatons of ACM, CISI contans 460 abstracts of nformaton scence papers, CRANDFIELD contans 400 abstract of aeronautcal papers, MEDLINE contans 033 abstracts of medcne papers. The clusterng algorthms are supposed to cluster the dataset contanng 7097 abstracts nto four groups. After preprocessng (stemmng, stop-words elmnaton, and hgh frequency word prunng), we have 377 terms n the document collecton. Wth the 377 terms we created 7097 document vectors usng tf df weghtng scheme, each document vector has 377 dmensons. We evaluate clusterng results obtaned by each algorthm wth three commonly-used measures: entropy, F measure and mutual nformaton [3, 26]. There are dfferent clusterng qualty measures renderng dfferent results. However, f a method performs better than the others on many of these measures then we could say that the method s better than the others. Entropy, F measure and mutual nformaton measures are external qualty measures whch evaluate the clusterng results by comparng the clusters produced by the algorthm to the known classes of documents. Wth the entropy measure method, the clusterng qualty s better f the entropy s smaller. Whle wth F measure and mutual nformaton method, the hgher the evaluated values are the better clusterng result s. We run the experments wth the proposed method, SRSM and WSSM. We also run the test collecton wth the CLUTO toolkt. The WordNet-based smlarty measure used n the experment s the Wu and Palmer measure [7], whch s a path-based method. It computes the relatedness of two concepts usng the lowest common subsumer of two concepts lcs(c, c 2 ) whch s the frst shared concept on the paths from the concepts to the root concept of the ontology herarchy. 2 depth( lcs) sm( c, c2) (2) l( c, lcs) l( c, lcs) 2 depth( lcs) 2 where l(c,lcs) s the length of the path between the two nodes and depth(lcs) s the number of nodes on the path from lcs to root. We ran the experment usng the proposed method wth the value of threshold = 0.3. Table shows the evaluaton of clusterng results from CLUTO toolkt s algorthms, Table 2 shows the evaluaton of clusterng results of SRSM and the newly proposed method wth dfferent values of parameter. The best evaluaton n each qualty measure s shaded n Table 2. As for WSSM, the evaluaton of the clusterng result was 0.363, 0.332, 0.894 for entropy, mutual nformaton and F measure respectvely. As seen n these results, the best case s the clusterng by the proposed approach wth = 0.55 n all three evaluaton measures. Wth SRSM, co-occurrence of terms s used to determne the smlarty classes of terms. In the proposed method, we use both co-occurrence of terms and WordNet based semantc smlarty. The new approach, as suggested by the fgures n the column of "sze of smlarty classes" n Table 2, can remove rrelevant terms from smlarty classes. For example, wth the SRSM mplementaton n our experment, smlarty class of photon contans ntegraton and algebra, whch have low semantc relatedness wth photon tself. Wth the new method, ntegraton and algebra are removed from the smlarty class. Another example would be the term program, whch for SRSM s n the smlarty class of glossary, whle for the new approach t s not the case. The removal of rrelevant terms mproves the qualty of smlarty classes and could gve better clusterng results. Table : Clusterng results of the frst data set from CLUTO toolkt [7] CLUTO Method Entropy nformaton F measure Rb 0.562 0.26 0.64 Rbr 0.56 0.26 0.65 Drect 0.552 0.264 0.672 Agglo.283 0.00 0.452 Bagglo 0.455 0.299 0.72

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org 6 Table 2: Evaluaton of clusterng results wth SRSM and the new method for the frst data set SRSM New method Entropy nformaton F measure Sze of smlarty Sze of smlarty classes Entropy Informaton F measure classes Max Avg Max Avg 0.40 0.375 0.328 0.859 83 4.80 0.33 0.344 0.896 75 3.69 0.45 0.348 0.337 0.877 69 3.6 0.39 0.348 0.902 67 2.88 0.50 0.327 0.345 0.892 69 3.52 0.29 0.358 0.95 67 2.8 0.55 0.309 0.352 0.900 60 2.45 0.286 0.360 0.96 60 2.07 0.60 0.309 0.352 0.905 60 2.9 0.288 0.359 0.95 60.89 0.65 0.306 0.353 0.907 60 2.5 0.297 0.356 0.93 60.86 0.70 0.308 0.353 0.908 28.37 0.294 0.358 0.94 20.29 0.75 0.3 0.35 0.907 28.34 0.299 0.356 0.93 20.27 0.80 0.30 0.352 0.908 7.2 0.300 0.355 0.93 7.7 The maxmum, the mnmum and the average szes of smlarty classes of SRSM and the proposed method are shown n Table 2. Number of terms that are not n WordNet s 4753 among 377 terms. It s around 36%. We can see that szes of smlarty classes of the proposed method are smaller than those of SRSM. The dfference s the result of removng terms wth low WordNet based semantc relatedness from smlarty classes n the proposed method. As defned by eq. (9), a smlarty class of a term t conssts of terms that satsfy both the condton of co-occurrence wth t and one of the followng condtons: ) at least one of the two terms does not exst n WordNet database; 2) the WordNet based smlarty measure between the two terms s greater than a threshold value. For example, when = 0.55, the average number of smlarty classes defned only by the co-occurrence condton s 2.45 (SRSM), whle the one defned by eq. (9) s 2.07 (the proposed method), whch means that 0.38 terms n average are removed from smlarty classes of SRSM because they do not satsfy the above condton ) nor 2). Then, among the remanng 2.07 terms of smlarty classes of the proposed method, 0.88 terms satsfy the condton ) and.9 satsfy condton 2), n average. The contngency table of the best case of the proposed method s shown n Table 3. Precson and recall of CACM, CISI, CRANFIELD, and MEDLINE are 0.964, 0.797, 0.930, 0.962 and 0.85, 0.952, 0.976, 0.983, respectvely. The computaton tme of the new method s almost same as the one of the SRSM method whch has the tme Table 3: Contngency table of the best case of the proposed method CACM CISI CRANFIELD MEDLINE Cluster 2726 68 29 5 Cluster 2 347 390 4 4 Cluster 3 94 0 366 9 Cluster 4 37 2 05 complexty of O(MlogM) [7], where M s the number of terms n the text collecton. The dfference between the new and SRSM methods s the computaton of term semantc relatonshp based on WordNet. The computaton of semantc relatonshp s fast because we use a path based method and the maxmum depth of the word herarchy n WordNet s sxteen [9], a very small number n comparson wth number of terms n a text collecton. The second test collecton used n our experment s abstracts of papers from several IEEE ournals of several felds. We formed a collecton of 00 documents from IEEE Transactons on Knowledge and Data Engneerng (378 abstracts), IEEE Transactons on Bomedcal Engneerng (3 abstracts) and IEEE Transactons on Nanotechnology (32 abstracts). These categores of documents are denoted as KDE, BIO and NANO. We use the clusterng methods to cluster the data set nto three clusters. After removng stopwords and stemmng words, we have 5690 terms n the document collecton. Wth 5690 terms, the algorthm created 00 document vector usng tf df weghtng scheme, each document vector has 5690 dmensons. Table 4: Clusterng results of the second data set from CLUTO toolkt [7] CLUTO Method Entropy nformaton F measure rb 0.290 0.366 0.898 rbr 0.98 0.408 0.954 drect 0.98 0.408 0.954 agglo 0.684 0.87 0.723 graph 0.254 0.383 0.936 bagglo 0.234 0.392 0.939

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 4, July 20 ISSN (Onlne): 694-084 www.ijcsi.org 7 Table 5: Evaluaton of clusterng results wth SRSM and the new method for the second data set SRSM New method Entropy nformaton F measure Entropy nformaton F measure 0.30 0.205 0.405 0.953 0.33 0.438 0.97 0.35 0.4 0.434 0.970 0.25 0.442 0.974 0.40 0.55 0.428 0.965 0.22 0.443 0.975 0.45 0.75 0.48 0.960 0.37 0.436 0.97 0.50 0.79 0.47 0.959 0.50 0.430 0.967 0.55 0.72 0.420 0.962 0.86 0.44 0.956 0.60 0.82 0.46 0.956 0.6 0.425 0.963 0.65 0.96 0.408 0.952 0.74 0.49 0.959 0.70 0.202 0.406 0.95 0.88 0.43 0.955 Table 4 shows the evaluaton of clusterng results from CLUTO toolkt s algorthms. Table 5 shows the evaluaton of clusterng results of SRSM and the newly proposed method wth dfferent values of parameter. The best evaluaton n each qualty measure values s shaded n Table 5. For the WordNet semantc smlarty based method, the evaluaton of the clusterng result was 0.363, 0.332, 0.894 for entropy, mutual nformaton and F measure respectvely. The results show that clusterng results of the newly proposed method are better than those of the other methods n all three evaluaton measures. 6. Conclusons The vector space model s wdely used n the feld of document clusterng. It represents a document as a vector of terms. However, the smple VSM treats terms ndependent to each other and the semantc relatonshps between terms are not consdered. Therefore, t reduces the effectveness of document clusterng methods. SRSM method and WordNet semantc smlarty based method use the semantc relaton between terms to mprove the performance of document clusterng. However, these methods have ther own ssues as we dscussed n the prevous sectons. We proposed a new method that s a combnaton of SRSM and WordNet semantc smlarty based method to solve these ssues. Our experment results show that the qualty of the clusterng wth the proposed method s better than the ones wth SRSM and WordNet semantc smlarty based method. Its clusterng results are also better than results of other methods n the CLUTO toolkt. In addton to WordNet, Wkpeda and Wktonary are also promsng tools for semantc relatedness measurement and analyss [22]. In our future work, we wll explot these tools to further mprove document clusterng methods. References [] Y. Zhao and G. Karyps, Herarchcal clusterng algorthms for document datasets, Data Mnng and Knowledge Dscovery, 0 (2), pp. 4-68, 2005. [2] I.S. Dhllon and D.S. Modha, Concept decompostons for large sparse text data usng clusterng, Machne Learnng, 42 (-2), pp. 43-75, 200. [3] M. Stenbach, G. Karyps and V. Kumar, A comparson of document clusterng technques, Proceedngs of the KDD Workshop on Text Mnng, 2000. [4] Y. L, S.M. Chung and J.D. Holt, Text document clusterng based on frequent word meanng sequences, Data and Knowledge Engneerng, 64 (), pp. 38-404, 2008. [5] M. Mahdav and H. Abolhassan, Harmony K-means algorthm for document clusterng, Data Mnng and Knowledge Dscovery, pp. -22, 2008. [6] T.B. Ho and K. Funakosh, Informaton retreval usng rough sets, Journal of Japanese Socety for Artfcal Intellgence, 3 (3), pp. 424-433, 997. [7] N.C. Thanh, K. Yamada and M. Unehara, A Smlarty Rough Set Model for document representaton and document clusterng, Journal of Advanced Computatonal Intellgence and Intellgent Informatcs, 5 (2), pp. 25-33, 20. [8] W.K. Gad and M.S. Kamel, Enhancng text clusterng performance usng semantc smlarty, Lecture Notes n Busness Informaton Processng, 24 LNBIP, pp. 325-335, 2009. [9] L. Jng, M.K. Ng and J.Z. Huang, Knowledge-based vector space model for text clusterng, Knowledge and Informaton Systems, 25 (), pp. 35-55, 200. [0] Z. Pawlak, Rough sets, Int. J. of Informaton and Computer Scences, (5), pp. 34-356, 982. [] R. Slownsk and D. Vanderpooten, Smlarty Relaton as a bass for rough approxmaton, Advances n Machne Intellgence and Soft Computng, Vol.4, pp. 7-33, 997. [2] R. Slownsk and D. Vanderpooten, A generalzed defnton of rough approxmatons based on smlarty, IEEE Trans. on Knowledge and Data Engneerng, 2 (2), pp. 33-336, 2000. [3] T.B. Ho and N.B. Nguyen, Nonherarchcal document clusterng based on a tolerance rough set model, Internatonal Journal of Intellgent Systems, 7 (2), pp. 99-22, 2002. [4] X.-J. Meng, Q.-C. Chen, and X.-L. Wang, A tolerance rough set based semantc clusterng method for web search results, Informaton Technology Journal, 8 (4), pp. 453-464, 2009. [5] Prnceton Unversty, "About WordNet", WordNet, Prnceton Unversty. 200, http://wordnet.prnceton.edu. [6] R. Rada, H. Ml, E. Bcknell and M. Blettner, Development and applcaton of a metrc on semantc nets, IEEE Transactons on Systems, Man and Cybernetcs, v (n), pp. 7-30, 989. [7] Z. Wu and M. Palmer, Verbs semantcs and lexcal selecton, Proceedngs of the 32nd annual meetng on Assocaton for Computatonal Lngustcs (ACL '94), pp. 33-38, 994. [8] P. Resnk, Usng nformaton content to evaluate semantc smlarty, Proceedngs of the 4th Internatonal Jont Conference on Artfcal Intellgence, pp. 448-453, 995.

IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, September 20 ISSN (Onlne): 694-084 www.ijcsi.org 8 [9] J. J. Jang and D. W. Conrath. Semantc smlarty based on corpus statstcs and lexcal taxonomy, Proceedngs of the 0th Internatonal Conference on Research n Computatonal Lngustcs, Tape, Tawan, 997. [20] M. Lesk, Automatc sense dsambguaton usng machne readable dctonares: How to tell a pne cone from an ce cream cone, Proceedngs of the 5th Annual Internatonal Conference on Systems Documentaton, pp. 24-26, 986. [2] S. Baneree and T. Pedersen, An adapted lesk algorthm for word sense dsambguaton usng WordNet, CICLng ' 02: Proceedngs of the Thrd Internatonal Conference on Computatonal Lngustcs and Intellgent Text Processng, pp. 36-45, 2002. [22] T. Zesch and I. Gurevych, Wsdom of crowds versus wsdom of lngusts - Measurng the semantc relatedness of words, Natural Language Engneerng, 6 (), pp. 25-59, 200. [23] S. Patwardhan and T. Pedersen, Usng WordNet Based Context Vectors to Estmate the Semantc Relatedness of Concepts, Proceedngs of the EACL 2006 Workshop Makng Sense of Sense - Brngng Computatonal Lngustcs and Psycholngustcs Together, pp. -8, Trento, Italy, 2006. [24] G. Karyps, CLUTO - A Clusterng Toolkt, 2003, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download. [25] ftp://ftp.cs.cornell.edu/pub/smart [26] A. Strehl, J. Ghosh and R. Mooney, Impact of smlarty measures on web-page clusterng, Proceedngs of the 7th Natonal Conference on Artfcal Intellgence: Workshop of Artfcal Intellgence for Web search (AAAI 2000), Austn, TX, pp. 58 64, July 2000.