Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

Size: px

Start display at page:

Download "Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method"

Angel Green
5 years ago
Views:

1 Concept Forest: A New Ontology-asssted Text Document Smlarty Measurement Method James Z. Wang Wllam Taylor School of Computng Clemson Unversty, Box Clemson, SC , USA {jzwang, wptaylo}@cs.clemson.edu Abstract Although usng ontologes to assst nformaton retreval and text document processng has recently attracted more and more attenton, exstng ontologybased approaches have not shown advantages over the tradtonal keywords-based Latent Semantc Indexng (LSI) method. Ths paper proposes an algorthm to extract a concept forest (CF) from a document wth the assstance of a natural language ontology, the WordNet lexcal database. Usng concept forests to represent the semantcs of text documents, the semantc smlartes of these documents are then measured as the commonaltes of ther concept forests. Performance studes of text document clusterng based on dfferent document smlarty measurement methods show that the CF-based smlarty measurement s an effectve alternatve to the exstng keywords-based methods. In partcular, ths CFbased approach has obvous advantages over the exstng keywords-based methods, ncludng LSI, n processng text abstracts or n P2P envronments where t s mpractcal to collect the entre document corpus for analyss. 1. Introducton Currently, keywords-based technques are commonly used n varous nformaton retreval and text mnng applcatons. Among them, Vector Space Model (VSM) [1] and Latent Semantc Indexng (LSI) [2] are the most wdely adopted. Usng VSM, a text document s represented by a vector of the frequences of terms appearng n ths document. The smlarty between two text documents s measured as the cosne coeffcent between ther term frequency vectors. However, a major drawback of the keywords-based VSM approach s ts nablty of handlng the polysemy and synonymy phenomena of the natural language. As meanngs of words and understandng of concepts dffer n dfferent communtes, dfferent users mght use the same word for dfferent concepts (polysemy) or use dfferent words for the same concept (synonymy). Thus, matchng only keywords may not accurately reveal the semantc smlarty among text documents or between search crtera and text documents due to the heterogenety and ndependency of data sources and data repostores. For example, the keyword java can represent three dfferent concepts: coffee, an sland, or a programmng language, whle keywords dog and canne may represent the same concept n dfferent documents. LSI tres to overcome the lmtaton of VSM by usng statstcally derved conceptual ndces to represent text documents and queres. LSI assumes that there s an underlyng latent structure n word usage that s partally obscured by varablty of word choce and tres to address the polysemy and synonymy problems through modelng the co-occurrence of keywords n documents. Though earler studes contend that LSI may mplctly reveal concepts through the co-occurrence of keywords, we found that the co-occurrence of keywords may not necessarly mean ther contextualty n the document, especally n mult-dscplnary research papers. Ths s exactly why usng LSI-based tools to extract terms from commercal web documents, whch may contan ads, headlnes, and news feeds, s a questonable practce. On the other hand, how to map the LSI-based conceptual ndex nto the underlyng concept s not clear, makng t dffcult to vsualze the text mnng results. In addton, some text document archves, such as MEDLINE database [3] and web bloggng entres, contan prmarly short artcles or abstracts nstead of long papers. These short documents may not provde suffcent cooccurrence nformaton for LSI-based semantc smlarty measurement. Furthermore, n dynamc envronments, such as lve news feeds or P2P systems, t s mpractcal to collect the entre document corpus for analyss. In ths paper, to address the weaknesses of exstng keywords-based approaches, we propose an ontology-

2 asssted text document smlarty measurement method by buldng a concept forest to represent the semantcs of a text document. The rest of ths paper s organzed as follows. We frst dscuss the exstng ontology-based approaches and ther weaknesses n secton 2 and then dscuss our ontology-asssted concept forest constructon algorthm and the assocated smlarty measurement method n secton 3. In secton 4, we cluster varous text document corpuses based on smlarty values obtaned by dfferent methods to valdate the advantages of our CF-based approach. Fnally we gve our concluson and dscuss the future work n secton Background and exstng approaches Recently, to address the problems n keywords-based approaches, many studes tred to use ontologes to assst nformaton retreval and text document processng. These ontology-based approaches can be dvded nto two categores. One category of ontology-based methods [4, 5, 6, 7] apply machne learnng methods, such as clusterng analyss and fuzzy logc, to construct ontologes from text documents and, then, use these ontologes to assst nformaton retreval and text document processng [ 8, 9 ]. However, these methods requre analyzng the entre document corpus to construct a good ontology, and the performance of nformaton retreval and text document processng depends on how good the constructed ontologes are. Durng the corpus analyss, terms rarely appearng n the document corpus are often gnored because of ther low frequences of occurrence. However, hgh nformaton content of these rare terms s valuable for nformaton retreval accordng to nformaton theory. Ignorng these terms n the constructed ontologes may affect the performance of nformaton retreval and text document processng. Nonetheless, these ontology-based methods have not been fully evaluated aganst the keywords-based LSI method, arguably the best keywords-based method. Another group of ontology-based methods utlze an exstng ontology, such as WordNet [ 10 ], to assst nformaton retreval and text document processng. These methods use three dfferent approaches to take advantage of the exstng ontologcal knowledge. The frst approach [11, 12] nvolves usng WordNet to fnd synonyms or hypernyms of terms to mprove the performance of nformaton retreval and text document processng. However, ths approach may ntroduce nose by addng semantc content that s not present n the document corpus. For nstance, gven a document about beef and a document about pork, a hypernymbased method may use meat to replace beef and pork because two terms have a common hypernym meat. Ths approach over-smplfes or overgeneralzes the problem, makng t mpossble to dstngush documents contanng beef from documents contanng pork. Another problem wth ths approach s that t does not perform word sense dsambguaton. Instead, all synonyms or hypernyms related to a keyword are used to replace the keyword. These weaknesses often lead to dsappontng nformaton retreval and text document processng performance [13, 14]. The second approach focuses on word sense dsambguaton [15, 16, 17, 18] to address the synonymy and polysemy problems n natural language processng. However, ths approach tres to determne an exact sense for a term, often resultng n msclassfcaton of terms. Ths approach also gnores the mpact of the semantc smlartes and relatonshps among dfferent terms n the same text document on the performance of nformaton retreval and text document processng. To address the problems n the frst two approaches, the thrd approach apples varous technques [19, 20, 21, 22] to dscover the semantc smlartes and relatonshps of terms and use them to enhance the keywords-based nformaton retreval and text document processng methods, such as VSM. However, the technques used to dscover the term relatonshps and smlartes have ther weaknesses. Seddng [19] used a nave, syntax-based dsambguaton approach by assgnng each word a partof-speech (POS) tag and by enrchng the bag-of-words data representaton, whch extracts synonyms and hypernyms from WordNet to use n document clusterng. Unfortunately, ths study found that ncludng synonyms and hypernyms, dsambguated only by PoS tags, does not mprove the effectveness of text document clusterng. The authors attrbuted ths underperformance to the nose ntroduced by ncorrect senses retreved from WordNet and concluded that dsambguaton by PoS alone s nsuffcent to reveal the full potental of ncludng background knowledge n nformaton retreval and text document processng. To further nvestgate ths ssue, Smone [20] proposed a document search technque that uses other methods, n addton to POS taggng, to cluster search results nto meanngful categores accordng to the words that modfy the orgnal search term n the text document. Ths work focuses on determnng f the antonymy relaton, nstead of synonyms and hypernyms, could be used on the modfers found n documents to decompose a set of search results nto a herarchy of sub-clusters. Unfortunately, ther expermental studes agan suggest that ths approach cannot mprove the performance of nformaton retreval. Whle these two studes [19, 20] suggest explotng term relatonshps or smlartes usng WordNet may not mprove the performance of nformaton retreval and text document processng, other studes usng dfferent methods mply that t s possble to use term relatonshps

3 or smlartes to mprove the performance of the keywords-based VSM. Huang [21] used a guded selforganzaton map (SOM), a result of mergng statstcal methods, compettve neural models, and semantc relatonshps obtaned from WordNet, to mprove the performance of the tradtonal VSM. However, certan human nvolvement s requred to buld the guded SOM. Jng [22] calculates a mutual nformaton matrx for all terms n the documents based on nformaton obtaned from WordNet and uses the mutual nformaton to enhance the keywords-based VSM method. However automatcally computng term mutual nformaton (TMI) s sometmes problematc and may lead to wrong conclusons about the qualty of the learned mutual smlarty [23]. Even though usng SOM and TMI can mprove the performance of the keywords-based VSM, ther performance n comparson wth LSI, the best keywords-based method, has not been nvestgated. Furthermore, these methods requre analyzng the entre document corpus as VSM and LSI do. To address the problems n exstng ontology-based methods, we propose a new ontology-asssted method to measure the semantc smlarty of text documents. Ths new method constructs a concept forest (CF) from a text document, based on the co-occurrence of terms and ther semantc relatonshps found n WordNet. Usng the CF to represent the semantcs text documents, we propose a smple method to measure the semantc smlarty of two text documents. A unque feature of our proposed CFbased method s that we derve the concept forest based only on analyzng the co-occurrences and relatonshps of terms wthn a sngle document. Conversely, exstng approaches all requre analyzng the entre text document corpus to determne the semantc smlarty of two text documents. Therefore, our CF-based method s a practcal alternatve to the exstng nformaton retreval and text document processng methods n dynamc envronments such as P2P systems and lve news feeds, where t s mpractcal to collect the entre document corpus for analyss. 3. Concept Forest and Semantc Smlarty Our CF-based method ncludes three steps: concept forest constructon, semantc content purfcaton, and smlarty measurement. 3.1 Concept Forest Constructon We use WordNet [10] to assst our concept forest constructon. WordNet s a large lexcal database of Englsh words, n whch nouns, verbs, adjectves and adverbs are grouped nto sets of cogntve synonyms (synsets) wth each synset representng a dstnct concept. Synsets are nterlnked by means of conceptual and lexcal relatons. There are approxmately 150,000 words organzed n over 115,000 synsets n WordNet. Every synset contans a group of synonymous words or collocatons wth dfferent senses (concepts) of a word beng n dfferent synsets. Most synsets are connected to other synsets through semantc relatons, such as hypernym, hyponym, etc. The domnant semantc relatonshp n WordNet s hypernym, the s-a relatonshp. Most nouns and verbs are organzed nto herarches, defned by hypernym or s-a relatonshps. For example, Fgure 1 depcts the hypernym herarchy for the frst sense of the word dog. Fgure 1: Herarchy of Hypernym Relatonshps. Gven a text document, we frst extract all keywords and ther occurrence frequences from the document, excludng stop words such as pronouns, common verbs, common nouns, adjectves, and frlly words. These words add lttle or no value n determnng the document s semantc content accordng to prevous studes [1, 2]. We then use a smple WordNet morphology nterface (functon morphstr()) to stem these keywords,.e., to map nflected (or sometmes derved) words to ther stem, base or root form. For nstance, cared, cares, and carng are all mapped to the root word care. After word stemmng, we determne the proper synset for a word based on the co-occurrence of terms n the document and the semantc relatonshps of senses defned n WordNet. In WordNet, each set of synonyms (synset) shares some common propertes, such as a gloss (or dctonary) defnton, ndexed by a unque ID (called synsetid). However, one word may be related to several synsets due to the polysemy of the natural language. For nstance, the word java has three dfferent senses: (1) Coffee, cafe (synsetid: ); (2) Programmng, Programmng Language (synsetid: ); (3) An Island (synsetid: ). Therefore, smply retrevng all senses of the stemmed words to represent the semantc content of a document ntroduces a lot of nose [13, 14]. To address ths ssue, for any stemmed word obtaned from a document, we only use the sense that clearly represents the concept of the word n ths document for our concept forest constructon. Our procedure checks every par of stemmed words obtaned from the text document to determne whether

4 there are semantc relatonshps between ther senses defned n WordNet. We only consder the hypernym relatonshp n ths study because consderng only the hypernym relatonshp s adequate for measurng the semantc smlarty of documents due to the domnance of the hypernym relatonshp among the terms n text documents accordng to our expermental studes. Gven two terms (T1 and T2) obtaned from the same text document, f ther respectve synsets S1 and S2 have a hypernym relatonshp, the synsetids of S1 and S2 are used to represent the concepts of T1 and T2 respectvely, and other senses of T1 and T2 wll be dscarded. Meanwhle, a s-a relatonshp lnk s formed between the synsetids of S1 and S2. Ths process completes when all pars of stemmed words are nvestgated. For nstance, gven a document contanng words dsease, sckness, nfluenza, drug and medcne, we can construct a concept tree for terms dsease, sckness and nfluenza usng s-a relatonshp lnk based on the hypernym relatonshp among these terms as shown n Fgure 2. Smlarly, a concept tree can be bult for terms drug and medcne. These two concept trees form a concept forest depcted n Fgure 3. We note that the terms nstead of ther related synsetids are shown n the concept forest for demonstraton only. In actual concept forests, the synsetids are used to represent the concepts whenever possble. We also note that a concept tree may contan only a sngle stemmed word. S: (n) nfluenza, flu, grppe drect hypernym / nherted hypernym / sster term S: (n) contagous dsease, contagon S: (n) communcable dsease S: (n) dsease S: (n) llness, unwellness, malady, sckness Fgure 2: Hypernym herarchy for terms nfluenza, dsease and sckness. s-a dsease sckness s-a nfluenza drug s-a medcne Fgure 3: Concept forest derved from terms nfluenza, dsease, sckness, drug and medcne. Unlke some exstng approaches [1, 2], whch use all terms n all synsets of the stemmed words to represent the semantc content of a document, we treat keywords dfferently accordng to ther synset propertes and the semantc relatonshps among the synsets of keywords. If a keyword has only one sense, ts synsetid wll be used n the concept forest. If a keyword has more than one sense and no other keyword s senses have semantc relatonshps wth ths keyword s senses, then ths keyword wll be kept as ts orgnal stemmed word n the concept forest snce we cannot dsambguate the word sense. Fnally, f a keyword has many senses, and one or more senses have semantc relatonshps wth the senses of other keywords n the text document, only the synsetids of the senses havng semantc relatonshps wth the senses of other keywords wll be kept n the concept forest. Other senses of ths keyword wll be dscarded snce they are rrelevant to the semantc content of the text document. 3.2 Semantc Content Purfcaton A concept forest constructed by method descrbed n Secton 3.1 may contan terms or synsetids that are not closely related to the man topcs of the text document, and these terms or synsetids may sometmes ntroduce nose to nformaton retreval and text document processng. To address ths ssue, we use the frequences of terms occurrng n the text document to calculate a semantc content rate (SCR) for a concept tree n the concept forest. Each stemmed word obtaned from a text document has an assocated word frequency value correspondng to the number of occurrences that ths word was found n the text document. When a stemmed word s mapped to a partcular synsetid durng the CF constructon, the assocated word frequency value s transferred to the synsetid. If several stemmed words are mapped to the same synsetid, the word frequency value of ths synsetid s the sum of the word frequency values of these assocated words. We further defne the semantc content weght for a concept tree as the sum of the word frequency values of all ts assocated synsetids. For a sngle-node tree, ts semantc content weght s the word frequency value of ths sngle node. Assumng the semantc content weghts of concept trees n a concept forest are w 1, w 2,, w n, respectvely, the semantc content rate of concept tree s defned as: SCR w n j 1 The SCR values n a concept forest ndcate the semantc organzaton of the assocated text document. A concept forest obtaned from a clearly and concsely wrtten sngle-topc abstract may contan a concept tree havng an SCR value greater than 75%, whle the concept forest obtaned from a long multple-topc text document may contan several concept trees wth much smaller SCR values. To purfy the semantc content of a concept forest, we use a threshold (e.g., 5%) to flter out concept trees wth low SCR values. Any concept tree whose SCR value falls below ths threshold wll be removed from the fnal purfed concept forest. w j (1)

5 3.3 Semantc Smlarty Measurement Usng a concept forest to represent the semantc content of a text document, the semantc smlarty of two text documents can be determned by comparng ther concept forests. Formally, an concept forest s defned as a Drected Acyclc Graph (DAG): CF = [T, E, R], where T = {t 1, t 2,, t n } s a set of stemmed words or synsetids, and E = {e 1, e 2,, e m } s a set of edges connectng synsetids wth relatonshps defned n R = {r 1, r 2,, r k }. Specfcally, an edge e s defned as a trplet [t 1, t 2, r j ] where t 1, t 2 T and r j R. In addton, two terms can be lnked by only one relatonshp, that s, l k, [ t, t, r ] E [ t, t, r ] E. j k For nstance, the forest concept n Fgure 3 can be represented as CF = [{ dsease, sckness, nfluenza, drug, medcne }, {[ nfluenza, dsease, s-a ], [ dsease, sckness, s-a ], [ medcne, drug, sa ]}, { s-a }]. Gven two documents D 1 and D 2, and ther concept forests CF 1 = [T 1, E 1, R 1 ] and CF 2 = [T 2, E 2, R 2 ] respectvely, determnng the semantc smlarty of these two documents needs to consder the smlartes of the term sets, edge sets, and relatonshp sets n ther concept forests. However, we use only the hypernym ( s-a ) relatonshp to construct the CF and thus the relatonshp set R s the same for all CFs. On the other hand, the selecton of terms durng the CF constructon mples ther relatonshps. Therefore, we calculate the semantc smlarty of two text documents by smply comparng the smlarty of the term sets (T 1 and T 2 ) n ther concept forests, hopng ths smple measurement s suffcent for nformaton retreval and text document processng. That s: T1 T2 Sm ( D1, D2 ) (2) T T 4. Expermental Studes To evaluate whether our obtaned concept forest can represent the semantc content of a document, we cluster text documents based on ther semantc smlarty values calculated by Equaton 2. The clusterng results are then compared wth the results of document clusterng based on VSM and LSI respectvely. 4.1 Text Document Corpus As n many prevous studes, we derve our document corpuses from Reuters Text Categorzaton Collecton n UCI KDD archve [24]. The Reuters dataset s a collecton of documents that appeared on Reuters newswre n The documents were assembled and ndexed wth categores. Ths dataset 1 2 j l conssts of approxmately 21,500 fles coverng 132 (possbly overlappng) categores wth the fle sze per artcle rangng from 12 to 900 words. As we dscussed prevously, LSI s not effcent n nformaton retreval and text document processng for short text documents due to the nsuffcent cooccurrence nformaton wthn the short documents. We want to study whether our CF-based text document smlarty measurement method can address ths ssue so that t can be used for nformaton retreval n text abstract databases, such as MEDLINE database. Therefore, we ntentonally select text documents contanng less than 400 words. As shown n Table 1, four text document corpuses contanng 50 to 500 documents are selected for our expermental studes. Table 1: Selected text document corpuses Corpus Corpus Characterstcs C-1 50 Documents, 2 categores (Ol, Nat-Gas), 25 documents n each category. C Documents, 2 categores (Coffee, Sugar), 50 documents n each category. C Documents, 4 categores (Gran, Wheat, Shp, Crude), 50 documents n each category. C Documents, 2 categores (Wheat, Gran), 250 documents n each category. 4.2 Performance Evaluaton Method Although many studes used K-means clusterng algorthm or ts varants for text document clusterng [13, 14], K-means algorthm s not sutable for text document clusterng usng our CF-based smlarty measurement because t does not make sense to calculate a mean smlarty among a set of documents. Therefore, an agglomeratve herarchcal clusterng algorthm s used n our performance study. Gven a text document corpus, each document ntally belongs to ts own ndvdual cluster. We set the ntal smlarty threshold to be 1 and decrease the threshold wth a small nterval so that documents wth smlar semantcs wll be gradually merged nto the same group. Snce we already know the categores from whch each document was obtaned, the document clusterng process stops when the majorty of documents from dfferent categores fall nto ther respectve clusters and further decreasng the threshold wll cause clusters contanng documents prmarly from two dfferent categores merged nto one cluster. After the document clusterng, we calculate the clusterng accuracy as the number of documents correctly clustered nto ther categores dvded by the total number of documents.

6 Besdes clusterng text documents based on our CFbased smlarty measurement method, we also perform the document clusterng usng VSM and LSI as the document smlarty measurement methods. For VSM, the cosne coeffcents of the document vectors are used as the smlarty measures. For LSI, we calculate the rank k approxmaton of term vector for each document and calculate ther smlartes usng cosne coeffcents of ther term vectors. Then we use these smlarty values to cluster the text documents. We repeat the same process under dfferent k values and report the best clusterng results for LSI. 4.3 Performance Results We conducted our expermental studes on a DELL desktop computer equpped wth a 1.0 GHz Intel Pentum IV processor and 512 MB RAM, runnng the Red Hat Enterprse Lnux. We cluster the text document corpuses lsted n Table 1 based on three dfferent smlarty measurement methods, VSM, LSI, and CF-based method. The accuraces of text document clusterng usng dfferent methods are lsted n Table 2. In addton to the clusterng accuracy, we also observe the total tme needed to complete the corpus analyss and document clusterng. The results are reported n Fgure 4. Table 2: Clusterng accuraces on text corpuses lsted n Table 1 VSM LSI CF C-1 64% 64% 74% C-2 50% 62% 80% C-3 25% 34% 48% C-4 50% 56.8% 68% C-1 C-2 C-3 C-4 CF VSM Fgure 4: Total tme (mnutes) needed to complete the corpus analyss and text document clusterng usng dfferent methods on text corpuses lsted n Table 1. The performance results n Table 2 show that the accuracy of document clusterng based on our CF-based smlarty measurement s much better than that based on LSI VSM or LSI. On the other hand, the clusterng accuraces based on LSI are better than those based on VSM. The executon tme depcted n Fgure 4 exhbt the runtme effcency of our CF-based text document processng method. The total tme spent on corpus analyss and document clusterng usng the CF-based method s much less than that based on VSM or LSI. 5. Concluson and Future Studes In ths paper, we propose a novel algorthm to extract a concept forest (CF) from a text document wth the assstance of a natural language ontology, WordNet lexcal database. Usng concept forests to represent the semantcs of text documents, we measure the semantc smlarty of two text documents by smply comparng the term sets n ther respectve concept forests. Ths CFbased smlarty measurement does not requre analyzng the entre document corpus, an advantage over most exstng document smlarty measurement methods, ncludng the popular VSM and LSI. Ths unque advantage allows our CF-based text document smlarty measurement method to be used P2P envronments where collectng the entre document corpus for analyss s mpractcal. Our expermental studes also show that the CF-based text document smlarty measurement method performs much better than both VSM and LSI methods when document szes are relatvely small. Furthermore, our CF-based document smlarty measurement method s much more effcent regardng the total executon tme used for corpus analyss and document clusterng. Therefore, we beleve the CF-based approach s a practcal alternatve to the exstng keywords-based methods for nformaton retreval and text mnng n text abstract databases, such as MEDLINE. We are currently desgnng a graph-matchng-based method to compare the smlarty of two concept forests, hopng to provde a more sophstcated text document smlarty measurement and mprove the text document clusterng accuracy. We are also mplementng a CFbased nformaton retreval system to effectvely retreve text abstracts from MEDLINE database. 6. References [1] G. Salton, A. Wong, and C. S. Yang (1975), A Vector Space Model for Automatc Indexng, Communcatons of the ACM, vol. 18, no. 11, pages [2] Deerwester, S., Dumas, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexng by latent semantc analyss. Journal of the Amercan Socety for Informaton Scence, 41(6),

7 [3] MEDLINE Fact Sheet, l [4] Lee, C.S., Jan, Z.W. and Huang, L.K., A fuzzy ontology and ts applcaton to news summarzaton. IEEE Transactons on Systems, Man, and Cybernetcs, Part B: Cybernetcs. Volume 35, Issue 5. pp [5] Lpka Dey, Ashsh Chandra Rastog, Sachn Kumar, Generatng Concept Ontologes through Text Mnng, 2006 IEEE/WIC/ACM Internatonal Conference on Web Intellgence (WI'06), pp , [6] O. S. Chn, N. Kulathuramayer, A. W. Yeo, Automatc Dscovery of Concepts from Text, IEEE/WIC/ACM Internatonal Conference on Web Intellgence (WI 2006), pp , December 2006 [7] Blaz Fortuna, Dunja Mladenc and Marko Grobelnk, Sem-Automatc Constructon of Topc Ontology, Conference on Data Mnng and Data Warehouses (SKDD 2005) at multconference IS [8] Navgl, R., Velard, P. and Gangem, A., Ontology learnng and ts applcaton to automated termnology translaton. IEEE Intellgent Systems. Volume 18, Issue 1. pp [9] Sugumaran, V. and Storey, V.C., Ontologes for conceptual modelng: ther creaton, use, and management. Internatonal Journal of Data and Knowledge Engneerng. Volume 42, Issue 3. pp [10] Chrstne Fellbaum (ed.), WordNet: An Electronc Lexcal Database. The MIT Press, May [11] S. Scott and S. Matwn. Text Classfcaton usng WordNet Hypernyms. In S. Harabagu, edtor, Use of WordNet n Natural Language Processng Systems: Proceedngs of the Conference, pages Assocaton for Computatonal Lngustcs, Somerset, New Jersey, [12] D. Koller, and M. Saham, Herarchcally classfyng documents usng very few words, Proceedngs of the 14th nternatonal Conference on Machne Learnng ECML98, [13] A. Kehagas, V. Petrds, V.G. Kaburlasos, and P. Fragkou, A comparson of word- and sense-based text categorzaton usng several classfcaton algorthms, Journal of Intellgent Informaton Systems, 21(3), [14] A. Hotho, S. Staab, and G. Stumme. Ontologes mprove text document clusterng. In Proceedngs of the IEEE Internatonal Conference on Data Mnng, pages , [15] Dmtros Mavroeds et al., Word Sense Dsambguaton for Explotng Herarchcal Thesaur n Text Classfcaton, A. Jorge et al. (Eds.): PKDD 2005, LNAI 3721, pp , [16] Youjn Chung and Jong-Hyeok Lee, Practcal Word- Sense Dsambguaton Usng Co-occurrng Concept Codes, Machne Translaton (2005) 19: [17] Ernesto Wllam De Luca, Andreas Nürnberger: Usng clusterng methods to mprove ontology-based query term dsambguaton. Int. J. Intell. Syst. 21(7): (2006) [18] Yng Lu, Peter Scheuermann, Xngsen L, and Xngquan Zhu, Usng WordNet to Dsambguate Word Senses for Text Classfcaton, Y. Sh et al. (Eds.): ICCS 2007, Part III, LNCS 4489, pp , [19] J. Seddng and D. Kazakov. WordNet-based Text Document Clusterng. In Proc. of the Thrd Workshop on Robust Methods n Analyss of Natural Language Data (ROMAND), pp , Geneva, [20] Thomas de Smone and Dmtar Kazakov. Usng WordNet Smlarty and Antonymy Relatons to Ad Document Retreval. Recent Advances n Natural Language Processng (RANLP 2005), September 2005, Borovets, Bulgara. [21] Chhl Hung, Stefan Wermter and Peter Smth, Hybrd Neural Document Clusterng Usng Guded Self-Organzaton and WordNet, IEEE Intellgent Systems, Vol. 19, No. 2, pp , [22] L. Jng, L. Zhou, M. Ng and J. Huang, Ontologybased Dstance Measure for Text Clusterng, SIAM Text Mnng 2006 Workshop. [23] Marta Sabou, Learnng Web Servce Ontologes: an Automatc Extracton Method and ts Evaluaton, Ontology Learnng and Populaton ( Edtors: P.Butelaar, P. Cmano, B. Magnn), IOS Press, 2005 [24] Reuters Text Categorzaton Collecton, s21578.html

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School