Clustering of Words Based on Relative Contribution for Text Categorization

Size: px

Start display at page:

Download "Clustering of Words Based on Relative Contribution for Text Categorization"

Timothy Abraham Woods
5 years ago
Views:

1 Clusterng of Words Based on Relatve Contrbuton for Text Categorzaton Je-Mng Yang, Zh-Yng Lu, Zhao-Yang Qu Abstract Term clusterng tres to group words based on the smlarty crteron between words, so that the groups can be used as the dmensons of the vector space n the text categorzaton. We proposed two new smlarty crterons, whch consder the relatve contrbuton of one feature for one category relatve to other categores and the dfference between relatve contrbutons of two features. We used the proposed methods n herarchcal clusterng algorthm, and generated a compact and effcent representaton of documents. The proposed methods are evaluated on three benchmark corpora (20-newgroups, reuters and ndustry sector), combned wth two classfcaton algorthms (Support Vector Machnes, K-Nearest Neghbor), and compared wth three smlarty measures (weghted average KL dvergence, Cty-block, Eucldean). The expermental results ndcated that the performance of the proposed methods are comparatve wth the other methods when Support Vector Machnes s used; the proposed methods sgnfcantly outperform Eucldean and Cty-block and acheve comparatve performance wth weghted average KL dvergence when K-Nearest Neghbor classfer s used. Index Terms term clusterng, smlarty measure, text categorzaton, relatve contrbuton I. INTRODUCTION A utomatc document categorzaton, whch assgns the predefned categores to a new text document, s an mportant tool for people to organze the vast amount of Ths research s supported by Natonal Natural Scence Foundaton of Chna under Grant no Je-Mng Yang s wth College of Informaton Engneerng, Northeast Danl Unversty, Jln, Jln, Chna (e-mal: yjmlzy@gmal.com). Zh-Yng Lu s wth College of Informaton Engneerng, Northeast Danl Unversty, Jln, Jln, Chna (e-mal: yjmlzy@163.com). Zhao-Yang Qu s wth College of Informaton Engneerng, Northeast Danl Unversty, Jln, Jln, Chna (e-mal: qzywww@mal.nedu.edu.cn) dgtal data n the Internet [1]. The raw documents cannot be drectly fed nto a classfer, so they must be transformed nto a unform representaton form based on ther content [2]. Many experments show that the document representaton approach based on bag of words (BOW) s a sophstcated one [2]. A major characterstc of text categorzaton s the hgh dmensonalty of the feature vector space, whch can be tens or hundreds of thousands of terms for even a moderated sze dataset [3, 4]. It s a bg hurdle n applyng many sophstcated learnng algorthms to the text categorzaton [5]. Another major characterstc of text categorzaton are the hgh level of feature redundancy, feature rrelevance [4] and sparseness problem [6]. These characterstcs not only hnder the classfcaton process and hurt the performance of the classfer but also brng about over-fttng. Because of ths, the dmensonalty reducton s used to reduce the sze of the feature vector space [2]. The term clusterng s one of the dmensonalty reducton methods [2, 7]. It can create a new, reduced-sze feature space by groupng words wth hgh smlarty [1], and many words can be replaced wth the centrod or representatve feature of the correspondng word cluster. There are three key benefts of the term clusterng: (1) the features that have correlatons on the class labels assgned to documents are consdered as a new feature n the reduced-sze feature space. (2) The term clusterng can result n hgher classfcaton accuracy. (3) The term clusterng can provde a good soluton to the sparseness problem and generate extremely compact representatons [1, 6, 7]. The crucal stage of term clusterng s how to measure the smlarty of the terms [8]. The measurement of the smlarty between two elements s essental to the most clusterng procedures. Based on the smlarty, one can decde whch clusters should be combned or where a cluster should be splt. McGll [9] lsted more than 60 dfferent smlarty functons. The qualty of clusterng depends on whether the smlarty metrc s approprate or not. There are many smlarty measurements whch are wdely used n

2 clusterng, such as weghted average KL dvergence [1], Eucldean metrc [10, 11], Cty-block metrc [12]. However, not all proxmty measures are applcable n each envronment [13]. Two new smlarty measures are proposed for term clusterng n ths paper. The contrbuton of a feature to one category s represented by the sum of term frequency occurred n t. The relatve contrbuton of a feature occurrng n one category relatve to other categores s consdered. If the relatve contrbutons of one feature occurrng n each category are same as that of the other one, there exsts the hghest smlarty between these two features n terms of contrbuton to categorzaton. The dfference between the relatve contrbutons of two features s regarded as a new smlarty measure. To evaluate the proposed method, we used two classfcaton algorthms, Support Vector Machnes (SVM) and K-nearest neghbors (KNN) on three benchmark text corpora (20-newsgroups, Reuters and Industry Sector) and compared t wth three well-known smlarty measures (weghted average KL dvergence, Cty-block, Eucldean). The experments show that the performance of the proposed methods are comparatve wth the others when Support Vector Machnes s used; the proposed methods sgnfcantly outperform Eucldean and Cty-block and acheve comparatve performance wth weghted average KL dvergence when K-Nearest Neghbor classfer s used. II. RELATED WORK Much study has been devoted to word clusterng for text categorzaton n recent years. In ths secton we overvew the results whch are most relevant to our study. Word clusterng s frstly nvestgated and used n text categorzaton by Lews [14]. Lews used recprocal nearest neghbor (RNN) clusterng for clusterng terms. The recprocal nearest neghbor clusterng conssts of two tems, one s the nearest neghbor of the other one accordng to the smlarty measure. Lews chose a probablstc approach to text categorzaton and hs results were nferor to those obtaned by word ndexng. Baker & McCallum [1] ntroduced the dstrbutonal clusterng to document classfcaton wth a Naïve Bayes classfer. Dffered from other smlarty metrcs, the dstrbutonal clusterng calculated the probablty dstrbuton over the class ntroduced by the dfferent words to be clustered. The Kullback-Lebler dvergence, whch s an nformaton theoretc measurement, was used to exactly measure the dfference between two probablty dstrbutons. Baker and McCallum found that the dstrbutonal clusterng s better than feature selecton wth regard to preservng the nformaton contaned n redundant features. The expermental results showed that the categorzaton based on word clusterng can mantan good performance and keep a sgnfcantly more compact representaton. III. ALGORITHM DESCRIPTION A. Motvaton Baker and McCallum [1] consdered the probablty dstrbuton of a partcular word w t over classes C can be descrbed as P(C w t). When word w t and w s are clustered together, the new dstrbuton s the weghted average of the dstrbuton of word w t and w s. Table I lsts the term frequences of three features n 10 categores and the class probablty dstrbutons for these features. The numbers n the parentheses are the class probablty dstrbutons. The class dstrbuton curves of three features are shown n Fg 1. The horzontal axs s the lst of class labels. The vertcal axs ndcates the probablty of a term over each class. The curve n Fg 1 can be nterpreted as the probablty dstrbuton of a feature aganst classes. It can be seen from Fg 1 that the shape of the probablty dstrbuton of feature 1 s qute smlar to that of feature 2 and dssmlar to that of feature 3. Thus the feature 1 and feature 2 can be clustered together accordng to the dea of dstrbutonal clusterng. The probablty dstrbutons of words over each class are consdered as the smlarty metrc n dstrbutonal clusterng of words [1]. Inspred by the probablty dstrbutons of one word over each class, we beleved the dfferences of term frequences of a feature occurrng n varous categores could be regarded as the level of the contrbutons of the feature to each class n the context of document classfcaton. The term frequences of a feature occurrng n varous categores are the ponts n a two-dmensonal space, whch conssts of the term frequency and class label. Smlar to the class probablty dstrbuton curve n the dstrbutonal clusterng, the term frequences of a feature occurrng n varous categores are joned and form a contrbuton lne n term frequency aganst class label plane. The contrbuton curves of three features lsted n Table I are shown n Fg 2. The horzontal axs has tcks for lst of class labels. The vertcal axs

3 ndcates the term frequences of a feature occurrng n varous classes. It can be seen from Fg 2 that the dfferences among relatve contrbutons of three features are equal to zero f ther postons n two-dmensonal space are not consdered. In respect of the document classfcaton, the relatve contrbutons of three features to each class are dentcal. Therefore, these features should be clustered together. TABLE I THE TERM FREQUENCIES AND THE PROBABILITIES OF THREE FEATURES OVER 10 CATEGORIES. C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Feature 1 115(0.2174) 103(0.1947) 104(0.1966) 200(0.3781) 190(0.3592) 109(0.2060) 114(0.2155) 103(0.1947) 107(0.2023) 101(0.1909) Feature 2 85(0.2191) 73(0.1881) 74(0.1907) 170(0.4381) 160(0.4124) 79(0.2036) 85(0.2191) 74(0.1907) 75(0.1933) 73(0.1881) Feature 3 15(0.0605) 3(0.0121) 4(0.0161) 100(0.4032) 90(0.3629) 9(0.0363) 15(0.0605) 4(0.0161) 5(0.0202) 3(0.0121) Fg 1. The probablty dstrbuton curves of feature 1, 2 and 3 over varous categores. Fg 2. The contrbuton curve of feature 1, 2 and 3 for classfcaton. Fg 3. Two pars of contrbuton curves, one s smlar and the other s dssmlar.

4 B. Smlarty measurement It s assumed that the contrbuton curves of two features can be descrbed as two vector X = {x 1, x 2,, x n} and Y = {y 1, y 2,, y n}; x and y, [1, n], n 2 are the term frequences of two dfferent features occurrng n category c, respectvely. We beleve that f the dstances (d = x y ) between correspondng elements of two vectors are equal or approxmately equal, the relatve contrbuton of the two features for classfcaton should be smlar to each other. Fg 3 shows two pars of contrbuton curves of two features. d, [1, 10], s the dstance between correspondng elements of two vectors. The two contrbuton curves drew n Fg 3 (a) are smlar to each other because of that the dstances between correspondng elements of two vectors are equal; the two curves drew n Fg 3 (b) are not smlar to each other because of that d 4 and d 5 are sgnfcantly greater than others. In ths paper, we proposed two approaches to measure the smlarty between contrbuton curves of two features. In the frst approach, we use the varance of the dfferences between correspondng elements n two vectors to measure the smlarty between two contrbuton curves, called Relatve Contrbuton 1. The smaller the varance s, the more smlar the relatve contrbutons of two features are. Specally, f the varance s equal to zero, the relatve contrbuton of two features are dentcal. The measure of the smlarty between two vector X = {x 1, x 2,, x n} and Y = {y 1, y 2,, y n} correspondng to two features s lsted as follows: d 1 1 ( X, Y n n ) 1 x y x y n 1 n 1 In the second approach, we use the dfference between two adjacent elements n one vector as the relatve contrbuton of these elements to categorzaton, called Relatve Contrbuton 2. We frstly calculate the dfference between two adjacent elements n one vector, and then calculate the dfference between two vectors. The measure of the smlarty between two vector X = {x 1, x 2,, x n} and Y = {y 1, y 2,, y n} correspondng to two features s lsted as follows: 1 d( X, Y) x x y y n n C. Document representaton After all the terms are grouped nto K clusters, every document n the tranng set and test set wll be represented as a vector n whch the words occurrng n the document wll be mapped nto K clusters. For nstance, Table II s the lst of words and ther term frequences n one document. Table III ndcates 10 clusters created by them. In our experment, the document s represented as a vector, where the element value s the sum of term frequences of terms ncluded n the same cluster. Thus, the document lsted n Table II can be represented by {6, 13, 6, 1, 7, 2, 1, 2, 3, 2}. IV. EXPERIMENTAL SETUP A. Datasets and preprocessng For the document data whch s expected to be avalable for term clusterng, t s frst converted nto a term lst. In the experment, the stop-words are removed and the stemmer program processes the dfferent forms of the same word root [15]. We used a stop-words lst, whch contans 571 words. For the stemmer program, we used the Porter s stemmng algorthm ( The detals are shown as follows: 1. Identfy all words n the tranng set and make a lst of terms. 2. Remove stop-words and apply the stemmng algorthm to each word n the lst. 3. Buld a matrx M: each column (j) s a category n the tranng set, and each row () s a term n the lst. The M(,j) s the sum of term frequency of the th term occurs n the jth category. In order to evaluate the performance of the proposed method, three benchmark datasets - 20-Newsgroups, Reuters and Industry Sector - were used n our study. The 20-Newgroups were collected by Ken Lang (1995) and has become one of the standard corpora for text categorzaton. It contans newsgroup postngs, and all documents were evenly assgned to 20 dfferent UseNet groups. In our study, we only consder three categores such as talk.poltcs.guns, talk.poltcs.mdeast and talk.poltcs.msc. We gnore the UseNet header and only consder the content of the document when tokenzng the document.

5 TABLE II LIST OF WORDS AND THEIR TERM FREQUENCIES IN ONE DOCUMENT. word number word number word number hong 2 share 3 own 1 kong 2 total 1 fund 1 frm 2 outstandng 1 publcly 1 stake 3 common 2 held 1 washngton 1 stock 2 zealand 1 ndustral 2 flng 1 company 1 equty 2 securty 1 bought 2 pacfc 1 exchange 1 dsclose 1 nvestment 2 commsson 1 earler 1 rase 1 prncpally 1 month 1 TABLE III LIST OF 10 CLUSTERS CREATED BY THE WORDS OCCURRING IN TABLE II. THE NUMBER IN THE PARENTHESES IS THE SUM OF TERM FREQUENCIES OF WORDS OCCURRING IN THE SAME CLUSTER. C-1(6) C-2(13) C-3(6) C-4(1) C-5(7) C-6(2) C-7(1) C-8(2) C-9(3) C-10(2) hong stake company outstandng common month total dsclose held prncpally kong exchange own publcly earler securty rase commsson zealand nvestment frm pacfc flng washngton fund ndustral share stock bought equty The Reuters corpus contans stores taken from the Reuters newswre. All stores are non-unformly dvded nto 135 categores. In ths paper, we only consder the top 10 categores such as Earn, Acquston, Money-fx, Gran, Crude, Trade, Interest, Wheat, Shp and Corn. There are 9982 documents n top 10 categores. The Industry Sector corpus, made avalable by Market Gude Inc. ( conssts of company web pages classfed n a herarchy of ndustry sectors [16]. There are 6440 web pages attached to 12 root classes and then herarchcally parttoned nto 71 classes. In ths paper, we only consder 7 root classes, such as captal.goods.sector, conglomerates.ndustry, consumer.non-cyclcal.sector, energy.sector, healthcare.sector, transportaton.sector and utltes.sector. B. Clusterng algorthm Clusterng [17, 18], whch s a method of unsupervsed learnng, s a common technque for statstcal data analyss appled n many felds, such as machne learnng, data mnng, pattern recognton and nformaton processng. There exst many dfferent approaches for clusterng, such as Herarchcal Clusterng, Nearest Neghbor Clusterng, K-means Clusterng [19] and Expectaton Maxmzaton, and so on. The parttons generated by herarchcal clusterng algorthm are more versatle than those generated by other clusterng algorthms [20]. In cluster-based document retreval, the herarchcal clusterng algorthm performed better than other clusterng algorthms [20]. The results of herarchcal clusterng are usually presented n a dendrogram whch represents the nested groupng of data and smlarty levels at whch groupngs change [20]. In herarchcal clusterng, the number of cluster need not be known n advance and can be determned based on the dendrogram by the really requrements [21]. The strateges of herarchcal clusterng generally fall nto two types. One s agglomeratve herarchcal clusterng that begns wth each element as a separate cluster and merge them nto

6 larger clusters. The other s dvsve herarchcal clusterng that begns wth the whole set and proceed to dvde t nto smaller clusters. In ths paper, the agglomeratve herarchcal clusterng algorthm developed by Hoon, et al. [22] was adopted. C. Smlarty Measure The smlarty measures, whch are used to compare wth the proposed method, are detaled n ths secton. The KL dvergence to mean, whch s the average of the KL dvergence of each dstrbuton, s used n dstrbutonal clusterng [1]. Baker and McCallum used the weghted average nstead of the smple average. The weghted average KL dvergence s defned as follows: d P( w ) D( P( C w ) P( C w w )) t t t s P( w ) D( P( C w ) P( C w w )) s s t s where P(w t) and P(w s) are the probablty of the word w t and w s occurrng n the tranng set, respectvely; P(C w t) and P(C w s) are the contrbuton of the word w t and w s to classfcaton, respectvely; P(C w t w s) s the contrbuton of cluster n whch the word w t and w s are combned to classfcaton. D(P(C w t) P(C w t w s)) s the measure of neffcency that occurs when messages are sent accordng to one dstrbuton, P(C w t), but encoded wth a code that s optmal for a dfferent dstrbuton, P(C w t w s). The Cty-block metrc, whch s known as the Manhattan dstance, s the sum of dstances along each element n vector X = {x 1, x 2,, x n} and Y = {y 1, y 2,, y n}. n 1 d x y n 1 The Eucldean metrc s one of the most common types of dstance. It s the geometrc dstance n the multdmensonal space. The Eucldean dstance between two ponts n n-dmenson space X = {x 1, x 2,, x n} and Y = {y 1, y 2,, y n} can be computed as follows: n 1 2 d x y D. Classfers Many classfers are used n text categorzaton n recent years, such as Naïve Bayes (NB), K-nearest neghbor (KNN), Support Vector Machnes (SVM)[23], Roccho, and so on. Compared wth the state-of-art methods, Support Vector Machnes s a hgher effcent classfer n text categorzaton. There s much emprcal support for usng Support Vector Machnes for text categorzaton [6, 24, 25]. The publshed results show that KNN s qute effectve and ts performance s comparable to that obtaned by SVM [2]. In ths paper, we use SVM and KNN to compare the performance of varous clusterng methods. Support Vector Machnes s based on the structural rsk mnmzaton prncple for computatonal learnng theory, and t was orgnally developed by Drucker, et al. [26] and appled to text categorzaton by Joachms [3]. Snce Joachms [3] thought that most of the text categorzaton problems are lnearly separable, the lnear kernel for SVM s selected. In ths study, we use LIBSVM toolkt [27]. C-SVM [28, 29] s selected and the penalty parameter C s 1. KNN [30, 31] s a smple machne learnng algorthm that makes decson dependng on the major category labels attached to the k tranng documents whch are smlar to the test object, and t s a type of nstance-based classfer or lazy learner, snce the decson s made untl all the objects n the tranng set are scanned [2]. We used k=30 n ths experment, and the cosne dstance was used as the measure of document smlarty. E. Evaluatons The classfcaton effectveness n text categorzaton s usually measured n terms of the precson (P) and recall (R) [2] whch are orgnally defned for bnary classfcaton [24]. To compute the averaged estmates n multclass classfcaton context, the mcro-averagng method and -averagng method are used [25]. The mcro-averagng measure and -averagng measure are computed as follows: P R mcro mcro F1 TP TP FP mcro TP TP FN 2P R mcro P R mcro C mcro C 1 TP ( TP FP ) 1 C mcro C 1 TP ( TP FN ) 1

7 P F1 C P 1 C 2R R P P R C 1 R C where TP s the amount of the documents whch are correctly classfed to category c ; FP s the amount of the documents whch are msclassfed to the category c ; FN s the amount of the documents whch are belong to category c and msclassfed to other categores; C s the amount of the categores. The accuracy, whch s defned to be the percentage of correctly labeled documents n test set, s wdely used n text categorzaton [6, 15, 32-34]. The formula of the accuracy s defned as follows: TP TN Accuracy TP TN FP FN F. Valdaton In order to valdate the text representaton method based on word clusters created by the proposed method, the 5 2 fold cross valdaton [35] s used n ths paper. We perform 5 replcatons of 2-fold cross valdaton. In each replcaton, the documents n corpus are randomly parttoned nto two equal-szed subsets (A and B). The learnng algorthm s traned on subset A and tested on subset B, and then traned on subset B and tested on subset A. So 10 performance estmates are produced n 5 2 fold cross valdaton. V. RESULTS A. Results on 20-Newsgroups corpus Table IV and Table V show the mcro F1 measures of Support Vector Machnes and K-Nearest Neghbor when fve smlarty measures are used n herarchcal clusterng algorthm on 20-newsgroups. It can be seen from Table IV that the performance of the proposed methods outperforms that of the other smlarty measures n terms of mcro F1 when the number of clusters s 100, 200 or 300, respectvely. Table V shows that the performance of the Relatve Contrbuton 1 s superor to that of the other smlarty measures when the number of the clusters s 800, 900 or The F1 measures of Support Vector Machnes and K-Nearest Neghbor when fve smlarty measures are used n herarchcal clusterng algorthm on 20-newsgroups are lsted n Table VI and Table VII. It can be seen from Table VI that the performance of the proposed methods outperforms that of the other smlarty measures n terms of F1 when the number of clusters s 100, 200 or 300, respectvely. Table VII shows that the performance of the Relatve Contrbuton 1 s superor to that of Ctyblock and Eucldean except for the number of the clusters s 500 or 600. Moreover, the performance of the Relatve Contrbuton 1 s superor to that of WeghtedKLD when the number of the clusters s 800, 900 or Fg 4 ndcates the accuracy curves of SVM and KNN when fve smlarty measures are used on 20-newsgroups, respectvely. It can be seen from Fg 4(a) that the proposed methods acheve better perfomance wth fewer clusters. Fg 4(b) ndcates that the accuracy performance of the proposed methods s superor to that of other smlarty measures when the number of clusters s greater than 700. B. Results on Reuters corpus Table VIII and Table IX show the mcro F1 measures of Support Vector Machnes and K-Nearest Neghbor when fve smlarty measures are used n herarchcal clusterng TABLE IV THE MICRO F1 OF SUPPORT VECTOR MACHINES BASED ON 20-NEWSGROUPS. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean

8 algorthm on Reuters Table X and Table XI show the F1 measures of Support Vector Machnes and K-Nearest Neghbor when fve smlarty measures are used n herarchcal clusterng algorthm on Reuters Fg 5 draws the accuracy curves of SVM and KNN when fve smlarty measures are used n herarchcal clusterng on Reusters-21578, respectvely. It can be seen from Table VIII - XI and Fg 5 that the performance of the proposed measures s only nferor to that of the WeghtedKLD and superor to that of Ctyblock and Eucldean. TABLE V THE MICRO F1 OF K-NEAREST NEIGHBOR BASED ON 20-NEWSGROUPS. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE VI THE MACRO F1 OF SUPPORT VECTOR MACHINES BASED ON 20-NEWSGROUPS. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE VII THE MACRO F1 OF K-NEAREST NEIGHBOR BASED ON 20-NEWSGROUPS. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean Fg 4. the accuracy curves of Support Vector Machnes and K-Nearest Neghbor used on 20-Newsgroups, respectvely.

TABLE VIII THE MICRO F1 OF SUPPORT VECTOR MACHINES BASED ON REUTERS-21578. Relatve Contrbuton 1 62.49 63.26 63.15 63.51 63.44 63.46 63.40 63.40 63.13 63.30 Relatve Contrbuton 2 62.27 62.92 63.04 63.

9 TABLE VIII THE MICRO F1 OF SUPPORT VECTOR MACHINES BASED ON REUTERS Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE IX THE MICRO F1 OF K-NEAREST NEIGHBOR BASED ON REUTERS Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE X THE MACRO F1 OF SUPPORT VECTOR MACHINES BASED ON REUTERS Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE XI THE MACRO F1 OF K-NEAREST NEIGHBOR BASED ON REUTERS Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean Fg 5. the accuracy curves of Support Vector Machnes and K-Nearest Neghbor used on Reuters-21578, respectvely.

10 C. Results on Industry-Sector corpus Table XII and Table XIII show the mcro F1 measures of Support Vector Machnes and K-Nearest Neghbor when fve smlarty measures are used n herarchcal clusterng algorthm on Industry Sector corpus. Table XIV and Table XV show the F1 measures of Support Vector Machnes and K-Nearest Neghbor when fve smlarty measures are used n herarchcal clusterng algorthm on Industry Sector corpus. It can be seen from Table XII and Table XIV that the performance of SVM combned wth the Relatve Contrbuton 2 s superor to that of other smlarty measures when the number of clusters s 900 or Table XIII and Table XV ndcate that the performance of KNN combned wth the Relatve Contrbuton 1 s superor to that of other smlarty measures when the number of clusters s 800 or 900. Fg 6 shows the accuracy of SVM and KNN when fve smlarty measures are used n herarchcal clusterng on Industry Sector corpus. Fg 6(a) ndcates that the accuracy curve of SVM combned wth the proposed method s hgher than that wth other methods when the number of clusters s 600, 700 or 800. Fg 6(b) shows that the accuracy curve of KNN combned wth the Relatve Contrbuton 1 s hgher than that wth the other methods when the number of clusters s 700, 800 or 900. TABLE XII THE MICRO F1 OF SUPPORT VECTOR MACHINES BASED ON INDUSTRY-SECTOR. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE XIII THE MICRO F1 OF K-NEAREST NEIGHBOR BASED ON INDUSTRY-SECTOR. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE XIV THE MACRO F1 OF SUPPORT VECTOR MACHINES BASED ON INDUSTRY-SECTOR. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean TABLE XV THE MACRO F1 OF K-NEAREST NEIGHBOR BASED ON INDUSTRY-SECTOR. Relatve Contrbuton Relatve Contrbuton WeghtedKLD Ctyblock Eucldean

11 Fg 6. the accuracy curves of Support Vector Machnes and K-Nearest Neghbor used on Industry-Sector, respectvely. VI. ANALYSIS AND DISCUSSION The evaluaton for clusterng results to fnd the parttonng that best fts the underlyng data s one of the most mportant ssues n cluster analyss [36, 37]. There are many cluster valdaton ndces that have been proposed to valdate the qualty of the clusters n the lterature [38]. Daves and Bouldn [39] presented a valdaton ndex whch can nfer the approprateness of varous dvsons of the data. The Daves-Bouldn ndex has a low computatonal complexty. So we chose t to valdate the qualty of the clusters produced by the proposed method. Its formula s lsted as follows: M 1 j DB M max d ( c, c ) 1 j 1,..., M ; j j where M s the number of the clusters; σ (σ j) s the average dstance of all elements n cluster (j) to ther cluster center c (c j); d(c, c j) s the dstance between the cluster centers c and c j. The range of DB s [0, ] and the lower value s for a good clusterng [38]. Table XVI lsts the Daves-Bouldn ndces of the clusters when varous smlarty measures are used and all words n feature vector space are clustered nto k clusters. The k s equal to 100, 200, 300, 400, 500, 600, 700, 800, 900 or Due to the value of that s too large to dsplay, the Daves-Bouldn ndces of clusters generated by Cty-block and Eucldean are not lsted n Table XVI. It can be seen from Table XVI that the qualty of clusters generated by the proposed method s the best. However, there exsts a problem that the qualty of term cluster s not consstent wth the performance of classfers based on the term cluster. We ran our experments on an Intel Core2 Q6600 2G RAM PC under wndows XP. Due to the lmtaton of memory, the sze of the feature space must be less than 25000; otherwse, the clusterng software wll prompt that the memory s nsuffcent for clusterng. In our experments, we chose a part of corpora (20-newsgroups, reuters and ndustry sector) to ensure that the sze of the feature space s less than The clusterng based on the proposed method run about one hour. TABLE XVI THE DAVIES-BOULDIN INDICES OF THE CLUSTERS GENERATED BY VARIOUS SIMILARITY MEASURES ON THREE TEXT CORPORA. Dataset Smlarty measure 20- Relatve Contrbuton newgroups weghtedkld Reuters- Relatve Contrbuton weghtedkld Industry Relatve Contrbuton Sector weghtedkld

12 VII. CONCLUSION We proposed two new smlarty measure methods, whch use the relatve contrbuton of a feature for categores as the measure crteron. The proposed smlarty metrc s used n term clusterng for text categorzaton and can reduce the feature space by one to three orders of magntude whle losng only a few percent n classfcaton performance. We evaluated the proposed methods on three benchmark corpora (20-newgroups, reuters and ndustry sector), usng two classfcaton algorthms (Support Vector Machnes, K-Nearest Neghbor), and compared wth three well-known smlarty measures (weghted average KL dvergence, Cty-block, Eucldean). The results show that the performance of the proposed methods s comparatve wth that of other methods when Support Vector Machnes s used; the proposed methods sgnfcantly outperform Eucldean and Cty-block, and acheve comparatve performance wth weghted average KL dvergence when K-Nearest Neghbor classfer s used. Moreover, the qualty of the clusters generated by the proposed method s the best. In the future, the relatonshp between the qualty of a cluster and the performance of classfers based on the cluster s a research focus for us. REFERENCES [1] L. D. Baker and A. K. McCallum, "Dstrbutonal clusterng of words for text classfcaton," n proceedngs of the 21st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, Melbourne, Australa, [2] F. Sebastan, "Machne Learnng n Automated Text Categorzaton," ACM Computng Surveys, vol. 34, pp. 1-47, Mar [3] T. Joachms, "Text categorzaton wth Support Vector Machnes: Learnng wth many relevant features," n Machne Learnng: ECML-98. vol. 1398, C. Nédellec and C. Rouverol, Eds., ed: Sprnger Berln / Hedelberg, 1998, pp [4] C. Lee and G. G. Lee, "Informaton gan and dvergence-based feature selecton for machne learnng-based text categorzaton," Informaton Processng & Management, vol. 42, pp , [5] D. Fragouds, et al., "Best terms: an effcent feature-selecton algorthm for text categorzaton," Knowledge and Informaton Systems, vol. 8, pp , [6] R. Bekkerman, et al., "Dstrbutonal word clusters vs. words for text categorzaton," J. Mach. Learn. Res., vol. 3, pp , [7] A.-M. Hsham, "A New Text Categorzaton Technque Usng Dstrbutonal Clusterng and Learnng Logc," IEEE Transactons on Knowledge and Data Engneerng, vol. 18, pp , [8] N. Slonm and N. Tshby, "The power of word clusters for text classfcaton," n proceedngs of ECIR-01, 23rd European Colloquum on Informaton Retreval Research, Darmstadt, Germany, [9] M. McGll, "An Evaluaton of Factors Affectng Document Rankng by Informaton Retreval Systems," Techncal report, Syracuse Unversty School of Informaton Studes, [10] S. Ramaswamy and K. Rose, "Adaptve Cluster Dstance Boundng for Hgh-Dmensonal Indexng," IEEE Transactons on Knowledge and Data Engneerng, vol. 23, pp , [11] P.-E. DANIELSSON, "Eucldean Dstance Mappng," COMPUTER GRAPHICS AND IMAGE PROCESSING, vol. 14, pp , [12] R. M. C. R. de Souza and F. d. A. T. de Carvalho, "Clusterng of nterval data based on cty-block dstances," Pattern Recognton Letters, vol. 25, pp , [13] J. D'Hondt, et al., "Parwse-adaptve dssmlarty measure for document clusterng," Informaton Scences, vol. 180, pp , [14] D. D. Lews, "An evaluaton of phrasal and clustered representatons on a text categorzaton task," n proceedngs of the 15th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, Copenhagen, Denmark, [15] E. Youn and M. K. Jeong, "Class dependent feature scalng method usng nave Bayes classfer for text datamnng," Pattern Recognton Letters, vol. 30, pp , [16] A. McCallum, et al., "Improvng Text Classfcaton by Shrnkage n a Herarchy of Classes," n ICML-98, [17] A. K. Jan, "Data clusterng: 50 years beyond K-means," Pattern Recognton Letters, vol. 31, pp , [18] Y. Zhao and G. Karyps, "Data clusterng n lfe scences," Molecular Botechnology, vol. 31, pp , [19] J. Y. B. Tan, et al., "Tme Seres Predcton usng Backpropagaton Network Optmzed by Hybrd K-means-Greedy Algorthm," Engneerng Letters, vol. 20, pp , [20] A. K. Jan, et al., "Data clusterng: a revew," ACM Comput. Surv., vol. 31, pp , [21] Y. Mn-Shen and W. Kuo-Lung, "A smlarty-based robust clusterng method," IEEE Transactons on Pattern Analyss and Machne Intellgence,, vol. 26, pp , [22] M. J. L. d. Hoon, et al., "Open source clusterng software," Bonformatcs, vol. 20, pp [23] I. Nurtano, et al., "Classfyng Cyst and Tumor Leson Usng Support Vector Machne Based on Dental Panoramc Images Texture Features," IAENG Internatonal Journal of Computer Scence, vol. 40, pp , [24] T. Joachms, Learnng to Classfy Text Usng Support Vector

Machnes: Methods, Theory and Algorthms: Kluwer Academc Publshers 2002. [25] H. Ogura, et al.

, "Support Vector Machnes for Spam Categorzaton," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 10, pp. 1048-1054, 1999. [27] C.-C. Chang and C.-J. Ln. (2001).

Ln, "Tranng v-support Vector Classfers: Theory and Algorthms," Neural Computaton, vol. 13, pp. 2119-2147, 2001. [29] M. A. Davenport, et al.

13 Machnes: Methods, Theory and Algorthms: Kluwer Academc Publshers [25] H. Ogura, et al., "Feature selecton wth a measure of devatons from Posson n text categorzaton," Expert Systems wth Applcatons, vol. 36, pp , [26] H. Drucker, et al., "Support Vector Machnes for Spam Categorzaton," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 10, pp , [27] C.-C. Chang and C.-J. Ln. (2001). LIBSVM : a lbrary for support vector machnes Avalable: [28] C.-C. Chang and C.-J. Ln, "Tranng v-support Vector Classfers: Theory and Algorthms," Neural Computaton, vol. 13, pp , [29] M. A. Davenport, et al., "Controllng False Alarms wth Support Vector Machnes," presented at the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng (ICASSP), [30] T. Cover and P. Hart, "Nearest neghbor pattern classfcaton," IEEE Transactons on Informaton Theory, vol. 13, pp , [31] E. Uchno, et al., "IVUS-Based Coronary Plaque Tssue Characterzaton Usng Weghted Multple k-nearest Neghbor," Engneerng Letters, vol. 20, pp , [32] Y. H. L and A. K. Jan, "Classfcaton of Text Documents," The Computer Journal, vol. 41, pp , January 1, [33] M. Benkhalfa, et al., "Integratng External Knowledge to Supplement Tranng Data n Sem-Supervsed Learnng for Text Categorzaton," Informaton Retreval, vol. 4, pp , [34] A. Markov, et al., "The hybrd representaton model for web document classfcaton," Internatonal Journal of Intellgent Systems, vol. 23, pp , [35] T. G. Detterch, "Approxmate Sratstcal Tests for Comparng Supervsed Classfcaton Learnng Algorthms," Neural Computaton, vol. 10, pp , [36] M. Halkd, et al., "On Clusterng Valdaton Technques," Journal of Intellgent Informaton Systems, vol. 17, pp , [37] H. Alzadeh, et al., "A Framework for Cluster Ensemble Based on a Max Metrc as Cluster Evaluator," IAENG Internatonal Journal of Computer Scence, vol. 39, pp , [38] S. Günter and H. Bunke, "Valdaton ndces for graph clusterng," Pattern Recognton Letters, vol. 24, pp , [39] D. L. Daves and D. W. Bouldn, "A Cluster Separaton Measure," IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 1, pp , Je-Mng Yang receved hs MSc and PhD degree n computer appled technology from Northeast DanL Unversty, Chna n 2008 and Computer Scence and Technology from Jln Unversty, Chna n 2013, respectvely. Hs research nterests are n machne learnng and data mnng. Zh-Yng Lu receved her MSc degree n computer appled technology from Northeast DanL Unversty, Chna n Hs research nterests are n nformaton retreval and personalzed recommendaton. Zhao-Yang Qu receved hs MSc and PhD degree n computer scence from Dalan Unversty of Technology, Chna n 1988 and North Chna Electrc Power Unversty, Chna n 2012, respectvely. Hs research nterests are n artfcal ntellgence, machne learnng and data mnng.

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth