Web Document Classification Based on Fuzzy Association

Size: px

Start display at page:

Download "Web Document Classification Based on Fuzzy Association"

Loreen Briggs
6 years ago
Views:

1 Web Document Classfcaton Based on Fuzzy Assocaton Choochart Haruechayasa, Me-Lng Shyu Department of Electrcal and Computer Engneerng Unversty of Mam Coral Gables, FL 33124, USA Shu-Chng Chen Dstrbuted Multmeda Informaton System Laboratory School of Computer Scence Florda Internatonal Unversty Mam, FL 33199, USA Xuq L NSF/FAU Multmeda Laboratory, Florda Atlantc Unversty Boca Raton, FL 33431, USA xl@cse.fau.edu Abstract In ths paper, a method of automatcally classfyng Web documents nto a set of categores usng the fuzzy assocaton concept s proposed. Usng the same word or vocabulary to descrbe dfferent enttes creates ambguty, especally n the Web envronment where the user populaton s large. To solve ths problem, fuzzy assocaton s used to capture the relatonshps among dfferent ndex terms or eywords n the documents,.e., each par of words has an assocated value to dstngush tself from the others. Therefore, the ambguty n word usage s avoded. Experments usng data sets collected from two Web portals: Yahoo! ( and Open Drectory Proect (dmoz.org) are conducted. We compare our approach to the vector space model wth the cosne coeffcent. The results show that our approach yelds hgher accuracy compared to the vector space model. Keywords: Informaton Processng on the Web, Data Mnng, Document Classfcaton, Fuzzy Assocaton. 1. Introducton The World Wde Web (WWW) can be vewed as a dstrbuted database system, but wth two dfferent aspects. Frstly, WWW contans much larger amount of data than a typcal database system. WWW s often referred to as the world s largest dstrbuted database system wth the amount of data growng at an exponental rate [17]. These data can be of heterogeneous types such as text, mage, audo, and vdeo. Secondly, WWW nvolves a huge user populaton that s not restrcted to a certan demographc group or a geographc area. The result s the wde varaton n nformaton content and qualty. In addton, unle a typcal database system where the maorty of users only retreve the nformaton through some queres, WWW allows ts users to provde and share the nformaton publcly on the system. Wth the large amount of avalable nformaton on the Web, searchng for specfc nformaton or dscoverng any useful nformaton becomes a dffcult and challengng tas. To allevate ths problem, many data mnng technques have been appled nto the Web context. Ths s referred to as Web mnng [3]. Web mnng s defned as the dscovery and analyss of useful nformaton from WWW. Some of Web mnng technques nclude analyss of user access patterns [10][14], Web document clusterng [1][15], and classfcaton [2][4][5][16]. Document classfcaton or text categorzaton (as used n nformaton retreval context) s the process of assgnng a document to a predefned set of categores based on the document content. Document classfcaton can be appled as an nformaton flterng tool and can also be used to mprove the retreval results from a query process. To help the users search and browse for specfc nformaton on the Web, many of the well-nown Web portals such as Yahoo! [21] have organzed the nformaton, n form of Web documents, nto some predefned categores such as Arts & Humantes, Computers & Internet, and Entertanment. However, ths

2 approach of organzng Web documents requres human efforts and hence, s very subectve and does not scale well. In ths paper, a method of automatcally classfyng Web documents nto a set of categores usng the fuzzy assocaton concept s proposed. The fuzzy assocaton uses the concept of the Fuzzy Set theory [18] to model the vagueness n the nformaton retreval process. Examples of the research wors nvolvng the use of the fuzzy assocaton technque nclude [6], [7], [8], and [9]. The basc concept of fuzzy assocaton nvolves the constructon of a pseudothesaurus of eywords or ndex terms from a set of documents [7]. By constructng a pseudothesaurus, the relatonshp among dfferent ndex terms or eywords n the documents s captured,.e., each par of words has an assocated value to dstngush tself from other pars of words. Therefore, the ambguty n word usage s mnmzed. Several researches have been done n the area of document classfcaton or text categorzaton. Some of these researches perform experments usng only a document set from a specfc topc. For example, n [5], the document collecton, Reuters, whch s busness related, s used n ther experments. Other research wor such as [2], [4], and [16] focus on the Web documents. However, all of these researches use only a set of documents obtaned from a sngle Web drectory. For example, [2] and [16] use Yahoo! Drectory as ther data set, and [4] uses LooSmart ( s drectory. As mentoned earler, the process of organzng the Web drectores s based on human efforts and can be very subectve. Therefore, n ths paper, we apply our approach and perform our experments usng data sets collected from two dfferent Web portals: Yahoo! [21] and Open Drectory Proect [19]. In general, when dealng wth data n hgh multdmensonal space, the performance, n terms of storage space and executon tme, can be greatly affected by the hgh dmenson. Ths problem s generally nown as the curse of dmensonalty. For a document data set, ths problem also holds, snce a document collecton can contan mllons of dfferent ndex terms or eywords. A classcal document clusterng approach, vector space model [13], whch represents each document usng n- dmensonal vector (where n s the number of eywords) also suffers from ths problem. By usng the fuzzy assocaton technque n our approach, the dmenson of the eyword representaton for the categores can be reduced wthout much performance degradaton. Also the selecton of dfferent eywords n representng each category does not affect the performance as much compared to the vector space approach. The rest of the paper s organzed as follows. In the next secton, the concept of the fuzzy assocaton that has been appled n the area of nformaton retreval systems s ntroduced. In ths secton, our proposed fuzzy classfcaton model s also descrbed. In Secton 3, the expermental results and dscussons are gven. The paper s concluded n Secton Fuzzy assocaton for document classfcaton In ths secton, we frst revew the concept of the fuzzy assocaton that has been appled n the area of nformaton retreval systems. Then we descrbe our classfcaton model based on the fuzzy assocaton concept n detals Fuzzy assocaton n nformaton retreval Fuzzy set theory [18] deals wth the representaton of classes whose boundares are not well defned. The ey dea s to assocate a membershp functon wth the elements of the class. Ths functon taes values on the nterval [0, 1] wth 0 correspondng to no membershp n the class and 1 correspondng to full membershp. Membershp values between 0 and 1 ndcate margnal elements of the class. Thus, membershp n a fuzzy set s a noton ntrnscally gradual nstead of abrupt or crsp (as n conventonal Boolean logc). Fuzzy assocatve nformaton retreval (IR) mechansm s formalzed wthn the fuzzy set theory and based on the defnton of fuzzy assocaton. It captures the assocaton between the eywords to mprove the retreval results from tradtonal IR systems. By provdng the assocaton between the eywords, some addtonal documents that are not drectly ndexed by the eywords n the query can also be retreved. Defnton 1. A fuzzy assocaton between two fnte sets X={x 1,,x u } and Y={y 1,,y v } s formally defned as a bnary fuzzy relaton f: X Y [0,1], where u and v are the numbers of elements n X and Y, respectvely. The constructon of the assocaton between ndex terms or eywords s generally nown as the generaton of the fuzzy pseudothesaurus. In [7], a formal defnton and process of generatng fuzzy pseudothesaurus based on cooccurrences of eywords s gven. It can be summarzed as follows. Defnton 2. Gven a set of ndex terms, T={t 1,,t u }, and a set of documents, D={d 1,,d v }, each t s represented by a fuzzy set h(t ) of documents; h(t )={F(t,d ) d D}, where F(t,d ) s the sgnfcance (or membershp) degree of t n d.

3 Defnton 3. The fuzzy related terms (RT) relaton s based on the evaluaton of the co-occurrences of t and t n the set D and can be defned as follows. RT( t, t ) = mn( F( max( F( ), F( ), F( In [9], a smplfcaton of the fuzzy RT relaton based on the co-occurrence of eywords s gven as follow. n r =,, n + n n, Eq. 1, where r, represents the fuzzy RT relaton between eywords and, n, s the number of documents contanng both th and th eywords, n s the number of documents ncludng the th eyword, and n s the number of documents ncludng the th eyword. Next, the calculaton of the fuzzy RT relaton between eywords s appled n our classfcaton model Fuzzy classfcaton model The process of classfyng Web documents s explaned n detals as follows. Gven C = {C 1, C 2,, C m }, a set of categores, where m s the total number of categores, the frst step s to collect the tranng sets of Web documents, TD = {TD 1, TD 2,, TD m }, from each category n C. Ths step nvolves crawlng through the hypertext lns encapsulated n each document. Once the document collectons are obtaned, they are cleaned through the stemmng and stopword removal processes. Next, the most frequently occurred eywords from the document sets based on each category are extracted and put nto separate eyword sets, K = {K 1, K 2,, K m }. From these m sets of eywords, we combned them nto a set of all eywords, A = { 1, 2,, n },where n s the total number of all dstnct eywords representng the vector dmenson. Note that some of the eywords can appear n more than one category, but we only consder one nstance of these. Then we generate the eyword correlaton matrx M usng the fuzzy RT relaton equaton (gven n Eq. 1). The eyword correlaton matrx s an n n symmetrc matrx whose element, m, has the value on the nterval [0, 1] wth 0 ndcates no relatonshp and 1 ndcates full relatonshp between the eywords and. Therefore, m s equal to 1 for all =, snce a eyword has the strongest relatonshp to tself. )) )) To classfy the documents n the test data set nto dfferent categores, frst, each category must be represented wth a set of eywords. The best way to represent each category s to select only the exclusve eywords,.e., for category C,, we consder the eywords n K whch do not belong n another eyword sets K, where =1 m and. We refer to ths as the category eyword sets, CK = {CK 1, CK 2,, CK m }. Next, the test documents n the test data set are cleaned and the eywords are extracted by loong up n A, the lst of all eywords. Ths process gves us the representaton of those test documents, D = {d 1, d 2,, d p }, where p s the total number of documents to be classfed. After that, the membershp degrees between each document to each of the category sets are calculated usng the followng equaton. µ = max [1 (1 r, )], Eq. 2, a d b CK a b where µ, s the membershp degree of d belongng to C, r a,b s the fuzzy relaton between eyword a d and eyword b CK. A document d s classfed nto the category C where the membershp degree µ, s the maxmum. The eyword a n d s assocated to category C f the eywords b s n CK (for category C ) are related to the eyword a. Whenever there s at least one eyword n CK whch s strongly related to the eyword a n d (.e., r a,b ~ 1), then Eq. 2 yelds µ, ~ 1, and the eyword a s a good fuzzy ndex for the category C. In the case when all eywords n CK are ether loosely related or unrelated to a, the eyword a s not a good fuzzy ndex for C (.e., µ, ~ 0). 3. Experments and results Ths secton provdes the descrptons and characterstcs of the data sets used for performng our experments. Also, we brefly revew the vector space model wth the cosne coeffcent as a comparson approach. Then, the expermental results and dscussons are presented Expermental data sets Experments usng the predefned categores and the document sets collected from two Web portals, Yahoo! [21] and Open Drectory Proect (ODP) [19], are conducted. The bref descrpton and hstory of these two Web portals are provded n [20]. In our experments, we only consder those documents n Englsh and gnore all other non-englsh documents. Therefore, the categores, World and Regonal, are excluded from our expermental

4 data sets. Table 1 shows the selected categores from these two Web portals. Based on these predefned categores, we collect approxmately 18,000 documents from each of the Web drectores as the tranng and test data sets. To avod the problem of over-fttng the data when performng the experments, we randomly select two-thrd of the documents as the tranng data set and one-thrd as the test data set. Table 1. Predefned category sets from two Web portals Yahoo! ODP Category Abbr. Category Abbr. Arts & Humantes art Arts art Busness & Economy bus Busness bus Computers & Internet com Computers com Educaton edu Games game Entertanment et Health health Government gov Home home Health health Kds and Teens d News & Meda news News news Recreaton & Sports rec Recreaton rec Scence sc Scence sc Socal Scence sosc Shoppng shop Socety & Culture soc Socety soc TOTAL 12 Sports sport TOTAL 13 Consderng only the tranng data sets from these two dfferent Web stes, we extract and select the most frequently occurred eywords from each category as follows. For the Yahoo! data set, 350 most frequent eywords are selected from each of 12 categores. Some of the eywords appear n more than one category, but we only consder one nstance for each of these. The total number of all dstnct eywords s For the ODP data set, we also select 350 most frequent eywords from each of 13 categores. The total number of dstnct eywords s Vector space model The vector space model s one of the classcal clusterng methods frst proposed by [13]. Ths method has been successfully appled to many IR systems ncludng the well-nown SMART system [12]. The vector space model assgns the attrbutes (eywords n ths context) nto n-dmensonal space, where n s the number of the attrbutes. Therefore, each document can be represented by an n-dmensonal vector called a document vector. For the classfcaton problem, we have some predefned set of categores, where each can also be represented by an n-dmensonal vector called category vector. To classfy a document nto one of the categores, the document vector s compared wth all category vectors usng a smlarty metrc. The document s classfed nto the category where the smlarty measure s the hghest among all other categores. Several approaches for calculatng the smlarty measure between documents have been proposed [11]. Two types of measures have been wdely used. The frst s the dstance metrc (representng dssmlarty) such as Eucldean dstance. The second type s smlarty measures such as cosne and dce coeffcents. In ths paper, as a comparson approach, the cosne coeffcent s used to calculate the smlarty measures between a document and a category. The calculaton of the cosne coeffcent s gven below. n ( f, g, ) v v = 1 Eq. 3 COSINE( f, g ) = n n 2 f, = 1 = 1 g 2, where v f F, F s a set of n-dmensonal document vectors, v g G, G s a set of n-dmensonal category vectors, and n represents the total number of dstnct eywords Results and dscussons To compare the performance of our method (denoted as Fuzzy) to the vector space model (denoted as Vector) approach, we use the test data sets and measure the classfcaton accuracy by varyng the vector lengths of the category vectors. To see the effect of usng dfferent sets of eywords n representng the category vectors, we provde two ways of selectng the eywords: selectng from the most frequently occurred eywords (denoted as topmost), and selectng from the least frequently occurred eywords (denoted as bottommost). Fgure 1 shows the expermental result by usng the Yahoo! data set. As can be seen from ths fgure, for all cases, the classfcaton accuracy ncreases when the number of eywords used to represent the category vectors s ncreased. Our approach yelds a hgher accuracy compared to the vector space model. For example, when the vector length s 10, our approach yelds the accuraces of 74.9% for the topmost sets and 41.1% for the bottommost sets, whereas the vector space model yelds the accuraces of 57. for the topmost sets and 12.2% for the bottommost sets. In Fgure 2, the performance result based on 12 categores of Yahoo! s presented. As expected, our approach yelds hgher accuraces for all categores.

5 We perform the same experments on the ODP data set. The results are shown n Fgure 3 and Fgure 4, respectvely. The results are smlar to the results obtaned from the Yahoo! data set, except one dfferent observaton. By usng the bottommost eywords n our approach, the average accuracy s 78.1%, and by usng the topmost eywords n the vector space model, the average accuracy s 67.1%. That s, by usng ether the topmost or bottommost representatons, our approach performs better than the vector space model. category does not affect the performance as much as the vector space model. For example, for the Yahoo! data set, by usng the bottommost eywords, nstead of the topmost eywords, the accuracy drops 21.4% n our approach, whereas the accuracy drops 39. n the vector space model approach. Fuzzy(topm ost) Vector(topm ost) Fuzzy(bottom m ost) Vector(bottom m ost) Fuzzy(topm ost) Fuzzy(bottom m ost) 6 4 Vector(topm ost) Vector(bottom m ost) Vector Length Vector Length Fgure 3. Classfcaton performance by varyng the vector length ODP Fgure 1. Classfcaton performance by varyng the vector length Yahoo! Vector(topm ost) Fuzzy(topm ost) 6 4 art vector(topm ost) Fuzzy(topm ost) bus com edu et gov health news rec sc sosc Category soc 6 4 art bus com game health home d news rec sc shop soc sport Category Fgure 4. Classfcaton performance by categores ODP Fgure 2. Classfcaton performance by categores - Yahoo! Table 2 shows the summarzed results for both Yahoo! and ODP data sets. The results are calculated by averagng the accuracy values over all the vector lengths. By usng the topmost representaton for the category vector, our approach yelds hgher average classfcaton accuraces of 13.7% and 17.7% over the vector space model for the Yahoo! and ODP data sets, respectvely. Another observaton s that, for our approach, the selecton of dfferent eywords n representng the Data set Table 2. Average classfcaton accuracy Fuzzy (topmost) Fuzzy (bottommost) Vector (topmost) Vector (bottommost) Yahoo! 81.5% 60.1% 67.8% 28.8% ODP 84.8% 78.1% 67.1% 46.1%

6 4. Concluson In ths paper, an alternatve approach of automatcally classfyng the Web documents nto some predefned categores usng the fuzzy assocaton concept s proposed. Realzng the ambguty n word usage n Englsh, the fuzzy assocaton method avods ths problem by capturng the relatonshp or assocaton among dfferent ndex terms or eywords n the documents. The result s that each par of words has an assocated value to dstngush tself from other pars of words. Experments usng the data sets obtaned from two dfferent Web drectores, Yahoo! and ODP, are conducted. Both Web portals are ndependent and have dfferent characterstcs from each other. We compare our fuzzy assocaton approach to the vector space model approach. To see the effect of dfferent eyword selectons for category vectors, two dfferent alternatves: selectng from the most frequently occurred eyword (topmost) and selectng from the least frequently occurred eywords (bottommost) wth varyng vector lengths are used. The results show that, on average, our approach yelds hgher classfcaton accuraces compared to the vector space model for both the topmost and bottommost cases. In addton, wth our approach, usng fewer numbers of eywords for category representaton does not degrade the accuracy as much compared wth the vector space model. 5. Acnowledgments For Shu-chng Chen, ths research was supported n part by NSF CDA References [1] A.Z. Broder, S.C. Glassman, and M.S. Manasse, Syntactc Clusterng of the Web, Proceedngs of the 6th Internatonal World Wde Web Conference, Aprl 1997, pp [2] C. Cheur, M. Goldwasser, P. Raghavan, and E. Upfal, Web Search Usng Automatc Classfcaton, Proceedngs of the 6th Internatonal World Wde Web Conference, Aprl [3] R. Cooley, B. Mobasher, and J. Srvastava, Web Mnng: Informaton and Pattern Dscovery on the World Wde Web, Proceedngs of the 9th IEEE Internatonal Conference on Tools wth Artfcal Intellgence (ICTAI'97), November 1997, pp [4] S. T. Dumas and H. Chen, Herarchcal Classfcaton of Web Content, Proceedngs of the 23rd Internatonal ACM Conference on Research and Development n Informaton Retreval (SIGIR 00), August 2000, pp [5] D. Koller and M. Saham, Herarchcally Classfyng Documents Usng Very Few Words, Proceedngs of the 14th Internatonal Conference on Machne Learnng (ICML 97), July 1997, pp [6] S. Myamoto, Two Approaches for Informaton Retreval Through Fuzzy Assocatons, IEEE Transactons on Systems, Man, and Cybernetcs, vol. 19, no. 1, January/February 1989, pp [7] S. Myamoto, T. Myae, and K. Naayama, Generaton of a Pseudothesaurus for Informaton Retreval Based on Cooccurences and Fuzzy Set Operatons, IEEE Transactons on Systems, Man, and Cybernetcs, vol. 13, no. 1, 1983, pp [8] S. Myamoto and K. Naayama, Fuzzy Informaton Retreval Based on a Fuzzy Pseudothesaurus, IEEE Transactons on Systems, Man, and Cybernetcs, vol. 16, no. 2, March/Aprl 1986, pp [9] Y. Ogawa, T. Morta, and K. Kobayash, A Fuzzy Document Retreval System Usng the Keyword Connecton Matrx and a Learnng Method, Fuzzy Sets and Systems, vol. 39, 1991, pp [10] J. Ptow and P. Proll, Mnng Longest Repeatng Subsequences to Predct World Wde Web Surfng, Proceedngs of the 2nd USENIX Symposum on Internet Technologes and Systems (USITS'99), Oct 1999, pp [11] E. Rasmussen, Chapter 16: Clusterng Algorthms, n W. B. Fraes and R. Baeza-Yates, edtors, Informaton Retreval: Data Structures &Algorthms, Prentce Hall, 1992, pp [12] G. Salton, edtor. The SMART retreval system: experments n automatc document processng, Prentce- Hall Seres n Automatc Computaton, Englewood Clffs, New Jersey, 1971, Chapters [13] G. Salton, A. Wong, and C.S. Yang, A Vector-Space Model for Informaton Retreval, Communcatons of the ACM, vol. 18, no. 11, 1975, pp [14] M.-L. Shyu, S.-C. Chen, and C. Haruechayasa, Mnng User Access Behavor on the WWW, IEEE Internatonal Conference on Systems, Man, and Cybernetcs, October 2001, pp [15] M.-L. Shyu, S.-C. Chen, C. Haruechayasa, C.-M. Shu, and S.-T. L, Dsont Web Document Clusterng and Management n Electronc Commerce, Proceedngs of the Seventh Internatonal Conference on Dstrbuted Multmeda Systems (DMS 01), September [16] S. Tun, R. Abdullah, and T.E. Kong, Automatc Topc Identfcaton Usng Ontology Herarchy, Proceedngs of the Second Internatonal Conference on Computatonal Lngustcs and Intellgent Text Processng (CICLng 01), February 2001, pp [17] J. Wang, A Survey of Web Cachng Schemes for the Internet, ACM Computer Communcaton Revew, October 1999, pp [18] L.A. Zadeh, Fuzzy Sets, n D. Dubos, H. Prade, and R.R. Yager, edtors, Readngs n Fuzzy Sets for Intellgent Systems, Morgan Kaufmann Publshers, [19] Open Drectory Proect ODP. [20] The Maor Search Engnes. [21] Yahoo! Web Search Drectory.

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School