On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering

Size: px

Start display at page:

Download "On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering"

Edith Patrick
5 years ago
Views:

1 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL On-lne Hot Topc Recommendaton Usng Tolerance Rough Set Based Topc Clusterng Yonghu Wu, Yuxn Dng, Xaolong Wang, Jun Xu Intellgence Computng Research Center Harbn Insttute of Technology, Shenzhen Graduate School, Shenzhen, People s Republc of Chna {yhwu,wangxl}@nsun.ht.edu.cn, yxdng@htsz.edu.cn, ht.xujun@gmal.com Abstract In ths paper we present our research of onlne hot topc detecton and label extracton method for our hot topc recommendaton system. Usng a new topcal feature selecton method, the feature space s compressed sutable for an onlne system. The tolerance rough set model s used to enrchng the small set of topcal feature words to a topcal approxmaton space. Accordng to the dstance defned on the topcal approxmaton space, the web pages are clustered nto groups whch wll be merged wth document overlap. The topc labels are extracted based on the approxmaton topcal space enrched wth the useful but hgh frequency topcal words dropped by the clusterng process. The experments show that our method could generate more nformaton abundant classes and more topcal class labels, allevate the topcal drft caused by the non-topcal and nose words. Index Terms topc detecton, tolerance rough set model, assocaton rule, clusterng, recommendaton system I. INTRODUCTION Internet users get nformaton by nputtng key words nto search engnes. However, lots of people are nterested n the mportant news or topcs n a specfc doman whch can not descrbed by key words. These users don t know the key words and don t need to nput any thng. For example, the stock owners need a close followng about the fnancal news to fnd the most potental stocks. But they don t know the key words before the related news come out of webstes. News recommendaton system can solve ths problem. In ths research, we present a topc detecton and news recommendaton method, whch could recommend the most mportant news and topcs about a certan doman wthout the user logs. The user log-based news recommendaton system needs both the user s data and the news data. Ths knd of recommendaton s a flterng problem: usng the user s data as a flter, whch s wdely used n e-commerce webste where the past shoppng records are used as a flter for new product recommendaton. Our recommendaton method could fnd the topcs n web page collectons, take out the topc labels, and recommend the most mportant topcs to the user. Ths knd of recommendaton s a detecton problem. There are many researches about the topc detecton and trackng (TDT). However, most of them are based on the offlne corpus. The tradtonal TF-IDF based feature selecton s not sutable n a topc detecton process for the topc drft caused by the non-topcal words and nose words. Ths topcal drft can be allevated by a tolerance rough set based topcal approxmaton feature space. Most of the onlne topc research are focused on the new topc detect task. The topc label extracton s also an mportant part of news recommendaton system. Assocaton rules extracton s wdely used n e- commerce recommendaton system for extracton of recommendaton rules from shoppng-record, whch s compact and short. The web page contans more nose words compared to the shoppng record. When used n topc label extracton from web pages, the assocaton rule methods also suffers the topc drftng problem caused by the non-topcal words. When the number of pages grows up, the tme complexty of assocaton rule method ncreased sharply. In ths research, we present a tolerance rough set based topcal clusterng and class label extracton method based on a tolerance approxmaton topcal feature space, whch could reduce the non-topcal nose words and generate more topcal cluster. By controllng the threshold parameter the tolerance approxmaton topcal feature space s more compact and topcal than the TF-IDF feature space. The experments show our method s sutable for an on-lne recommendaton system. The remanng of the paper s organzed as follows. In secton 2, the related research works are gven. Secton 3 presents the dscusson of web page crawlng and testng corpus. Secton 4 dscusses the topc detect usng tolerance rough set based clusterng, the topc label extracton usng assocaton rules. The experment results and future works are shown n secton 5 and secton 6, respectvely. II. PREVIOUS WORKS A. Topc Detecton and Webste News Recommendaton The tradtonal TDT research defned a topc as a semnal vent or actvty, along wth all drectly related events and actvtes [1]. In our news recommendaton system, a topc s defned as a group of news pages reportng the same or related events. There are many TDT research based on the TDT corpus, whch contans document labeled nto dfferent topcs. Most of the TDT research usng the nformaton flterng methods. James Allan Ron papka [2] use the surprsng features trackng new event from the Internet. Bngjun Sun and Prasenjt do: /jcp

550 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 Mtra [3] use mutual nformaton (MI) and weghted mutual nformaton (WMI) trackng the topcs n multdocument envronment.

2 550 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 Mtra [3] use mutual nformaton (MI) and weghted mutual nformaton (WMI) trackng the topcs n multdocument envronment. Abhnandan Das and Mayur Datar [4] use a collaboratve Flterng method n the Google s personalzed news recommendaton system. Kuan-Yu Chen and Luesak Luesukprasert [5] make research of the hot topc extracton usng sentence modelng and tme-lne analyss. Clusterng methods [6] s a useful tool n the topc detecton process. B. Rough Set Theory Rough set theory s used for approxmaton of a concept, whch s an extenson of the set theory. Ths theory s frst ntroduced by Pawlk [7], and then wdely used n mnng the assocaton between attrbutes. Rough set Theory s wdely used n many researches such as: latent semantc analyss, expert system wth ncomplete nformaton, Assocaton rules extracton... and so on. Several models based on rough set are proposed n the nformaton retreval. Funakosh and Ho [8] proposed Equvalence rough set model, whch usng the equvalence classes. Ths model can be used to calculate the latent semantc between terms, but t s hard to calculate the equvalence classes due to the strct transtve property restrct. Skowron and Stepanuk [9] proposed the tolerance rough set model (TRSM) n Instead of usng the strct equvalent relaton TRSM relax the transtve restrcton whch makes the TRSM feasble to use n Informaton retreval. C. Assocaton Rules The assocaton rules extracton s wdely used n the research of the transacton records contanng compact data. J.Han and Y.Fu [10], R.Agrawal [11] deal wth the large database assocaton rule extracton problem. S. Wesly Changchen and Tzu-Chuen Lu [12] usng a neural network based clusterng and assocaton rules extracton method n on-lne products recommendaton system. M.Delgado [13] present a fuzzy assocaton rules extracton method n text mnng based on fuzzy transacton. III. WEB PAGE COLLECTION The web pages used n ths research s collected by the crawler of our search engne Inar. In order to get the most useful web pages from a specfc doman webste a new edton s developed from the Inar crawler: Pooler hot news crawler. The detaled nformaton about the crawler can be seen from [14] [15]. Pooler uses an mproved aggregate rankng algorthm [16] to rankng the webstes n a specfc doman, fnd the most useful set as well as the ndex pages, refresh the page set ncrementally. In ths research, we use the fnancal webstes as an example. As an onlne topc recommendaton system, we are focused on the new web pages refreshed everyday. The Pooler collects about 10,650,000 pages from about 2600 fnancal webste. The top 70 ste are chosen for ths research by the mproved aggregate rankng method. The publsh tme of web pages are adjusted usng the tme label extracted from the page whch could prevent the lost of last-modfy part n many webste servers. In order to get the number of pages processed by the recommendaton system, we sampled the web pages wth a publsh tme n October, (a) (b) New page refreshed daly New page refreshed hourly Fgure 1. Web page refreshng n October, 2008 TABLE I. NEW PAGES REFRESHED IN OCTOBER, 2008 Number of Refreshng Pages Days Number of Webstes The number of pages refreshed n October 2008 can be seen from Table I. Fg. 1(a) shows the new page refreshed by days n October, Fg. 1(b) shows the refreshng new pages by hours regardless of the date. IV. TOPIC DETECTION IN APPROXIMATION SPACE The clusterng method always suffers the nose words. In topcal clusterng of web pages, we call the words not related wth ts topc as the non-topcal words. Nontopcal words could drft the topcal clusterng process. In order to get more topcal clusterng results, several methods are proposed to reduce the non-topcal words as

3 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL well as the feature space. The reducton of the feature space may cause the lost of co-occurrence relaton of words, whch s an mportant feature n clusterng. In ths research, we use a tolerance rough set based clusterng method to get more topcal class. After feature selecton the non-topcal words are deserted. Ths could generate a compact feature space sutable for an onlne recommendaton system. Then ths feature space s extended nto a topcal approxmaton space usng the TRSM. By changng the TRSM parameters, the sze of the topcal approxmaton space can be controlled whch could balance the clusterng result and the runnng tme. Then for each class the topc label s extracted usng assocaton rule on the tolerance approxmaton feature space. A. Topcal Feature Selecton In order to get a topcal clusterng result, the topcal words suffcently descrbe the topc must be extracted. Several term weghtng schemes are used n selectng the key words from the web pages. Term frequency-nverse document frequency (TF-IDF) s wdely used n text mnng. The TF-IDF method can drop the hgh frequency words whch are dsaster for the clusterng. However, the hot topcs occurs n many web page wll be dropped by ths term weghtng method. So we usng dfferent feature selecton n the topc label extracton, whch gves the topcal words whch occur frequently on many web stes a hgh score. Another dsadvantage of the TF-IDF as well as other frequency based feature selecton s all the worlds are treated equally, whch s not sutable for a topcal clusterng. The topcal words are more mportant than other context n a topcal cluster. If the feature space contans more topcal words, the cluster result wll be more topcal. Usually, the web page contans the ttle fled, whch n most tme s the key words of the content and should be gven a hgh score than the content. The words close to or co-occur wth the ttle words can be used to enrch the topcal words. At frst, the web pages collected by crawler Pooler are purfed by a cleanng module to drop the scrpts and nose texts such as the advertsements. Some of the nonnformatve parts are dropped by ths cleanng module, such as the numbers, the hyperlnks n text area, the error message generated by the scrpt, the foot note of web pages. These no-topcal words usually cause the topcal drft n clusterng process. For example, the web pages are usually generated from some predefned models. The pages collected from the same web ste usually contan the same navgaton text, copy rght nformaton, as well as the foot note nformaton. The page co-occurrence of these texts s even hgher than the topcal words, whch lead to a low NTD dstance. Ths non-topcal dstance could be allevated by web sted dstance (NSD). However, these non-topcal dstances are low enough to drft the clusterng process n many tmes. Then the texts are processed by the Chnese word segmentaton and Named Entty Recognton system ELUS [17]. The ELUS system uses a trgram model wth smoothng algorthm for Chnese word segmentaton and a Maxmum entropy model for Named Entty Recognton. In the thrd SIGHAN-2006 bakeoff the ELUS system acheves F-measure 96.8%, whch s the best n MSRA open test. The part of speech features can be extracted effectvely usng ELUS system. The ttle words and the noun word of web pages are extracted as the ntal topcal words set. B. Topcal Words Enrchng There are few topcal words derved from the ttle. More topcal words can be found from the co-occur set of these topcal words, whch s the latent semantc features. Latent Semantc Analyss (LSA) method can be used to fnd the topcal words relaton. However ths knd of method usually cost too much complex computng for an onlne system. TRSM s another useful tools to mnng the latent semantc feasble for on-lne clusterng. The orgnal topcal words set can be extend to a upper approxmaton topcal words set n whch the two web page are related to a topc even f they are not related n the co-occur feather space. Tolerance Rough Set Model: Let U be the unverse, R U U be an equvalence relaton defned on U. A = (U, R) s an approxmaton space. In the orgnal rough set theory the lower and upper approxmaton of set X U wth respect to relaton R s defned as: RX [ ] = { x U:[ x] X}. (1) R RX [ ] = { x U:[ x] I X }. (2) Where [ X ] R = { y U xry} s the equvalence class of objects ndscernble wth x regardng R. Then the rough boundary of X can be defned as: R BN ( ) ( ) ( ) R X = R X R X. (3) A topc s always overlapped wth other related topcs, so we have to deal wth the overlappng classes derved from the clusterng process. The approxmaton of rough set theory s useful to deal wth ths knd of overlappng concept. The orgnal small sze of topcal words set can be enrchng usng ths approxmaton. However, the relaton R n orgnal rough set theory, whch s reflectve, symmetrc and transtve are too strct for natural language processng (NLP). As a matter of fact, these transtve n topcal clusterng always lead to a topc drft. Skowron and Stepanuk proposed a tolerance rough set model (TRSM) by relaxng the transtve restrct. The restrct on of the relaton R n TRSM s reflexve and symmetrc. The approxmaton space derved usng TRSM s called tolerance space. TRSM can be defned as a quadruple: = ( U, I, v, P). (4) Whch U s a non-empty set of objects, I : U P( U), v: P( U) P( U) [0,1], P: I( U) {0,1} are uncertanty

4 552 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 vague ncluson and structure functon where PIx ( ( )) = 1 for any x U. The lower and upper approxmaton n R for any X U can be defned as: RX [ ] = { x U: PIx ( ( )) = 1 vix ( ( ), X) = 1}. (5) RX [ ] = { x U: PIx ( ( )) = 1 vix ( ( ), X) > 0}. (6) The relaton R n TRSM for text nformaton retreval s usually defned as the co-occurrence n document set D. In our topcal feature approxmaton the R s defned as the topcal co-occurrence n all document set D. We tred many ways to defne the topcal co-occurrence. In ths paper we defne the topcal co-occurrence as the words co-occurred n the topcal area. The web page ttle s an mportant topcal area to generate the orgnal topcal word set (TWS). The topcal area s defned as the text area around the topcal word n TWS wth a dstance of l. If the page ttle contans only few topcal words not enough to descrbe the document the part of speech (POS) feature s used to enrch the feature set. Let D = { d1, d2, L, d n } be a set of documents, and T = { t1, t2l, t m } be a word set co-occurred n the topcal area of D. The topcal approxmaton space R = { T, Iθ, v, P} can be defned over T usng uncertanty functon I θ : I ( t ) = { t f ( t, t ) θ} U { t}. (7) θ j TO j Where θ s a user defned threshold, fto ( t, t j ) s the topcal co-occurrence functon of words t and t j n dstance l. The lower and upper topcal approxmaton can be defned as: RX [ ] = { t T: vi ( ( t), X) = 1}. (8) θ RX [ ] = { t T: vi ( ( t), X) > 0}. (9) The orgnal topcal word set can be derved from the web page ttle as well as the POS feature words. The upper topcal approxmaton R s a set of words whch related to the topc n the topcal approxmaton feature space. C. Topcal Clusterng Semantc Dstance Measurement of Topcal words: The measurement of word dstance s an mportant factor of the clusterng method. There are many ways to compute the semantc dstance between two topcal words. The manually labeled dctonary such as WordNet and synonym word tables can be used n dstance measurng. Ths knd of manually label based tools suffers many dsadvantages n topcal clusterng. Frst, the nconsstent between the manually labeled relatons and the new arrvng web pages usually lead to a topc drft n the topc clusterng process. The Internet has no dearth of content. Every tme there comes thousands of new web pages contans topcal words nconsstent wth the manually labeled relatons. Second, the commonly used θ WordNet lke lexcons has rare sense when used n specfc domans. Mantanng of labeled lexcons of a specfc doman s not easy. Some statstcal based method s proposed based on the assumpton that the two words tend to be related f they have a hgh co-occurrence. Ths method makes good use of the specfc word co-occurrence features n web pages. The Normalzed Google Dstance (NGD) s used n many related papers. The NGD method uses the number of pages contanng the words returned by Google as the measurement of the dstance between two words. The normalzed Google dstance between words w 1, w2 can be defned as : NGD( w1, w2) = max{log f( w1),log f( w2)} log f( w1, w2). (10) log N mn{log f( w ),log f( w )} 1 2 Where f( w ) are documents contans word w, N s the total number of the document set D, f( w1, w 2) s the number of web pages contanng word w 1 and w 2. Words wth hgh topc relaton are tendng to have low Google dstance. The NGD dstance can be derved usng the Google engne by countng the page numbers returned from Google. However the topcal approxmaton set s large and changng every tme the new page comes whch make t not feasble n practce. In ths research, we use a page set from a specfc doman to compute the dstance lke the NGD. Ths web page collecton comes from the specfc doman contans specfc features whch can not be extracted from a common web page collecton. The NGD lke dstance s generated by the page cooccurrence of words. The web pages have more abundant nformaton than text fles, such as the source webste names. At the begnnng of ths research we use the NGD lke dstance derved from our web page collectons. We fnd that the topcal words co-occurred n many webstes s more mportant than the words co-occurred n many web pages belongs to the same webste. In another word, the topcal words occur n many webstes (hgh webste co-occurrence words), s more mportant than the words wth hgh page co-occurrence. The source webste name s a useful feature whch could gve us more nformaton about the dstance between two topcal words. The webste frequency of topcal words can be derved from the URL. The dstance used n ths research s the Normalzed Topcal Dstance (NTD), whch composed of the Normalzed Page Dstance (NPD) and the Normalzed Ste Dstance (NSD). NPD( w1, w2) = max{log fp( w1),log fp( w2)} log fp( w1, w2). (11) log NP mn{log fp( w1),log fp( w2)} NSD( w1, w2) = max{log fs( w1),log fs( w2)} log fs( w1, w2). (12) log N mn{log f ( w ),log f ( w )} S S 1 S 2

5 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL Where f ( w ) and f ( w ) are respectvely the number P S of pages contan word w and the number of webstes contan word w. N P s the total number of web pages. N S s the total number of webstes. Then we can defne our NGD lke dstance NTD as follows: NTD = α NPD + α NSD. (13) P The parameter α P and α S are used to control the weght of NPD and NSD. By ths defnton, f two topcal words group co-occurred n the same number of pages but deferent number of webstes, the dstance between two words co-occurred n more webstes are tends to be closer. The NSD between two topcal words can be computed along wth the NPD computng process. The computng complexty of NTD s the same wth tradtonal NGD. The NGD lke method suffers a bas when two words co-occurred only wth each other. Snce low document frequency words are useless n clusterng process, a threshold s used to drop these words. The hgh document frequency words are also dropped by an upper threshold. Topcal Clusterng Process: After the topcal words selecton all the web pages are presented by the topcal words set. These topcal words are clustered nto groups that each group contans the topcal words hghly corelated n the topcal approxmaton space usng complete-lnk clusterng. The complete-lnk clusterng tend to produce tghtly bounded compact clusters than the sngle-lnk clusterng, whch s helpful to generate as much topc as the topcal approxmaton space contans. Cluster Mergng: The words from the same topc may be clustered nto dfferent groups due to the tghtly bounded clusters generated by the complete-lnk clusterng. There are too many groups to be ntegrated nto our hot news recommendaton system. As we consder the NSD n our dstance measurement, the topcal words wth both hgh page co-occurrence and webste co-occurrence tends to have low NTD, whch makes these worlds more lkely to be clustered nto a small class wth only several topcal words. These groups should be merged usng other topcal relaton measure method. The document overlap of each class can be used to deal wth these small classes. All the clusters generated by the frst clusterng process are merged by the document overlap between them. The overlap between two class C1 and C 2 s defned as: S tradtonal cluster label extracton method can not well descrbe the topcal meanngs. Usng assocaton rules method we can generate topcal labels sutable for on-lne news recommendaton system. The assocaton rule extracton s proved to be effectve n recommendaton system based on DB or other short records. These knds of records are compact and commonly contan few words, such as the shoppng record. The web pages, however, contans a large set of words whch cost more computng tme. The computng cost could be allevated by usng the assocaton rules extracton algorthm n each class generated by the topcal clusterng. The output of the topc label generaton s a group of topcal words whch s the hot topcal words. The frst step s select canddate hot topcal words from the web page of each class. The topcal words sets are selected n the topcal clusterng process. But some hgh frequency topcal words are dropped by the feature selecton. In topc label generaton we select all the hgh frequency noun words nto the canddate topcal words set. The hotness H ( W ) of each topcal world W s defned as: Ste( W) Page( W) HW ( ) = λ1 + λ2. (15) S P Where Ste( W ) s the number of webste whch contans a web page related to topcal word W, S s the total number of webste select by the rankng component, Page( W ) s the number of web pages contanng topcal words W, P s the total number of pages. C1I C2 OL( C1, C2) =. (14) MIN{ C1, C2 } Where C s the document number n group C. In the experments the threshold of document overlap s set to D. Topcal Label Generaton The topcal label generaton requres more topcal related label. In feature selecton process the topcal words wth hgh document frequency s dropped, whch may be the topcal words. The label generated by

6 554 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 As the tradtonal assocaton rules extracton compute the support value between each record, we compute the support between each canddate hot topcal words. The support of two topcal word set WS 1 and WS 2 can be calculated as follow: Ste( WS1IWS2) SUP( WS1, WS2) = λ1 max{ Ste( WS1), Ste( WS2) }. (16) Page( WS1IWS2) + λ2 max{ Page( WS1), Page( WS2)} Where the support of two topcal word set composed of two parts: the webste frequency support as well as the document frequency support. Parameter λ 1 and λ 2 can be used to control the hot weght caused by webste and page. In ths research λ 1 s set to 0.75 and λ 2 s set to Takng the top N topcal words as the canddate hot word, the topc label generaton can be descrbed as Algorthm 1. In the topc label generaton, each of the hot topcal word s a member of the label set L. Then the support between each l, lj L s computed. If the support value reaches the threshold the unon set of l, l j becomes a new member of the label set. After each supportng compute loop, the new members n T wll be merged when the overlap of the two members set reach the cover threshold θ c. The fnal label set for ths class s the one contans the most of the topcal words. V. EXPERIMENTS AND RESULTS Ths research s desgned to mprove the fnancal news recommendaton system, whch s an mportant part of the hatanyuan knowledge servce platform. The topcs are ranked by the number of topcal words as well as the number of pages. Then the system returns the most mportant page ttles derved from the top topcs wthout the label generaton process. Wth the topcal clusterng method n ths research our recommendaton system wll be able to recommend hot topcs of current fnancal webste whch s more useful than the hot ttles. Usually, the user s experence s the most useful tools n the evaluaton of a clusterng algorthm, whch makes the comparng of dfferent clusterng algorthms a nontrval task. A. En vronment and Testng Corpus The recommendaton system s mplemented usng c++ and PHP on Lnux system. The crawler s runnng on a Lnux machne shows n table II. The web page corpus s randomly select from the news pages refreshed n October, The non-nformatve pages are deserted. B. Mnmum Word Frequency Mnmum word frequency threshold θ wf s very mportant for the topcal clusterng. The low threshold wll takes more words come nto the topcal word set, whch makes the clusterng generate more basc class and has a hgh Coverage. However, the tme cost s growng wth the class number. The nose word n the TF-IDF based feature selecton wll generate many useless class and the nontopcal words wll drft the topcal clusterng process. Fg. 2 shows the relaton between the mnmum term frequency threshold and the base class number n our topcal approxmaton space compared wth TF-IDF feature space n a corpus contanng 2500 web pages. The threshold dstance used n clusterng s set to 0.3. Both methods wll get a small number of classes usng a hgh threshold and a large number of classes usng a low threshold. The relaton between threshold and class coverage can be seen n Fg. 3. Both methods get a hgh coverage n low threshold and a low coverage n hgh threshold. As can be seen from Fg. 3, the POS feature and tolerance rough set based approxmaton feature space could drop the non-topcal relatons between non-topcal words; reduce the number of class and runnng tme, whch s sutable for onlne usng. The class coverage of topcal approxmaton feature space s decrease for the non-topcal words are dropped n feature selecton. C. Tme Cost In order to test the tme complexty of ths algorthm, we sampled sets contanng dfferent pages from 500 to Fg. 4 shows the tme cost lne for dfferent numbers of web page sets whch s near to lnear. As we can see from Fg. 1(a), there are 5000 to new web pages refreshed every day. The page number can be controlled by dscardng the redundant and topcal less pages to get a reasonable tme cost for a 24-hour on-lne hot topc recommendaton system. VI. CONCLUSION AND FUTURE WORK In ths paper, a tolerance rough set based clusterng method s presented to detect the topcs n fnancal web ste news pages. In the feature selecton we use the part of speech feature and the ttle words to form ntal topcal word set. Then, ths topcal word set are extended by the TRSM usng the co-occurrence of topcal words n dstance l. Based on the NGD method, the web ste dstance (NSD) s ntroduced nto the new NGD lke dstance measure method: NTD. Usng NTD dstance the news pages are clustered nto more topcal class than the tradtonal feature space. The class label s generated by the assocaton rule method. The experments show that the tolerance rough set based topcal approxmaton space can explot the latent semantc under the co-occurrence of topcal words. Ths method allevates the non-topcal word caused topcal drftng, reduces the non-topcal class numbers as well as the runnng tme. The topcal clusterng and label generaton method could generate more topcal class than the tradtonal TF-IDF features. For the applcaton of ths method, there are some future tasks as how to detect the web pages n whch the ttle nconsstent wth the content, fndng the most mportant topcs from the clusterng result. The ttle words of web pages are treated mportant n our method,

7 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL whch s a good topcal feature for the web pages wth page ttles consstent wth the page contents. For the web pages wth ttle nconsstent wth the contents, we must fnd method to fnd the real topcal words set. Part of ths research s ntegrated nto our hot fnancal news detecton system ( whch s an mportant part of knowledge servce plat form hatanyuan. TABLE II. SYSTEM RUNING ENVIRONMENT CPU Intel Pentum 3.0 Memory Dsk 2G 320G ACKNOWLEDGMENT The authors wsh to thank Dr. Xanjun Meng for hs helpful suggestons. Ths nvestgaton was supported n part by the Natonal Natural Scence Foundaton of Chna( No ) and the natonal 863 Program of Chna(No. 2006AA01Z197, No. 2007AA01Z194). Fgure 2. Mnmum word frequency threshold and class numbers Fgure 3. Mnmum world frequency threshold and class coverage Fgure 4. Runnng tme for dfferent web page numbers REFERENCES [1] TDT2004, The 2004 topc detecton and trackng task defnton and evaluaton plan, [Onlne]. Avalable: [2] J. Allan, R. Papka, and V. Lavrenko, On-lne new event detecton and trackng, SIGIR 98: Proceedngs of the 21st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pp , November [3] B. Sun, P. Mtra, H. Zha, C. Gles, and J. Yen, Topc segmentaton wth shared topc detecton and algnment of multple documents, SIGIR 07, pp , July [4] A. S. Das, M. Datar, A. Garg, and S. Rajaram, Google news personalzaton: scalable onlne collaboratve flterng, n WWW 07: Proceedngs of the 16th nternatonal conference on World Wde Web. New York, NY, USA: ACM, 2007, pp [5] K.-Y. Chen, L. Luesukprasert, and S. cho T.Chou, Hot topc extracton based on tmelne analyss and multdmensonal sentence modelng, IEEE Transactons on Knowledge and Data Engneerng, pp , August [6] A. K. Jan, M. N. Murty, and P. J. Flynn, Data clusterng: a revew, ACM Computer Survey, vol. 31, no. 3, pp , [7] Z. Pawlak, Rough sets, Internatonal Journal of Computer and Informaton Scences. 11(5), pp , [8] K. Funakosh and T. B. Ho, A Rough Set Approach to Informaton Retreval. In: L. Polkowsk and A. Skowron, (eds.) Rough Sets n Knowledge Dscovery. Erewhon, NC: Physca-Verlag, [9] A. Skowron and J. Stepanuk, Tolerance approxmaton space, 3 rd Internatonal Workshop on Rough Sets and Soft computng, pp , November [10] J. Han and Y. Fu, Dscovery of multple-level assocaton rules from large databases, n Proceedngs of the 21st VLDB Conference, 1995, pp [11] R. Agrawal, T. Imel nsk, and A. Swam, Mnng assocaton rules between sets of tems n large databases, n SIGMOD 93: Proceedngs of the 1993 ACM SIGMOD nternatonal conference on Management of data. New York, NY, USA: ACM, 1993, pp [12] S. W. Changchen and T. Lu, Mnng assocaton rules procedure to support on-lne recommendaton by customers and products fragmentaton, Expert System wth Applcatons, vol. 20, pp , [13] M.Degado, M. Martn-Bautsta, D. Sanchez, and J.M.Serrano, Assocaton rule extracton for text mnng, n Lecture Notes n Computer Scence, 2002, pp

8 556 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 [14] Y. Dng, X. Wang, L. Ln, and Y. Wu, The desgn and mplementaton of the crawler-nar, ICMLC, pp , August [15] Y. Wu, X. Wang, Y. dng, and J. Xu, Desgn and mplementaton of a c/s structured dstrbuted mult-thread crawler usng ace and pvfs, Journal of Computatonal Informaton System, vol. 4, no. 3, pp , August [16] G. Feng, T.-Y. Lu, Y. Wang, Y. Bao, Z. Ma, X.-D. Zhang and W.-Y. Ma, Aggregaterank: brngng order to web stes, n SIGIR 06: Proceedngs of the 29th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval. New York, NY, USA: ACM, 2006, pp [17] W. Jang, Y. Guan, amd XL. Wang A Pragmatc Chnese Word Segmentaton System Proceedngs of the Ffth SIGHAN Workshop on Chnese Language Processng, Sydney, July 2006, pp

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School