On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering

Size: px
Start display at page:

Download "On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering"

Transcription

1 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL On-lne Hot Topc Recommendaton Usng Tolerance Rough Set Based Topc Clusterng Yonghu Wu, Yuxn Dng, Xaolong Wang, Jun Xu Intellgence Computng Research Center Harbn Insttute of Technology, Shenzhen Graduate School, Shenzhen, People s Republc of Chna {yhwu,wangxl}@nsun.ht.edu.cn, yxdng@htsz.edu.cn, ht.xujun@gmal.com Abstract In ths paper we present our research of onlne hot topc detecton and label extracton method for our hot topc recommendaton system. Usng a new topcal feature selecton method, the feature space s compressed sutable for an onlne system. The tolerance rough set model s used to enrchng the small set of topcal feature words to a topcal approxmaton space. Accordng to the dstance defned on the topcal approxmaton space, the web pages are clustered nto groups whch wll be merged wth document overlap. The topc labels are extracted based on the approxmaton topcal space enrched wth the useful but hgh frequency topcal words dropped by the clusterng process. The experments show that our method could generate more nformaton abundant classes and more topcal class labels, allevate the topcal drft caused by the non-topcal and nose words. Index Terms topc detecton, tolerance rough set model, assocaton rule, clusterng, recommendaton system I. INTRODUCTION Internet users get nformaton by nputtng key words nto search engnes. However, lots of people are nterested n the mportant news or topcs n a specfc doman whch can not descrbed by key words. These users don t know the key words and don t need to nput any thng. For example, the stock owners need a close followng about the fnancal news to fnd the most potental stocks. But they don t know the key words before the related news come out of webstes. News recommendaton system can solve ths problem. In ths research, we present a topc detecton and news recommendaton method, whch could recommend the most mportant news and topcs about a certan doman wthout the user logs. The user log-based news recommendaton system needs both the user s data and the news data. Ths knd of recommendaton s a flterng problem: usng the user s data as a flter, whch s wdely used n e-commerce webste where the past shoppng records are used as a flter for new product recommendaton. Our recommendaton method could fnd the topcs n web page collectons, take out the topc labels, and recommend the most mportant topcs to the user. Ths knd of recommendaton s a detecton problem. There are many researches about the topc detecton and trackng (TDT). However, most of them are based on the offlne corpus. The tradtonal TF-IDF based feature selecton s not sutable n a topc detecton process for the topc drft caused by the non-topcal words and nose words. Ths topcal drft can be allevated by a tolerance rough set based topcal approxmaton feature space. Most of the onlne topc research are focused on the new topc detect task. The topc label extracton s also an mportant part of news recommendaton system. Assocaton rules extracton s wdely used n e- commerce recommendaton system for extracton of recommendaton rules from shoppng-record, whch s compact and short. The web page contans more nose words compared to the shoppng record. When used n topc label extracton from web pages, the assocaton rule methods also suffers the topc drftng problem caused by the non-topcal words. When the number of pages grows up, the tme complexty of assocaton rule method ncreased sharply. In ths research, we present a tolerance rough set based topcal clusterng and class label extracton method based on a tolerance approxmaton topcal feature space, whch could reduce the non-topcal nose words and generate more topcal cluster. By controllng the threshold parameter the tolerance approxmaton topcal feature space s more compact and topcal than the TF-IDF feature space. The experments show our method s sutable for an on-lne recommendaton system. The remanng of the paper s organzed as follows. In secton 2, the related research works are gven. Secton 3 presents the dscusson of web page crawlng and testng corpus. Secton 4 dscusses the topc detect usng tolerance rough set based clusterng, the topc label extracton usng assocaton rules. The experment results and future works are shown n secton 5 and secton 6, respectvely. II. PREVIOUS WORKS A. Topc Detecton and Webste News Recommendaton The tradtonal TDT research defned a topc as a semnal vent or actvty, along wth all drectly related events and actvtes [1]. In our news recommendaton system, a topc s defned as a group of news pages reportng the same or related events. There are many TDT research based on the TDT corpus, whch contans document labeled nto dfferent topcs. Most of the TDT research usng the nformaton flterng methods. James Allan Ron papka [2] use the surprsng features trackng new event from the Internet. Bngjun Sun and Prasenjt do: /jcp

2 550 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 Mtra [3] use mutual nformaton (MI) and weghted mutual nformaton (WMI) trackng the topcs n multdocument envronment. Abhnandan Das and Mayur Datar [4] use a collaboratve Flterng method n the Google s personalzed news recommendaton system. Kuan-Yu Chen and Luesak Luesukprasert [5] make research of the hot topc extracton usng sentence modelng and tme-lne analyss. Clusterng methods [6] s a useful tool n the topc detecton process. B. Rough Set Theory Rough set theory s used for approxmaton of a concept, whch s an extenson of the set theory. Ths theory s frst ntroduced by Pawlk [7], and then wdely used n mnng the assocaton between attrbutes. Rough set Theory s wdely used n many researches such as: latent semantc analyss, expert system wth ncomplete nformaton, Assocaton rules extracton... and so on. Several models based on rough set are proposed n the nformaton retreval. Funakosh and Ho [8] proposed Equvalence rough set model, whch usng the equvalence classes. Ths model can be used to calculate the latent semantc between terms, but t s hard to calculate the equvalence classes due to the strct transtve property restrct. Skowron and Stepanuk [9] proposed the tolerance rough set model (TRSM) n Instead of usng the strct equvalent relaton TRSM relax the transtve restrcton whch makes the TRSM feasble to use n Informaton retreval. C. Assocaton Rules The assocaton rules extracton s wdely used n the research of the transacton records contanng compact data. J.Han and Y.Fu [10], R.Agrawal [11] deal wth the large database assocaton rule extracton problem. S. Wesly Changchen and Tzu-Chuen Lu [12] usng a neural network based clusterng and assocaton rules extracton method n on-lne products recommendaton system. M.Delgado [13] present a fuzzy assocaton rules extracton method n text mnng based on fuzzy transacton. III. WEB PAGE COLLECTION The web pages used n ths research s collected by the crawler of our search engne Inar. In order to get the most useful web pages from a specfc doman webste a new edton s developed from the Inar crawler: Pooler hot news crawler. The detaled nformaton about the crawler can be seen from [14] [15]. Pooler uses an mproved aggregate rankng algorthm [16] to rankng the webstes n a specfc doman, fnd the most useful set as well as the ndex pages, refresh the page set ncrementally. In ths research, we use the fnancal webstes as an example. As an onlne topc recommendaton system, we are focused on the new web pages refreshed everyday. The Pooler collects about 10,650,000 pages from about 2600 fnancal webste. The top 70 ste are chosen for ths research by the mproved aggregate rankng method. The publsh tme of web pages are adjusted usng the tme label extracted from the page whch could prevent the lost of last-modfy part n many webste servers. In order to get the number of pages processed by the recommendaton system, we sampled the web pages wth a publsh tme n October, (a) (b) New page refreshed daly New page refreshed hourly Fgure 1. Web page refreshng n October, 2008 TABLE I. NEW PAGES REFRESHED IN OCTOBER, 2008 Number of Refreshng Pages Days Number of Webstes The number of pages refreshed n October 2008 can be seen from Table I. Fg. 1(a) shows the new page refreshed by days n October, Fg. 1(b) shows the refreshng new pages by hours regardless of the date. IV. TOPIC DETECTION IN APPROXIMATION SPACE The clusterng method always suffers the nose words. In topcal clusterng of web pages, we call the words not related wth ts topc as the non-topcal words. Nontopcal words could drft the topcal clusterng process. In order to get more topcal clusterng results, several methods are proposed to reduce the non-topcal words as

3 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL well as the feature space. The reducton of the feature space may cause the lost of co-occurrence relaton of words, whch s an mportant feature n clusterng. In ths research, we use a tolerance rough set based clusterng method to get more topcal class. After feature selecton the non-topcal words are deserted. Ths could generate a compact feature space sutable for an onlne recommendaton system. Then ths feature space s extended nto a topcal approxmaton space usng the TRSM. By changng the TRSM parameters, the sze of the topcal approxmaton space can be controlled whch could balance the clusterng result and the runnng tme. Then for each class the topc label s extracted usng assocaton rule on the tolerance approxmaton feature space. A. Topcal Feature Selecton In order to get a topcal clusterng result, the topcal words suffcently descrbe the topc must be extracted. Several term weghtng schemes are used n selectng the key words from the web pages. Term frequency-nverse document frequency (TF-IDF) s wdely used n text mnng. The TF-IDF method can drop the hgh frequency words whch are dsaster for the clusterng. However, the hot topcs occurs n many web page wll be dropped by ths term weghtng method. So we usng dfferent feature selecton n the topc label extracton, whch gves the topcal words whch occur frequently on many web stes a hgh score. Another dsadvantage of the TF-IDF as well as other frequency based feature selecton s all the worlds are treated equally, whch s not sutable for a topcal clusterng. The topcal words are more mportant than other context n a topcal cluster. If the feature space contans more topcal words, the cluster result wll be more topcal. Usually, the web page contans the ttle fled, whch n most tme s the key words of the content and should be gven a hgh score than the content. The words close to or co-occur wth the ttle words can be used to enrch the topcal words. At frst, the web pages collected by crawler Pooler are purfed by a cleanng module to drop the scrpts and nose texts such as the advertsements. Some of the nonnformatve parts are dropped by ths cleanng module, such as the numbers, the hyperlnks n text area, the error message generated by the scrpt, the foot note of web pages. These no-topcal words usually cause the topcal drft n clusterng process. For example, the web pages are usually generated from some predefned models. The pages collected from the same web ste usually contan the same navgaton text, copy rght nformaton, as well as the foot note nformaton. The page co-occurrence of these texts s even hgher than the topcal words, whch lead to a low NTD dstance. Ths non-topcal dstance could be allevated by web sted dstance (NSD). However, these non-topcal dstances are low enough to drft the clusterng process n many tmes. Then the texts are processed by the Chnese word segmentaton and Named Entty Recognton system ELUS [17]. The ELUS system uses a trgram model wth smoothng algorthm for Chnese word segmentaton and a Maxmum entropy model for Named Entty Recognton. In the thrd SIGHAN-2006 bakeoff the ELUS system acheves F-measure 96.8%, whch s the best n MSRA open test. The part of speech features can be extracted effectvely usng ELUS system. The ttle words and the noun word of web pages are extracted as the ntal topcal words set. B. Topcal Words Enrchng There are few topcal words derved from the ttle. More topcal words can be found from the co-occur set of these topcal words, whch s the latent semantc features. Latent Semantc Analyss (LSA) method can be used to fnd the topcal words relaton. However ths knd of method usually cost too much complex computng for an onlne system. TRSM s another useful tools to mnng the latent semantc feasble for on-lne clusterng. The orgnal topcal words set can be extend to a upper approxmaton topcal words set n whch the two web page are related to a topc even f they are not related n the co-occur feather space. Tolerance Rough Set Model: Let U be the unverse, R U U be an equvalence relaton defned on U. A = (U, R) s an approxmaton space. In the orgnal rough set theory the lower and upper approxmaton of set X U wth respect to relaton R s defned as: RX [ ] = { x U:[ x] X}. (1) R RX [ ] = { x U:[ x] I X }. (2) Where [ X ] R = { y U xry} s the equvalence class of objects ndscernble wth x regardng R. Then the rough boundary of X can be defned as: R BN ( ) ( ) ( ) R X = R X R X. (3) A topc s always overlapped wth other related topcs, so we have to deal wth the overlappng classes derved from the clusterng process. The approxmaton of rough set theory s useful to deal wth ths knd of overlappng concept. The orgnal small sze of topcal words set can be enrchng usng ths approxmaton. However, the relaton R n orgnal rough set theory, whch s reflectve, symmetrc and transtve are too strct for natural language processng (NLP). As a matter of fact, these transtve n topcal clusterng always lead to a topc drft. Skowron and Stepanuk proposed a tolerance rough set model (TRSM) by relaxng the transtve restrct. The restrct on of the relaton R n TRSM s reflexve and symmetrc. The approxmaton space derved usng TRSM s called tolerance space. TRSM can be defned as a quadruple: = ( U, I, v, P). (4) Whch U s a non-empty set of objects, I : U P( U), v: P( U) P( U) [0,1], P: I( U) {0,1} are uncertanty

4 552 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 vague ncluson and structure functon where PIx ( ( )) = 1 for any x U. The lower and upper approxmaton n R for any X U can be defned as: RX [ ] = { x U: PIx ( ( )) = 1 vix ( ( ), X) = 1}. (5) RX [ ] = { x U: PIx ( ( )) = 1 vix ( ( ), X) > 0}. (6) The relaton R n TRSM for text nformaton retreval s usually defned as the co-occurrence n document set D. In our topcal feature approxmaton the R s defned as the topcal co-occurrence n all document set D. We tred many ways to defne the topcal co-occurrence. In ths paper we defne the topcal co-occurrence as the words co-occurred n the topcal area. The web page ttle s an mportant topcal area to generate the orgnal topcal word set (TWS). The topcal area s defned as the text area around the topcal word n TWS wth a dstance of l. If the page ttle contans only few topcal words not enough to descrbe the document the part of speech (POS) feature s used to enrch the feature set. Let D = { d1, d2, L, d n } be a set of documents, and T = { t1, t2l, t m } be a word set co-occurred n the topcal area of D. The topcal approxmaton space R = { T, Iθ, v, P} can be defned over T usng uncertanty functon I θ : I ( t ) = { t f ( t, t ) θ} U { t}. (7) θ j TO j Where θ s a user defned threshold, fto ( t, t j ) s the topcal co-occurrence functon of words t and t j n dstance l. The lower and upper topcal approxmaton can be defned as: RX [ ] = { t T: vi ( ( t), X) = 1}. (8) θ RX [ ] = { t T: vi ( ( t), X) > 0}. (9) The orgnal topcal word set can be derved from the web page ttle as well as the POS feature words. The upper topcal approxmaton R s a set of words whch related to the topc n the topcal approxmaton feature space. C. Topcal Clusterng Semantc Dstance Measurement of Topcal words: The measurement of word dstance s an mportant factor of the clusterng method. There are many ways to compute the semantc dstance between two topcal words. The manually labeled dctonary such as WordNet and synonym word tables can be used n dstance measurng. Ths knd of manually label based tools suffers many dsadvantages n topcal clusterng. Frst, the nconsstent between the manually labeled relatons and the new arrvng web pages usually lead to a topc drft n the topc clusterng process. The Internet has no dearth of content. Every tme there comes thousands of new web pages contans topcal words nconsstent wth the manually labeled relatons. Second, the commonly used θ WordNet lke lexcons has rare sense when used n specfc domans. Mantanng of labeled lexcons of a specfc doman s not easy. Some statstcal based method s proposed based on the assumpton that the two words tend to be related f they have a hgh co-occurrence. Ths method makes good use of the specfc word co-occurrence features n web pages. The Normalzed Google Dstance (NGD) s used n many related papers. The NGD method uses the number of pages contanng the words returned by Google as the measurement of the dstance between two words. The normalzed Google dstance between words w 1, w2 can be defned as : NGD( w1, w2) = max{log f( w1),log f( w2)} log f( w1, w2). (10) log N mn{log f( w ),log f( w )} 1 2 Where f( w ) are documents contans word w, N s the total number of the document set D, f( w1, w 2) s the number of web pages contanng word w 1 and w 2. Words wth hgh topc relaton are tendng to have low Google dstance. The NGD dstance can be derved usng the Google engne by countng the page numbers returned from Google. However the topcal approxmaton set s large and changng every tme the new page comes whch make t not feasble n practce. In ths research, we use a page set from a specfc doman to compute the dstance lke the NGD. Ths web page collecton comes from the specfc doman contans specfc features whch can not be extracted from a common web page collecton. The NGD lke dstance s generated by the page cooccurrence of words. The web pages have more abundant nformaton than text fles, such as the source webste names. At the begnnng of ths research we use the NGD lke dstance derved from our web page collectons. We fnd that the topcal words co-occurred n many webstes s more mportant than the words co-occurred n many web pages belongs to the same webste. In another word, the topcal words occur n many webstes (hgh webste co-occurrence words), s more mportant than the words wth hgh page co-occurrence. The source webste name s a useful feature whch could gve us more nformaton about the dstance between two topcal words. The webste frequency of topcal words can be derved from the URL. The dstance used n ths research s the Normalzed Topcal Dstance (NTD), whch composed of the Normalzed Page Dstance (NPD) and the Normalzed Ste Dstance (NSD). NPD( w1, w2) = max{log fp( w1),log fp( w2)} log fp( w1, w2). (11) log NP mn{log fp( w1),log fp( w2)} NSD( w1, w2) = max{log fs( w1),log fs( w2)} log fs( w1, w2). (12) log N mn{log f ( w ),log f ( w )} S S 1 S 2

5 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL Where f ( w ) and f ( w ) are respectvely the number P S of pages contan word w and the number of webstes contan word w. N P s the total number of web pages. N S s the total number of webstes. Then we can defne our NGD lke dstance NTD as follows: NTD = α NPD + α NSD. (13) P The parameter α P and α S are used to control the weght of NPD and NSD. By ths defnton, f two topcal words group co-occurred n the same number of pages but deferent number of webstes, the dstance between two words co-occurred n more webstes are tends to be closer. The NSD between two topcal words can be computed along wth the NPD computng process. The computng complexty of NTD s the same wth tradtonal NGD. The NGD lke method suffers a bas when two words co-occurred only wth each other. Snce low document frequency words are useless n clusterng process, a threshold s used to drop these words. The hgh document frequency words are also dropped by an upper threshold. Topcal Clusterng Process: After the topcal words selecton all the web pages are presented by the topcal words set. These topcal words are clustered nto groups that each group contans the topcal words hghly corelated n the topcal approxmaton space usng complete-lnk clusterng. The complete-lnk clusterng tend to produce tghtly bounded compact clusters than the sngle-lnk clusterng, whch s helpful to generate as much topc as the topcal approxmaton space contans. Cluster Mergng: The words from the same topc may be clustered nto dfferent groups due to the tghtly bounded clusters generated by the complete-lnk clusterng. There are too many groups to be ntegrated nto our hot news recommendaton system. As we consder the NSD n our dstance measurement, the topcal words wth both hgh page co-occurrence and webste co-occurrence tends to have low NTD, whch makes these worlds more lkely to be clustered nto a small class wth only several topcal words. These groups should be merged usng other topcal relaton measure method. The document overlap of each class can be used to deal wth these small classes. All the clusters generated by the frst clusterng process are merged by the document overlap between them. The overlap between two class C1 and C 2 s defned as: S tradtonal cluster label extracton method can not well descrbe the topcal meanngs. Usng assocaton rules method we can generate topcal labels sutable for on-lne news recommendaton system. The assocaton rule extracton s proved to be effectve n recommendaton system based on DB or other short records. These knds of records are compact and commonly contan few words, such as the shoppng record. The web pages, however, contans a large set of words whch cost more computng tme. The computng cost could be allevated by usng the assocaton rules extracton algorthm n each class generated by the topcal clusterng. The output of the topc label generaton s a group of topcal words whch s the hot topcal words. The frst step s select canddate hot topcal words from the web page of each class. The topcal words sets are selected n the topcal clusterng process. But some hgh frequency topcal words are dropped by the feature selecton. In topc label generaton we select all the hgh frequency noun words nto the canddate topcal words set. The hotness H ( W ) of each topcal world W s defned as: Ste( W) Page( W) HW ( ) = λ1 + λ2. (15) S P Where Ste( W ) s the number of webste whch contans a web page related to topcal word W, S s the total number of webste select by the rankng component, Page( W ) s the number of web pages contanng topcal words W, P s the total number of pages. C1I C2 OL( C1, C2) =. (14) MIN{ C1, C2 } Where C s the document number n group C. In the experments the threshold of document overlap s set to D. Topcal Label Generaton The topcal label generaton requres more topcal related label. In feature selecton process the topcal words wth hgh document frequency s dropped, whch may be the topcal words. The label generated by

6 554 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 As the tradtonal assocaton rules extracton compute the support value between each record, we compute the support between each canddate hot topcal words. The support of two topcal word set WS 1 and WS 2 can be calculated as follow: Ste( WS1IWS2) SUP( WS1, WS2) = λ1 max{ Ste( WS1), Ste( WS2) }. (16) Page( WS1IWS2) + λ2 max{ Page( WS1), Page( WS2)} Where the support of two topcal word set composed of two parts: the webste frequency support as well as the document frequency support. Parameter λ 1 and λ 2 can be used to control the hot weght caused by webste and page. In ths research λ 1 s set to 0.75 and λ 2 s set to Takng the top N topcal words as the canddate hot word, the topc label generaton can be descrbed as Algorthm 1. In the topc label generaton, each of the hot topcal word s a member of the label set L. Then the support between each l, lj L s computed. If the support value reaches the threshold the unon set of l, l j becomes a new member of the label set. After each supportng compute loop, the new members n T wll be merged when the overlap of the two members set reach the cover threshold θ c. The fnal label set for ths class s the one contans the most of the topcal words. V. EXPERIMENTS AND RESULTS Ths research s desgned to mprove the fnancal news recommendaton system, whch s an mportant part of the hatanyuan knowledge servce platform. The topcs are ranked by the number of topcal words as well as the number of pages. Then the system returns the most mportant page ttles derved from the top topcs wthout the label generaton process. Wth the topcal clusterng method n ths research our recommendaton system wll be able to recommend hot topcs of current fnancal webste whch s more useful than the hot ttles. Usually, the user s experence s the most useful tools n the evaluaton of a clusterng algorthm, whch makes the comparng of dfferent clusterng algorthms a nontrval task. A. En vronment and Testng Corpus The recommendaton system s mplemented usng c++ and PHP on Lnux system. The crawler s runnng on a Lnux machne shows n table II. The web page corpus s randomly select from the news pages refreshed n October, The non-nformatve pages are deserted. B. Mnmum Word Frequency Mnmum word frequency threshold θ wf s very mportant for the topcal clusterng. The low threshold wll takes more words come nto the topcal word set, whch makes the clusterng generate more basc class and has a hgh Coverage. However, the tme cost s growng wth the class number. The nose word n the TF-IDF based feature selecton wll generate many useless class and the nontopcal words wll drft the topcal clusterng process. Fg. 2 shows the relaton between the mnmum term frequency threshold and the base class number n our topcal approxmaton space compared wth TF-IDF feature space n a corpus contanng 2500 web pages. The threshold dstance used n clusterng s set to 0.3. Both methods wll get a small number of classes usng a hgh threshold and a large number of classes usng a low threshold. The relaton between threshold and class coverage can be seen n Fg. 3. Both methods get a hgh coverage n low threshold and a low coverage n hgh threshold. As can be seen from Fg. 3, the POS feature and tolerance rough set based approxmaton feature space could drop the non-topcal relatons between non-topcal words; reduce the number of class and runnng tme, whch s sutable for onlne usng. The class coverage of topcal approxmaton feature space s decrease for the non-topcal words are dropped n feature selecton. C. Tme Cost In order to test the tme complexty of ths algorthm, we sampled sets contanng dfferent pages from 500 to Fg. 4 shows the tme cost lne for dfferent numbers of web page sets whch s near to lnear. As we can see from Fg. 1(a), there are 5000 to new web pages refreshed every day. The page number can be controlled by dscardng the redundant and topcal less pages to get a reasonable tme cost for a 24-hour on-lne hot topc recommendaton system. VI. CONCLUSION AND FUTURE WORK In ths paper, a tolerance rough set based clusterng method s presented to detect the topcs n fnancal web ste news pages. In the feature selecton we use the part of speech feature and the ttle words to form ntal topcal word set. Then, ths topcal word set are extended by the TRSM usng the co-occurrence of topcal words n dstance l. Based on the NGD method, the web ste dstance (NSD) s ntroduced nto the new NGD lke dstance measure method: NTD. Usng NTD dstance the news pages are clustered nto more topcal class than the tradtonal feature space. The class label s generated by the assocaton rule method. The experments show that the tolerance rough set based topcal approxmaton space can explot the latent semantc under the co-occurrence of topcal words. Ths method allevates the non-topcal word caused topcal drftng, reduces the non-topcal class numbers as well as the runnng tme. The topcal clusterng and label generaton method could generate more topcal class than the tradtonal TF-IDF features. For the applcaton of ths method, there are some future tasks as how to detect the web pages n whch the ttle nconsstent wth the content, fndng the most mportant topcs from the clusterng result. The ttle words of web pages are treated mportant n our method,

7 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL whch s a good topcal feature for the web pages wth page ttles consstent wth the page contents. For the web pages wth ttle nconsstent wth the contents, we must fnd method to fnd the real topcal words set. Part of ths research s ntegrated nto our hot fnancal news detecton system ( whch s an mportant part of knowledge servce plat form hatanyuan. TABLE II. SYSTEM RUNING ENVIRONMENT CPU Intel Pentum 3.0 Memory Dsk 2G 320G ACKNOWLEDGMENT The authors wsh to thank Dr. Xanjun Meng for hs helpful suggestons. Ths nvestgaton was supported n part by the Natonal Natural Scence Foundaton of Chna( No ) and the natonal 863 Program of Chna(No. 2006AA01Z197, No. 2007AA01Z194). Fgure 2. Mnmum word frequency threshold and class numbers Fgure 3. Mnmum world frequency threshold and class coverage Fgure 4. Runnng tme for dfferent web page numbers REFERENCES [1] TDT2004, The 2004 topc detecton and trackng task defnton and evaluaton plan, [Onlne]. Avalable: [2] J. Allan, R. Papka, and V. Lavrenko, On-lne new event detecton and trackng, SIGIR 98: Proceedngs of the 21st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pp , November [3] B. Sun, P. Mtra, H. Zha, C. Gles, and J. Yen, Topc segmentaton wth shared topc detecton and algnment of multple documents, SIGIR 07, pp , July [4] A. S. Das, M. Datar, A. Garg, and S. Rajaram, Google news personalzaton: scalable onlne collaboratve flterng, n WWW 07: Proceedngs of the 16th nternatonal conference on World Wde Web. New York, NY, USA: ACM, 2007, pp [5] K.-Y. Chen, L. Luesukprasert, and S. cho T.Chou, Hot topc extracton based on tmelne analyss and multdmensonal sentence modelng, IEEE Transactons on Knowledge and Data Engneerng, pp , August [6] A. K. Jan, M. N. Murty, and P. J. Flynn, Data clusterng: a revew, ACM Computer Survey, vol. 31, no. 3, pp , [7] Z. Pawlak, Rough sets, Internatonal Journal of Computer and Informaton Scences. 11(5), pp , [8] K. Funakosh and T. B. Ho, A Rough Set Approach to Informaton Retreval. In: L. Polkowsk and A. Skowron, (eds.) Rough Sets n Knowledge Dscovery. Erewhon, NC: Physca-Verlag, [9] A. Skowron and J. Stepanuk, Tolerance approxmaton space, 3 rd Internatonal Workshop on Rough Sets and Soft computng, pp , November [10] J. Han and Y. Fu, Dscovery of multple-level assocaton rules from large databases, n Proceedngs of the 21st VLDB Conference, 1995, pp [11] R. Agrawal, T. Imel nsk, and A. Swam, Mnng assocaton rules between sets of tems n large databases, n SIGMOD 93: Proceedngs of the 1993 ACM SIGMOD nternatonal conference on Management of data. New York, NY, USA: ACM, 1993, pp [12] S. W. Changchen and T. Lu, Mnng assocaton rules procedure to support on-lne recommendaton by customers and products fragmentaton, Expert System wth Applcatons, vol. 20, pp , [13] M.Degado, M. Martn-Bautsta, D. Sanchez, and J.M.Serrano, Assocaton rule extracton for text mnng, n Lecture Notes n Computer Scence, 2002, pp

8 556 JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 [14] Y. Dng, X. Wang, L. Ln, and Y. Wu, The desgn and mplementaton of the crawler-nar, ICMLC, pp , August [15] Y. Wu, X. Wang, Y. dng, and J. Xu, Desgn and mplementaton of a c/s structured dstrbuted mult-thread crawler usng ace and pvfs, Journal of Computatonal Informaton System, vol. 4, no. 3, pp , August [16] G. Feng, T.-Y. Lu, Y. Wang, Y. Bao, Z. Ma, X.-D. Zhang and W.-Y. Ma, Aggregaterank: brngng order to web stes, n SIGIR 06: Proceedngs of the 29th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval. New York, NY, USA: ACM, 2006, pp [17] W. Jang, Y. Guan, amd XL. Wang A Pragmatc Chnese Word Segmentaton System Proceedngs of the Ffth SIGHAN Workshop on Chnese Language Processng, Sydney, July 2006, pp

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

A Method of Hot Topic Detection in Blogs Using N-gram Model

A Method of Hot Topic Detection in Blogs Using N-gram Model 84 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 A Method of Hot Topc Detecton n Blogs Usng N-gram Model Xaodong Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna

More information

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques Enhancement of Infrequent Purchased Product Recommendaton Usng Data Mnng Technques Noraswalza Abdullah, Yue Xu, Shlomo Geva, and Mark Loo Dscplne of Computer Scence Faculty of Scence and Technology Queensland

More information

An Image Fusion Approach Based on Segmentation Region

An Image Fusion Approach Based on Segmentation Region Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

An Improved Image Segmentation Algorithm Based on the Otsu Method

An Improved Image Segmentation Algorithm Based on the Otsu Method 3th ACIS Internatonal Conference on Software Engneerng, Artfcal Intellgence, Networkng arallel/dstrbuted Computng An Improved Image Segmentaton Algorthm Based on the Otsu Method Mengxng Huang, enjao Yu,

More information

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Document Representation and Clustering with WordNet Based Similarity Rough Set Model IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model

More information

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images Internatonal Journal of Informaton and Electroncs Engneerng Vol. 5 No. 6 November 015 Usng Fuzzy Logc to Enhance the Large Sze Remote Sensng Images Trung Nguyen Tu Huy Ngo Hoang and Thoa Vu Van Abstract

More information

Domain Thesaurus Construction from Wikipedia *

Domain Thesaurus Construction from Wikipedia * Internatonal Conference on Computer, Networks and Communcaton Engneerng (ICCNCE 2013) Doman Thesaurus Constructon from Wkpeda * WenKe Yn 1, Mng Zhu 2, TanHao Chen 2 1 Department of Electronc Engneerng

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Design of Structure Optimization with APDL

Design of Structure Optimization with APDL Desgn of Structure Optmzaton wth APDL Yanyun School of Cvl Engneerng and Archtecture, East Chna Jaotong Unversty Nanchang 330013 Chna Abstract In ths paper, the desgn process of structure optmzaton wth

More information

Network Intrusion Detection Based on PSO-SVM

Network Intrusion Detection Based on PSO-SVM TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*

More information

Virtual Machine Migration based on Trust Measurement of Computer Node

Virtual Machine Migration based on Trust Measurement of Computer Node Appled Mechancs and Materals Onlne: 2014-04-04 ISSN: 1662-7482, Vols. 536-537, pp 678-682 do:10.4028/www.scentfc.net/amm.536-537.678 2014 Trans Tech Publcatons, Swtzerland Vrtual Machne Mgraton based on

More information

Study of Data Stream Clustering Based on Bio-inspired Model

Study of Data Stream Clustering Based on Bio-inspired Model , pp.412-418 http://dx.do.org/10.14257/astl.2014.53.86 Study of Data Stream lusterng Based on Bo-nspred Model Yngme L, Mn L, Jngbo Shao, Gaoyang Wang ollege of omputer Scence and Informaton Engneerng,

More information

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models Text and Data Mnng In Innovaton Joseph Engler Innovaton Typology Generatonal Models 1. Lnear or Push (Baroque) 2. Pull (Romantc) 3. Cyclc (Classcal) 4. Strategc (New Age) 5. Collaboratve (Polyphonc) Collaboratve

More information

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Positive Semi-definite Programming Localization in Wireless Sensor Networks Postve Sem-defnte Programmng Localzaton n Wreless Sensor etworks Shengdong Xe 1,, Jn Wang, Aqun Hu 1, Yunl Gu, Jang Xu, 1 School of Informaton Scence and Engneerng, Southeast Unversty, 10096, anjng Computer

More information

A Method of Query Expansion Based on Event Ontology

A Method of Query Expansion Based on Event Ontology A Method of Query Expanson Based on Event Ontology Zhaoman Zhong, Cunhua L, Yan Guan, Zongtan Lu A Method of Query Expanson Based on Event Ontology 1 Zhaoman Zhong, 1 Cunhua L, 1 Yan Guan, 2 Zongtan Lu,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment Cross-lngual Pseudo Relevance Feedback Based on Weak Relevant opc Algnment WANG Xu-wen Insttute of Medcal Informaton & Lbrary, Chnese Academy of Medcal Scences, Beng 100020 wang.xuwen@mcams.ac.cn ZHANG

More information

ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System

ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System ApproxMGMSP: A Scalable Method of Mnng Approxmate Multdmensonal Sequental Patterns on Dstrbuted System Changha Zhang, Kongfa Hu, Zhux Chen, Lng Chen Department of Computer Scence and Engneerng, Yangzhou

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku

More information

Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio,

Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio, Parallel and Dstrbuted Assocaton Rule Mnng - Dr. Guseppe D Fatta fatta@nf.un-konstanz.de San Vglo, 18-09-2004 1 Overvew Assocaton Rule Mnng (ARM) Apror algorthm Hgh Performance Parallel and Dstrbuted Computng

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

A Web Site Classification Approach Based On Its Topological Structure

A Web Site Classification Approach Based On Its Topological Structure Internatonal Journal on Asan Language Processng 20 (2):75-86 75 A Web Ste Classfcaton Approach Based On Its Topologcal Structure J-bn Zhang,Zh-mng Xu,Kun-l Xu,Q-shu Pan School of Computer scence and Technology,Harbn

More information

A Resources Virtualization Approach Supporting Uniform Access to Heterogeneous Grid Resources 1

A Resources Virtualization Approach Supporting Uniform Access to Heterogeneous Grid Resources 1 A Resources Vrtualzaton Approach Supportng Unform Access to Heterogeneous Grd Resources 1 Cunhao Fang 1, Yaoxue Zhang 2, Song Cao 3 1 Tsnghua Natonal Labatory of Inforamaton Scence and Technology 2 Department

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds TF 2 P-growth: An Effcent Algorthm for Mnng Frequent Patterns wthout any Thresholds Yu HIRATE, Ego IWAHASHI, and Hayato YAMANA Graduate School of Scence and Engneerng, Waseda Unversty {hrate, ego, yamana}@yama.nfo.waseda.ac.jp

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION 1 FENG YONG, DANG XIAO-WAN, 3 XU HONG-YAN School of Informaton, Laonng Unversty, Shenyang Laonng E-mal: 1 fyxuhy@163.com, dangxaowan@163.com, 3 xuhongyan_lndx@163.com

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop Assocaton Rule Mnng wth Parallel Frequent Pattern Growth Algorthm on Hadoop Zhgang Wang 1,2, Guqong Luo 3,*,Yong Hu 1,2, ZhenZhen Wang 1 1 School of Software Engneerng Jnlng Insttute of Technology Nanng,

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks 2017 2nd Internatonal Semnar on Appled Physcs, Optoelectroncs and Photoncs (APOP 2017) ISBN: 978-1-60595-522-3 FAHP and Modfed GRA Based Network Selecton n Heterogeneous Wreless Networks Xaohan DU, Zhqng

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Personalized Concept-Based Clustering of Search Engine Queries

Personalized Concept-Based Clustering of Search Engine Queries IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Personalzed Concept-Based Clusterng of Search Engne Queres Kenneth Wa-Tng Leung, Wlfred Ng, and Dk Lun Lee Abstract The exponental growth of nformaton

More information

Alignment Results of SOBOM for OAEI 2010

Alignment Results of SOBOM for OAEI 2010 Algnment Results of SOBOM for OAEI 2010 Pegang Xu, Yadong Wang, Lang Cheng, Tany Zang School of Computer Scence and Technology Harbn Insttute of Technology, Harbn, Chna pegang.xu@gmal.com, ydwang@ht.edu.cn,

More information

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm Recommended Items Ratng Predcton based on RBF Neural Network Optmzed by PSO Algorthm Chengfang Tan, Cayn Wang, Yuln L and Xx Q Abstract In order to mtgate the data sparsty and cold-start problems of recommendaton

More information

Feature-Based Matrix Factorization

Feature-Based Matrix Factorization Feature-Based Matrx Factorzaton arxv:1109.2271v3 [cs.ai] 29 Dec 2011 Tanq Chen, Zhao Zheng, Quxa Lu, Wenan Zhang, Yong Yu {tqchen,zhengzhao,luquxa,wnzhang,yyu}@apex.stu.edu.cn Apex Data & Knowledge Management

More information

A User Selection Method in Advertising System

A User Selection Method in Advertising System Int. J. Communcatons, etwork and System Scences, 2010, 3, 54-58 do:10.4236/jcns.2010.31007 Publshed Onlne January 2010 (http://www.scrp.org/journal/jcns/). A User Selecton Method n Advertsng System Shy

More information

Information Retrieval

Information Retrieval Anmol Bhasn abhasn[at]cedar.buffalo.edu Moht Devnan mdevnan[at]cse.buffalo.edu Sprng 2005 #$ "% &'" (! Informaton Retreval )" " * + %, ##$ + *--. / "#,0, #'",,,#$ ", # " /,,#,0 1"%,2 '",, Documents are

More information

Performance Assessment and Fault Diagnosis for Hydraulic Pump Based on WPT and SOM

Performance Assessment and Fault Diagnosis for Hydraulic Pump Based on WPT and SOM Performance Assessment and Fault Dagnoss for Hydraulc Pump Based on WPT and SOM Be Jkun, Lu Chen and Wang Zl PERFORMANCE ASSESSMENT AND FAULT DIAGNOSIS FOR HYDRAULIC PUMP BASED ON WPT AND SOM. Be Jkun,

More information

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals nkselector: A Web Mnng Approach to Hyperlnk Selecton for Web Portals Xao Fang and Olva R. u Sheng Department of Management Informaton Systems Unversty of Arzona, AZ 8572 {xfang,sheng}@bpa.arzona.edu Submtted

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

Detection of an Object by using Principal Component Analysis

Detection of an Object by using Principal Component Analysis Detecton of an Object by usng Prncpal Component Analyss 1. G. Nagaven, 2. Dr. T. Sreenvasulu Reddy 1. M.Tech, Department of EEE, SVUCE, Trupath, Inda. 2. Assoc. Professor, Department of ECE, SVUCE, Trupath,

More information

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING An Improved K-means Algorthm based on Cloud Platform for Data Mnng Bn Xa *, Yan Lu 2. School of nformaton and management scence, Henan Agrcultural Unversty, Zhengzhou, Henan 450002, P.R. Chna 2. College

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Impact of a New Attribute Extraction Algorithm on Web Page Classification

Impact of a New Attribute Extraction Algorithm on Web Page Classification Impact of a New Attrbute Extracton Algorthm on Web Page Classfcaton Gösel Brc, Banu Dr, Yldz Techncal Unversty, Computer Engneerng Department Abstract Ths paper ntroduces a new algorthm for dmensonalty

More information

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran mogan_az@ksh.sharf.edu

More information

Object-Based Techniques for Image Retrieval

Object-Based Techniques for Image Retrieval 54 Zhang, Gao, & Luo Chapter VII Object-Based Technques for Image Retreval Y. J. Zhang, Tsnghua Unversty, Chna Y. Y. Gao, Tsnghua Unversty, Chna Y. Luo, Tsnghua Unversty, Chna ABSTRACT To overcome the

More information

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research Schedulng Remote Access to Scentfc Instruments n Cybernfrastructure for Educaton and Research Je Yn 1, Junwe Cao 2,3,*, Yuexuan Wang 4, Lanchen Lu 1,3 and Cheng Wu 1,3 1 Natonal CIMS Engneerng and Research

More information

Remote Sensing Image Retrieval Algorithm based on MapReduce and Characteristic Information

Remote Sensing Image Retrieval Algorithm based on MapReduce and Characteristic Information Remote Sensng Image Retreval Algorthm based on MapReduce and Characterstc Informaton Zhang Meng 1, 1 Computer School, Wuhan Unversty Hube, Wuhan430097 Informaton Center, Wuhan Unversty Hube, Wuhan430097

More information

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches Proceedngs of the Internatonal Conference on Cognton and Recognton Fuzzy Flterng Algorthms for Image Processng: Performance Evaluaton of Varous Approaches Rajoo Pandey and Umesh Ghanekar Department of

More information

PROPERTIES OF BIPOLAR FUZZY GRAPHS

PROPERTIES OF BIPOLAR FUZZY GRAPHS Internatonal ournal of Mechancal Engneerng and Technology (IMET) Volume 9, Issue 1, December 018, pp. 57 56, rtcle ID: IMET_09_1_056 valable onlne at http://www.a aeme.com/jmet/ssues.asp?typeimet&vtype

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Impact of Contextual Information for Hypertext Documents Retrieval

Impact of Contextual Information for Hypertext Documents Retrieval Impact of Contextual Informaton for Hypertext ocuments Retreval Idr Chbane and Bch-Lên oan SUPELEC Computer Scence dpt. Plateau de Moulon 3 rue Jolot Cure 9 92 Gf/Yvette France {Idr.Chbane Bch-Len.oan}@supelec.fr

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

A high precision collaborative vision measurement of gear chamfering profile

A high precision collaborative vision measurement of gear chamfering profile Internatonal Conference on Advances n Mechancal Engneerng and Industral Informatcs (AMEII 05) A hgh precson collaboratve vson measurement of gear chamferng profle Conglng Zhou, a, Zengpu Xu, b, Chunmng

More information

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu

More information

User Tweets based Genre Prediction and Movie Recommendation using LSI and SVD

User Tweets based Genre Prediction and Movie Recommendation using LSI and SVD User Tweets based Genre Predcton and Move Recommendaton usng LSI and SVD Saksh Bansal, Chetna Gupta Department of CSE/IT Jaypee Insttute of Informaton Technology,sec-62 Noda, Inda sakshbansal76@gmal.com,

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Classic Term Weighting Technique for Mining Web Content Outliers

Classic Term Weighting Technique for Mining Web Content Outliers Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa Classc Term Weghtng Technque for Mnng Web Content Outlers W.R. Wan Zulkfel, N. Mustapha, and A. Mustapha

More information

CUM: An Efficient Framework for Mining Concept Units

CUM: An Efficient Framework for Mining Concept Units CUM: An Effcent Framework for Mnng Concept Unts P.Santh Thlagam Ananthanarayana V.S Department of Informaton Technology Natonal Insttute of Technology Karnataka - Surathkal Inda 575025 santh_soc@yahoo.co.n,

More information

Data Preprocessing Based on Partially Supervised Learning Na Liu1,2, a, Guanglai Gao1,b, Guiping Liu2,c

Data Preprocessing Based on Partially Supervised Learning Na Liu1,2, a, Guanglai Gao1,b, Guiping Liu2,c 6th Internatonal Conference on Informaton Engneerng for Mechancs and Materals (ICIMM 2016) Data Preprocessng Based on Partally Supervsed Learnng Na Lu1,2, a, Guangla Gao1,b, Gupng Lu2,c 1 College of Computer

More information

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval KD Lab at mageclef 2012 Personal Photo Retreval Cha-We Ku, Been-Chan Chen, Guan-Bn Chen, L-J Gaou, Rong-ng Huang, and ao-en Wang Knowledge, nformaton, and Database ystem Laboratory Department of Computer

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline mage Vsualzaton mage Vsualzaton mage Representaton & Vsualzaton Basc magng Algorthms Shape Representaton and Analyss outlne mage Representaton & Vsualzaton Basc magng Algorthms Shape Representaton and

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

Extraction of User Preferences from a Few Positive Documents

Extraction of User Preferences from a Few Positive Documents Extracton of User Preferences from a Few Postve Documents Byeong Man Km, Qng L Dept. of Computer Scences Kumoh Natonal Insttute of Technology Kum, kyungpook, 730-70,South Korea (Bmkm, lqng)@se.kumoh.ac.kr

More information

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China for Database Clusterng Guangdong Unversty of Technology, Guangdong, 0503, Chna E-mal: 6085@qq.com Me Zhang Guangdong Unversty of Technology, Guangdong, 0503, Chna E-mal:64605455@qq.com Database clusterng

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

A Knowledge Management System for Organizing MEDLINE Database

A Knowledge Management System for Organizing MEDLINE Database A Knowledge Management System for Organzng MEDLINE Database Hyunk Km, Su-Shng Chen Computer and Informaton Scence Engneerng Department, Unversty of Florda, Ganesvlle, Florda 32611, USA Wth the exploson

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures A Novel Adaptve Descrptor Algorthm for Ternary Pattern Textures Fahuan Hu 1,2, Guopng Lu 1 *, Zengwen Dong 1 1.School of Mechancal & Electrcal Engneerng, Nanchang Unversty, Nanchang, 330031, Chna; 2. School

More information

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation Internatonal Conference on Logstcs Engneerng, Management and Computer Scence (LEMCS 5) Maxmum Varance Combned wth Adaptve Genetc Algorthm for Infrared Image Segmentaton Huxuan Fu College of Automaton Harbn

More information