Classic Term Weighting Technique for Mining Web Content Outliers

Size: px

Start display at page:

Download "Classic Term Weighting Technique for Mining Web Content Outliers"

Paul Simpson
5 years ago
Views:

1 Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa Classc Term Weghtng Technque for Mnng Web Content Outlers W.R. Wan Zulkfel, N. Mustapha, and A. Mustapha Abstract Outler analyss has become a popular topc n the feld of data mnng but there have been less work on how to detect outlers n web content. Mnng Web Content Outlers s used to detect rrelevant web content wthn a web portal. Term Frequency (TF) technques from Informaton Retreval (IR) have been used to detect the relevancy of a term n a web document. However, when document length vares, relatve frequency s preferred. Ths study used maxmum frequency normalzaton and appled Inverse Document Frequency (IDF) weghtng technque whch s a tradtonal term weghtng method n IR to use the value of less frequent terms among documents whch are consdered as more dscrmnatve than frequent terms. The dataset s from The 20 Newsgroups Dataset. TF.IDF s used n dssmlarty measure and the result acheves up to 91.10% of accuracy, whch s about 17.77% hgher than the prevous technque. Keywords nformaton retreval, outlers, term weghtng, web content I I. INTRODUCTION N the past few years, there was a rapd expanson of actvtes n the Web Content Mnng area. However, the focus was only on the techncal, vsual desgn and frequent web content pattern whle less frequent web content pattern called outlers was undervalued. Web content outler mnng s focused on detectng an rrelevant web page from the rest of the web pages under the same categores [3],[5]. Web content outler mnng not only s helpful to detect outlers when a web portal s hacked but also may lead to the dscovery of emergng busness patterns and trends [12]. Unlke tradtonal outler mnng algorthms desgned solely for numerc data sets, web outler mnng algorthms should be applcable for varyng types of data such as text, hypertext, vdeo, audo, mage and HTML tags [11]. There are two groups of web content outler mnng strateges. Those that drectly mne the content outler of documents to dscover nformaton of outlers and those that reject outlers to mprove on the search content of other tools lke search engnes. W. R. Wan Zulkfel s wth the Department of Computer Scence, Faculty of Computer Scence and Informaton Technology, Unversty Putra Malaysa, Serdang, Selangor, Malaysa (phone: ; e-mal: wanrusla@gmal.com). N. Mustapha s wth the Department of Computer Scence, Faculty of Computer Scence and Informaton Technology, Unversty Putra Malaysa, Serdang, Selangor, Malaysa (e-mal: norwat@fsktm.upm.edu.my). A. Mustapha s wth the Department of Computer Scence, Faculty of Computer Scence and Informaton Technology, Unversty Putra Malaysa, Serdang, Selangor, Malaysa (e-mal: ada@fsktm.upm.edu.my). Web content outler mnng s related wth data outler mnng and text outler mnng. It s because many data mnng technques can be appled n Web content mnng, and most of the web contents are texts. However, t s dfferent from data mnng and text mnng because Web data are manly semstructured and/or unstructured, whle data mnng deals prmarly wth structured data and text mnng focuses only on unstructured texts. Web content outler mnng thus requres creatve applcatons of data outler mnng and/or text outler mnng technques to buld ts own unque approaches. The n-gram based and word based technque are useable n the preprocessng part of mnng web content outler. The n- gram based technque s wdely used to dscompose and slce a word nto substrngs szed n. N-gram based technques are sutable n web content outlers mnng because the fxed lengths concept helps n memory utlzaton, plus t supports partal matchng of strngs whch s good for outler detecton [11],[12],[14]. However n-gram based systems become slow for very large datasets because of the huge number of n-gram vectors generated durng mnng web content outlers [14]. Whereas the word based technque just mantan the sze of the words. Although the words are n varable length, the effcency of word based web content outler mnng can be ncreased by ndexng the words n two dmensonal format (, j) and ndexng the doman dctonary based on length of the word [4], [6]. The organzed doman dctonary ensured that the memory space, search tme and run tme for checkng the relevancy of the web documents gets reduced [4]. The n-gram based systems takes a longer tme to complete a task than the word based systems even though the sze of data s not too large. Ths problem ncreases the necessty to use word-based technque n web content outlers mnng to accelerate mplementaton due to the exponental growth of data on the nternet. Term weghtng technque such as TF.IDF [7] has been used ntensely for varous text retreval tasks. A wealth of approaches to model the term vector space has been proposed [1],[2],[8],[10], but the nterest to mplement those technques n Mnng Web Content Outlers has been so far lmted. In ths paper, we used classc vector space technque, TF.IDF to see the compatblty of the technque for Mnng Web Content Outlers. II. RELATED WORKS Weghtng technque has been used n Mnng Web Content Outlers, but the concept s dfferent from term weghtng 271

2 Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa technques n Informaton Retreval. The term weght assgned to the text n web content depends on whch HTML tags enclosed n the text. META and TITLE tags are gven a larger weght than BODY tags because ts gves a better representaton of web content. Relatve Document Weght (RDW) used ths concept. It can compare dfferent documents wth varyng szes n the same category, but the ssue s most of web pages do not have META tag descrpton [11]. The above technque then modfed to n-gram weghtng technque whch s usng n-gram wth doman dctonary [12] and wthout a doman dctonary [13] to determne the smlarty of strngs and expand t to nclude pages contanng smlar strngs. N-grams are used because t supports partal matchng of strngs wth errors. The HyCOQ algorthm s generated to enhance n-gram weghtng technque wthout a doman dctonary by usng the strength of n-gram based and word-based systems. The ndvdual document dssmlartes were derved usng k-dssmlarty, neghborhood dssmlarty and nearest dssmlarty densty adapted from local outler concept [17]. Word based systems apples dfferent technques than n- gram based systems. Besdes applyng full word matchng, the doman dctonary was ndexed based on the length of word n order to enhance term searchng qualty [4]. There are three types of outler detecton n web content. The frst type detect outlers n a web content and remove t mmedately from the orgnal web content to get the requred web content by the user. The system used clusterng technque and mathematcal set formula such as subset, unon and ntersecton for detectng outlers [3]. Meanwhle, the second type focuses on detectng outlers n web pages and returns the web pages that are suspected as web page outlers to the user [11], [12], [13], [17]. Ths applcaton captured web content outlers to gan nterestng values whch can lead to new emergng busness patterns and trends. In addton, the thrd type detects outlers n web pages, remove web pages outler and mprove the search page result by removng redundant web pages [5], [6]. Every type of applcaton s mportant. Ths study focuses on second type of outler detecton. There stll have many thngs to mprove especally the qualty of the outler return result. A word based system used TF [9] but not mplemented t as weghtng technque and TF.IDF [6]. The exstng method used TF.IDF n ther applcaton but t mplemented wth n-gram based technque. Due to the slow runnng problem of n-gram based systems, ths paper changed the technque to word based technque but stll mplementng TF.IDF to see the effcency of word based technque n detectng web content outlers wth TF.IDF technque [7]. III. ARCHITECTURE DESIGN The proposed algorthm uses the advantages of full word match and organzed doman dctonary whch s ndexed based on length of the word [4]. The paper assumes the exstence of a dctonary for ntended category. The full word frequency profle for the web page s generated. The web pages are weghted based on ther frequency and a penalty s awarded aganst word that s present n the document but not n the doman dctonary because t contrbutes more to dssmlarty of the document. Whle those found n the dctonary ncreases the smlarty between the document and the dctonary [12]. The weghtng of a term corresponds to ts frequency of occurrence n the document whch s dstngushed n two types of frequences. The term frequency corresponds to the number of term occurrences n the concern nformaton. Whle absolute frequency corresponds to the stemmed words frequency n the whole collecton of nformaton [16]. Terms whch have a weak frequency are not representatve of the document content whle the most sgnfcant terms are those whose frequency s ntermedate. When document length vares, relatve frequency s preferred than normalzng the values. Maxmum Frequency Normalzaton s used wth Inverse Document Frequency (IDF) weghtng technque because the less frequent terms among documents mght be more dscrmnatve. The relatve weght of document determnes ts dssmlarty weghts compares to other documents n the category and then outlers are ranked based on dssmlarty weghts whch are hgher than the other document n the category. Fg. 1 shows the archtecture desgn of the proposed system. Organzed Doman Dctonary Extracted Web Pages Preprocessng Full Word Profle Generaton Compute Dssmlarty Measure Determne Outlers Fg. 1 Archtecture Desgn of the proposed system A. Document Extracton At the frst phase, the web pages under the same category of nterest were retreved and extracted. It can be acheved usng web search engne or web crawlers [18]. The web pages are analyzed to elmnate texts whch are not enclosed n TITLE or META or BODY tags. However, ths paper used the already extracted dataset taken from the WEBKB data repostory [14], [20]. 272

3 Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa B. Preprocessng Then n the preprocessng phase, any data besdes text embedded n the HTML tags lke hyperlnk, mage, sound, numerc characters, symbols, null values (whtespaces and other predefned characters from both sde of strng) and stop words were removed. Stop words whch are known as words wth frequency greater than user specfed frequency, have been removed from web contents usng publc lst of stopwords [21]. Web contents were also stemmed wth Porter Stemmng Algorthm [22] to change the words to root word. C. Generate Full Word Profle The fltered datasets s then used to generate full word profle. At ths tme, the doman dctonary has been ndexed based on the length of the word [4]. It s mportant to use organzed doman dctonary because every word n the web pages s checked wth the doman dctonary based on the length. If the words exst n both sdes, t wll be flagged as 1, otherwse 0 wll be returned. Then the word frequency wll be counted. The full word profle generated by ndexng all word wth two dmensonal format (,j) [4] and every word attached wth word frequency, word length and the bnary number whch mentoned ether t exst n the doman dctonary or not. D. Compute Dssmlarty Measure In the weghtng computaton, a classc term weghtng technque, TF.IDF [7] from Informaton Retreval (IR) was adopted to evaluate the representatveness of terms n the web content. The dssmlarty measure computed to determne the dfference among pages wthn the same category [11]. The Maxmum Frequency Normalzaton appled to Term Frequency (TF) weghtng because when the document length vares, the relatve frequency s preferred [16]. Snce term frequency alone may not have the dscrmnatng power to pck up all relevant documents from other rrelevant documents, an IDF (Inverse Document Frequency) factor whch takes the collecton dstrbuton nto account has been proposed to help to mprove the performance of IR [15]. 0.5 f ( t, d ) d j, j e j MaxFreq d ( ) N k (1) where e j shows the word exst n the doman dctonary or not and gven f(t j,d ) denotes the frequency of term t j present n the document d, whle MaxFreq(d ) determne maxmum frequency of a word n a document, N s the total number of documents and k s the number of documents wth term t j appears. However, the dssmlarty measure (1) wll only compute the words that exst n the dctonary because the formula returns only a bnary value. Then the words that dd not exst n the doman dctonary wll not be computed. The reason s the word that exsts n the dctonary s more relevant to the doman category and t represents the power of the document. The outlers come out wth the lowest frequency of word that exsts n the dctonary and there wll be only a few words that exst n the doman dctonary. Therefore the dssmlarty measures wll return a hgher dssmlarty value than other web pages. The same results shows n the dssmlarty functon below: 0.5 f ( t, e ) e j, j MaxFreq d ( ) where e shows the words n the document that exst n the doman dctonary. The other functons have the same meanng and defnton, refer to (1). Equaton (2) s the dssmlarty measure where the formula was smplfed from formula (1) and t computes words that only exst n the document and the doman dctonary. E. Determne Outlers The output from the dssmlarty measure was ranked to determne the outlers. The top n (the value of n s equal to total of benchmark data) of the result declared as outlers. IV. ALGORITHM N k Input: Doman Dctonary and Web Document d Output: Outlyng documents 1. Read the content of the documents and the doman dctonary. 2. Extract the documents and preprocess. 3. Generate full word profle 4. Generate organzed doman dctonary 5. For (nt 0; <NoOfDoc; ++) { 6. For( nt j1; j<noofwords; j++) { 7. If ( j exsts n the doman dctonary) { f ( t j, e ) N j MaxFreq d, k ( ) e 9. }}// end of nner loop 10. / number of words n the document that exst n the doman dctonary. 11. Rank the result of 12. The top n of the result declared as outlers. V. EXPERIMENTAL RESULTS Ths technque has been tested wth two datasets. The frst dataset consst of 35 web pages from the Course folder of Unversty Cornell, provded by World Wde Knowledge Base (WEBKB). There s no benchmark data for testng web content outlers, so embedded motve s the only way to know f the outlers returned are actually real outlers. Therefore, the experment used 10 benchmark web pages from Scence Medcal folder provded by The 20 Newsgroups Dataset. Although the outlers usually consttute less than 10% of the entre dataset [19], but the ratonal for choosng 10 web pages as embedded motfs for the frst experment s to see the performance of the system n detectng outlers f there s more (2) 273

4 Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa outlers n the dataset. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% F1-Measure Measurement Accuracy N-GRAMS TF TF.IDF Fg. 2 Performance of outler detecton from the frst dataset. Fg. 2 shows the performance of outler detecton from the frst dataset. The results are counted based on how much the web content outlers (whch s from the benchmark dataset) returned by the system. The results are ranked and the top 10 web pages are categorzed as web content outlers. It qualfed by two parameters whch s the percentage of the accuracy and the F1-measure. The expermental result shows that the system usng TF.IDF technque acheves up to 91.10% of accuracy, whch about 17.77% hgher than the TF technque and 13.10% hgher than the N-Gram based technque. Besdes, t also acheves up to 80% of F1-measure, whch s a 40% mprovement from the TF technque and 30% mprovement from the N-gram based technque. Moreover, the result of the recommended technque shows faster executon tme than N- gram based system and t s sutable for large sze dataset. The second dataset consst of 200 web pages from the Course folder of Unversty Texas, Washngton and Wsconsn provded by World Wde Knowledge Base. 20 benchmark web pages (that s 10% of the entre dataset) was also taken from Scence Medcal folder provded by The 20 Newsgroups Dataset. Fg. 3 shows the performance of outler detecton from the second dataset. The top 20 results returned by the system were consdered as outlers. 100% 80% 60% 40% 20% 0% F1-Measure Measurement Accuracy N-GRAMS TF TF.IDF Fg. 3 Performance of outler detecton from the second dataset. The second experment shows that the performance of TF.IDF technque acheves up to 93.63% of accuracy, whch s about 7.27% hgher than the TF technque and 1.54% hgher than the N-Gram based technque. Besdes, t also acheves up to 65% of F1-measure, whch s a 40% mprovement from the TF technque and 10% mprovement from the N-gram based technque. The N-gram based systems shows good performance but t s not very effcent because the system takes a very long tme to process large datasets. It s because of the huge number of n-gram vectors generated durng mnng [14]. VI. CONCLUSION AND FUTURE WORK Mnng Web Content Outlers have relatons wth mnng text outlers and Informaton Retreval. Therefore many technques from both felds can be adopted for mnng Web Content Outlers. Some effort s needed to mprove the qualty of outler detecton n web content. Ths paper used a tradtonal weghtng technque TF.IDF [7] from Informaton Retreval whch s commonly used n text mnng. The experment shows the TF.IDF technque from Informaton Retreval s not only compatble to use n detectng web outlers, t even returns better results than the prevous works. Ths encourages the efforts to use another weghtng technque from those dscplnes for mnng web content outlers n the future. Then, the technque can be enhanced by addng some calculaton to remove redundant web pages f exst. REFERENCES [1] A. Khan, B. Baharudn and K. Khan, Effcent feature selecton and doman relevance term weghtng method for Document Classfcaton, Second Internatonal Conference on Computer Engneerng and Applcatons IEEE, [2] C. Desy, M. Gowr, S. Baskar, S.M.A. Kalaaras,and N. Ramraj, A novel term weghtng scheme MIDF for Text Categorzaton, Journal of Engneerng Scence and Technology Vol. 5, No. 1 pp , [3] G.Poonkuzhal, K.Thagarajan, and K.Sarukes, Set theoretcal approach for mnng web content through outlers detecton, Internatonal Journal on Research and Industral Applcatons, Vol. 2, pp , Jan [4] G. Poonkuzhal, K. Thagarajan, K. Sarukes, and G.V. Uma, Sgned approach for mnng web content outlers, Proceedngs of World Academy of Scence, Engneerng and Technology, Vol. 56, pp , [5] G. Poonkuzhal, K.Thagarajan, and K.Sarukes, Elmnaton of Redundant Lnks n Web Pages - Mathematcal Approach, World Academy of Scence, Engneerng and Technology 52, [6] G.Poonkuzhal, K.Sarukes, and G.V. Uma, Web content outler mnng through mathematcal approach and trust ratng, 10th WSEAS Internatonal Conference on Appled Computer and Appled Computatonal Scence (ACACOS '11), [7] G. Salton, Automatc Text Processng: The Transformaton, Analyss and Retreval of Informaton by Computer, Addson-Wesley Edtors, [8] G. Tsatsarons and V. Panagotopouloua, A generalzed vector space model for Text Retreval, Proceedngs of the EACL, Assocaton for Computatonal Lngustcs Based on Semantc Relatedness, Athens, Greece, pp , Aprl [9] H.P. Luhn, A statstcal approach to mechanzed encodng and searchng of lterary nformaton, IBM Journal of Research and Development (4), , [10] L-S. Chen, and C-W. Chang, A new term weghtng method by ntroducng class nformaton for sentment classfcaton of Textual Data, Proceedngs of the Internatonal MultConference of Engneers and Computer Scentsts Vol I, IMECS, Hong Kong, March

5 Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa [11] M. Agyemang, K. Barker, and R.S. Alhajj, Framework for Mnng Web Content Outlers, ACM Symposum on Appled Computng, pp , 2004 [12] M. Agyemang, K. Barker, and R.S. Alhajj, Mnng web content outlers usng structure orented weghtng technques and n-grams, Proceedngs of ACM SAC, New Mexco, [13] M. Agyemang, K. Barker, and R.S. Alhajj, WCOND-Mne: Algorthm for Detectng Web Content Outlers from Web Documents, Proceedngs of the 10th IEEE Symposum on Computers and Communcatons (ISCC), [14] M. Agyemang, K. Barker, and R.S. Alhajj, Hybrd approach to web content outler mnng wthout query vector. Sprnger Berln, Vol. 3589, [15] M. Lan, C. L. Tan, and J. Su, Supervsed and tradtonal term weghtng methods for Automatc Text Categorzaton, Journal of IEE PAMI, Vol.10, July [16] M. Mohammadan, Intellgent Agents For Data Mnng and Informaton Retreval, Unversty of Canberra, Australa, Idea Group Publshng, Hershey, London, Melbourne, Sngapore, 2004, pp [17] M.M. Breung, H-P. Kregel, R.T. Ng, and J. Sander, LOF: Identfyng Outlers n Large Dataset, Proc. of ACM SIGMOD, Dallas, TX, pp , [18] S. Chakrabart, M. Berg, and B. Dom, Focused crawlng: A new approach to topc-specfc Web Resource Dscovery, Computer Networks, Amsterdam, Netherlands, [19] V. Barnett, and T. Lews, Outlers n Statstcal Data, John Wlley, [20] July [21] July [22] July

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,