Feature Selection for Web Page Classification Using Swarm Optimization

Feaure Selecion for Web Page Classificaion Using Swarm Opimizaion B. Leela Devi, A. Sankar Absrac The web s increased populariy has included a huge amoun of informaion, due o which auomaed web page classificaion sysems are essenial o improve search engines performance. Web pages have many feaures like HTML or XML ags, hyperlinks, URLs and ex conens which can be considered during an auomaed classificaion process. I is known ha Webpage classificaion is enhanced by hyperlinks as i reflecs Web page linkages. The aim of his sudy is o reduce he number of feaures o be used o improve he accuracy of he classificaion of web pages. In his paper, a novel feaure selecion mehod using an improved Paricle Swarm Opimizaion (PSO) using principle of evoluion is proposed. The exraced feaures were esed on he WebKB daase using a parallel Neural Nework o reduce he compuaional cos. Keywords Web page classificaion, WebKB Daase, Term Frequency-Inverse Documen Frequency (TF-IDF), Paricle Swarm Opimizaion (PSO). I. INTRODUCTION HE increase in usage of Web and is growh are well Tknown. Texual daa on he Web is esimaed a one erra bye, in addiion o audio and video images which imposes new challenges o Web direcories. Web direcories enable users o search he Web, by classifying Web documens ino subjecs. Web pages manual classificaion suffers as Web documens increase [].Tex classificaion aims o caegorize documens ino a specific number of predefined classes using documen feaures. Tex classificaion has a crucial role in rerieval and managemen asks like informaion exracion, informaion rerieval, documen filering, and building hierarchical direcories [2]. When ex classificaion focuses on web pages, i is called web classificaion or web page classificaion. Classificaion assigns predefined class labels o unseen or es daa. For his, a se of labelled daa rains a classifier which hen labels unseen daa. Classificaion is supervised learning [3]. The process is no differen in web page classificaion so ha here are one or more predefined class labels. Classificaion model assigns class labels o web pages which are hyperex wih many feaures like exual okens, markup ags, URLs and hos names in URLs. As web pages have addiional properies, his classificaion differs from radiional ex classificaion. Web page classificaion has subfields like subjec classificaion and funcional classificaion. In he former, he Leela Devi B is wih Professional Group of Insiuions, Palladam, Tamilnadu, India (e-mail: leeladevi_2008@rediffmail.com). Sankar A is wih PSG College of Technology, Coimbaore, Tamilnadu, India. classifier is concerned wih web page conen and deermines he web page subjec. For example, online newspapers caegories like finance, spor, and echnology are examples of subjec classificaion. Funcional classificaion deals wih funcion or ype of a web page. For example, deermining wheher a web page is a personal homepage or a course page is funcional classificaion. Subjec and funcional classificaion are popular classificaion ypes [2]. A HTML documen s individual componen is an HTML elemen made up of a ree of HTML elemens and nodes like ex nodes wih all elemens having specified aribues. Elemens have conen and ex. HTML represens semanics or meaning [4]. HTML markup has key componens, including characer references, characer-based daa ypes, elemens (and aribues), and eniy references. Anoher componen is documen ype declaraion, riggering sandards mode rendering. Semanic HTML is wriing HTML emphasizing encoded informaion s meaning over presenaion (looks). HTML includes semanic markup from incepion, and presenaional markup like <fon>, <i> and <cener> ags. There are also semanically neural ags. HTML and associaed proocols from incepion were acceped quickly. Bu, here were no sandards in he early years of language. Though HTML was conceived as a semanic language wihou presenaion deails, pracical use pushed presenaional elemens and aribues o i, driven by varied browser vendors. Laes HTML sandards are effors o overcome chaoic language developmen and o creae a raional foundaion o build meaningful and well-presened documens. Feaure selecion is an imporan classificaion sep. Web pages are in HTML forma meaning ha web pages are semisrucured daa, wih HTML ags and hyperlinks in addiion o pure ex. Due o his web pages propery, feaure selecion in classificaion is differen from radiional classificaion. Feaure selecion reduces daa dimension wih ens or hundreds or housands of feaures which canno be processed furher. A major problem of web page classificaion is he high dimensionaliy of he feaure space. Bes feaure subses have leas feaures ha mos conribue o classificaion accuracy. To improve web page classificaion performance, many approaches impored from feaure selecion or ex classificaion were applied. Informaion gain [3], muual informaion [5], documen frequency [6], and erm srengh [7] are popular feaure selecion echniques. Informaion gain (IG) measures informaion in bis abou he class predicion, when he only informaion available is a feaure and corresponding 340

class disribuion. This sudy proposes a new feaure selecion echnique using PSO algorihm for web page classificaion. Secion II reviews relaed work in lieraure. Secion III describes mehods used and Secion IV discusses he resuls of experimens. Secion V concludes he paper. II. RELATED WORKS Auomaic web page classificaion was emphasized by [8] hrough minimum feaures. They also proposed a procedure o generae opimum feaures for web pages. The opimum feaures model, machine learning classifiers. Experimen wih a bench marking daa se wih such machine learning classifiers improved classificaion accuracy. To improve classificaion resuls wih resources use, a full web page is no needed. As web pages conain high dimensions daa, hey are preprocessed o idenify bes represenaive feaures reflecing caegories. A mulilevel feaure selecion wihou bias o frequen erms was proposed. Resuls reveal he new feaure selecion process idenifies lesser feaures wih high informaion gain ensuring classificaion accuracy. Previous works reveal a web page being pariioned ino many segmens or blocks. Blocks imporance o a page is no equal. I was proved ha differeniaing noisy and unimporan blocks from pages helped web mining, search, and accessibiliy. No uniform approach was presened, in he same works, o measure imporan web page porions. A user sudy by [9] found people having consisen views on web page blocks imporance. The new work invesigaes how o locae a model o auomaically assign imporance values o web page blocks. Block imporance esimaion as a learning problem is defined. Firs, Vision-based Page Segmenaion (VIPS) algorihm pariions a web page ino semanic blocks wih hierarchy srucure. Then spaial feaures like posiion, size, and conen feaures including many images and links were exraced for feaure vecor consrucion in each block. Learning algorihms like SVM and NN rain block imporance models based on his. In he proposed experimens, bes models achieve 79% performances wih Micro-F and 85.9%, wih Micro-Accuracy. Web classificaion was ried via differen echnologies. Xhemali e al. [0] compared NN, NB, and DT classifiers for auomaic analysis and classificaion of raining course web pages aribue daa. The sudy inroduced an enhanced NB classifier and ran he same daa sample hrough DT and NN classifiers o deermine classifiers success rae in raining courses domain. Research revealed ha enhanced NB classifier ouperformed radiional NB classifier. They performed well if no beer han popular, rival echniques. The new sudy revealed ha NB classifier is he bes choice for raining courses domain, achieving a F-Measure value of over 97%, despie being rained wih fewer samples han classificaion sysems encounered. A graph-based semi-supervised learning algorihm applied o he Web page classificaion was proposed by []. The algorihm used a similariy measure beween Web pages o consruc a k-neares neighbor graph. Preliminary experimens on a WebKB daase showed ha he new algorihm exploied unlabeled daa as also labeled ones o ge higher Web page classificaion accuracy. The effec of considering named eniies as web page classificaion feaures was invesigaed by [2]. The ess were in five differen domains baseball, fooball, healh, poliics and science "wih web pages from online news providers. Resuls showed ha incorporaing named eniies leads o sligh gains in classifier performance for narrow domains, bu is no rue for all domains. Resuls showed ha classificaion based on named eniies can be good for some domains (baseball) bu is lower han lexical erms based represenaion. Saraç and Ozel [3] aimed o apply a recen opimizaion echnique called Firefly Algorihm (FA), o choose bes feaures for Web page classificaion. FA seleced a feaures subse and o evaluae seleced feaures finess, he J48 classifier of Weka daa mining ool was used. WebKB and Conference daases evaluaed he proposed feaure selecion sysem s effeciveness. The resul showed ha when a feaures subse was seleced by using FA, WebKB and Conference daases were classified wihou accuracy loss, also as feaures decreased he ime needed o classify he new Web pages, reduced. A GA o selec bes feaures for Web page classificaion problem o improve accuracy and classifiers run ime performance was proposed by [4]. The increased informaion on he Web has raised a need for accurae auomaed classifiers for Web pages o mainain Web direcories and increase search engines' performance. To decrease feaure space, a GA ha deermines bes feaures for a se of Web pages was developed. I was found ha when GA proposed feaures were used, and a knn classifier was employed, accuracy wen up o 96%. A new cenroid-based approach o classify web pages by genre using characer n-grams from differen informaion sources like ile, URL, headings and anchors was proposed by [5]. To deal wih web pages complexiy and web genres rapid evoluion, he new approach implemened a muli-label and adapive classificaion scheme where web pages were classified singly, and each affeced more han one genre. According o similariy beween a new page and every genre cenroid, he new approach eiher adaped he genre cenroid under consideraion or considered he new page as a noise page and discarded i. The resuls showed beer resuls han curren muli-label classifiers. An approach of Web page classificaion using NB classifier based on ICA was proposed by [6]. To perform classificaion, a Web page was firs represened by a feaures vecor wih differen weighs, and a weigh calculaed mehod was improved. As feaures were big, PCA seleced relevan feaures from a preprocessing secion as inpu for improved ICA algorihm (MFICA). Finally, MFICA oupu was sen o a NB classifier for classificaion o boos classifier performance. Evaluaion proved ha he ICA model based NB classifier ensured accepable classificaion accuracy. 34

III. METHODOLOGY WebKB daase is used here for evaluaion. Four differen feaure selecion echniques were described and a new feaure selecion mehod using PSO algorihm was proposed. Daase WebKB Sop words Porer Semming TF-IDF Compuaion (TFxIDF) Feaure Selecion using PSO algorihm Classificaion performed using ANN Fig. Flowchar for he proposed mehod WebKB Daa Se WebKB daase [7] is a se of webpages collaed by he World Wide Knowledge Base (WebKB) projec of he CMU ex learning group downloaded from The 4 Universiies Daase Homepage [8]. The pages are colleced from various universiies compuer science deparmens in 997 and manually classified o 7 differen classes, including suden, faculy, saff, deparmen, course, projec, and ohers. For every class, he collecion has web pages from four universiies; i.e. Cornell, Wisconsin Texas, and Washingon universiies, and miscellaneous pages from oher universiies. All 8,282 web pages are classified manually ino he seven caegories so ha he suden caegory has 64 pages, faculy 24, saff 37, deparmen 82, course 930, projec 504, and ohers have 3764 pages. The class oher is a collecion of pages no deemed he main page and do no represen any insance of he earlier six classes. WebKB daase includes 867 web pages from Cornell Universiy, 827 pages from Texas Universiy, 205 from Washingon Universiy, 263 from Wisconsin Universiy, and finally 420 miscellaneous pages from oher universiies. This sudy uses Projec, Faculy, Suden, and Course classes from he WebKB daase. As Saff and Deparmen classes have limied posiive examples, hey are no considered. Sopwords Sop words are filered prior o, or afer, naural language daa processing. Sop words are commonly used words frequenly filered from ex in informaion rerieval asks. When removing sop words, noise are ge ridded of, and space is saved o sore documens. For example, consider an insance "I am a suden of compuer science a Wisconsin Universiy." The sopwords "I", "am", a, of, and "a" are lef ou of he full-ex index. Thus on removal of he sopwords he insance is represened by suden compuer science Wisconsin Universiy. Semming Semming is a ool used in vocabulary mismach problem, where query words do no mach documen words. Semmers conflae cerain varian forms of same word like (paper, papers) and (hold, holds, holding ) [9]. Afer removing high frequency words, indexing conflaes word varians ino same sem or roo using a semming algorihm. For example, words "engineering", "engineers" or "engineered" are reduced o he sem "engineer". Grouping words in informaion rerieval, wih he same roo under same sem (or indexing erm) increases success rae when maching documens o a query [20]. Porer s algorihm is based on seps ha each sep removes a ype of suffix by subsiuion rules. These rules only apply when specific condiions hold; he resuling sem mus have a minimal lengh. Mos rules have a condiion based on a socalled measure. A measure is a number of vowel-consonan sequences (where consecuive vowels/consonans are couned as one) presen in a resuling sem. This condiion mus preven leers which resemble a suffix, bu only a par of he sem is removed [2]. For example, suden, sudens on Porer semming is suden. Similarly, for sudied, sudies, sudy, sudying is sudi. Term Frequency-Inverse Documen Frequency (TF-IDF) Inverse Documen Frequency (IDF) represens scaling facor. If erm occurs frequenly in documens, is IDF value is less as erm has lower discriminaive power [22]. IDF() is defined as (): d IDF log d d is a se of documens wih erm. Similar documens have similar relaive erm frequencies. Similariy is measured among documen ses or beween a documen and query. Cosine measure locaes documens [23]; he cosine measure is go by (2) 2, sim v v 2 2 () vv. v v (2) where v and v 2 are wo documen vecors, v. v 2 defined as vv i i 2i and v v. v. TF-IDF funcion weighs each vecor componen (each relaing o a word of a vocabulary) of every documen as follows. Firs, i incorporaes word frequency in a documen. Thus, he more a word appears in a documen (is TF, erm frequency is high) he more i is hough o be significan in he documen. Also, IDF measures how infrequen a word is in a collecion. This is esimaed using an enire raining ex collecion a hand.tf IDF combines weighs of TF and IDF by muliplying hem. TF gives more weigh o a frequen erm 342

in an essay while IDF downscales he weigh, if a erm occurs in many essays [24]. Feaure Selecion Using PSO PSO originaed from he simulaion of birds social behavior in a flock. In PSO, a paricle flies in search space wih a velociy adjused by is flying memory and companion's flying experience. A paricle has is objecive funcion value decided by a finess funcion (3): v wv c r p x c r p x (3) id id id id 2 2 gd id where i represens ih paricle and d is a soluion space dimension, c denoes a cogniion learning facor, and c2 indicaes a social learning facor, r and r2 are random numbers uniformly disribued in (0,), p id and p gd sand for posiion wih bes finess found so far for he ih and he bes posiion in he neighborhood, v id and v id are velociies a ime and ime, and x id is posiion of ih paricle a ime. Every paricle moves o a new poenial soluion based on (4): x x v, d, 2,..., D, id id id A binary PSO where a paricle moves in a sae space resriced o 0 and on each dimension, regarding changes in probabiliies ha a bi will be in one sae or he oher was proposed by [25] as in (5), (6): id, x id, (4), rand() S( v ) (5) 0 S. v e v (6) Funcion S(v) is a sigmoid limiing ransformaion and rand ( ) is a random number from a uniform disribuion in [0.0,.0]. This sudy uses a binary PSO algorihm version for PSO. Each paricle s posiion is given in a binary sring form represening a feaure selecion siuaion. Feaure Selecion Using Proposed PSO The process for proposed PSO is given by. Sar PSO 2. Find gbes and Pbes 3. Each paricle is updaed according o (7) v wv c r p x c r p x (7) id id id id 2 2 gd id 4. Sar random muaion hill climbing for gbes as follows: Random muaion hill climbing is a local search mehod ha has a sochasic componen. i. Choose a binary sring a random. Call his sring bes_evaluaed. ii. Muae a bi chosen a random in bes_evaluaed. iii. Compue he finess of he muaed sring. If he finess is greaer han he finess of bes_evaluaed, hen se bes_evaluaed o he muaed sring. iv. If maximum number of ieraions has been performed reurn bes_evaluaed_ oherwise, go o Sep ii. 5. Sar parallel hill climbing on he new pbes values. Arificial Neural Nework (ANN) Realisic cogniive simulaion and Compuer Assised Learning (CAL) inspired search for opimal learning and eaching mehodology and classical eaching performance. ANN learning model use led o fair assessmens performance of he suggesed learning and eaching opics. So, opimal uoring mehod is reached afer analysis and evaluaion of simulaion resuls. Fig. 2 depics an ANN learning and eaching model s block diagram which presens 2 diverse learning paradigms simulaion. Boh are relaed o ineracive uoring and learning process and self-organized learning. The firs is relaed o classical (uor supervised) learning seen in classrooms (face o face uoring). This moves ineracively hrough a bidirecional communicaion process beween uor and learner(s). The second paradigm underakes self-organized (unsupervised) uoring process. Fig. 2 Generalized ANN block diagram Error vecor referring o Fig. 2, a a ime insan (n) observed during learning process is (8) en ( ) = y( n) dn ( ) (8) where e(n) is error correcing signal conrolling learning process, adapively, x(n) is inpu simulus, y(n) is oupu response vecor, and d(n) desired numeric value(s). Equaion (9) is deduced: T V ( n)= X ( n) W ( n) k j kj Y n V n e - Vk( n) - Vk( n) k( )= ( k( ))=(-e )/( ) e ( n) = d ( n) - y ( n) k k k W ( n)= W ( n) W ( n) kj kj kj where X is inpu vecor, W weigh vecor, an acivaion (odd sigmoid) funcion characerized by λ as gain facor and Y as oupu. e k is error value and dk he desired oupu. Noing ha ΔW kj (n) is a dynamic change of weigh vecor value connecing k h and ih neurons. The weigh vecor value dynamic changes for supervised phase are (0): (9) 343

W = e ( n) X ( n) (0) kj k j where, η is learning rae value during learning process. Bu for unsupervised paradigm, dynamic weigh vecor value change is given by (): W = Y ( n) X ( n) () kj k j Noe ha e k (n) in is subsiued by y k (n) a arbirary ime insan (n) during learning. The proposed parallel NN has 2 sub-neworks. Every nework has 2 hidden layers, wih differing ransfer funcions. This sudy uses ransfer funcion sigmoid and anh. Differen funcions advanage is ha muual inerference is reduced during complex asks simulaneous processing and execuion. Table I gives he proposed NN parameers. TABLE I PARAMETERS FOR THE PROPOSED MLP NN Parameer Value Inpu Neuron 50 Oupu Neuron 4 Number of Hidden Layer 2 Number of processing elemens upper 4 Number of processing elemens lower 4 Transfer funcion of hidden layer upper Tanh Transfer funcion of hidden layer lower Sigmoid Learning Rule of hidden layer Momenum Sep size 0. Momenum 0.7 Transfer funcion of oupu layer Tanh Learning Rule of oupu layer Momenum Sep size 0. Momenum 0.7 TABLE II THE TOP 5 WORDS SELECTED BY VARIOUS FEATURE SELECTION TECHNIQUES annual professor annual annual associae science aspecs assignmen compue hu assignmen associae course cornell associae componen define universal componen define develop reurn define direc direc compue direc people geomeric annual noe professor hall deparmenal people research hour research professor reurn noe public research phd overview lecure resume universal page people reurn public people professional phd lecure phd phd universal noe annual professor annual annual associae science aspecs assignmen compue hu assignmen associae course cornell associae componen IV. EXPERIMENTAL RESULTS The proposed PSO based feaure selecion for web page classificaion using HTML ags is calculaed using he 4 Universiies Daase and compared wih Correlaion Feaure Selecion (CFS), Muual Informaion (MI) and PSO feaure exracion mehod. Five classes are classified (Suden, Course, Faculy, Projec, and Ohers). The accuracy, precision, recall and f measure are compued: The op 5 words seleced by various feaure selecion echnique is abulaed in Table II. The Neural Nework classifies he web pages based on he keywords. The following ables and figures show he experimenal resuls in deail. Tables III-V show he average precision, recall and F measure obained for various feaure exracion. The resuls are shown graphically from Figs. 3-6. TABLE III PRECISION Suden 0.8747 0.879 0.8823 0.8848 Faculy 0.883 0.887 0.8924 0.8876 Course 0.7672 0.7525 0.8077 0.8402 Projec 0.669 0.6596 0.7063 0.7809 Oher 0.928 0.8973 0.9237 0.9438 Precision 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 suden faculy course projec oher Class label Fig. 3 Precision From Table III and Fig. 3, i is observed ha Precision is improved for Proposed PSO when compared o CFS, MI, and PSO. On an average, Precision increases for Proposed PSO by 2.3% when compared o PSO, by 6.23% when compared o MI and by 5.47% when compared o CFS. For class label course, precision of Proposed PSO increases by 3.94% when compared o PSO. TABLE IV RECALL Suden 0.7773 0.7404 0.8034 0.8492 Faculy 0.8749 0.8625 0.9025 0.909 Course 0.84 0.8203 0.8628 0.8785 Projec 0.759 0.7294 0.7864 0.7982 Oher 0.928 0.9346 0.9275 0.9397 CFS MI PSO Proposed PSO 344

Recall 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 Fig. 4 Recall From Table IV and Fig. 4 i is observed ha Recall is improved for Proposed PSO when compared o CFS, MI, and PSO. On an average, Recall increases for Proposed PSO by 2.94% when compared o PSO, by 6.79% when compared o MI and by 4.7% when compared o CFS. For class label faculy, recall of Proposed PSO increases by 3.82% when compared o CFS. TABLE V F MEASURE Suden 0.8225 0.8037 0.8402 0.8667 Faculy 0.8784 0.8745 0.8975 0.8982 Course 0.803 0.7856 0.8343 0.8595 Projec 0.76 0.6924 0.7448 0.7898 Oher 0.975 0.962 0.925 0.946 f measure 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 suden faculy course projec oher Class label suden faculy course projec oher Class label Fig. 5 f measure From Table V and Fig. 5 i is observed ha f measure is improved for Proposed PSO when compared o CFS, MI, and PSO. On an average, f measure increases for Proposed PSO by 2.65% when compared o PSO, by 6.72% when compared o MI and by 5.30% when compared o CFS. For class label suden, f measure of Proposed PSO increases by 7.54% when compared o MI. V. CONCLUSION Auomaic Web-page classificaion by using hyperex is a big approach o caegorize large Webpage quaniies. Two major approaches were sudied for Web-page classificaion: CFS MI PSO Proposed PSO CFS MI PSO Proposed PSO conen-based and conex-based approaches. Conen-based classificaion mehods use words or phrases of a arge documen o build he classifier and achieve limied accuracy. This sudy proposed a new feaure selecion mehod using PSO algorihm. Resuls from experimens showed ha he new mehod ouperformed oher feaure selecion mehods and ensured good classificaion accuracy. REFERENCES [] Mangai, J. A., & Kumar, V. S. (20). A Novel Approach for Web Page Classificaion using Opimum. IJCSNS, (5), 252. [2] X. Qi and B. D. Davison, Web page classificaion: feaures and algorihms, ACM Compuing Surveys, vol. 4, no. 2, aricle 2, 2009. [3] T. M. Michell, Machine Learning, McGraw-Hill, NewYork, NY, USA, s ediion, 997. [4] Golub, K. and A. Ardo (2005, Sepember). Imporance of HTML srucural elemens and meadaa in auomaed subjec classificaion. In Proceedings of he 9h European Conference on Research and Advanced Technology for Digial Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp. 368 378. Springer. [5] C. E. Shannon, A mahemaical heory of communicaion, The Bell Sysem Technical Journal, vol. 27, pp. 379 423, 948. [6] Y. Yang and J. O. Pedersen, A comparaive sudy on feaure selecion in ex caegorizaion, in Proceedings of he 4 h Inernaional Conference on Machine Learning (ICML 97), pp. 42 420, Nashville, Tenn, USA, July 997. [7] W. J. Wilbur and K. Sirokin, The auomaic idenificaion of sop words, Journal of Informaion Science, vol. 8,no., pp. 45 55, 992.. [8] Mangai, J. A., & Kumar, V. S. (20). A Novel Approach for Web Page Classificaion using Opimum. IJCSNS, (5), 252. [9] Song, R., Liu, H., Wen, J. R., & Ma, W. Y. (2004, May). Learning block imporance models for web pages. In Proceedings of he 3h inernaional conference on World Wide Web (pp. 203-2). ACM. [0] Xhemali, D., Hinde, C. J., & Sone, R. G. (2009). Naive bayes vs. decision rees vs. neural neworks in he classificaion of raining web pages. [] Liu, R., Zhou, J., & Liu, M. (2006, Ocober). Graph-based semisupervised learning algorihm for web page classificaion. In Inelligen Sysems Design and Applicaions, 2006. ISDA'06. Sixh Inernaional Conference on (Vol. 2, pp. 856-860). IEEE. [2] Samarawickrama, S., & Jayarane, L. (202, Sepember). Effec of Named Eniies in Web Page Classificaion. In Compuaional Inelligence, Modelling and Simulaion (CIMSiM), 202 Fourh Inernaional Conference on (pp. 38-42). IEEE. [3] Saraç, E., & Ozel, S. A. (203, June). Web page classificaion using firefly opimizaion. In Innovaions in Inelligen Sysems and Applicaions (INISTA), 203 IEEE Inernaional Symposium on (pp. - 5). IEEE. [4] Ozel, S. A. (20, June). A geneic algorihm based opimal feaure selecion for web page classificaion. In Innovaions in Inelligen Sysems and Applicaions (INISTA), 20 Inernaional Symposium on (pp. 282-286). IEEE. [5] Jebari, C., & Wani, M. A. (202, December). A Muli-label and Adapive Genre Classificaion of Web Pages. In Machine Learning and Applicaions (ICMLA), 202 h Inernaional Conference on (Vol., pp. 578-58). IEEE. [6] He, Z., & Liu, Z. (2008, Ocober). A Novel Approach o Naïve Bayes Web Page Auomaic Classificaion. In Fuzzy Sysems and Knowledge Discovery, 2008. FSKD'08. Fifh Inernaional Conference on (Vol. 2, pp. 36-365). IEEE. [7] Sun, A., Lim, E. P., & Ng, W. K. (2002, November). Web classificaion using suppor vecor machine. In Proceedings of he 4h inernaional workshop on Web informaion and daa managemen (pp. 96-99). ACM. [8] Kan, M. Y., &Thi, H. O. N. (2005, Ocober). Fas webpage classificaion using URL feaures. In Proceedings of he 4h ACM inernaional conference on Informaion and knowledge managemen (pp. 325-326). ACM. [9] Larkey, L. S., Balleseros, L., & Connell, M. E. (2002, Augus). Improving semming for Arabic informaion rerieval: ligh semming and co-occurrence analysis. In Proceedings of he 25h annual 345

inernaional ACM SIGIR conference on Research and developmen in informaion rerieval (pp. 275-282). ACM. [20] Savoy, J. (999). A semming procedure and sopword lis for general French corpora. JASIS, 50(0), 944-952. [2] Kraaij, W., & Pohlmann, R. (994). Porer s semming algorihm for Duch. Informaieweenschap, 67-80. [22] Papineni, K. (200, June). Why inverse documen frequency?. In Proceedings of he second meeing of he Norh American Chaper of he Associaion for Compuaional Linguisics on Language echnologies (pp. -8). Associaion for Compuaional Linguisics. [23] Nigam, K., McCallum, A. K., Thrun, S., & Michell, T. (2000). Tex classificaion from labeled and unlabeled documens using EM. Machine learning, 39(2), 03-34. [24] Soucy, P., & Mineau, G. W. (2005, July). Beyond TFIDF weighing for ex caegorizaion in he vecor space model. In IJCAI (Vol. 5, pp. 30-35). [25] Kennedy, J.; Eberhar, R.C., A discree binary version of he paricle swarm algorihm, Sysems, Man, and Cyberneics, 997. 'Compuaional Cyberneics and Simulaion'., 997 IEEE Inernaional Conference on Volume 5, 2-5 Oc. 997 Page(s):404-408 vol.5. Leela Devi B is wih he School of Compuer Applicaions a Professional Group of Insiuions, Palladam. She is currenly pursuing his docorae in India. Sankar A is currenly working in PSG College of Technology, Coimbaore, India. 346