32 IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, Jue 2007 Web Text Feature Extractio with Particle Swarm Optimizatio Sog Liagtu,, Zhag Xiaomig Istitute of Itelliget Machies, Chiese Academy of Scieces,Hefei,23003 Chia Automatio Departmet, Uiversity of Sciece ad Techology of Chia,Hefei,230036 Chia Summary The Iteret cotiues to grow at a pheomeal rate ad the amout of iformatio o the web is overwhelmig. It provides us a great deal of iformatio resource. Due to its wide distributio, its opeess ad high dyamics, the resources o the web are greatly scattered ad they have o uified maagemet ad structure. This greatly reduces the efficiecy i usig web iformatio.web text feature extractio is cosidered as the mai problem i text miig. We use Vector Space Model (VSM) as the descriptio of web text ad preset a ovel feature extractio algorithm which is based o the improved particle swarm optimizatio with reverse thig particles (PSORTP). This algorithm will greatly improve the efficiecy of web texts processig. Key words: Web miig, VSM, Particle swarm optimizatio, Text feature extractio. Itroductio The web is growig at a tremedous rate. It cotais a huge amout of ustructured, distributed data. This cotet provides a great potetial source for iformatio extractio that eeds to be filtered, orgaized, ad maitaied i order to permit a efficiet use[]. However, Due to its wide distributio, its opeess ad high dyamics, the resources o the web are greatly scattered ad they have o uified maagemet ad structure. This greatly reduces the efficiecy i usig web iformatio. How to search for the iformatio from the Massive web iformatio speedily ad accurately has become a major problem[2]. There was a urget eed of tools that ca quickly ad effectively fid resources ad kowledge from the Web. The mai form of iformatio o the web is web text. So how to process these Web Texts becomes the key questio. Data miig ca help discoverig potetial kowledge ad iformatio from mass raw data ad effectively solve the problem that iformatio is abudat but kowledge is lackig. However, most classical Data Miig tools require structured iformatio. Therefore, Web-Based Text Miig has become a ew theme of Data Miig. I this paper, a review of Web miig is preseted outliig the techiques of Web miig ad data processig. A ew Web text feature Extractio method is proposed that a modified particle swarm optimizatio algorithm is used i it. 2. Web Miig ad Descriptio of Web Text 2. Defiitio of Web Miig Web miig ca be broadly defied as the discovery ad aalysis of useful iformatio from the World Wide Web[3]. This broad defiitio o the oe had describes the automatic search ad retrieval of iformatio ad resources available from millios of sites ad o-lie databases, i.e., Web cotet miig, ad o the other had, the discovery ad aalysis of user access patters from oe or more Web servers or o-lie services, i.e., Web usage miig. Web Miig targeted large, heterogeeous, distributed data of the Web. The Web is a semi-structured or ustructured. Therefore sematic uderstadable absece of machiery, atural laguage uderstadig ad text processig are the mai methods used i the fields of Web miig. 2.2 Descriptio of Web Text Most of the iformatio i the Iteret is i the form of Web texts. How to express this semi-structured ad ustructured iformatio of Web texts is the basic preparatory work of Web Miig. Vector space model is oe of good method which is applied more i recet years. It is used to represet each documet as a vector of certai weighted word frequecies. I a documet vector, the value o each term axis represets the importace of the term i the documet. Therefore the documet vector is merely a set of all terms with a value represetig the importace of each term i the documet, le V( d) = ( t, w( d);...; ti, wi( d);...; t, w( d)). t i is the term i ad wi ( d ) is the value represetig the importace of term i i documet d. wi( d) = ψ ( tfi( d )) is always defied as a fuctio of ti s frequecy- tfi ( d ) i documet d. TF-IDF is a commo method of determiig the weights of the term i VSM.The weight of the term is proportioal to Term Frequecy(TF) ad with the Documet Frequecy(DF) i iverse proportio[4]. The followig is a commoly used formula for calculatig Term Weight: Mauscript received Jue 5, 2007. Mauscript revised Jue 20, 2007.
IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 33 W = tf N log( + 0.0) k 2 2 N ( tf ) log ( + 0.0) k = k () tf is the frequecy of T k i documet D i. N is the total umber of sample documets. k is the umber of docmets which have the term T k. The terms which appear i the Headig, subheadig ad keywords always should be emphasized. 2.3 Feature Extractio Words ad phrases is the basic elemets of a ormal documet ad there is certai regularity with the frequecy of each term i differet documets. So we ca use it to distiguish documets which have differet cotets. Text feature vectors are obtaied through algorithms of words segmetatio ad statistical approach of term frequecy. The dimesio of the vector is immese by usig this approach.if o disposal is doe to the origial text vector, the vector calculatig expeses will be tremedous ad the efficiecy of the whole process will be extraordiary iefficiet. Therefore, we eed to process the text vector for further refiemet o the basis of that the origial meaig is esured ad the feature vector is rather compact. Extractig text features is a NP-complete problem. Curretly, may ovel approaches o the study of feature extractio is proposed. I the study of Liu li, a ovel term selectio ad weightig approach based o key words is preseted[5]. ZOU Jua proposed a method of Chiese text eigevalue extractio usig multiple heuristic rules i their paper[6]. We believe that the Web texts ca be see as a multi-dimesioal iformatio space which is made up of the feature terms. The the text features selectio is just the optimizatio process i the multi-dimesioal iformatio space. Therefore efficiet optimizatio algorithm is eeded i fidig the text features. As a geeral itelliget search algorithm, Particle Swarm Optimizatio(PSO) is discovered through simulatio of a simplified social model ad it ca search the multidimesioal complex space efficietly. I this paper, a ovel approach of Web text feature Extractio with particle swarm optimizatio algorithm is preseted. 3. Web Text Feature Extractio with Improved PSO Algorithm Particle Swarm Optimizatio (PSO) was first itroduced by James Keedy ad Russel C.Eberrhart i 995 ad it was discovered through simulatio of a simplified social model. It ca search the multidimesioal complex space efficietly through cooperatio ad competitio amog the idividuals i a populatio of particles[7]. We th that the text features selectio ca be trasformed ito the optimizatio process i the multi-dimesioal iformatio space. Firstly, the Web text should be coded ad the text vector is viewed as the positio vector of a particle. Because the umber of the text features is ukow, we proposed a ovel particle swarm optimizatio with alterable dimesio. Reverse thig particles are also added i the algorithm i order to improve the global search ability of the PSO. The mai process of the approach is preseted i the followigs: 3. The Selectio of Text Features Before extractig text feature, we eed to pretreat the text to get the feature terms. That extractig good features is a very importat meas ad a basic require for text processig. From the poit of view of atural laguage uderstadig, the ous ad verbs form the mai cotet of a text. Curretly, the mai methods used i text pretreatmet iclude stem-mig of Eglish documets ad segmetatio of Chiese documets. I this paper, our approach is the positive maximal match segmetatio based o dictioary. 3.2 Code The Feature The weight of a feature term of the web documets is oe dimesio of a particle s positio vector. Because the umber of the documet features is ukow, we proposed a dyamic PSO. The dimesio of particles positio i the algorithm is dyamic i order to adapt to the ucertai umber of the documet features. The particles i the swarm is iitialized by the radom selected feature terms from the term list after word segmetig. The particles positio vector is maily formed from the stadardized web documet vector model. 3.3 Optimizatio Procedure Whe the PSO starts, a populatio of particles is geerated first with radom positios ad a velocity is radom assiged to each particle[8]. The fitess of each particle is the evaluated accordig to the objective fuctio. The the particles begi to search for the best solutio. Each particle s trajectory is adjusted by dyamically alterig the velocity of each particle, accordig to its ow search experiece ad other particles experiece. The positio vector ad the velocity vector of particle j i the d-dimesioal search space ca be represeted as X j =
34 IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 (x j,x j2,x j3,,x jd) ad V j = (v j,v j2,v j3,,v jd ) respectively. Accordig to the objective fuctio, let the best positio of each particle be P j = (p j,p j2,p j3,,p jd ) ad the best positio of all the particles be G = (g,g 2,g 3, g d ). The formulas which the particles use to adjust its positio ad velocity is: similar( P, Q) = i= p q i 2 2 pi qi i= i= i (5) V = V+ c* rad()*( P X) + c2* rad2()*( G X) (2) j j j j j X = X + V (3) j+ j j+ where c ad c 2 are acceleratio coefficiets ad are always made 2, rad() ad rad2() are radom umbers i [0,][9]. The first part of (2) represets the previous velocity. The secod part represets the persoal experiece. The third part represets the collaborative effect of the particles ad it always pulls the particles toward the global best solutio which particles foud so far. At each iteratio of PSO, the velocity of each particle is calculated accordig to (2) ad the positio is updated accordig to (3). Geerally, a maximum velocity vector is defied i order to cotrol the V j. Wherever a v jd exceeds the defied limit, its velocity will be set to be v max d. If a particle fids a better positio tha the previously foud best positio, it will be stored i memory. The algorithm goes o util the satisfactory solutio is foud or the predefied umber of iteratios is met. 3.4 Evaluatio Fuctio Evaluatio fuctio is a importat criterio for judgig the merits of particle locatio. The locatio of each particle is composed of the feature weight of the same documets whe the terms of the web documet are acquired. So the best particle should be able to reflect the web documet well ad it also should cotai other particles distributio iformatio. Therefore, the idividual's fitess should be measured by all particles which preset the same documet. The more similar they are, the bigger their fitess should be. The fitess fuctio should be defied as followig: swarm_ size d d d i i j j= fitess ( P ) = similar( P, P )/ swarm _ size (4) The similarity of vectors ca be measured by the agle betwee two vectors or by the distace betwee two particles. It is easy to use the agle betwee two vectors ad the formula is le followig: p, q is the elemets of vector P,Q. I additio, the dimesio of a vector is also a importat criterio. If several particles have the same fitess, smaller the dimesio is, better the particle is. Therefore, we eed to icrease the pealty fuctio p(r) i the evaluatio fuctio i order to lead the particles to the best regio. I this paper, we set pealty fuctio p(r) as followig: d + 2 pd ( ) = (6) d + 2 Where d is the dimesio of vectors. The fial evaluatio fuctio is formed by addig the pealty fuctio to the fitess formula: value( d) = p( d) fitess( d) (7) 4. Feasibility Aalysis ad Desig Theoretically, the dimesio of particles' locatio should be dyamic. But i fact,it is t. We use empty elemets to fill the dimiished dimesio. Whe the weight of oe feature is less tha 0.05, we also replace it with empty elemet. The structure of the particles locatio is le followig: W W2 Wm B B2 B Fig.. the structure of a particle After the particle desig is fiished,we use the algorithm of particle swarm optimizatio with reverse thig particles[0]. The trait of the reverse thig particles is that they do ot update their positio accordig to (3). They move to a opposite directio. You ca look at the Fig. 2. They are selected radomly i the particle swarm ad their ew formula which will update their positio is : X '( t+ ) = X V (8) j j j
IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 35 Fig.2. trait of reverse thig particles Whe the algorithm starts, ormal particles ru followig the formula (2) ad (3) ad the reverse thig particles ru followig the formula (2) ad (8). If oe of the reverse thig particles fids a better solutio tha G(the best positio of all the particles), the reverse thig particle will chage to be a ormal particle ad oe particle will be selected radomly from the ormal particles to become a reverse thig particle. If the G is acquired by the ormal particles, the reverse thig particle will still be opposite. Whe the reverse thig particles positio reach the border of the problem space, they will be set to be the ormal particles which has the best positio G. The flow chart of PSORTP is depicted graphically i Fig 3 5. Experimets The mai purpose of the experimets is to verify the effectiveess of the ovel web text feature extractio algorithm. The mai experimetal materials is the web pages(html) which is Parsed by HTMLParser. We use two criterios: accuracy rate(ar) ad reductio rate(rr). hl ( ao ( t)) AR( t) = 00% ao (9) fv RR( t) = 00% ov (0) () ao t : the umber of feature terms of text t after optimizatio; hl ( ao ( t)) : the umber of high weight feature terms of HLSegmet[]; fv ov Fig.3. The flow chart of PSORTP : the umber of feature terms of text t; : the umber of origial terms of text t; We tested four web pages. The result is the followig: Table : The test results of the effectiveess of the ovel web text feature extractio algorithm. No. Text () ov t ao() t AR() t RR() t size 996 396 87 88.9% 47.2% 2 7 33 56 93.2 47.% 3 56 293 98 95.8 33.4% 4 78 88 4 96. 46.6%
36 IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 From the results, we ca see that by usig this method, the text features are optimized. Based o the text feature extractio algorithm, we ca greatly improve the efficiecy of web documets processig. 6. Coclusio I this paper, a ovel Web text feature Extractio algorithm is proposed. The algorithm is based o the particle swarm optimizatio with reverse thig particles ad the structure of the particles is also improved. The algorithm ca search the multidimesioal complex space efficietly. This method preset a good idea i feature dimesio reductio ad improvig the efficiecy of web documets processig. Now we are cotiuig our experimets about the algorithm ad we will release more of our results of the experimets i times to come. I the future we will cocetrate o applyig the method to the fields le Web documets filterig, classificatio ad clusterig. Refereces []. Maya Rupert,etc. The Web ad Complex Adaptive Systems Proceedigs of the 20th Iteratioal Coferece o Advaced Iformatio Networkig ad Applicatios (AINA 06), pp200-204,2006 [2]. Wag Shi, etc.. Web miig J. Computer Sciece, 27 (4) : 28~ 3, 2000 [3]. http://maya.cs.depaul.edu/~mobasher/webmier/survey/surv ey.html [4]. SHI Zhogzhi Kowledge discovery, Tsighua Uiversity Press, 2002 [5]. LIU Li etc. Term selectio ad weightig approach based o key words i text categorizatio Computer Egieerig ad Desig 2006.6 [6]. ZOU Jua etc. A Eigevalue Extractio Met hod for Chiese Texts Usig Multiple Heuristic Rules COMPUTER ENGINEERING & SCIENCE 2006.8 [7]. Keedy J., Eberhart R.C.: Particle swarm optimisatio. Proceedigs of the IEEE Iteratioal Coferece o Neural Networks. (995) 942-948. [8]. Rataweera, etc.:self-orgaizig Hierarchical Particle Swarm Optimizer With Time-Varyig Acceleratio Coefficiets. IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION. (2004) 240-255 [9]. Shi Y.H., Eberhart R.C.: Parameter Selectio i Particle Swarm Optimizatio. Evolutioary Programmig VII: Proc.EP 98. (998) 59-600 [0]. ZHANG Xiaomig, Wag Rujig. Particle Swarm Optimizatio Algorithm with Reverse Thig Particles. Computer Sciece, 2006, 33 (0) :56~59 []. www.hylada.com Sog Liagtu received the B.E. ad M.E. degrees, from Ahui Agri. Uiv. i 987 ad 990, respectively. he is workig toward the Ph.D, i Automatio departmet,uiversity of Sciece ad Techology of Chia. he has bee a associate professor i the Istitute of Itelliget Machies, Chiese Academy of Scieces,sice 2000. His research iterest icludes data acquisitio,patter recogitio ad itelliget systems. Zhag Xiaomig received the B.E. from Shadog Uiversity of techology i 2002. He received M.E. from Graduate Uiversity of Chiese Academy of Scieces i 2006. Now he is a Research Assistat of Istitute of Itelliget Machies, Chiese Academy of Scieces. His research iterest icludes computatioal itelligece, web miig, machie learig.